100% found this document useful (5 votes)

3K views229 pages

Advance Statistical Methods in Data Science Chen

advance statistical methods in data science chen

Uploaded by

e-cevik

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (5 votes)

3K views229 pages

Advance Statistical Methods in Data Science Chen

advance statistical methods in data science chen

Uploaded by

e-cevik

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 229

ICSA Book Series in Statistics

Series Editors: Jiahua Chen · Ding-Geng (Din) Chen

Ding-Geng (Din) Chen
Jiahua Chen
Xuewen Lu
Grace Y. Yi
Hao Yu Editors

Advanced
Statistical
Methods in
Data Science
ICSA Book Series in Statistics

Series editors
Jiahua Chen
Department of Statistics
University of British Columbia
Vancouver
Canada

Ding-Geng (Din) Chen

University of North Carolina
Chapel Hill, NC, USA
More information about this series at http://www.springer.com/series/13402
Ding-Geng (Din) Chen • Jiahua Chen •
Xuewen Lu • Grace Y. Yi • Hao Yu
Editors

Advanced Statistical
Methods in Data Science

123
Editors
Ding-Geng (Din) Chen Jiahua Chen
School of Social Work Department of Statistics
University of North Carolina at Chapel Hill University of British Columbia
Chapel Hill, NC, USA Vancouver, BC, Canada

Department of Biostatistics Grace Y. Yi

Gillings School of Global Public Health Department of Statistics and Actuarial
University of North Carolina at Chapel Hill Science
Chapel Hill, NC, USA University of Waterloo
Waterloo, ON, Canada

Xuewen Lu Hao Yu
Department of Mathematics and Statistics Department of Statistics and Actuarial
University of Calgary Science
Calgary, AB, Canada Western University
London, ON, Canada

ISSN 2199-0980 ISSN 2199-0999 (electronic)

ICSA Book Series in Statistics
ISBN 978-981-10-2593-8 ISBN 978-981-10-2594-5 (eBook)
DOI 10.1007/978-981-10-2594-5

Library of Congress Control Number: 2016959593

© Springer Science+Business Media Singapore 2016

This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of
the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology
now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, express or implied, with respect to the material contained herein or for any
errors or omissions that may have been made.

Printed on acid-free paper

This Springer imprint is published by Springer Nature

The registered company is Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #22-06/08 Gateway East, Singapore 189721,
Singapore
To my parents and parents-in-law, who value
higher education and hard work; to my wife
Ke, for her love, support, and patience; and
to my son John D. Chen and my daughter
Jenny K. Chen for their love and support.
Ding-Geng (Din) Chen, PhD

To my wife, my daughter Amy, and my son

Andy, whose admiring conversations
transformed into lasting enthusiasm for my
research activities.
Jiahua Chen, PhD

To my wife Xiaobo, my daughter Sophia, and

my son Samuel, for their support and
understanding.
Xuewen Lu, PhD

To my family, Wenqing He, Morgan He, and

Joy He, for being my inspiration and offering
everlasting support.
Grace Y. Yi, PhD
Preface

This book is a compilation of invited presentations and lectures that were presented
at the Second Symposium of the International Chinese Statistical Association–
Canada Chapter (ICSA–CANADA) held at the University of Calgary, Canada,
August 4–6, 2015 (http://www.ucalgary.ca/icsa-canadachapter2015). The Sympo-
sium was organized around the theme “Embracing Challenges and Opportunities of
Statistics and Data Science in the Modern World” with a threefold goal: to promote
advanced statistical methods in big data sciences, to create an opportunity for the
exchange ideas among researchers in statistics and data science, and to embrace the
opportunities inherent in the challenges of using statistics and data science in the
modern world.
The Symposium encompassed diverse topics in advanced statistical analysis
in big data sciences, including methods for administrative data analysis, survival
data analysis, missing data analysis, high-dimensional and genetic data analysis,
and longitudinal and functional data analysis; design and analysis of studies
with response-dependent and multiphase designs; time series and robust statistics;
and statistical inference based on likelihood, empirical likelihood, and estimating
functions. This book compiles 12 research articles generated from Symposium
presentations.
Our aim in creating this book was to provide a venue for timely dissemination
of the research presented during the Symposium to promote further research and
collaborative work in advanced statistics. In the era of big data, this collection
of innovative research not only has high potential to have a substantial impact on
the development of advanced statistical models across a wide spectrum of big data
sciences but also has great promise for fostering more research and collaborations
addressing the ever-changing challenges and opportunities of statistics and data
science. The authors have made their data and computer programs publicly available
so that readers can replicate the model development and data analysis presented
in each chapter, enabling them to readily apply these new methods in their own
research.

vii
viii Preface

The 12 chapters are organized into three sections. Part I includes four chapters
that present and discuss data analyses based on latent variable models in data
sciences. Part II comprises four chapters that share a common focus on lifetime data
analyses. Part III is composed of four chapters that address applied data analyses in
big data sciences.
Part I Data Analysis Based on Latent or Dependent Variable Models (Chaps. 1,
2, 3, and 4)
Chapter 1 presents a weighted multiple testing procedure commonly used and
known in clinical trials. Given this wide use, many researchers have proposed
methods for making multiple testing adjustments to control family-wise error rates
while accounting for the logical relations among the null hypotheses. However, most
of those methods not only disregard the correlation among the endpoints within the
same family but also assume the hypotheses associated with each family are equally
weighted. Authors Enas Ghulam, Kesheng Wang, and Changchun Xie report on
their work in which they proposed and tested a gatekeeping procedure based on
Xie’s weighted multiple testing correction for correlated tests. The proposed method
is illustrated with an example to clearly demonstrate how it can be used in complex
clinical trials.
In Chap. 2, Abbas Khalili, Jiahua Chen, and David A. Stephens consider
the regime-switching Gaussian autoregressive model as an effective platform
for analyzing financial and economic time series. The authors first explain the
heterogeneous behavior in volatility over time and multimodality of the conditional
or marginal distributions and then propose a computationally more efficient regu-
larization method for simultaneous autoregressive-order and parameter estimation
when the number of autoregressive regimes is predetermined. The authors provide
a helpful demonstration by applying this method to analysis of the growth of the US
gross domestic product and US unemployment rate data.
Chapter 3 deals with a practical problem of healthcare use for understanding
the risk factors associated with the length of hospital stay. In this chapter, Cindy
Xin Feng and Longhai Li develop hurdle and zero-inflated models to accommodate
both the excess zeros and skewness of data with various configurations of spatial
random effects. In addition, these models allow for the analysis of the nonlinear
effect of seasonality and other fixed effect covariates. This research draws attention
to considerable drawbacks regarding model misspecifications. The modeling and
inference presented by Feng and Li use the fully Bayesian approach via Markov
Chain Monte Carlo (MCMC) simulation techniques.
Chapter 4 discusses emerging issues in the era of precision medicine and the
development of multi-agent combination therapy or polytherapy. Prior research has
established that, as compared with conventional single-agent therapy (monother-
apy), polytherapy often leads to a high-dimensional dose searching space, especially
when a treatment combines three or more drugs. To overcome the burden of
calibration of multiple design parameters, Ruitao Lin and Guosheng Yin propose
a robust optimal interval (ROI) design to locate the maximum tolerated dose (MTD)
in Phase I clinical trials. The optimal interval is determined by minimizing the
probability of incorrect decisions under the Bayesian paradigm. To tackle high-
Preface ix

dimensional drug combinations, the authors develop a random-walk ROI design

to identify the MTD combination in the multi-agent dose space. The authors of
this chapter designed extensive simulation studies to demonstrate the finite-sample
performance of the proposed methods.
Part II Lifetime Data Analysis (Chaps. 5, 6, 7, and 8)
In Chap. 5, Longlong Huang, Karen Kopciuk, and Xuewen Lu present a new
method for group selection in an accelerated failure time (AFT) model with a
group bridge penalty. This method is capable of simultaneously carrying out feature
selection at the group and within-group individual variable levels. The authors
conducted a series of simulation studies to demonstrate the capacity of this group
bridge approach to identify the correct group and correct individual variable even
with high censoring rates. Real data analysis illustrates the application of the
proposed method to scientific problems.
Chapter 6 considers issues around Case I interval censored data, also known
as current status data, commonly encountered in areas such as demography,
economics, epidemiology, and medical science. In this chapter, Pooneh Pordeli and
Xuewen Lu first introduce a partially linear single-index proportional odds model to
analyze these types of data and then propose a method for simultaneous sieve max-
imum likelihood estimation. The resultant estimator of regression parameter vector
is asymptotically normal, and, under some regularity conditions, this estimator can
achieve the semiparametric information bound.
Chapter 7 presents a framework for general empirical likelihood inference of
Type I censored multiple samples. Authors Song Cai and Jiahua Chen develop
an effective empirical likelihood ratio test and efficient methods for distribution
function and quantile estimation for Type I censored samples. This newly developed
approach can achieve high efficiency without requiring risky model assumptions.
The maximum empirical likelihood estimator is asymptotically normal. Simulation
studies show that, as compared to some semiparametric competitors, the proposed
empirical likelihood ratio test has superior power under a wide range of population
distribution settings.
Chapter 8 provides readers with an overview of recent developments in the
joint modeling of longitudinal quality of life (QoL) measurements and survival
time for cancer patients that promise more efficient estimation. Authors Hui Song,
Yingwei Peng, and Dongsheng Tu then propose semiparametric estimation methods
to estimate the parameters in these joint models and illustrate the applications of
these joint modeling procedures to analyze longitudinal QoL measurements and
recurrence times using data from a clinical trial sample of women with early breast
cancer.
Part III Applied Data Analysis (Chaps. 9, 10, 11, and 12)
Chapter 9 presents an interesting discussion of a confidence weighting model
applied to multiple-choice tests commonly used in undergraduate mathematics and
statistics courses. Michael Cavers and Joseph Ling discuss an approach to multiple-
choice testing called the student-weighted model and report on findings based on
the implementation of this method in two sections of a first-year calculus course at
the University of Calgary (2014 and 2015).
x Preface

Chapter 10 discusses parametric imputation in missing data analysis. Author

Peisong Han proposes to estimate and subtract the asymptotic bias to obtain
consistent estimators. Han demonstrates that the resulting estimator is consistent
if any of the missingness mechanism models or the imputation model is correctly
specified.
Chapter 11 considers one of the basic and important problems in statistics: the
estimation of the center of a symmetric distribution. In this chapter, authors Pengfei
Li and Zhaoyang Tian propose a new estimator by maximizing the smoothed
likelihood. Li and Tian’s simulation studies show that, as compared with the existing
methods, their proposed estimator has much smaller mean square errors under
uniform distribution, t-distribution with one degree of freedom, and mixtures of
normal distributions on the mean parameter. Additionally, the proposed estimator is
comparable to the existing methods under other symmetric distributions.
Chapter 12 presents the work of Jingjia Chu, Reg Kulperger, and Hao Yu in which
they propose a new class of multivariate time series models. Specifically, the authors
propose a multivariate time series model with an additive GARCH-type structure
to capture the common risk among equities. The dynamic conditional covariance
between series is aggregated by a common risk term, which is key to characterizing
the conditional correlation.
As a general note, the references for each chapter are included immediately
following the chapter text. We have organized the chapters as self-contained units
so readers can more easily and readily refer to the cited sources for each chapter.
The editors are deeply grateful to many organizations and individuals for their
support of the research and efforts that have gone into the creation of this collection
of impressive, innovative work. First, we would like to thank the authors of
each chapter for the contribution of their knowledge, time, and expertise to this
book as well as to the Second Symposium of the ICSA–CANADA. Second, our
sincere gratitude goes to the sponsors of the Symposium for their financial support:
the Canadian Statistical Sciences Institute (CANSSI), the Pacific Institute for the
Mathematical Sciences (PIMS), and the Department of Mathematics and Statistics,
University of Calgary; without their support, this book would not have become a
reality. We also owe big thanks to the volunteers and the staff of the University
of Calgary for their assistance at the Symposium. We express our sincere thanks
to the Symposium organizers: Gemai Chen, PhD, University of Calgary; Jiahua
Chen, PhD, University of British Columbia; X. Joan Hu, PhD, Simon Fraser
University; Wendy Lou, PhD, University of Toronto; Xuewen Lu, PhD, University
of Calgary; Chao Qiu, PhD, University of Calgary; Bingrui (Cindy) Sun, PhD,
University of Calgary; Jingjing Wu, PhD, University of Calgary; Grace Y. Yi,
PhD, University of Waterloo; and Ying Zhang, PhD, Acadia University. The editors
wish to acknowledge the professional support of Hannah Qiu (Springer/ICSA Book
Series coordinator) and Wei Zhao (associate editor) from Springer Beijing that made
publishing this book with Springer a reality.
Preface xi

We welcome readers’ comments, including notes on typos or other errors, and

look forward to receiving suggestions for improvements to future editions of this
book. Please send comments and suggestions to any of the editors listed below.

University of North Carolina at Chapel Hill Ding-Geng (Din) Chen, MSc, PhD
Chapel Hill, NC, USA

University of British Columbia Jiahua Chen, MSc, PhD

Vancouver, BC, Canada

University of Calgary Xuewen Lu, MSc, PhD

Calgary, AB, Canada

University of Waterloo Grace Y. Yi, MSc, MA, PhD

Waterloo, ON, Canada

Western University Hao Yu, MSc, PhD

West Ontario, ON, Canada

July 28, 2016

Contents

Part I Data Analysis Based on Latent or Dependent Variable

Models
1 The Mixture Gatekeeping Procedure Based on Weighted
Multiple Testing Correction for Correlated Tests . . .. . . . . . . . . . . . . . . . . . . . 3
Enas Ghulam, Kesheng Wang, and Changchun Xie
2 Regularization in Regime-Switching Gaussian
Autoregressive Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 13
Abbas Khalili, Jiahua Chen, and David A. Stephens
3 Modeling Zero Inflation and Overdispersion in the Length
of Hospital Stay for Patients with Ischaemic Heart Disease . . . . . . . . . . . 35
Cindy Xin Feng and Longhai Li
4 Robust Optimal Interval Design for High-Dimensional
Dose Finding in Multi-agent Combination Trials . . .. . . . . . . . . . . . . . . . . . . . 55
Ruitao Lin and Guosheng Yin

Part II Life Time Data Analysis

5 Group Selection in Semiparametric Accelerated Failure
Time Model .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 77
Longlong Huang, Karen Kopciuk, and Xuewen Lu
6 A Proportional Odds Model for Regression Analysis of
Case I Interval-Censored Data .. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 101
Pooneh Pordeli and Xuewen Lu
7 Empirical Likelihood Inference Under Density Ratio
Models Based on Type I Censored Samples: Hypothesis
Testing and Quantile Estimation . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 123
Song Cai and Jiahua Chen

xiii
xiv Contents

8 Recent Development in the Joint Modeling of Longitudinal

Quality of Life Measurements and Survival Data from
Cancer Clinical Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 153
Hui Song, Yingwei Peng, and Dongsheng Tu

Part III Applied Data Analysis

9 Confidence Weighting Procedures for Multiple-Choice Tests . . . . . . . . . 171
Michael Cavers and Joseph Ling
10 Improving the Robustness of Parametric Imputation .. . . . . . . . . . . . . . . . . 183
Peisong Han
11 Maximum Smoothed Likelihood Estimation of the Centre
of a Symmetric Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 195
Pengfei Li and Zhaoyang Tian
12 Modelling the Common Risk Among Equities:
A Multivariate Time Series Model with an Additive
GARCH Structure .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 205
Jingjia Chu, Reg Kulperger, and Hao Yu

Index . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 219
Contributors

Song Cai School of Mathematics and Statistics, Carleton University, Ottawa, ON,
Canada
Michael Cavers Department of Mathematics and Statistics, University of Calgary,
Calgary, AB, Canada
Jiahua Chen Big Data Research Institute of Yunnan University and Department of
Statistics, University of British Columbia, Vancouver, BC, Canada
Jingjia Chu Department of Statistical and Actuarial Sciences, Western University,
London, ON, Canada
Cindy Xin Feng School of Public Health and Western College of Veterinary
Medicine, University of Saskatchewan, Saskatoon, SK, Canada
Enas Ghulam Division of Biostatistics and Bioinformatics, Department of Envi-
ronmental Health, University of Cincinnati, Cincinnati, OH, USA
Peisong Han Department of Statistics and Actuarial Science, University of Water-
loo, Waterloo, ON, Canada
Longlong Huang Department of Mathematics and Statistics, University of Cal-
gary, Calgary, AB, Canada
Abbas Khalili Department of Mathematics and Statistics, McGill University,
Montreal, QC, Canada
Karen Kopciuk Department of Cancer Epidemiology and Prevention Research,
Alberta Health Services, Calgary, AB, Canada
Reg Kulperger Department of Statistical and Actuarial Sciences, Western Univer-
sity, London, ON, Canada
Longhai Li Department of Mathematics and Statistics, University of Saskatchewan,
Saskatoon, SK, Canada

xv
xvi Contributors

Pengfei Li Department of Statistics and Actuarial Science, University of Waterloo,

Waterloo, ON, Canada
Ruitao Lin Department of Statistics and Actuarial Science, The University of
Hong Kong, Hong Kong, China
Joseph Ling Department of Mathematics and Statistics, University of Calgary,
Calgary, AB, Canada
Xuewen Lu Department of Mathematics and Statistics, University of Calgary,
Calgary, AB, Canada
Yingwei Peng Departments of Public Health Sciences and Mathematics and
Statistics, Queens University, Kingston, ON, Canada
Pooneh Pordeli Department of Mathematics and Statistics, University of Calgary,
Calgary, AB, Canada
Hui Song School of Mathematical Sciences, Dalian University of Technology,
Dalian, Liaoning, China
David A. Stephens Department of Mathematics and Statistics, McGill University,
Montreal, QC, Canada
Zhaoyang Tian Department of Statistics and Actuarial Science, University of
Waterloo, Waterloo, ON, Canada
Dongsheng Tu Departments of Public Health Sciences and Mathematics and
Statistics, Queens University, Kingston, ON, Canada
Kesheng Wang Department of Biostatistics and Epidemiology, East Tennessee
State University, Johnson City, TN, USA
Changchun Xie Division of Biostatistics and Bioinformatics, Department of
Environmental Health, University of Cincinnati, Cincinnati, OH, USA
Guosheng Yin Department of Statistics and Actuarial Science, The University of
Hong Kong, Hong Kong, China
Hao Yu Department of Statistical and Actuarial Sciences, Western University,
London, ON, Canada
Part I
Data Analysis Based on Latent
or Dependent Variable Models
Chapter 1
The Mixture Gatekeeping Procedure Based
on Weighted Multiple Testing Correction
for Correlated Tests

Enas Ghulam, Kesheng Wang, and Changchun Xie

Abstract Hierarchically ordered objectives often occur in clinical trials. Many

multiple testing adjustment methods have been proposed to control family-wise
error rates while taking into account the logical relations among the null hypotheses.
However, most of them disregard the correlation among the endpoints within the
same family and assume the hypotheses within each family are equally weighted.
This paper proposes a gatekeeping procedure based on Xie’s weighted multiple test-
ing correction for correlated tests (Xie, Stat Med 31(4):341–352, 2012). Simulations
have shown that it has power advantages compared to those non-parametric methods
(which do not depend on the joint distribution of the endpoints). An example is given
to illustrate the proposed method and show how it can be used in complex clinical
trials.

1.1 Introduction

In order to obtain better overall knowledge of a treatment effect, the investigators

in clinical trials often collect many endpoints and test the treatment effect for each
endpoint. These endpoints might be hierarchically ordered and logically related.
However, the problem of multiplicity arises when multiple hypotheses are tested.
Ignoring this problem can cause false positive results. Currently, there are two
common types of multiple testing adjustment methods. One is based on controlling

E. Ghulam • C. Xie ()

Division of Biostatistics and Bioinformatics, Department of Environmental Health, University of
Cincinnati, Cincinnati, OH, USA
e-mail: [email protected]
K. Wang
Department of Biostatistics and Epidemiology, East Tennessee State University, Johnson City,
TN, USA

© Springer Science+Business Media Singapore 2016 3

D.-G. (Din) Chen et al. (eds.), Advanced Statistical Methods in Data Science,
ICSA Book Series in Statistics, DOI 10.1007/978-981-10-2594-5_1
4 E. Ghulam et al.

family-wise error rate (FWER), which is the probability of rejecting at least one true
null hypothesis, and the other is based on controlling false discovery rate (FDR),
which is the expected proportion of false positives among all significant hypotheses
(Benjamini and Hochberg 1995; Benjamini and Yekutieli 2001). The gatekeeping
procedures we consider here belong to the type of FWER control.
Consider a clinical trial with multiple endpoints. The hypotheses associated with
these endpoints can be grouped into m ordered families, F1 ; : : : ; Fm , with k1 ; : : : ; km
hypotheses respectively.
When the endpoints are hierarchically ordered with logical relations, many
gatekeeping procedures have been suggested to control FWER including serial gate-
keeping (Bauer et al. 1998; Maurer et al. 1995; Westfall and Krishen 2001), parallel
gatekeeping (Dmitrienko et al. 2003) and their generalization called tree-structured
gatekeeping (Dmitrienko et al. 2008, 2007). In serial gatekeeping procedure, the
hypotheses in Fi are tested only if all hypotheses in the previously examined family,
Fi1 are rejected. Otherwise, the hypotheses in Fi are accepted without testing. In
parallel gatekeeping procedure, the hypotheses in Fi are tested only if at least one
hypothesis in the previously examined family, Fi1 is rejected. In the tree-structured
gatekeeping procedure, a hypothesis in Fi is tested only if all hypotheses in one
subset (called a serial rejection set from F1 ; : : : ; Fi1 ) are rejected and at least one
hypothesis in another subset (called a parallel rejection set from F1 ; : : : ; Fi1 ) is
rejected. Recently, Dmitrienko and Tamhane (2011) proposed a new approach for
gatekeeping, based on mixture of multiple testing procedures.
In this paper, we use the mixture method with Xie’s weighted multiple testing
correction, which is proposed for a single family of hypotheses, as a component
procedure. We call the resulting mixture gatekeeping procedure as WMTCc-based
gatekeeping procedure. Xie’s WMTCc was proposed for multiple correlated tests
with different weights and is more powerful than weighted Holm procedure. Thus
the proposed new WMTCc-based gatekeeping procedure should have an advantage
over the mixture gatekeeping procedure based on Holm procedure, including
Bonferroni parallel gatekeeping multiple testing procedure.

1.1.1 WMTCc Method

Assume that the test statistics follow a multivariate normal distribution with known
correlation matrix ˙. Let p1 ; : : : ; pm be the observed p-values for null hypotheses
.1/ .m/
H0 ; : : : ; H0 respectively and wi > 0, i D 1; : : : ; m be the weight for null
.i/ P
hypothesis H0 . Note that we do not require that m iD1 wi D 1 because it can be
seen from Eqs. (1.2) or (1.3) below that the adjusted p-values only depend on the
ratios of the weights. For each i D 1; : : : ; m, calculate qi D pi =wi . Then the adjusted
1 The Mixture Gatekeeping Procedure Based on Weighted Multiple Testing. . . 5

.i/
p-value for the null hypothesis H0 is

Padj_i D P. minj qj qi /

D 1 P. all qj > qi /
(1.1)
Tm
D 1P jD1 aj Xj bj

D 1 P.all pj > pi wj =wi /;

where Xj , j D 1; : : : ; m are standardized multivariate normal with correlation matrix

˙ and

1 pi wj 1 pi wj
aj D ˚ ; bj D ˚ 1 (1.2)
2wi 2wi

for the two-sided case and

1 pi wj
aj D 1; bj D ˚ 1 (1.3)
wi

for the one-sided case.

Therefore the WMTCc is to first adjust the m observed p-values for multiple
testing by computing m adjusted p-values in (1.1). If Padj_i ˛, reject the
.i/
corresponding null hypothesis H0 . Suppose k1 null hypotheses have been rejected,
we then adjust the remaining m k1 observed p-values for multiple testing after
removing the rejected k1 null hypotheses, using the corresponding correlation matrix
and weights. Continue the procedures above until there is no null hypothesis left
after removing the rejected null hypotheses or there is no null hypothesis which can
be rejected.

1.1.2 Single-Step WMTCc Method

The single-step WMTCc is to adjust the m observed p-values for multiple testing by
computing m adjusted p-values in (1.1). If Padj_i ˛, reject the corresponding null
.i/
hypothesis H0 . It does not remove the rejected null hypotheses and calculate the
adjusted p-value again for the remaining observed P-values as the WMTCc does.
6 E. Ghulam et al.

1.2 Mixture Gatekeeping Procedures

Following Dmitrienko and Tamhane (2011), we consider a mixture procedure P

from P1 and P2 for testing the null hypotheses in family F D F1 [ F2 . Let K1 D
f1; 2; ::; k1 g, K2 D fk1 C 1; : : : ; kg, K D K1 [ K2 D f1; : : : ; kg be the index sets of
null hypotheses in F1 , F2 and F, respectively. Let H.I/ D \j2I Hj , where I D I1 [ I2 ,
in which Ii Ki , i D 1; 2. Let m1 and m2 be the number of null hypotheses in I1 and
I2 , respectively. Suppose P1 is single-step WMTCc and P2 is the regular WMTCc
procedure. The single-step WMTCc tests and rejects any intersection hypothesis
H.I1 / at level ˛ if p1 .I1 / D minm iD1 Padj_i ˛. The regular WMTCc tests and rejects
1

any intersection hypothesis H.I2 / at level ˛ if p2 .I2 / D minm iD1 Padj_i ˛.

p2 .I2 /
I . p1 .I1 /; p2 .I2 // D min. p1 .I1 /; /; (1.4)
c.I1 ; I2 j˛/

where 0 c.I1 ; I2 j˛/ 1 and must satisfy the following equation:

PfI . p1 .I1 /; p2 .I2 // ˛jH.I/g D Pfp1 .I1 / ˛

or p2 .I2 / c.I1 ; I2 j˛/˛jH.I/g D ˛: (1.5)

The package mvtnorm (Genz et al. 2009) in the R software environment (Team
2013) can be used to calculate c.I1 ; I2 j˛/. If we assume the hypotheses within family
Fi , i D 1; 2 are correlated, but the hypotheses between families are not correlated,
c.I1 ; I2 j˛/ can be defined as 1 e1 .I1 j˛/=˛, where e1 .I1 j˛/ D Pfp1 .I1 / ˛jH.I1 /g.
Note c.I1 ; I2 j˛/ is independent of I2 .

1.3 Simulation Study

In this section, simulations were performed to estimate the family-wise type I error
rate (FWER) and to compare the power performance of the two mixture gatekeeping
procedures: Holm-based gatekeeping procedure and the proposed new WMTCc-
based gatekeeping procedure. In these simulations, two families are considered.
Each family has two endpoints.
We simulated a clinical trial with two correlated endpoints and 240 individuals.
Each individual had probability 0.5 to receive the active treatment and probability
0.5 to receive a placebo. The two endpoints from each family were generated from
a multivariate normal distribution with chosen as 0.0, 0.3, 0.5, 0.7, and 0.9. The
treatment effect size was assumed as (0,0,0,0), (0.4,0.1,0.4,0.1), (0.1,0.4,0.1,0.4)
and (0.4,0.4,0.4,0.4), where the first two numbers are for the two endpoints in the
family 1 and the last two numbers are for the two endpoints in the family 2. The
corresponding weights for the four endpoints were (0.6, 0.4, 0.6, 0.4) and (0.9,
1 The Mixture Gatekeeping Procedure Based on Weighted Multiple Testing. . . 7

0.1, 0.9, 0.1). The observed p-values were calculated using two-sided t-tests for the
coefficient of the treatment, ˇ D 0, in linear regressions. The adjusted p-values
in Holm-based gatekeeping procedure were obtained using weighted Bonferroni
method for family 1 and Weighted Holm method for family 2. The adjusted p-
values in the proposed WMTCc-based gatekeeping procedure were obtained using
the single-step WMTCc method for family 1, and the regular WMTCc method
for family 2 where the estimated correlations from simulated data were used for
both families. We replicated the clinical trial 1,000,000 times independently and
calculated the family-wise type I error rate, defined as the number of clinical trials
where at least one true null hypothesis was rejected, divided by 1,000,000. The result
is shown in Table 1.1.
From these simulations, we can conclude the following:
1. Both Holm-based gatekeeping procedure and the proposed WMTCc-based
gatekeeping procedure can control the family-wise type I error rate very well.
The proposed WMTCc-based gatekeeping procedure keeps the family-wise type
I error rate at 5 % level when the correlation (between endpoints) increases.
However, the family-wise type I error rate in Holm-based gatekeeping procedure
decreases, demonstrating decreased power when the correlation increases.
2. The proposed WMTCc-based gatekeeping procedure has higher power of reject-
ing at least one hypothesis among the four hypotheses in the two families
compared with the Holm-based gatekeeping procedure, especially when the
correlation between endpoints is high.
3. The proposed WMTCc-based gatekeeping procedure has a power advantage over
the Holm-based gatekeeping procedure for each individual hypothesis in family
1, especially when the correlation between endpoints is high.
4. The proposed WMTCc-based gatekeeping procedure has an advantage over the
Holm-based gatekeeping procedure for each individual hypothesis in family 2,
especially when the correlation between endpoints are high.

1.4 Example

Following Dmitrienko and Tamhane (2011)’s example of the schizophrenia trial.

Assume that the sample size per dose group (placebo, low dose and high dose) is
300 patients and the size of the classifier-positive subpopulation is 100 patients per
dose group. Further assume that the t-statistics for testing the null hypotheses of
no treatment effect in the general population and classifier-positive subpopulation
are given by t1 D 2:04, t2 D 2:46, t3 D 2:22 and t4 D 2:66 with 897 d.f., 897
d.f., 297 d.f. and 297 d.f., respectively. We calculate two-sided p-values for the
four null hypotheses computed from these t-statistics instead of one-sided p-values
considered by Dmitrienko and Tamhane (2011). The p-values are p1 D 0:042, p2 D
0:014, p3 D 0:027 and p4 D 0:008. Dmitrienko and Tamhane (2011) considered
un-weighted procedures, however, for illustration purposes only, we give different
8

Table 1.1 Simulated family-wise error rate and power of the WMTCc-based mixture gatekeeping procedure and the Holm-based mixture gatekeeping
procedure
Holm-based gatekeeping WMTCc-based gatekeeping
Family 1 Family 2 FWER or Family 1 Family 2 FWER or
Weight Effect size (Weighted Bonferroni) (Weighted Holm) Power (Single-step WMTCc) (WMTCc) Power
(0.6,0.4, 0.6,0.4) (0,0,0,0) 0:0 3.0, 2.0 0.1, 0.1 5:0 3.1, 2.0 0.1, 0.1 5:0
0:3 3.0, 2.0 0.1, 0.0 4:8 3.1, 2.1 0.1, 0.1 5:0
0:5 3.0, 2.0 0.1, 0.1 4:7 3.2, 2.2 0.1, 0.1 5:0
0:7 3.0, 2.0 0.1, 0.1 4:3 3.5, 2.3 0.1, 0.1 5:0
0:9 3.0, 2.0 0.1, 0.1 3:7 4.1, 2.7 0.2, 0.1 5:0
(0.4,0.1, 0.4,0.1) 0:0 81.7, 6.0 63.2, 3.5 82:8 81.8, 6.1 63.6, 6.2 82:9
0:3 81.7, 6.0 62.7, 3.5 82:1 82.1, 6.2 63.3, 6.6 82:5
0:5 81.8, 6.1 62.7, 3.5 82:0 82.6, 6.4 63.9, 6.8 82:8
0:7 81.8, 6.0 62.6, 3.5 81:9 83.4, 6.8 65.2, 7.0 83:5
0:9 81.8, 6.1 62.6, 3.5 81:9 85.0, 7.7 68.1, 7.2 85:2
(0.1,0.4,0.1,0.4) 0:0 8.2, 77.3 3.5, 53.1 79:1 8.3, 77.4 4.7, 53.7 79:3
0:3 8.2, 77.3 3.5, 52.5 78:1 8.4, 77.7 5.0, 53.2 78:5
0:5 8.2, 77.3 3.5, 52.2 77:6 8.7, 78.1 5.2, 53.4 78:5
0:7 8.2, 77.3 3.5, 52.2 77:5 9.2, 79.1 5.3, 54.7 79:3
0:9 8.2, 77.3 3.5, 52.2 77:4 10.4, 80.8 5.5, 57.7 81:1
(0.4,0.4, 0.4,0.4) 0:0 81.7, 77.3 75.8, 71.3 95:8 81.8 77.4 79.7, 78.9 95:9
0:3 81.7, 77.3 74.1, 69.8 93:2 82.1, 77.7 77.7, 76.8 93:4
0:5 81.8, 77.3 72.7, 68.5 91:1 82.6, 78.1 76.4, 75.4 91:6
0:7 81.8, 77.3 71.1, 67.0 88:6 83.4, 79.1 75.3, 74.3 89:8
0:9 81.8, 77.3 68.7, 64.8 84:9 85.0, 80.8 74.7, 73.6 87:8
E. Ghulam et al.
(0.9,0.1, 0.9,0.1) (0,0,0,0) 0:0 4.5, 0.5 0.2, 0.0 5:0 4.5, 0.5 0.2, 0.0 5:0
0:3 4.5, 0.5 0.2, 0.0 5:0 4.6, 0.5 0.2, 0.0 5:0
0:5 4.5, 0.5 0.2, 0.0 5:0 4.7, 0.5 0.2, 0.1 5:0
0:7 4.5, 0.5 0.2, 0.0 5:0 4.8, 0.5 0.2, 0.1 5:0
0:9 4.5, 0.5 0.2, 0.0 4:5 5.0, 0.5 0.2, 0.1 5:0
(0.4,0.1, 0.4,0.1) 0:0 85.9, 2.1 73.0, 1.7 86:1 85.9, 2.1 73.1, 8.4 86:2
0:3 85.9, 2.1 72.9, 1.6 85:9 86.0, 2.1 73.1, 9.0 86:0
0:5 85.9, 2.1 72.9, 1.7 85:9 86.2, 2.1 73.4, 9.3 86:2
0:7 85.9, 2.1 72.9, 1.6 85:9 86.5, 2.2 73.9, 9.4 86:5
0:9 85.9, 2.1 72.9, 1.7 85:9 86.8, 2.2 74.5, 9.5 86:9
(0.1,0.4, 0.1,0.4) 0:0 11.1, 60.2 2.2, 24.6 64:6 11.2, 60.3 2.3, 25.2 64:7
0:3 11.1, 60.3 2.2, 23.9 62:8 11.2, 60.5 2.4, 24.4 63:0
0:5 11.1, 60.2 2.2, 23.6 61:6 11.4, 60.6 2.4, 24.2 62:0
0:7 11.1, 60.3 2.2, 23.3 60:9 11.6, 61.1 2.4, 24.1 61:7
0:9 11.1, 60.2 2.2, 23.2 60:5 12.0, 61.5 2.5, 24.5 61:8
(0.4,0.4, 0.4,0.4) 0:0 85.9, 60.2 78.4, 54.1 94:4 85.9, 60.3 79.0, 75.1 94:4
0:3 85.8, 60.3 76.8, 53.2 91:7 86.0, 60.5 77.4, 73.1 91:8
0:5 85.9, 60.2 75.7, 52.6 89:8 86.2, 60.6 76.4, 72.0 90:0
0:7 85.9, 60.3 74.6, 52.0 87:8 86.5, 61.1 75.7, 71.4 88:4
0:9 85.9, 60.2 73.6, 51.5 86:1 86.8, 61.5 75.2, 72.0 87:0
1 The Mixture Gatekeeping Procedure Based on Weighted Multiple Testing. . .
9
10 E. Ghulam et al.

Table 1.2 Adjusted p-values produced by the WMTCc-based mixture gatekeeping procedure and
the Holm-based mixture gatekeeping procedure in the schizophrenia trial example with parallel
gatekeeping restrictions
Null Adjusted p-value
Family hypothesis Weight Raw p-value Holm-based WMTCc-based
F1 H1 0.8 0.5 0.042 0.052 0.049
H2 0.2 0.014 0.070 0.066
F2 H3 0.8 0.027 – 0.033
H4 0.2 0.008 – 0.039

weights to different tests and use the weighted Holm-based mixture gatekeeping
procedure and the proposed WMTCc-based mixture gatekeeping procedure. The
results are given in Table 1.2. With FWER D 0:05, the weighted Holm-based
mixture gatekeeping procedure does not reject any of the four hypotheses while
the proposed WMTCc-based mixture gatekeeping procedure rejects the 1st, 3rd and
4th hypotheses.

1.5 Concluding Remarks and Discussions

In this paper, we proposed the WMTCc-based mixture gatekeeping procedure.

Simulations have shown that the proposed WMTCc-based gatekeeping procedure
using estimated correlation from the data can control the family-wise type I error
rate very well as summarized in Table 1.1.
The proposed WMTCc-based gatekeeping procedure has a power advantage over
the Holm-based gatekeeping procedure for each individuals hypothesis in the two
families, especially when the correlation is high.
In conclusion, our studies show that the proposed WMTCc-based mixture
gatekeeping procedure based on Xie’s weighted multiple testing correction for
correlated tests outperforms the non-parametric methods in multiple testing in
clinical trials.

References

Bauer P, Röhmel J, Maurer W, Hothorn L (1998) Testing strategies in multi-dose experiments

including active control. Stat Med 17(18):2133–2146
Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful
approach to multiple testing. J R Stat Soc Ser B (Methodolog) 57:289–300
Benjamini Y, Yekutieli D (2001) The control of the false discovery rate in multiple testing under
dependency. Ann Stat 29:1165–1188
Dmitrienko A, Offen WW, Westfall PH (2003) Gatekeeping strategies for clinical trials that do not
require all primary effects to be significant. Stat Med 22(15):2387–2400
1 The Mixture Gatekeeping Procedure Based on Weighted Multiple Testing. . . 11

Dmitrienko A, Tamhane AC (2011) Mixtures of multiple testing procedures for gatekeeping

applications in clinical trials. Stat Med 30(13):1473–1488
Dmitrienko A, Tamhane AC, Wiens BL (2008) General multistage gatekeeping procedures. Biom
J 50(5):667–677
Dmitrienko A, Wiens BL, Tamhane AC, Wang X (2007) Tree-structured gatekeeping tests in
clinical trials with hierarchically ordered multiple objectives. Stat Med 26(12):2465–2478
Genz A, Bretz F, Miwa T, Mi X, Leisch F, Scheipl F, Hothorn T (2009) mvtnorm: multivariate
normal and t-distributions. R package version 0.9–8. http://CRAN.R-project.org/package=
mvtnorm
Maurer W, Hothorn L, Lehmacher W (1995) Multiple comparisons in drug clinical trials and
preclinical assays: a-priori ordered hypotheses. Biometrie in der chemisch-pharmazeutischen
Industrie 6:3–18
Team RC (2013) A language and environment for statistical computing. R foundation for statistical
computing, Vienna
Westfall PH, Krishen A (2001) Optimally weighted, fixed sequence and gatekeeper multiple testing
procedures. J Stat Plan Inference 99(1):25–40
Xie C (2012) Weighted multiple testing correction for correlated tests. Stat Med 31(4):341–352
Chapter 2
Regularization in Regime-Switching Gaussian
Autoregressive Models

Abbas Khalili, Jiahua Chen, and David A. Stephens

Abstract Regime-switching Gaussian autoregressive models form an effective

platform for analyzing financial and economic time series. They explain the het-
erogeneous behaviour in volatility over time and multi-modality of the conditional
or marginal distributions. One important task is to infer the number of regimes and
regime-specific parsimonious autoregressive models. Information-theoretic criteria
such as AIC or BIC are commonly used for such inference, and they typically
evaluate each regime/autoregressive combination separately in order to choose the
optimal model accordingly. However, the number of combinations can be so large
that such an approach is computationally infeasible. In this paper, we first use a
computationally efficient regularization method for simultaneous autoregressive-
order and parameter estimation when the number of autoregressive regimes is
pre-detertermined. We then use a regularized Bayesian information criterion (RBIC)
to select the most suitable number of regimes. Finite sample performance of the
proposed methods are investigated via extensive simulations. We also analyze the
U.S. gross domestic product growth and the unemployment rate data to demonstrate
this method.

2.1 Introduction

A standard Gaussian autoregressive (AR) model of order q postulates that

Yt D 0 C 1 Yt1 C C q Ytq C "t (2.1)

for a discrete-time series fYt I t D 1; 2; : : :g, where .Yt1 ; Yt2 ; : : : ; Ytq / and "t
are independent and "t N.0; 2 /. Under this model, the conditional variance,

A. Khalili () • D.A. Stephens

Department of Mathematics and Statistics, McGill University, Montreal, QC, Canada
e-mail: [email protected]; [email protected]
J. Chen
Big Data Research Institute of Yunnan University and Department of Statistics, University of
British Columbia, Vancouver, BC, Canada
e-mail: [email protected]

© Springer Science+Business Media Singapore 2016 13

D.-G. (Din) Chen et al. (eds.), Advanced Statistical Methods in Data Science,
ICSA Book Series in Statistics, DOI 10.1007/978-981-10-2594-5_2
14 A. Khalili et al.

or volatility, of the series is var.Yt jYt1 ; : : : ; Ytq / D 2 , which is a constant with

respect to time. In some financial and econometrics applications, the conditional
variance of the time series clearly changes over time. ARCH (Engle 1982) and
GARCH (Bollerslev 1986) models were subsequently motivated to accommodate
the volatility changes. However, the time series may also exhibit heterogeneity
in conditional mean or conditional (or marginal) distribution. Such non-standard
behaviours call for more flexible models beyond (2.1), ARCH and GARCH.
Wong and Li (2000) introduced finite mixture of Gaussian autoregressive (MAR)
models to accommodate the above non-standard behaviour. A MAR model combines
K stationary or non-stationary Gaussian AR processes to capture heterogeneity
while ensuring stationarity of the overall model. Due to the presence of several
AR processes, a MAR model is also termed as the regime-switching Gaussian
autoregressive model. MAR models generalize the Gaussian mixture transition
distributions of Le et al. (1996) which were designed to model time series with
non-Gaussian characteristics such as flat stretches, bursts of activity, outliers,
and change points. Wong and Li (2000) used the expectation-maximization (EM)
algorithm of Dempster et al. (1977) for maximum likelihood parameter estimation
in MAR models when the number of regimes is pre-determined. They examined the
performance of information-theoretic criteria such AIC and BIC for selection of the
number of AR regimes and the variable/model selections within each regime through
simulations.
A parsimonious AR model can be obtained by setting some of f1 ; 2 ; : : : ; q g
in (2.1) zero. Such variable selection is known to lead to more effective subsequent
statistical inferences. Information-theoretic criteria such AIC or BIC are widely
used to choose the best subset of f1 ; 2 ; : : : ; q g. They typically evaluate 2q
possible AR submodels in an exhaustive calculation. When q is large, this is
a formidable computational challenge. The straightforward application of AIC,
BIC or other information criteria to MAR model selection poses even a greater
computational challenge. To overcome the computational obstacle, Jirak (2012)
proposed simultaneous confidence intervals for parameter and order estimation;
Wang et al. (2007) and Nardi and Rinaldo (2011) used the least absolute shrinkage
and selection operator (LASSO) of Tibshirani (1996).
Regularization techniques such as the LASSO by Tibshirani (1996), the smoothly
clipped absolute deviation (SCAD) by Fan and Li (2001), the adaptive LASSO by
Zou (2006) have been successfully applied in many situations. In this paper we first
present a regularized likelihood approach for simultaneous AR-order and parameter
estimation in MAR models when the number of AR regimes is predetermined.
The new approach is computationally very efficient compared to existing methods.
Extensive simulations show that the method performs well in a wide range of finite
sample situations. In some applications, the data analysts must also decide on the
best number of AR regimes (K) for a data set. We propose to use a regularized BIC
(RBIC) for choosing K. Our simulations show that the RBIC performs well in various
situations.
2 Regularization in MAR Models 15

The rest of the paper is organized as follows. In Sect. 2.2, the MAR model and the
problems of model selection and parameter estimation are introduced. In Sect. 2.3,
we develop new methods for the problems of interest. Our simulation study is given
in Sect. 2.4. We analyze the U.S. gross domestic product ( GDP) growth and U.S.
unemployment rate data in Sect. 2.5. Finally, Sect. 2.6 contains some discussion and
conclusions.

2.2 Terminology and Model

Consider an observable time series fYt W t D 1; 2; : : :g with corresponding realized

values fyt W t D 1; 2; : : :g, and a latent stochastic process fSt W t D 1; 2; : : :g taking
values in f1; 2; : : : ; Kg with K being the number of regimes underlying the time
series. In a mixture of Gaussian autoregressive ( MAR) model, fSt W t D 1; 2; : : :g
is an iid process, and the conditional distribution of Yt j.St D k; yt1 ; : : : ; ytq / is
presumed normal with variance k2 and mean

t;k D k0 C k1 yt1 C : : : C kq ytq I k D 1; 2; : : : ; K: (2.2)

Here q is the maximum order that is thought to be reasonable across all K AR

regimes. Let ˚ K D .1 ; 2 ; : : : ; K ; 12 ; 22 ; : : : ; K2 ; 1 ; 2 ; : : : ; K / denote the
vector of all parameters, where k D .k0 ; k1 ; : : : ; kq /> is the coefficient vector
of the kth AR regime. As in the usual finite mixture formulation, in a MAR model
the conditional distribution of Yt j.yt1 ; : : : ; ytq / is a Gaussian mixture with density

X
K
f .yt jyt1 ; : : : ; ytq ; ˚ K / D k .yt I t;k ; k2 / (2.3)
kD1

where PrŒSt D k D k 2 .0; 1/ are mixing proportions that sum to one, and
.I t;k ; k2 / is the density function of N.t;k ; k2 /.
We assume that in the true MAR model underlying the data some elements of
the vectors k (except the intercepts k0 ) are zero, which is referred to as a MAR
submodel as formally defined below.
Subset-AR models – where, in formulation (2.1), parameter vector D
.0 ; 1 ; : : : ; q /> contains zeros – are often used in time series literature (Jirak
2012). For each subset = f1; 2; : : : ; qg, we denote its cardinality by j=j,
introduce column vector yt= D f1; ytj W j 2 =g> and coefficient sub-vector
Œ= D f0 ; j ; j 2 =g> . We denote it as k Œ=k when applied to the kth regime.
The regime-specific conditional mean is then . k Œ=k /> yt= . Each combination of
=1 ; =2 ; : : : =K specifies a MAR submodel with the conditional density function

X
K
fŒ=1 ;=2 ;:::;=K .yt jyt1 ; yt2 ; : : : ; ytq ; ˚ K / D k .yt I t;k .=k /; k2 / (2.4)
kD1
16 A. Khalili et al.

where
X
t;k .=k / D . k Œ=k /> yt=k D k0 C kj ytj :
j2=k

Let .Y1 ; Y2 ; : : : ; Yn / Y1Wn be a random sample from a MAR model (2.3), with a
joint density function that may be factorized as f1 .y1 ; y2 ; : : : ; yq /f2 .yqC1 ; : : : ; yn jy1 ;
y2 ; : : : ; yq /. As a standard approach in time series, we work with conditional density
f2 ./, and the (conditional) likelihood function in a MAR model is given by

ln .˚ K / D logff2 .yqC1 ; : : : ; yn jy1 ; y2 ; : : : ; yq /g

X
n
D logff .yt jyt1 ; yt2 ; : : : ; ytq ; ˚ K /g
tDqC1
( )
X
n X
K
D log k .yt I t;k ; k2 / : (2.5)
tDqC1 kD1

In principle, once K is selected, we could carry out maximum (conditional)

likelihood estimation of ˚ K by maximizing ln .˚ K /. However, since all of the
estimated AR coefficients would be non-zero, such an approach does not provide
a MAR sub-model as postulated. Instead, we may use the information-theoretic
approaches such as AIC and BIC, based on (2.5), to select a MAR submodel (2.4) out
of 2Kq possible candidates that best balances the model parsimony and goodness of
fit of the data. The K itself may be chosen over a set of possible values f1; 2; : : : ; K g
for an upper bound K specified empirically. The difficulty withPthis strategy,
however, is that the total number of MAR submodels is given by K KD1 2
Kq
. If
AIC and BIC are used, one would have to compute the criterion for each separate
model. The computational cost will explode even for moderate sized K and q. This
observation motivates us to investigate the regularization methods in later sections.

2.3 Regularization in MAR Models

2.3.1 Simultaneous AR-Order and Parameter Estimation when

K is Known

In the following sections, we investigate regularization of the conditional likeli-

hood (2.5), and study effective optimization procedures.
A penalty on the mixture component variances: Similar to conventional Gaus-
sian mixture models with unequal component variances k2 ’s, the conditional
log-likelihood ln .˚ K / in (2.5) diverges to positive infinity when some component
2 Regularization in MAR Models 17

variance k2 goes to 0. This can be avoided by introducing a penalty function as in

Hathaway (1985) and Chen et al. (2008):

X
K
Qln .˚ K / D ln .˚ K / pn .k2 / (2.6)
kD1

where pn .k2 / is a smooth penalty function of k2 , such that pn .k2 / ! C1, as k !
0 or C1. We refer to (2.6) as the adjusted conditional log-likelihood. Specifically,
we follow Chen et al. (2008) and specify
2
1 Vn2 k
pn .k2 / Dp 2
C log (2.7)
n k Vn2
P
where Vn2 D .n q/1 ntDqC1 .yt yN /2 is the sample variance of the observed
P
series, and yN D .n q/1 ntDqC1 yt . From a Bayesian point of view, the use of
penalty corresponds to a data-dependent Gamma prior on 1=k2 with its mode at
1=Vn2 . In what follows, we will work with Qln .˚ K /.
AR -order selection and parameter estimation via regularization: If we directly
maximize the adjusted conditional log-likelihood Qln .˚ K /, the estimates of some of
the AR coefficients kj may be close but not equal to zero. The resulting full model
will not be as parsimonious as required in applications. We achieve model selection
by maximizing the regularized (or penalized) conditional log-likelihood

pln .˚ K / D Qln .˚ K / rn .˚ K / (2.8)

with some regularization function

8 9
X
K <Xq =
rn .˚ K / D k rn .kj I nk / (2.9)
: ;
kD1 jD1

for pre-specified pair K and q. The penalty function rn .I / will be chosen
positive, continuous in and having a spike at D 0; 0 is a tuning
parameter controlling the severity of the penalty. When rn .I / is appropriately
chosen, maximizing (2.8) will lead to some kj s having fitted values exactly zero.
Furthermore, increasing the size of nk generally forces more of fitted values of
kj s to be zero. Consequently, such a procedure leads to a method that performs
simultaneous AR-order and parameter estimation. This procedure does not evaluate
every possible MAR submodel and thereby is computationally feasible.
18 A. Khalili et al.

Example of penalties: Forms of rn .I / with the desired properties are the LASSO
penalty

rn .I / D n jj;

and the SCAD penalty which is most often characterized by its first derivative:

.a jj/C
rn0 .I / D n I.jj / C I.jj > / sgn./
.a 1/

for some constant a > 2; where I./ and sgn./ are the indicator and sign functions,
and ./C is the positive part of the input, respectively. Fan and Li (2001) showed that
the value a D 3:7 minimizes a Bayes risk criterion for . This choice of a has since
become the standard in various model selection problems. We used this value in our
simulations and data analysis.
For the MAR model (2.3) where the St are iid samples, the theoretical proportion
of St D k is given by k . Thus, we choose the penalties in (2.9) to be proportional
to the mixing probabilities k to control the level of regime-specific penalty on
kj s. This improves the finite sample performance of the method, and the influence
vanishes as the sample size n increases.

2.3.2 Numerical Computations

The maximization of the penalized conditional log-likelihood for a K regime MAR

model with maximal AR order q is an optimization over a space of dimension .K
1/ C K.q C 2/; for example, with K D 5 and q D 10, the number of parameters is
64; this number is large, but direct optimization using Nelder-Mead or quasi-Newton
methods (via optim in R) is still possible when a local quadratic approximation to
the penalty is adopted: following Fan and Li (2001), the approximation

rn0 .0 I nk / 2
rn .kj I nk / ' rn .0 I nk / C .kj 02 / (2.10)
20

holds in a neighbourhood of a current value 0 , and may be used. Coordinate-based

methods operating on the incomplete data likelihood may also be useful.
In this paper, however, we use a modified EM algorithm for maximization of the
penalized log-likelihood pln .˚ K / in (2.8). Let O̊ n;K D arg maxfpln .˚ K /g denote the
maximum penalized conditional likelihood estimator ( MPCLE ) of ˚ K . By tuning the
level of penalty nk , this estimator has its O k components containing various number
of zero-fitted values and parsimony is induced.
2 Regularization in MAR Models 19

2.3.2.1 EM-Algorithm

For observation yt , let Ztk , with realization ztk , equal 1 if St D k, and equal 0
otherwise. The complete conditional adjusted log-likelihood function under the
MAR model is given by

X
K X
n XK
Qlcn .˚ K / D ztk log k C log .yt I t;k ; k2 / pn .k2 /
kD1 tDqC1 kD1

and thus the penalized complete conditional log-likelihood is plcn .˚ K / D Qlcn .˚ K /

rn .˚ K /:
Let x> t D .1; yt1 ; yt2 ; : : : ; ytq /, X D .xqC1 ; xqC2 ; : : : ; xn /> , y D
.m/
.yqC1 ; yqC2 ; : : : ; yn /> . Given the current value of the parameter vector ˚ K , the
EM algorithm iterates as follows:

E-step: We compute the expectation of the latent Ztk variables conditional on the
other parameters and the data. At .mC1/-th iteration, the EM objective function is

X
K X
n
.m/ .m/
Q.˚ K I ˚ K / D !tk log k C log .yt I t;k ; k2 /
kD1 tDqC1

X
K X
K X
q
pn .k2 / k rn .kj I nk /
kD1 kD1 jD1

with weights

.m/ .m/ 2.m/

.m/ .m/ k .yt I t;k ; k /
!tk D E.Ztk jyI ˚ K / D PK .m/ .m/ 2.m/
:
lD1 l .yt I t;l ; l /

M-step: By using the penalty pn .k2 / in (2.7) and the quadratic approxima-
.m/
tion (2.10) for rn .kj I nk /, we maximize the (approximated) Q.˚ K I ˚ K / with
respect to ˚ K . Note that the quadratic approximation to the penalty provides a
minorization of the exact objective function, ensuring that the iterative algorithm
still converges to the true maximum of the penalized likelihood function.
The EM updates of the regime-specific AR-coefficients and variances, for k D
1; : : : ; K, are given by

.mC1/ .m/ .m/ .m/

k D .X> W k X C ˙ k /1 X > W k Y

Pn .m/ .mC1/ 2 p
2.mC1/ tDqC1 !tk .yt x>
t k / C 2Vn2 = n
k D Pn .m/ p
tDqC1 !tk C 2= n
20 A. Khalili et al.

.m/ .m/ .m/

with diagonal matrices W k D diagf!tk I t D q C 1; : : : ; ng and ˙ k D
.m/ 2.m/ .m/ .m/
diagf0; k k rn0 .kj I nk /=kj ; j D 1; : : : ; qg.
For the mixing probabilities, the updates are

1 X .m/
n
.mC1/
k D ! ; k D 1; 2; : : : ; K
n q tDqC1 tk

.m/
that maximize the leading term in Q.˚ K I ˚ K /, and it worked well in our simulation
study.
.0/
Starting from an initial value ˚ K , the EM algorithm continues until some
.mC1/ .m/
convergence criterion is met. We used the stopping rule k˚ K ˚ K k
,
5
for a pre-specified small value ", taken 10 in our simulations and data analysis.
Due to the sparse selection penalty rn .kj I nk / some of the estimates Okj will be
very close to zero at convergence; these estimates are set to zero. Thus we achieve
simultaneous AR-order and parameter estimation.

2.3.2.2 Tuning of in rn .; /

One remaining issue in the implementation of the regularization method is the

choice of tuning parameter for each regime. We recommend a regime-specific
information criterion together with a grid search scheme as follows.
We first fit the full MAR model by finding the maximum point of Qln .˚ K / defined
by (2.6) using the same EM algorithm above. Once the maximum point of Qln .˚ K /,
e̊ K , is obtained, we compute

Q k .yt I Q t;k ; Q k2 /
!Q tk D PK :
lD1 Q l .yt I Q t;l ; Q l2 /

This is the fitted probability of St D k, conditional on y, and based on the fitted full
model.
Next, we pre-choose a grid of -values f 1 ; 2 ; : : : ; M g for some M, say M D 10
or 15. For each i and regime k, we define a regime specific regularized likelihood
function

X
n X
q
Qlk .; 2 / D !Q tk log .yt I t ; 2 / pn . 2 / Q k rn .j I i /
tDqC1 jD1

with t D x> t . We then search for its maximum point with respect to and ,
2

k . i / and 2k ; compute N t;k;i D x>

t .
k i /, and the residual sum of squares ( RSS )

X
n
RSS k . i / D !Q tk .yt N t;k;i /2 :
tDqC1
2 Regularization in MAR Models 21

The weights !Q tk are included because observations yt may not be from regime k.
The regime-specific information criterion is computed as

IC. i I k/ D nk logŒRSS k . i / C DF. i /.log nk / (2.11)

Pq P
where DF. i / D jD1 I. kj 6D 0/ and nk D ntDqC1 !Q tk . This information criterion
mimics the one used in linear regression by Zhang et al. (2010). We choose the value
of the tuning parameter for regime k as

Q k D argmin IC. i I k/:

1iM

2.3.3 Choice of the Mixture-Order or Number of AR

Regimes K

The procedure presented in the last subsection is used when the number of AR
regimes K is pre-specified. However, a data-adaptive choice of K is needed in most
applications. We now propose a regularized BIC (RBIC) for choosing K.
Consider the situation where placing an upper bound K on K is possible. For
each K 2 f1; 2; : : : ; K g, we fit a MAR model as above with resulting estimates
P Pq
denoted by O̊ n;K . Let NK D KkD1 jD1 I.Okj 6D 0/ be the total number of non-zero
Okj s, and

RBIC . O̊ n;K / D ln . O̊ n;K / 0:5.NK C 3K 1/ log.n q/ (2.12)

where 3K 1 counts the number of parameters .k ; k2 ; k0 /, ln ./ is the conditional
log-likelihood given in (2.5). We then select the estimated number of AR regimes
KO n as

KO n D arg max RBIC . O̊ n;K /: (2.13)

1KK

We now have an estimated MAR model with KO n AR regimes, and the regime-specific
AR model characterized by the corresponding O k , k D 1; 2; : : : ; KO n .
Extensive simulation studies show that the RBIC performs well in various
situations. It is noteworthy that, for each K, RBIC is computed based on the outcome
O̊ n;K from the regularization method outlined in (2.8) and (2.9), which is obtained
after examining a grid of, say, M D 10 or 15, possible values for the tuning
parameters nk . In comparison, the standard BIC adopted in Wong and Li (2000)
must examine 2Kq possible MAR submodels. The RBIC thus offers a substantial
computational advantage unless K and q are both very small.
22 A. Khalili et al.

2.4 Simulations

In this section we study the performance of the proposed regularization method for
AR-order and parameter estimation, and the RBIC for selection of the number of AR
regimes (mixture-order) via simulations. We generated times series data from five
MAR models. The parameter settings for the first four models are:

Model .K ; q/ .1 ; 2 / .1 ; 2 / t;1 t;2

1 .2; 5/ .:75; :25/ .5; 1/ :50yt1 1:3yt1
2 .2; 5/ .:75; :25/ .5; 1/ :70yt1 :65yt2 :45yt1 1:2yt3
3 .2; 6/ .:75; :25/ .5; 1/ :67yt1 :55yt2 :45yt1 C :35yt3 :65yt6
4 .2; 15/ .:65; :35/ .3; 1/ :58yt1 :45yt6 :56yt1 :40yt3 C :44yt12

Note that q in the above table is the pre-specified maximum AR-order, and K
is the true number of AR-regimes in a MAR model. By Theorem 1 of Wong and Li
(2000), in a MAR model, a necessary and Pqsufficient
P condition for a MAR to be first-
order stationary is that all roots of 1 jD1 KkD1 k kj zj D 0 are inside the unit
circle. The parameter values k and kj in Models 1–4 are chosen to ensure, at least,
this condition is satisfied.
The fifth MAR model under our consideration is a three-regime MAR with
parameter values:

Model .K ; q/ .1 ; 2 ; 3 / .1 ; 2 ; 3 / t;1 t;2 t;3

5 .3; 5/ .:4; :3; :3/ .1; 1; 5/ :9yt1 :6yt2 :5yt1 1:5yt1 :75yt2

The maximum AR-order specified in the regularization method for this model is
also q D 5, and the values of k and kj are chosen such that the MAR model is, at
least, first-order stationary as defined above.

2.4.1 Simulation when K is Specified

In this section we assess the performance of the the regularization method for
AR-order and parameter estimation when the number of AR regimes (K ) of the
MAR is specified. We use the EM-algorithm outlined in Sect. 2.3.2 to maximize the
regularized likelihood defined in (2.8). The regime-specific tuning parameters nk
in rn .I nk / are chosen by the criterion IC in (2.11). The computations are done
in C++ and on a Mac OS X machine with 2 2:26 GHz Quad-Core Intel Xeon
processor.
2 Regularization in MAR Models 23

Table 2.1 Correct (Cor) and incorrect (Incor) number of estimated zero kj ’s in Models 1, 2 and 3.
The numbers inside Œ; are the true Cor in regimes Reg1 and Reg2 of each model
MAR Model 1 Model 2 Model 3
Method n Regimes Cor[4,4] Incor Cor[3,3] Incor Cor[4,3] Incor
BIC 150 Reg1 3.83 .024 2.92 .001 3.87 .014
Reg2 3.56 .025 2.81 .005 2.76 .023
250 Reg1 3.92 .000 2.95 .000 3.93 .001
Reg2 3.88 .000 2.90 .000 2.92 .002
400 Reg1 3.95 .000 2.96 .000 3.93 .000
Reg2 3.93 .000 2.95 .000 2.95 .000
SCAD 150 Reg1 3.93 .022 2.97 .000 3.89 .013
Reg2 3.85 .002 2.91 .004 2.88 .013
250 Reg1 3.99 .001 3.00 .000 3.98 .003
Reg2 3.98 .000 2.99 .000 2.98 .003
400 Reg1 4.00 .000 3.00 .000 4.00 .000
Reg2 4.00 .000 3.00 .000 3.00 .000

The simulation results are based on 1000 randomly generated time series of
a given size from each of the five models, and they are reported in the form of
regime-specific: correct (Cor) and incorrect (Incor) number of estimated zero AR-
coefficients kj , and the empirical mean squared errors ( MSEk ) of the estimators O k
of the vector of AR-coefficients k . The Regk in the tables represent AR regimes of
each MAR model.
For Models 1–3, it is computationally feasible to implement the standard BIC of
Wong and Li (2000). Therefore, we also reported the results based on BIC together
with their computational costs. For Models 4 and 5, the amount of computation of
BIC is infeasible. Thus, BIC is not included in our simulation.
Tables 2.1 and 2.2 contain the simulation results based on the SCAD regular-
ization method and standard BIC for Models 1, 2 and 3. From Table 2.1, the
regularization method clearly outperforms BIC by having higher rates of correctly
(Cor) estimated zero AR-coefficients and lower rates of incorrectly (Incor) estimated
zero AR-coefficients, in both regimes Reg1 and Reg2 of the three MAR models.
Both methods improve as the sample size increases. Table 2.2 provides the regime-
specific empirical mean square errors ( MSEk ) of the estimators O k . For n D 150,
SCAD outperforms BIC in all three models, especially with respect to the MSE 2 .
When the sample size increases, the two methods have similar performances.
Regarding the computational time, the regularization method took about 6–16 s for
n D 150 and 400, respectively, to complete 1000 replications for each of the three
models. The BIC took about 2:78 and 7:83 h for Models 1 and 2, and it took 22:9
and 143:8 h for Model 3 when n D 150 and 400, respectively.
24 A. Khalili et al.

Table 2.2 Regime-specific empirical mean squared errors (MSE) in Models 1, 2 and 3
Sample size Model 1 Model 2 Model 3
Method n MSE1 MSE2 MSE1 MSE2 MSE1 MSE2
BIC 150 .027 .121 .013 .014 .034 .029
250 .008 .002 .006 .002 .015 .004
400 .004 .001 .004 .001 .008 .002
SCAD 150 .024 .010 .014 .011 .035 .015
250 .010 .001 .005 .002 .015 .005
400 .006 .001 .003 .001 .007 .002

Table 2.3 Correct (Cor) and incorrect (Incor) number of estimated zero kj ’s, and regime-specific
empirical mean squares errors (MSE) in Model 4. The numbers inside Œ; are the true Cor in
regimes Reg1 and Reg2 of the model
MAR Cor[13,12] MSE
Method n Regimes Reg1 Reg2 MSE1 MSE2
SCAD 250 Cor 12:8 11:3 .066 .089
Incor :027 :100
400 Cor 12:9 11:9 .014 .031
Incor :002 :058
600 Cor 13:0 11:9 .007 .016
Incor :000 :037
800 Cor 13:0 12:0 .005 .008
Incor :000 :014
1000 Cor 13:0 12:00 .004 .003
Incor :000 :000
LASSO 250 Cor 12:8 11:3 .056 .076
Incor :019 :096
400 Cor 12:9 11:8 .021 .027
Incor :004 :045
600 Cor 12:9 11:9 .013 .014
Incor :000 :021
800 Cor 13:0 12:0 .011 .008
Incor :000 :004
1000 Cor 13:0 12:0 .009 .005
Incor :000 :000

Table 2.3 contains the simulation results for the MAR Model 4 which has higher
AR-orders. Overall, the new method based on either SCAD or LASSO performed
very well. It took SCAD about 118 and 787 s, corresponding to n D 250 and
1000, respectively, to complete the 1000 replications. The LASSO had similar
computational times. The BIC is computationally infeasible.
2 Regularization in MAR Models 25

Table 2.4 Correct (Cor) and incorrect (Incor) number of estimated zero kj ’s, and regime-specific
empirical mean squares errors (MSE) in Model 5. The numbers inside Œ; ; are the true Cor in
regimes Reg1 , Reg2 and Reg3 of the model
MAR Cor[3,4,3] MSE
Method n Regimes Reg1 Reg2 Reg3 MSE1 MSE2 MSE3
SCAD 150 Cor 2:97 3:96 2:70 .006 .002 .285
Incor :003 :000 :090
250 Cor 2:99 4:00 2:89 .001 .001 .067
Incor :000 :000 :025
400 Cor 3:00 4:00 2:97 .001 .000 .019
Incor :000 :000 :003
LASSO 150 Cor 2:96 3:95 2:51 .008 .002 .368
Incor :004 :000 :165
250 Cor 2:99 4:00 2:67 .002 .000 .229
Incor :000 :000 :077
400 Cor 3:00 4:00 2:80 .002 .000 .145
Incor :002 :000 :023

Model 5 has K D 3, and its simulation results are in Table 2.4. The
regularization method performs reasonably well in both AR-order selection and
parameter estimation. Comparatively to regimes 1 and 2, the method has lower rates
of correctly estimated zero AR-coefficients, higher rates of incorrectly estimated
zero AR-coefficients and also larger mean square errors for regime 3. This is more
evident for LASSO. This is expected because the noise level (3 D 5) in Reg3
is much higher. Consequently, it is harder to maintain the same level of accuracy
for AR-order selection and parameter estimation. When the sample size increases,
the regularization method has improved precision, either when SCAD or LASSO is
used. The regularization method took about 13 and 34 s, for n D 150 and 400,
respectively, to complete the simulations.

2.4.2 Selection of K

In this section we examine the performance of the estimator KO n in (2.13). We report

the observed distribution of KO n based on 1000 replications. The results for Models
1–3 and also Model 5 are reported in Table 2.5. Model 4 has more complex regime
structures. Thus, it is more closely examined with additional sample sizes and the
results are singled out in Table 2.6. For each model, KO n is calculated based on the
RBIC. Note that if we replace the factor log.n q/ in (2.12) by number 2, we create
an AIC motivated RAIC selection method. We also obtained the simulation results
based on RAIC to serve as a potential yardstick. We present the results corresponding
26 A. Khalili et al.

Table 2.5 Simulated distribution of the mixture-order estimator KO n . Results for the true order K
are in bold. Values in Œ are the proportion of concurrently correct estimation of the regime-specific
AR -orders

Models: 1 .K D 2/ 2 .K D 2/ 3 .K D 2/ 5 .K D 3/
n K RBIC RAIC RBIC RAIC RBIC RAIC RBIC RAIC
150 1 .131 .000 .031 .004 .023 .000 .000 .000
2 .855Œ:838 .207 .965Œ:962 .468 .957Œ:936 .215 .016 .000
3 .014 .150 .004 .177 .019 .169 .945Œ:938 .589
4 or 5 .000 .643 .000 .351 .001 .616 .039 .411
250 1 .008 .000 .002 .002 .000 .000 .002 .002
2 .978Œ:973 .312 .995Œ:995 .572 .996Œ:994 .217 .000 .000
3 .014 .190 .003 .171 .004 .185 .986Œ:985 .768
4 or 5 .000 .498 .000 .255 .000 .598 .012 .230
400 1 .002 .000 .004 .004 .000 .000 .001 .001
2 .998Œ:998 .327 .996Œ:996 .632 .999Œ:999 .220 .001 .001
3 .000 .194 .000 .158 .001 .229 .996Œ:996 .836
4 or 5 .000 .479 .000 .206 .000 .551 .002 .162

Table 2.6 Simulated distribution of the mixture-order estimator KO n . Results for the true order K
are in bold. Values in Œ are the proportion of concurrently correct estimation of the regime-specific
AR -orders. Model 4 .K D 2/

n D 250 n D 400 n D 600 n D 800 n D 1000

K RBIC RAIC RBIC RAIC RBIC RAIC RBIC RAIC RBIC RAIC
1 .145 .002 .012 .000 .003 .000 .001 .000 .000 .000
2 .814Œ:808 .327 .949Œ:949 .431 .967Œ:967 .492 .983Œ:983 .530 .999Œ:999 .551
3 .038 .264 .035 .244 .024 .217 .011 .190 .000 .187
4 or 5 .003 .407 .004 .325 .006 .291 .005 .280 .001 .262

to the true order K in bold. The subscripts inside Œ are the proportion of times that
both the mixture-order and the regime-specific AR-orders are selected correctly.
From Table 2.5, when the sample size is n D 150, the success rates of RBIC are
85:5 %, 96:5 %, and 95:7 % for Models 1–3 respectively. The success rate of RBIC
is 94:5 % for Model 5. As the sample size increases to n D 400, all success rates
exceed 99 %. Overall, the RBIC performs well. As expected, the RAIC tends to select
higher orders in all cases.
We now focus on the results in Table 2.6 for Model 4 (K D 2). The success
rate of RBIC is 81:4 % when n D 250, and it improves to 94:9 % and 99:9 % when
n D 400 and n D 1000. Note that RAIC severely over-estimates the order even when
n D 1000.
2 Regularization in MAR Models 27

2.5 Real Data Examples

2.5.1 U.S. Gross Domestic Product (GDP) Growth

We analyze the data comprising the quarterly GDP growth rate (Yt ) of the U.S.
over the period from the first quarter of 1947 to the first quarter of 2011. The
data is obtained from the US Bureau of Economic Affairs website http://www.
bea.gov. Figure 2.1 contains the time series plot, the histogram and the sample
autocorrelation function (ACF) of 256 observations of Yt . The time series plot shows
that the variation in the series changes over time, and the histogram of the series
is multimodal. This motivates us to consider fitting a MAR model to this data. The
ACF plot indicates that the sample autocorrelation function at the first two lags are
significant. Thus, we let the maximum q D 5 and applied our method in Sect. 2.3
and fitted MAR models with K D 1; 2; 3; 4, to this data set. The RBIC values for
k D 1; 2; 3; 4 are: 351:66; 343:01; 344:14; 345:36. Thus, we select KO D 2.

(a) (b)
0.6
4
US GDP % growth (quarterly)
3
2

0.4
Density
1
0

0.2
−2

0.0

1950 1970 1990 2010 −2 0 2 4

Date US GDP % growth (quarterly)
(c)
0.0 0.2 0.4 0.6 0.8 1.0
ACF

0 5 10 15 20
Lag

Fig. 2.1 (a) and (b) The time series plot and histogram of the U.S. GDP data. (c) The ACF of the
U.S. GDP data
28 A. Khalili et al.

The fitted conditional density function of the model is given by

f .yt jyt1 / D :276 .yt I :702; :2862 / C :724 .yt I :497 C :401yt1 ; 1:062 /:

The standard errors of the estimators .O10 ; O20 ; O21 / are .:060; :092; :080/. The
estimated conditional variance of Yt is:

2
X 2
X X
2 2
c t jyt1 / D
Var.Y O k O k2 C O 2k;t
O k O k O k;t D :838 :033 yt1 C :032 y2t1 :
kD1 kD1 kD1

We have the conditional variance plotted with respect to time in Fig. 2.2. It is seen
that up to the year 1980, the time series has high volatility compared to the years
after 1980.
1.2
Estimated conditional variance
1.1
1.0
0.9

1950 1960 1970 1980 1990 2000 2010

Date

Fig. 2.2 The conditional variance of the fitted MAR model to the U.S. GDP data
2 Regularization in MAR Models 29

(a) 0.6 (b) (c) (d)

0.6

0.6
0.5

0.5

0.5
0.4

0.4

0.4
Predictive density

Predictive density

Predictive density
0.3

0.3

0.3
0.2

0.2

0.2
0.1

0.1

0.1
0.0

0.0

−2 0 2 4 −2 0 2 4 −2 0 2 4 −2 0 2 4
yt yt yt yt

Fig. 2.3 One-step predictive density at 4 quarters of year 2009

Figures 2.3 and 2.4 give a number of first-step predictive (conditional) density
functions. The time points correspond to 4 quarters in year 2009 and 8 quarters in
years 1949 and 1950. Over two periods, the conditional density function changes
from bimodal to unimodal or from unimodal to bimodal. It is interesting to find
these changes occur when the time series experiences high volatility. The fitted MAR
model has successfully captured such behaviours.
30 A. Khalili et al.

(a) (b) (c) (d)

0.6

0.6
0.6

0.5

0.5
0.5
Predictive density

Predictive density

Predictive density
0.4

0.4

0.4
0.4

0.3

0.3
0.3

0.2

0.2
0.2

0.1

0.1
0.1
0.0

0.0

0.0
−2 0 2 4 −2 0 2 4 −2 0 2 4 −2 0 2 4
yt yt yt yt

(e) (f) (g) (h)

0.5

0.5
0.6

0.5
0.5

0.4

0.4
0.4
Predictive density

Predictive density

Predictive density
0.4

0.3

0.3
0.3
0.3

0.2

0.2
0.2
0.2

0.1

0.1
0.1
0.1
0.0

0.0

−2 0 2 4 −2 0 2 4 −2 0 2 4 −2 0 2 4
yt yt yt yt

Fig. 2.4 One-step predictive density at 8 quarters in years 1949–1950

2.5.2 U.S. Unemployment Rate

The data are monthly U.S. unemployment rates over the period of 1948–2010,
obtained from http://www.bea.gov. The time series plot and histogram of the
observed series of length 755 is given in Fig. 2.5. The time series plot shows an
increasing and decreasing trend in the series and also high volatility over time.
The histogram of the series is clearly multimodal indicating that a MAR model
may be appropriate. As is commonly done in time series we use the first difference
transformation of the series in order to remove the increasing and decreasing
trend in the series. The time series plot and also the ACF of the first difference
zt D yt yt1 are given in Fig. 2.5. The trend in the mean series has been
2 Regularization in MAR Models 31

(a) (b)

0.30
10
US unemployment rate
8

0.20
Density
6

0.10
4

0.00
1950 1970 1990 2010 4 6 8 10
Date US unemployment rate
(c) (d)

0.0 0.2 0.4 0.6 0.8 1.0

diff(US unemployment rate)
0.5 1.0

ACF
−0.5
−1.5

0 200 400 600 0 5 10 15 20 25

Time Lag

Fig. 2.5 (a) and (b) The time series plot and histogram of the monthly U.S. unemployment rate.
(c) The time series plot of the first difference of the U.S. unemployment rate. (d) The ACF of the
first difference

successfully removed but the variance still changes over time. In what follows
we fit a MAR model to the difference zt . Based on the ACF of the zt in Fig. 2.5,
the autocorrelation at around lag five seem significant. Thus we let q D 5 and
applied our method in Sect. 2.3 and fitted MAR models with K D 1; 2; 3; 4, to zt .
The RBIC values for K D 1; 2; 3; 4 are: 396:28; 359:25; 345:25; 353:49.
Thus, we select KO D 3. The parameter estimates of the corresponding
fitted MAR model are .O 1 ; O 2 ; O 3 I O 1 ; O 2 ; O 3 I O11 ; O12 ; O15 ; O22 ; O23 ; O31 ; O35 / D
.:184; :742; :074I :152; :148; :256I :631; :659; :298; :143; :182; 1:15; :862/. The
standard errors of the AR-coefficient estimators are .:128; :113; :126; :036; :033;
:486; :217/.
32 A. Khalili et al.

0.30
0.25
Estimated conditional variance
0.20
0.15
0.10
0.05

1950 1960 1970 1980 1990 2000 2010

Date

Fig. 2.6 Fitted conditional variance of MAR model to the U.S. unemployment rate

The fitted MAR model to the original series yt has the conditional density function

f .yt jyt1 ; : : : ; yt6 / D :184 .yt I 1:63yt1 C :028yt2 :659yt3

C :298yt5 :298yt6 ; :1522 / C :742 .yt I yt1 C :143yt2
C :039yt3 :182yt4 ; :1482 / C :074 .yt I :150yt1
C 1:15yt2 C :862yt5 :862yt6 ; :2562 /:

b
The estimated conditional variance of Yt , i.e. Var.Yt jyt1 ; : : : ; yt6 /, is plotted
against time in Fig. 2.6. It is seen that the unemployment rate has high volatility
over different time periods.
Figure 2.7 shows the one-step predictive density function of the series yt for the
period of November 1974 to April 1975. The shape of the predictive density changes
over this period where the unemployment rate yt has experienced a dramatic change
from 6.6 to 8.8. We have observed similar behaviours of the one-step predictive
density over different time periods where the series has high volatility. For example,
in year 1983, the unemployment rate decreases from 10.4 in January to 8.3 in
December. To save space, the related plots are not reported here.
2 Regularization in MAR Models 33

(a) (b) (c)

2.0
2.0
2.0

1.5
1.5
1.5
Predictive density

Predictive density

Predictive density
1.0
1.0
1.0

0.5
0.5
0.5
0.0

0.0

0.0
4 5 6 7 8 9 10 4 5 6 7 8 9 10 4 5 6 7 8 9 10
yt yt yt

(d) (e) (f)

2.0
2.0

2.0

1.5
1.5

1.5
Predictive density

Predictive density

Predictive density
1.0
1.0

1.0

0.5
0.5

0.5
0.0

0.0

4 5 6 7 8 9 10 4 5 6 7 8 9 10 4 5 6 7 8 9 10
yt yt yt

Fig. 2.7 One-step predictive density for the period of November 1974 to April 1975

2.6 Summary and Discussion

Regime-switching Gaussian autoregressive (AR) models provide a rich class of

statistical models for time series data. We have developed a computationally
efficient regularization method for the selection of the regime-specific AR-orders
and the number of AR regimes. We evaluated finite sample performance of the
proposed methods through extensive simulations. The proposed RBIC for selecting
the number AR-regimes performs well in various situations considered in our
simulation studies. It represents a substantial computational advantage compared to
the standard BIC . The proposed methodologies could be extended to the situations
where there are exogenous variables xt affecting the time series yt . Large sample
properties such as selection consistency and oracle properties of the proposed
regularization methods are the subject of future research.
34 A. Khalili et al.

References

Bollerslev T (1986) Generalized autoregressive conditional heteroskedasticity. J Econom

31(3):307–327
Chen J, Tan X, and Zhang R (2008) Inference for normal mixtures in mean and variance. Stat Sin
18:443–465
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM
algorithm (with discussion). J R Stat Soc B 39:1–38
Engle RF (1982) Autoregressive conditional heteroscedasticity with estimates of the variance of
United Kingdom inflation. Econometrica 50(4):987–1007
Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties.
J Am Stat Assoc 96:1348–1360
Hathaway RJ (1985) A constraint formulation of maximum-likelihood estimation for normal
mixture distributions. Ann Stat 13:795–800
Jirak M (2012) Simultaneous confidence bands for Yule-Walker estimators and order selection.
Ann Stat 40(1):494–528
Le ND, Martin RD, Raftery AE (1996) Modeling flat stretches, bursts, and outliers in time series
using mixture transition distribution models. J Am Stat Assoc 91:1504–1514
Nardi Y, Rinaldo A (2011) Autoregressive process modeling via the Lasso procedure. J Multivar
Anal 102(3):528–549
Tibshirani R (1996) Regression shrinkage and selection via Lasso. J R Stat Soc B 58:267–288
Wang H, Li G, Tsai C-L (2007) Regression coefficient and autoregressive order shrinkage and
selection via the Lasso. J R Stat Soc B 69(1):63–78
Wong CS, Li WK (2000) On a mixture autoregressive model. J R Stat Soc B 62:95–115
Zou H (2006) The adaptive Lasso and its oracle properties. J Am Stat Assoc 101(476):1418–1429
Chapter 3
Modeling Zero Inflation and Overdispersion
in the Length of Hospital Stay for Patients
with Ischaemic Heart Disease

Cindy Xin Feng and Longhai Li

Abstract Ischaemic heart disease is the top one cause of death in the world;
however, quantifying its burden in a population is a challenge. Hospitalization data
provide a proxy for measuring the severity of ischaemic heart disease. Length of
stay (LOS) in hospital is often used as an indicator of hospital efficiency and
a proxy of resource consumption, which may be characterized as zero-inflated
if there is an over-abundance of zeros, or zero-deflated if there are fewer zeros
than expected under a standard count model. Such data may also have a highly
right-skewed distribution for the nonzero values. Hurdle models and zero inflated
models were developed to accommodate both the excess zeros and skewness of
the data with various configuration of spatial random effects, as well as allowing
for analysis of nonlinear effect of seasonality and other fixed effect covariates. We
draw attention to considerable drawbacks with regards to model misspecifications.
Modeling and inference use the fully Bayesian approach via Markov Chain Monte
Carlo (MCMC) simulation techniques. Our results indicate that both hurdle and
zero inflated models accounting for clustering at the residential neighborhood
level outperforms the models without counterpart models, and modeling the count
component as a negative binomial distribution is significantly superior to ones with
a Poisson distribution. Additionally, hurdle models provide a better fit compared to
the counterpart zero-inflated models in our application.

C.X. Feng ()

School of Public Health, University of Saskatchewan, 104 Clinic Place, Saskatoon, SK, S7N5E5,
Canada
e-mail: [email protected]
L. Li
Department of Mathematics and Statistics, University of Saskatchewan, 106 Wiggins Road,
Saskatoon, SK, S7N 5E6, Canada
e-mail: [email protected]

© Springer Science+Business Media Singapore 2016 35

D.-G. (Din) Chen et al. (eds.), Advanced Statistical Methods in Data Science,
ICSA Book Series in Statistics, DOI 10.1007/978-981-10-2594-5_3
36 C.X. Feng and L. Li

3.1 Introduction

Ischaemic heart disease (IHD), also known as coronary heart disease (CHD) (Bhatia
2010), is a disease characterized by reduced blood supply to the heart due to buildup
of plaque along the inner walls of the coronary arteries. IHD is the top one cause of
death in the world (Mathers and Loncar 2006; Murray 1997). In 2004, the number of
IHD-related deaths was 7:2 million, accounting for 12:2 % of all deaths and 5:8 %
of all years of life lost, and 23.2 million people experienced moderate or severe
disability due to IHD (Mathers et al. 2008). As the most common type of heart
disease, the projected total costs of IHD will increase from 46:8 billion dollars
in 2015 to 106:4 billion dollars in 2030 (Go et al. 2013). In 2008, cardiovascular
disease accounted for 29 % of all deaths in Canada with 28 % of all male deaths
and 29:7 % of all female deaths according to the mortality, summary list of causes
released by Statistics Canada in 2011. Among those, 54 % were due to ischemic
heart disease. In 2005/06 there were 160,323 hospitalizations for ischemic heart
disease (Tracking Heart Disease and Stroke in Canada 2009), which caused huge
burden on medical care services.
Many studies on improving admission outcomes for ischemic heart disease
patients tended to focus on reducing duration of in-patient care, which is often
measured as length of stay (LOS), i.e., duration of a hospital admission (i.e.,
the difference in days between the date of admission and the date of discharge).
A shorter LOS often results in reduced costs of health resources. Therefore,
LOS is often used as an indicator of hospital efficiency and a proxy of resource
consumption, which is a also a measure of crucial recovery time for in-patient
treatment. The number of heart disease patients in need of surgery is increasing
due to the aging population and prevalence (Chassot et al. 2002). To prepare for the
increasing demand of inpatient treatment from a service management perspective
as well as with the advances in pharmaceutical, medical technologies and clinical
practice, health services provided same-day surgery, also known as ambulatory
surgery or outpatient surgery. This type of surgery does not require an overnight
hospital stay, so that surgery patients may go home and do not need an overnight
hospital bed, leading to a decline in LOS. The purpose of outpatient surgery is to
create a cost reduction to the health system, as well as saving the patient time.
Analysis of LOS data may assist monitoring and planning rescouses allocation
and designing appropriate interventions. The potential risk factors for LOS can be
at both patient or group level, some observed and others unobserved and maybe
spatially correlated. From a statistical point of view, several important features of
the LOS data must be considered. First, the data are potentially zero-inflated or
deflated depending on the proportion of patients with day surgery (LOS D 0).
Second, because the patients are clustered within neighborhood, within cluster
correlation need to be addressed. Within the context of health services research,
the study of regional variation in LOS can help suggesting regional health care
inequalities, which can motivate further study by examining the nature of these
inequalities. This type of geographical differences may be driven by socio-economic
3 Modeling Zero Inflation and Overdispersion in LOS 37

determinants, availability and access to health care and health seeking behavior.
Third, for ischaemic heart disease, temporal variation may be a factor of access to
care; therefore, needs to be modeled flexibly to capture the non-linear trend. Finally,
it is unclear if the probability of having day surgery for patients from certain areas
are correlated with the mean of the LOS for patients from the same geographic
areas reflecting the needs of health care in geographic areas. If correlation is absent,
that implies the health care seeking behaviors for outpatients and inpatients are
not geographically correlated. If correlation is present, failing to account for such
dependence may produce biased parameter estimates. Therefore, adequate statistical
modeling and analysis taking into all those features in this data is needed.
The empirical investigation indicated that the number of zeros (LOS D 0) is
greater than expected under a standard count distribution, so the data is zero inflated.
The excess zeros often comes from two sources. Some may be from subjects who
choose not to stay in hospital overnight and thereby contributing to ‘sampling zeros’
while some who are genuine non-user of inpatient service and are hence considered
as ‘structured zeros’. Standard count distributions, such as Poisson and negative
binomial, may fail to provide an adequate fit, since they can not account for excess
zeros arising from two data generating process simultaneously. Fitting these models
may lead to inflated variance estimates of model parameters. If the data exhibit
only an excess of structural zeros, a hurdle model (Heilbron 1994; Mullahy 1986)
would be more appropriate. Hurdle models have been exploited in many disciplines
as well, such as drug and alcohol use study (Buu et al. 2012) and health-care
utilization (Neelon et al. 2013; Neelon and O’Malley 2010). The model consists of
two components: a Bernoulli component modeling a point mass at zero and a zero-
truncated count distribution for the positive observations. Alternatively, if excess
zeros are comprised of both structural and sampling zeros, zero-inflated model
(Lambert 1992) can be used, which combines the untruncated count distribution
and a degenerate distribution at zero. This type of model has been extensively used
in many fields, such as environmental and ecological science (Ver Hoef and Jansen
2007), substance abuse (Buu et al. 2012), dentistry (Mwalili et al. 2008) and public
health (Musal and Aktekin 2013), etc.
The distinction between zero-inflated and hurdle models may be subtle, but one
may be more appropriate than another depending how the zeros arise. The different
models can yield different results with different interpretations. If zeros arise in only
one way, then a hurdle model may be more appropriate. In the context of our study,
if patients either decline or have never been referred to same day surgery, in which
case the zero observations only come from ‘structural’ source. In contrast, if zeros
arises in two sources: among those who are not at risk of being hospitalized over
night or those who are at risk but nevertheless choose not to use services. In such
case, a zero-inflated model would be more desirable.
Model fitting can be carried out either using EM algorithm or Bayesian approach.
For each component, patient or neighborhood level fixed effect covariates, as well
as random effects at neighborhood level accounting for clustering within neighbors
can be included. The random effect terms for each of two model components can be
modeled as an independent and identically distributed normal distribution (IID).
38 C.X. Feng and L. Li

To provide spatial smoothing and borrowing information across neighborhoods,

spatially correlated random effect – conditional autoregressive model (CAR) (Besag
et al. 1991) can be applied to the count or both Bernoulli and count components
(Agarwal et al. 2002; Rathbun 2006; Ver Hoef and Jansen 2007). Furthermore,
(Neelon et al. 2013) developed a spatial hurdle model for exploring geographic
variation in emergency department utilization by linking the Bernoulli and Poisson
components of a hurdle model using a bivariate conditional CAR prior (Gelfand
and Vounatsou 2003; Mardia 1988). In our study, we seek to investigate zero-
inflated and hurdle models with various random effect structures, accommodating
potential overdispersion by considering different parametric specifications of count
distribution. We draw attention to considerable drawbacks with regards to model
misspecifications.
The rest of the paper is organized as follows. We first describe the data. Next, we
specify the models and outline the Bayesian approach used for model estimation.
This is followed by the application of the model to the data, and then results are
presented. Discussion on the results and limitations of the study conclude this
Chapter.

3.2 Methods

3.2.1 Data

We used hospital discharge administrative database with the admission date ranging
between January 1 and December 31 in 2011 due to IHD, which was provided by
the Saskatchewan Ministry of Health. These administrative databases produced by
every acute care hospital in the province of Saskatchewan, provide the following
information from every single admission: age, gender, admission and discharge
dates, the patient’s areas of residence, and diagnosis and procedure codes [Inter-
national Classification of disease (ICD) 10th revision Clinical Modification Code –
I20 – I25 for IHD]. The patients’ areas of residence, i.e., postal code, is confidential;
therefore, each case was matched to one of the 33 health districts in Saskatchewan.
Figure 3.1 presents the histogram of the LOS for IHD patients from
Saskatchewan in 2011. Of the 5777 hospitalized cases due to IHD, 1408.24 %/
had same-day surgery, which constitutes the zero counts in LOS. Among those
inpatients who stayed in hospital overnight, the number of days ranged from 1
to 156, with 75 % having fewer than a week of stay. Suppose that the data were
generated under an independent and identically distributed Poisson regression with
mean parameter as the mean 4:5 days, which is the mean of the LOS in our data.
Under such model, we would expect about 1 % of zeros, which is far fewer 0s than
observed. The proportion of zeros and the right-skewed non-zero counts suggest
the potential zero inflation relative to the conventional Poisson distribution and
overdispersion. Hence, special distributions are needed to provide an adequate fit to
the data.
3 Modeling Zero Inflation and Overdispersion in LOS 39

0.25
0.20
0.15
Density
0.10
0.05
0.00

0 20 40 60 80 100
LOS (in days)

Fig. 3.1 Empirical distribution of LOS in days

Table 3.1 provides summary statistics on patient characteristics. Of the 5777

hospitalized cases due to ischaemic heart disease, 3931.68 %/ are males, 391.6:8 %/
are Aboriginal. During the study period, the number of IHD hospitalized cases tends
to slightly vary over time, with the median age at 69 years old (Interquartile range
(IQR): 59–79). Those characteristics of the data are more or less the same for those
who had day-surgery and those who stayed in hospital overnight. For those who
stayed in hospital over night, the median LOS is around 4 days with IQR ranging
from 2 days to about a week.
The left panel in Fig. 3.2 displays the percentage of patients accessing same day
surgery from each health district, which indicates higher values appeared in the
south and in the middle of the province, but generally lower in the north territories.
The right panel of the Fig. 3.2 presents the average number of LOS per patient from
each health district, which shows that one of the health districts in the north west
had higher mean LOS and a cluster of health district on the south east had a higher
mean LOS compared with the rest of the health districts.
40 C.X. Feng and L. Li

Table 3.1 Summary statistics of the data

Variable Total LOS = 0 LOS > 0
n.%/ n.%/ n.%/ Median(IQR)
Gender
Male 3931(68.0) 1006(25.6) 2925(74.4) 3(0–5)
Female 1846(21.8) 402(21.8) 1444(78.2) 3(1–6)
Ethnicity
Aboriginal 391(6.8) 88(22.5) 303(77.5) 4(2–6)
Non-Aboriginal 5386(93.2) 1320(24.5) 4066(75.5) 4(2–7)
Month
Jan 529(9.2) 112(21.2) 417(78.8) 4(2–8)
Feb 469(8.1) 103(22.0) 366(78.0) 4(2–7)
Mar 520(9.0) 147(28.3) 373(71.7) 4(2–8)
Apr 483(8.4) 112(23.2) 371(76.8) 4(2–8)
May 490(8.5) 119(24.3) 371(75.7) 4(2–8)
Jun 504(8.7) 136(27.0) 368(73.0) 4(2–6)
Jul 445(7.7) 106(23.8) 339(76.2) 4(2–7)
Aug 404(7.0) 110(27.2) 294(72.8) 4(2–6)
Sep 460(8.0) 109(23.6) 351(76.3) 4(2–7)
Oct 505(8.7) 119(23.6) 386(76.4) 4(2–7)
Nov 503(8.7) 114(22.7) 389(77.3) 4(2–7)
Dec 465(8.0) 121(26.0) 344(74.0) 4(2–7)
Age
Œ18; 40/ 54(0.93) 12(22.2) 42(77.8) 2(1–4)
Œ40; 50/ 368(6.4) 87(23.6) 281(76.4) 2.5(1–5)
Œ50; 60/ 1074(18.6) 268(25.0) 806(75.0) 2(1–5)
Œ60; 70/ 1457(25.2) 420(28.8) 1037(71.2) 2(0–5)
Œ70; 80/ 1434(24.8) 405(28.2) 1029(71.8) 2(0–6)
80C 1390(24.1) 216(15.5) 1174(84.5) 4(1–8)

3.2.2 The Statistical Models

3.2.2.1 The Hurdle Model

The hurdle model (Heilbron 1994; Mullahy 1986) is a two-component mixture

model consisting of a zero mass and the non-zero observations component following
a conventional count distribution, such as Poisson or negative binomial.
Let Yij denote the LOS in days for ith, i D 1; ; n, patient from health district j,
j D 1; ; J. The general structure of a hurdle model is given by
(
ij yij D 0;
P.Yij D yij / D p.yij I ij / (3.1)
.1 ij / 1p.0I ij / yij > 0;
3 Modeling Zero Inflation and Overdispersion in LOS 41

under 0.14 under 4.55

0.14 − 0.2 4.55 − 6.17
0.2 − 0.27 6.17 − 7.8
over 0.27 over 7.8

Fig. 3.2 The panel on the left: percentage of patients with day surgery in each of the health districts
from Saskatchewan in 2011; the panel on the right: mean number of LOS among those inpatients
with at least one day stay in hospital. The darker color represents higher values

where ij D P.Yij D 0/ is the probability of a subject belonging to the zero

component; p.yij I ij / represents a probability distribution for a regular count
distribution with a vector of parameters ij and p.0I ij / is the distribution evaluated
at zero. If the count distribution follows a Poisson distribution, the probability
distribution for the hurdle Poisson model is written as:
(
ij if yij D 0;
P.Yij D yij / D yij
eij ij =yij Š : (3.2)
.1 ij / 1eij if yij > 0

Alternatively, the non-zero count component can follow other distributions to

account for overdispersion and negative binomial is the most commonly used. The
hurdle negative binomial model (hurdle NB) is given by:
8
ˆ
< ij yij r if yij D 0;
P.Yij D yij / D 1ij .yij Cr/ ij
r
.r/yij Š ij Cr
r
ij Cr if yij > 0 ; (3.3)
:̂ r
1 Cr
ij
42 C.X. Feng and L. Li

where .1 C ij =r/ is a measure of overdispersion. As r ! 1, the negative binomial

converges to a Poisson distribution. To model the association between a set of
predictors and the zero-modified response, both hurdle Poisson or hurdle NB models
can be extended to a regression setting by modeling each component as a function
of covariates. The covariates appearing in the two components are not necessarily
the same. Let w0ij be the set of factors contributing to the out-patient with LOS D 0
and x0ij be the set of factors contributing to the in-patient with non-zero LOS. The
parameter ij represents the probability of using day surgery. When ij D 1, no
patients received day surgery and the data follows a truncated count distribution,
whereas, when ij D 0, no patients stayed in hospital overnight. ij ranges between
0 and 1. The parameter ij measures the expected mean counts of LOS (in days)
for those patients who stayed in hospital overnight, so as ij increases, the average
LOS increases. Both logit.ij / and log.ij / are assumed to depend on a function of
covariates. In addition, the random effects at the health district level are introduced
in the model to account for possible correlation between the two components. The
random components also control the variation at the health district level. The model
can be written as:

logit.ij / D w0ij ˛ C f1 .monthij / C b1j

(3.4)
log.ij / D x0ij ˇ C f2 .monthij / C b2j ;

where w0ij and x0ij are patient level fixed effect covariates for the logistic and Poisson
components, and ˛ and ˇ are the corresponding vectors of regression coefficients.
In our study context,

w0ij ˛ D ˛0 C ˛1 aborij C ˛2 maleij C ˛3 I.ageij 2 Œ18; 40// C ˛4 I.ageij 2 Œ40; 50//C

˛5 I.ageij 2 Œ50; 60// C ˛6 I.ageij 2 Œ60; 70// C ˛7 I.ageij 2 Œ70; 80//
x0ij ˇ D ˇ0 C ˇ1 aborij C ˇ2 maleij C ˇ3 I.ageij 2 Œ18; 40// C ˇ4 I.ageij 2 Œ40; 50//C
ˇ5 I.ageij 2 Œ50; 60// C ˇ6 I.ageij 2 Œ60; 70// C ˇ7 I.ageij 2 Œ70; 80//;
(3.5)

where ˛0 and ˇ0 represent the intercept terms for the excess zero and random
count components, respectively, and abor is the indicator of Aboriginal status for
the patients with non-Aboriginal as the reference category; male denotes the male
gender with female gender as the reference category. The age variable is categorized
into 6 categories with the 80 years old and above as the reference category and I./
is 1 if the condition in the bracket is true.
Seasonal variation has been observed in mortality due to coronary heart disease,
often characterized by a winter peak (Bull and Morton 1978; Rogot and Blackwelder
1979; Rose 1966). It has been postulated that temperature changes could account for
practically all of the seasonal variation observed in coronary heart disease deaths,
since lower environmental temperature may exert a direct effect on the heart or
has an indirect effect via changes in blood pressure (Woodhouse et al. 1993). Even
though the existing literature contains a vast amount of evidence on the role of
3 Modeling Zero Inflation and Overdispersion in LOS 43

seasonal variations in the effects of IHD mortality, little is currently available on the
possible effects of seasonal variations on LOS due to IHD. Hence, we considered to
flexibly model the temporal effect of month of being admitted to the hospital using
the smooth function with cubic B-spline basis in our study. The specification for the
spline function in the logit.ij / and the log.ij / components are:

X
K
fh .monthij / D ckh Bk .monthij /; h D 1; 2; (3.6)
kD1

where Bk .monthij /; k D 1; ; K denote the cubic B-spline basis function with

a predefined number of equidistant knots for the excess zero and the random
Poisson component, respectively; and fckh ; k D 1; ; KI h D 1; 2g denotes the
corresponding regression coefficients for the basis functions of month. To ensure
enough flexibility, we choose K D 6.
The parameters b1j and b2j in (3.4) are random effect terms to account for residual
variation at the areal level unexplained by the patient level covariates, where b1j is
a latent areal level variable contributing to the propensity of access day surgery
for patients living in health district j and b2j is a latent variable contributing to
the expected mean of LOS for those inpatients from health district j. As such,
larger values of b1j imply that inpatients living in health district j are more likely to
receive day surgery compared with patients in health districts with lower b1j values.
Likewise, larger values of b2j imply, on average, longer LOS among patients in the
jth health districts compared with other health districts.
Those random effect terms can account for unmeasured characteristics at the
health district level; therefore, to study their correlation is of interests, as it can
reflect the association between the propensity of accessing day-surgery and the mean
length of hospital stay. For example, the patients from some health districts could be
more likely to be referred to day-surgery and also patients from those health districts
tend to stay in hospital longer than those from other health districts. Alternatively,
patients from certain health districts may be more likely to access day surgery rather
than staying in hospital overnight, vise versa; or the spatial patterns for propensity
of receiving day surgery and mean of LOS are not statistically related. To account
for the potential association, we can assume a joint multivariate normal distribution
for bj D .b1j ; b2j /T , j D 1; ; J as bj MVN.0; ˙ /, where ˙ is a 2 2 variance-
covariance matrix with diagonal elements ˙11 and ˙22 representing the conditional
variances of b1 D .b1j ; ; b1J /T and b2 D .b2j ; ; b2J /T respectively, and off-
diagonal element ˙12 representing the within-area covariance between b1 and b2 .
The correlation between the b1 and b2 is D ˙12 =˙11 ˙22 , which measures the
strength of the association between the two process, 1 1. When D 0, the
two components of the hurdle model are uncorrelated, so the propensity of using
day surgery is unrelated to the mean length of hospital stay within an area. When
> 0, health districts with a higher proportion of day surgery users tend to have
higher mean of length of hospital stay and when < 0, health districts with a higher
proportion of day surgery tend to have lower mean of length of hospital stay.
44 C.X. Feng and L. Li

To account potential spatial correlation for each component and across the two
components, a bivariate intrinsic CAR prior distribution (Gelfand and Vounatsou
2003; Mardia 1988) can be used for bj (Neelon et al. 2013):
0 1
1 X 1
bj jb.j/ ; ˙ MVN @ b` ; ˙ A (3.7)
mj mj
`2ıj

where ıj and mj denote the set of labels of the “neighbors” of area j and the
number of neighbors, respectively. ˙ is again a 2 2 variance-covariance matrix
and the diagonal elements describing the spatial covariance structure characterizing
each component of the hurdle model. The off-diagonal element ˙12 contains cross-
covariance, the covariances between the two components at different areas, which
allows the covariances between component of proportion of day surgery at area j
and component of mean of length of hospital stay at area j0 to be different from that
between the proportion of day surgery at area j0 and mean of length of hospital stay
at area j.

3.2.2.2 The Zero-Inflated Model

A zero-inflated model assumes that the zero observations have two different origins:
“structural” and “sampling”. The sampling zeros are due to the usual Poisson
(or negative binomial) distribution, which assumes that those zero observations
happened by chance. Zero-inflated models assume that some zeros are observed
due to some specific structure in the data. The general structure of a zero-inflated
model is given as:

ij C .1 ij /p.0I ij / yij D 0;
P.Yij D yij / D (3.8)
.1 ij /p.yij I ij / yij > 0;

which consists of a degenerate distribution at zero and an untruncated count

distribution with a vector of parameters ij . If the count distribution follows a
Poisson distribution, the zero inflated Poisson model (ZIP) is given by:
(
ij C .1 ij /eij if yij D 0;
P.Yij D yij / D yij
eij ij ; (3.9)
.1 ij / yij Š
if yij > 0

where ij is the mean of the standard Poisson distribution. As with hurdle models,
overdispersion can be modeled via the negative binomial distribution. The zero
inflated negative binomial model (ZINB) is then given by:
8 h r i
< ij C .1 ij / r
if yij D 0;
P.Yij D yij / D ij Cr yij r : (3.10)
: .1 ij / ij
.y Cr/ ij r
if y ij > 0
.r/yij Š ij Cr ij Cr
3 Modeling Zero Inflation and Overdispersion in LOS 45

3.2.3 Bayesian Posterior Computation

Fully Bayesian inference is adopted for model estimation, which is based on the
analysis of posterior distribution of the model parameters. In general, the posterior
is highly dimensional and analytically intractable, which makes inference almost
impossible. This problem is circumvented by using Markov chain Monte Carlo
(MCMC) methods simulation techniques, where the samples are drawn from the
fully conditional of parameters given the rest of the data. At convergence, the
MCMC draws the Monte Carlo samples from the joint posterior distribution of the
model parameter, which can be then used for parameter estimates and corresponding
uncertainty intervals, thus avoiding the need for asymptotic assumptions when
assessing the sampling variability of parameter estimates.
To complete the model specification, we assign uniform priors to the intercept
parameters ˛0 and ˇ0 and weakly informative proper priors N.0; 10/ for the
remaining regression coefficients, including the spline parameters. For the spatial
covariance matrix, ˙ , we assume an inverse Wishart prior IW.2; I2 /, where I2
denotes the two-dimensional identity matrix. Updating the full conditionals of
parameters is implemented in WinBUGS (Spiegelhalter et al. 2005). We ran two
parallel dispersed chains for 20,000 iterations, each, discarding the first 10,000 as
burn-in. Convergence of Markov chain Monte Carlo chains were assessed by using
trace plots and Gelman-Rubin statistics, which indicated rapid convergence of the
chains.
To compare various models, we employ the deviance information criterion
(DIC), defined as DICD D C pD where D is the posterior mean of the deviance,
which measures the goodness of fit (Spiegelhalter et al. 2002). The penalty term
pD is the effective number of model parameters, which is a measure of model
complexity. Models with lower D indicate good fit and lower values of pD indicate
a parsimonious model. Therefore, models with smaller values of DIC are preferred
as they achieve a more optimal combination of fit and parsimony.

3.3 Analysis of the LOS Data

To analyze the LOS data, we initially considered fitting the Poisson, negative
binomial (NB), ZIP, ZINB, hurdle Poisson and hurdle NB regression models without
any random effect terms. To assess which distribution fits the data better, various
statistical tests were applied to evaluate over-dispersion and compare model fit when
not including the random effect terms. Akaike’s information criterion (AIC) (Akaike
1973) and Vuong statistic (Vuong 1989) were calculated.
Table 3.2 summarizes the statistics comparing the goodness of fit of the models.
Positive difference in the Vuong statistic means that the model in the row fits
better than the model in the column. Negative difference means that the model
in the column fits better than the model in the row. The conventional Poisson
46 C.X. Feng and L. Li

Table 3.2 Criteria for evaluating the goodness of fit and model selection of six competing models
for analyzing the LOS of ischaemic heart disease patients from Saskatchewan in 2011. The second
column is the Akaike’s information criterion (AIC) and the rest of the columns present Vuong
statistic. Negative number means that the model in the column fits better than the model in the row
and positive number means that the model in the column fits better than the models in the row
AIC Vuong Statistic
NB ZIP ZINB hurdlePa hurdleNBb
Poisson 52836 14 22 14 22 14
<0:001 <0:001 <0:001 <0:001 <0:001
NB 29676 – 11 3 11 4
<0:001 0:003 <0:001 <0:001
ZIP 45171 – – 11 0.9 11
<0:001 0:18 <0:001
ZINB 29665 – – – 11 1.9
<0:001 0:028
hurdlePa 45172 – – – – 11
<0:001
hurdleNBb 29631 – – – – –
a
hurdleP denotes hurdle Poisson model
b
hurdleNB denotes hurdle Negative Binomial model

model is inferior to the other models as shown by all the negative numbers in its
row; the hurdle model NB shows superior fit compared to the other models, with
all the negative numbers in its column; and zero-inflated models fit better than
their corresponding non-zero inflated counterparts; this suggests that the best fitting
model needs to account for both zero-inflation and over-dispersion in the observed
data. In addition, the hurdle Poisson and hurdle NB models fit better than their
corresponding ZIP and ZINB models, which suggests that the zero counts were best
modeled as being only structural zeroes.
Furthermore, we included the random effect terms in the model to compare the
goodness of fit of ZIP, ZINB, hurdle Poisson and hurdle NB models with various
configuration of random effect terms, ranging from the model without any random
effect terms, models with random effect term in one of the two model components
and models with random effect terms in both model components. The random effect
term is either IID normally distributed or assigned with a CAR prior or both random
effect terms are correlated through a bivariate normal distribution or MCAR prior
conditional on the predictors. The results are presented in Table 3.3, which shows
that the inclusion of random effects further improve the model fitness despite the
model complexity. Therefore, modeling the impact of fixed effect factors alone is
not sufficient to produce satisfactory fit to the data, and random effects at health
district level in both the bernoulli or the counts components are needed to account
for areal level heterogeneity. However, spatial correlation at the health district level
is not strong in either of the components with the DIC scores for the MCAR models
larger than the counterpart IID models.
3 Modeling Zero Inflation and Overdispersion in LOS 47

Table 3.3 DIC and pD for competing models in the analysis of LOS for ischaemic heart disease
patients from Saskatchewan in 2011
Model Hurdle ZI
Poisson NB Poisson NB
No .b1 ; b2 /T 45172(28) 29631(29) 45172(28) 29655(9)
IID b1 45122(46) 29581(48) 45121(46) 29629(21)
IID b2 44284(46) 29534(48) 44289(46) 29606(21)
Independent CAR b1 45122(46) 29579(46) 45121(46) 29642(28)
Independent CAR b2 44292(62) 29540(53) 44296(62) 29611(31)
IID b1 and b2 44233(76) 29484(71) 44231(76) 29543(50)
Independent CAR b1 and b2 44240(79) 29488(71) 44236(79) 29556(56)
Bivariate IID .b1 ; b2 /T 44235(80) 29486(76) 44233(80) 29540(50)
MCAR .b1 ; b2 /T 44243(82) 29492(74) 44241(82) 29540(54)

Table 3.4 presents the posterior means and 95 % credible intervals for all
parameters except for the B-spline coefficients for the hurdle NB and hurdle Poisson
models with the bivariate IID random effect structure on .b1 ; b2 /T . Under the hurdle
NB model, after adjusting for other predictors, males are more likely to access
day surgery with posterior mean (95 % CI) as 0:143.0:008; 0:276/, whereas male
gender is not significantly associated with means of LOS with posterior mean (95 %
CI) as 0:011.0:073; 0:093/. Aboriginal status had no impact on either propensity
of receiving day surgery or LOS. In general, as age decreases, the likelihood of
accessing day surgery increases, but not for the age under 40 years old. As a contrast,
as age increases, the LOS increases, which is intuitively sensible, as elder patients
needs longer time to recover.
The variance components of the random effect terms in the model indicate
that the two components are not statistically associated with each other at the
residential neighborhood with the posterior mean (95 % CI) of being estimated as
0:108.0:294; 0:489/ under the hurdle NB model. This suggests that the probability
of accessing day surgery is not correlated with the mean LOS among users at
the health district level after adjusting for various patient-level covariates. In
comparison with the hurdle NB model, the hurdle Poisson model yields relatively
smaller variance component estimates.
Figure 3.3 displays the temporal trends of month being admitted to hospital due
to IHD on the linear predictor scale for the two model components under the hurdle
NB model with the bivariate IID random effect structure on .b1 ; b2 /T . The horizontal
line at zero corresponds to no month effect. The log-odds of day surgery use do
not vary over time with the point-wise credible interval covering zero. For the log
of mean of LOS, although a bimodal pattern appears in the early spring and late
fall, the effect is not significant, shown in Fig. 3.3. Under the counterpart hurdle
Poisson model, the log-odds of day surgery use is consistent with the hurdle NB
model; however, the temporal effect on the mean of LOS became more pronounced,
shown in Fig. 3.4. Therefore, hurdle Poisson model yields greater temporal effect
48 C.X. Feng and L. Li

Table 3.4 Posterior mean estimates and 95 % credible intervals (in parentheses) for the parameters
from the hurdle NB and hurdle Poisson models with the random effect terms .b1 ; b2 /T following a
bivariate normal distribution
Variable Parameter Hurdle NB Hurdle Poisson
logit.ij /
Intercept ˛0 2.083(2.365, 1.810) 2.085(2.385, 1.806)
Aboriginal ˛1 0.072(0.345, 0.187) 0.069(0.352, 0.189)
Male ˛2 0.143(0.008, 0.276)* 0.138(0.007, 0.284)*
Age 18–40 ˛3 0.403(0.261, 1.044) 0.431(0.261, 1.103)
Age 40–50 ˛4 0.474(0.189, 0.750)* 0.478(0.185, 0.754)*
Age 50–60 ˛5 0.549(0.334, 0.750)* 0.551(0.348, 0.765)*
Age 60–70 ˛6 0.762(0.575, 0.950)* 0.765(0.564, 0.966)*
Age 70–80 ˛7 0.746(0.548, 0.928)* 0.753(0.555, 0.947)*
log.ij /
Intercept ˇ0 1.782(1.599, 1.961) 2.034(1.919, 2.167)
Aboriginal ˇ1 0.000(0.160, 0.176) 0.031(0.091, 0.031)
Male ˇ2 0.011(0.073, 0.093) 0.004(0.031, 0.023)
Age 18–40 ˇ3 1.035(1.438, 0.596)* 0.842(1.010, 0.681)*
Age 40–50 ˇ4 0.815(0.983, 0.646)* 0.643(0.704, 0.581)*
Age 50–60 ˇ5 0.615(0.730, 0.490)* 0.484(0.522, 0.444)*
Age 60–70 ˇ6 0.499(0.607, 0.390)* 0.388(0.421, 0.353)*
Age 70–80 ˇ7 0.233(0.339, 0.131)* 0.177(0.209, 0.145)*
Variance component
var.b1j / ˙11 0.117(0.061, 0.212) 0.119(0.057, 0.211)
var.b2j / ˙22 0.101(0.057, 0.170) 0.085(0.050, 0.144)
cov.b1j ; b2j / ˙12 0.012(0.037,0.062) 0.009(0.034, 0.053)
corr.b1j ; b2j / 0.108(0.294, 0.489) 0.085(0.321, 0.459)

compared to the corresponding hurdle NB model, similarly for the ZIP compared
with the ZINB model (not presented here), suggesting that failure to account for
overdispersion leads to over-estimation of the temporal effect. The seasonality
pattern is in contrast with the findings for coronary heart disease mortality in the
literature, which often reported higher hospital mortality rates in winter than other
seasons. Nevertheless, inpatients undergoing surgery who environmental condition
may be under control. Such difference in seasonality pattern between mortality and
LOS warrants further investigation.
Figure 3.5 presents the posterior mean estimates of the random effects b1 (left
panel) and b2 (right panel) when .b1 ; b2 /T following bivariate IID based on the
hurdle NB model. The left panel indicates that those health districts in red have
increased propensity of accessing day surgery, which were distributed mainly in
the middle of the province stretching towards the south east of the province. The
right panel shows that a health district in the north west and some regions in the
south middle east have higher expected mean counts of LOS in days. The different
residual spatial patterns imply that the spatial distribution for the two components
3 Modeling Zero Inflation and Overdispersion in LOS 49

0.4
1

0.2
logit(π)

log(μ)
0.0
0

−0.2
−1

−0.4
−2

2 4 6 8 10 12 2 4 6 8 10 12
month month

Fig. 3.3 Temporal effect on the linear predictor scale for the binary component (left panel) and
the NB component (right panel) for the hurdle NB model with the random effect terms .b1 ; b2 /T
following a bivariate normal distribution. Dashed lines denote 95 % credible intervals

after accounting for the individual level covariates are not sharing similar spatial
patterns.

3.4 Discussion

In this article, hurdle models and zero inflated models were considered to model
the LOS for IHD hospitalizations. The models accommodate both excess zeros
and skewness of the data with various configuration of fixed and random effects
allowing for analysis of nonlinear effect of seasonality and spatial pattern. The
initial inspection of the observed data, as well as fit statistics, suggested that
the distribution of the LOS was both overdispersed and zero-inflated. Our results
indicate that both hurdle and zero inflated models including random effects at areal
level for both model components outperform the models without those terms, and
modeling the count component as a negative binomial distribution is significantly
superior to modeling the count component as a Poisson distribution.
50 C.X. Feng and L. Li

0.4
1

0.2
logit(π)

log(μ)
0.0
0

−0.2
−1

−0.4
−2

2 4 6 8 10 12 2 4 6 8 10 12
month month

Fig. 3.4 Temporal effect on linear predictor scale for the binary component (left panel) and the
Poisson component (right panel) for the hurdle Poisson model with the random effect terms
.b1 ; b2 /T following a bivariate normal distribution. Dashed lines denote 95 % credible intervals

Hurdle models outperform the corresponding zero-inflated models in our appli-

cation. Min and Agresti (2005) suggested that hurdle models might provide better
fit if there is evidence of zero deflation among subgroups of the population.
Zero-inflated models imply zero inflation at all the levels of the covariates. Min
and Agresti (2005) also revealed unstable nature of the zero inflated formulation,
primarily because there is no distinct selection process leading to zero or non-zero
value. On the contrary, the hurdle model has a very stable behavior and performance.
Neelon and O’Malley (2010) gives a detailed discussion comparing the zero inflated
and hurdle models in health service research setting. The importance of accounting
for zero inflation and overdispersion clearly deserves further attention in the health
care utilization literature.
Our results highlight some important policy implications for management of
utilization of hospital services. By investigating the spatial pattern of propensity of
accessing day surgery and the means of LOS, policy makers can target communities
with greater needs for services such as day surgery centers to reduce the burden to
3 Modeling Zero Inflation and Overdispersion in LOS 51

under −0.16 under −0.16

−0.16 − 0.07 −0.16 − 0.1
0.07 − 0.3 0.1 − 0.36
over 0.3 over 0.36

Fig. 3.5 Posterior mean estimates of the random effects b1 (left panel) and b2 (right panel)based
on the hurdle NB model with the random effect terms .b1 ; b2 /T following a bivariate normal
distribution

the primary health care facilities, which may be in great need at the remote or rural
communities.
The estimates of the effect of covariates differed in magnitude between models.
Of particular note, the Poisson models (ZIP or hurdle Poisson) estimated significant
temporal effect of month of being admitted to hospitals for the means of LOS
component, whereas under the negative binomial models (ZINB or hurdle NB),
the temporal effect was not detected to be significant, suggesting it is important to
account for overdispersion in the model, since ignoring greater dispersion in the data
will result in underestimation of the variance of the estimators. This illustrates the
risk of falsely identifying a significant effect if the model chosen does not model the
spread of the data correctly. Although in our application the time was not detected
significantly impact on either of the components, the models presented in this article
can be adapted to analyze other health indicator of similar structure and in like
settings.
52 C.X. Feng and L. Li

A major limitation of our analysis is that the data used comes from hospital
registers. In Saskatchewan, registered first nation patients is regulated by the
Canadian federal government, so the results may be biased towards urban areas
that are well covered by health facilities. Moreover, socio-demographic variables
are not contained in a hospital registration data; therefore, a more representative
data is to link the hospital data with a cross-sectional household surveys data, which
will provide additional patient or health district level covariates reflecting patients’
deprivation level; however, such data are often carried out every several years and
the personal identifier is generally not being released due to confidentiality. The
geographic unit in our application is only restricted at the health district level due to
confidentiality of releasing postal code. Stronger spatial autocorrelation may emerge
if a finer level geographic unit, such as census block, would be available for this
study.

References

Agarwal DK, Gelfand A, Citron-Pousty S (2002) Zero-inflated models with application to spatial
count data. Environ Ecol Stat 9:341–355
Akaike H (1973) Information theory as an extension of the maximum likelihood principle. In:
Petrov BV, Csaki BF (eds) Second international symposium on information theory. Academiai
Kiado, Budapest
Besag J, York J, Mollie A (1991) Bayesian image restoration with two applications in spatial
statistics. Ann Inst Stat Math 43:1–21
Bhatia S (2010) Biomaterials for clinical applications. Springer, New York
Bull GM, Morton J (1978) Environment, temperature and death rates. Age Ageing 7:210–230
Buu A, Li R, Tan X, Zucker R (2012) Statistical models for longitudinal zero-inflated count data
with applications to the substance abuse field. Stat Med 31:4074–4086
Chassot P, Delabays A, Spahn DR (2002) Preoperative evaluation of patients with, or at risk of,
coronary artery disease undergoing non-cardiac surgery. Br J Anaesth 89:747–759
Gelfand AE, Vounatsou P (2003) Proper multivariate conditional autoregressive models for spatial
data analysis. Biostat 4:11–25
Go A, Mozaffarian D, Roger V, Benjamin E, Berry J, Borden W, Bravata D, Dai S, Ford E, Fox
C (2013) Heart disease and stroke statistics – 2013 update: a report from the American heart
association. Circulation 127:e6–e245
Heilbron DC (1994) Zero-altered and other regression models for count data with added zeros.
Biom J 36:531–547
Lambert D (1992) Zero-inflated Poisson regression with an application to defects in manufacturing.
Technometrics 34:1–14
Mardia KV (1988) Multi-dimensional multivariate gaussian Markov random fields with application
to image processing. J Multivar Anal 24:265–284
Mathers C, Fat D, Boerma J (2008) The global burden of disease: 2004 update. World Health
Organization, Geneva
Mathers C, Loncar D (2006) Projections of global mortality and burden of disease from 2002 to
2030. PLoS Med 3:2011–2030
Min Y, Agresti A (2005) Random effect models for repeated measures of zero-inflated count data.
Stat Modell 5:1–19
Mullahy J (1986) Specification and testing of some modified count data models. J Econom 33:341–
365
3 Modeling Zero Inflation and Overdispersion in LOS 53

Murray CJ, Lopez AD (1997) Alternative projections of mortality and disability by cause 1990–
2020: global burden of disease study. Lancet 349:1498–1504
Musal M, Aktekin T (2013) Bayesian spatial modeling of HIV mortality via zero-inflated Poisson
models. Stat Med 32:267–281
Mwalili S, Lesaffre E, Declerck D (2008) The zero-inflated negative binomial regression model
with correction for misclassification: an example in caries research. Stat Methods Med Res
17:123–139
Neelon B, Ghosh P, Loebs P (2013) A spatial Poisson hurdle model for exploring geographic
variation in emergency department visits. J R Stat Soc Ser A 176:389–413
Neelon B, O’Malley A, Normand S (2010) A Bayesian model for repeated measures zero-inflated
count data with application to outpatient psychiatric service use. Stat Modelling 10:421–439
Rathbun S, Fei SL (2006) A spatial zero-inflated Poisson regression model for oak regeneration.
Envrion Ecol Stat 13:409–426
Rogot E, Blackwelder WC (1979) Associations of cardiovascular mortality with weather in
Memphis, Tennessee. Public Health Rep 85:25–39
Rose G (1966) Cold weather and ischaemic heart disease. Br J Prev Soc Med 20:97–100
Spiegelhalter D, Thomas A, Best N, Lunn D (2005) Winbugs user manual, version 1.4. MRC
Biostatistics Unit, Institute of Public Health and Department of Epidemiology & Public Health,
Imperial College School of Medicine. http://www.mrc-bsu.cam.ac.uk/bugs
Spiegelhalter DJ, Best NG, Carlin BP, Van der Linde A (2002) Bayesian measures of model
complexity and fit (with discussion). J R Stat Soc Ser B 64:583–640
Tracking Heart Disease and Stroke in Canada (2009) Public Health Agency of Canada. http://www.
phac-aspc.gc.ca/publicat/2009/cvd-avc/pdf/cvd-avs-2009-eng.pdf
Ver Hoef JM, Jansen J (2007) Space-time zero-inflated count models of harbor seals. Environ
18:697–712
Vuong QH (1989) Likelihood ratio tests for model selection and non-nested hypotheses. Econom
57:307–333
Woodhouse P, Khaw K, Plummer M (1993) Seasonal variation of blood pressure and its
relationship to ambient temperature in an elderly population. J Hypertens 85:1267–1274
Chapter 4
Robust Optimal Interval Design
for High-Dimensional Dose Finding
in Multi-agent Combination Trials

Ruitao Lin and Guosheng Yin

Abstract In the era of precision medicine, combination therapy is playing a

more and more important role in drug development. However, drug combinations
often lead to a high-dimensional dose searching space compared to conventional
single-agent dose finding, especially when three or more drugs are combined for
treatment. To overcome the burden of calibration of multiple design parameters,
which often intertwine with each other, we propose a robust optimal interval (ROI)
design to locate the maximum tolerated dose (MTD) in phase I clinical trials. The
optimal interval is determined by minimizing the probability of incorrect decisions
under the Bayesian paradigm. Our method only requires specification of the target
toxicity rate, which is the minimal design parameter. Neither does ROI impose
any parametric assumption on the underlying distribution of the toxicity curve,
nor it needs to calibrate any other design parameters. To tackle high-dimensional
drug combinations, we develop a random-walk ROI design to identify the MTD
combination in the multi-agent dose space. Both the single- and multi-agent
ROI designs enjoy convergence properties with a large sample size. We conduct
simulation studies to demonstrate the finite-sample performance of the proposed
methods under various scenarios. The proposed ROI designs are simple and easy to
implement, while their performances are competitive and robust.

4.1 Introduction

The primary objective of phase I dose-finding trials is to determine the maximum

tolerated dose (MTD), which is typically defined as the dose with the dose-limiting
toxicity (DLT) probability closest to the target toxicity rate. Nowadays, combination
therapy is playing a more and more important role in drug development. After
demonstrating the clinical effectiveness of two agents separately, a natural follow-up

R. Lin • G. Yin ()

Department of Statistics and Actuarial Science, The University of Hong Kong, Pokfulam Road,
Hong Kong, China
e-mail: [email protected]; [email protected]

© Springer Science+Business Media Singapore 2016 55

D.-G. (Din) Chen et al. (eds.), Advanced Statistical Methods in Data Science,
ICSA Book Series in Statistics, DOI 10.1007/978-981-10-2594-5_4
56 R. Lin and G. Yin

step is to evaluate their joint effects when used in combination, especially if they
target different disease pathways. In general, dose finding in two-drug combination
trials is much more complicated since the joint toxicity order of the combined doses
is only partially known. Due to the enormous data from the historical trials and
the emergence of precision medicine, there is a trend to combine three or more
drugs for the sake of improved efficacy as well as reduced side effects. However,
multi-agent combination brings new challenges to the phase I dose-finding design:
the dimension of the dose searching space expands multiplicatively with respect
to the number of drugs in the combination. For a three-drug combination trial,
a usual logistic model may need eight parameters to quantify the joint effect
of the combined therapy by including the main effects, and two- and three-way
interactions. More importantly, these parameters should satisfy several conditions
under the partial order constraints, which in fact become very challenging to set.
As the sample size of a phase I trial is typically small, it is difficult to estimate
a large number of unknown parameters accurately, needless to say identifying the
true MTDs in multi-agent dose finding.
Numerous statistical methods have been proposed for phase I single-agent dose-
finding trials, which can generally be classified as algorithm- and model-based
designs (Yin 2012). The algorithm-based methods, such as the well-known 3 C 3
design (Storer 1989), usually proceed based on a set of prespecified rules without
imposing any model assumption on the unknown toxicity curve. Despite simplicity
and dominance in practice, the 3 C 3 design has been criticized for its poor
performance (Ahn 1998). Alternatives to the 3 C 3 design include the accelerated
titration design (Simon et al. 1997), the biased coin design (Durham et al. 1997),
the group up-and-down design (Gezmu and Flournoy 2006), and so on. For a
comprehensive review on the algorithm-based methods, see Liu et al. (2013). By
contrast, model-based dose-finding methods typically aim to find the MTD by
estimating the toxicity curve based on an imposed parametric model. The most
prominent model-based method is the continual reassessment method (CRM) by
O’Quigley et al. (1990), which dynamically determines the possible MTD based on
the observed data. For various extensions of the CRM, see Heyd and Carlin (1999),
Leung and Wang (2002), and Yuan et al. (2007). Although the model-based designs
tend to have superior operating characteristics over the algorithm-based, Rogatko
et al. (2007) reported that only 1.6 % of the phase I cancer trials (20 of 1235 trials)
published between 1991 and 2006 used model-based designs such as the modified
CRM, while the remainder used variations of the 3 C 3 design.
Interval designs, which belong to the algorithm-based class, have recently
attracted enormous attention due to their simplicity and desirable properties. The
entire procedure of an interval design is guided by comparing the observed toxicity
rate (or the number of DLTs) with a prespecified toxicity tolerance interval. Yuan
and Chappell (2004) suggested utilizing an interval to determine dose escalation or
de-escalation. Ivanova et al. (2007) proposed a cumulative cohort design by modi-
fying the group up-and-down design. Ji et al. (2007) proposed a toxicity probability
interval method using penalties to determine the dose assignment, and Ji et al.
(2010) made a further modification based on the unit probability mass. However,
4 Robust Optimal Interval Design 57

the specification of the tolerance interval is critical for the design performance. To
solve this problem, Liu and Yuan (2015) developed a Bayesian optimal interval
(BOIN) design by minimizing the probability of incorrect dose allocation under a
Bayesian decision-making framework. From a theoretical perspective, Oron et al.
(2011) showed that the MTD identified by an interval design converges almost
surely to one of the doses in the tolerance interval. Lin and Yin (2016) extended
BOIN to two-dimensional dose finding by comparing the posterior probability of
each dose combination falling inside the predetermined interval.
Most of the aforementioned methods require certain degrees of prespecfication
of design parameters, which is crucial for the trial performance. However, very
limited literature is devoted to addressing the issues on parameter calibration. For
example, the CRM requires the prespecification of the toxicity rates (or the skeleton)
for the dose levels under consideration. Such prespecification can be arbitrary and
subjective and, as a result, the operating characteristics are sensitive to various
toxicity scenarios. To overcome the arbitrariness in the prespecification of toxicity
rates, Yin and Yuan (2009) proposed a Bayesian model averaging CRM approach,
which is robust to the misspecification of the skeleton and thus leads to competitive
trial performance. Besides the skeleton, other design specifications in the CRM
include the working model and the prior distributions of the unknown parameters,
which also affect the design properties. The BOIN design requires to prespecify two
parameters 1 and 2 , which denote respectively the toxicity rates for underdosing
and overdosing. However, the performance of BOIN is sensitive to these two tuning
parameters, as they uniquely determine the optimal toxicity probability interval.
Similar to single-agent trial designs, two-agent dose finding in drug combina-
tion trials can also be classified as either algorithm- or model-based. Conaway
et al. (2004) proposed to use the pool-adjacent-violators algorithm to determine
dose allocation in drug combination trials. Ivanova and Wang (2004) applied the
Narayana design to find the MTD based on partial orders. Huang et al. (2007)
developed a two-agent 3 C 3 method by partitioning the dose space into separate
zones along the diagonal direction. Fan et al. (2009) proposed a three-stage 2C1C3
design. However, the dose-escalation rules in the existing algorithm-based two-
dimensional designs are rather ad-hoc and typically lack a theoretical support. Thus,
the performances of these methods are well below the satisfactory level.
Most of the model-based designs are developed under the CRM framework,
which continuously update the unknown parameters by assuming a certain model
for the joint toxicity surface. For example, Thall et al. (2003) considered a six-
parameter model for the joint toxicity rate of two drugs. Wang and Ivanova (2005)
proposed a log-linear working model for the dose-toxicity relationship. Yuan and
Yin (2008) applied the CRM to subtrials in a sequential order so that overly toxic
or overly safe doses can be eliminated in an efficient way. Yin and Yuan (2009)
utilized a copula-type regression method to characterize the interactive effects of
the two agents in combination. In a more general framework of 2 2 tables (Yin
and Yuan 2009; Yin and Lin 2014), many other copulae and bivariate binary models
can be applied to two-drug combination designs. Shi and Yin (2013) developed
a two-dimensional approach of escalation with overdose control on the basis of a
58 R. Lin and G. Yin

four-parameter logistic regression model. However, the number of unknown param-

eters in a two- or multi-agent combination trial is relatively large in comparison to
the small sample size, such that the estimation may be unstable and the trial results
are sensitive to some prior specifications. The situation becomes worse when three
or more drugs are combined, as more unknown parameters need to be estimated
in order to characterize two-way, three-way, and four-way interactions. However,
there are very limited statistical methods for dose finding with three or more drugs
in combination.
Our research is motivated by a phase I dose-finding study of combined treatment
with mitoxantrone and genasense in patients with metastatic hormone-refractory
prostate cancer (Chi et al. 2001). One of the major goals of the prostate cancer trial
was to find the MTD of the combination therapy. To broaden application of interval
designs as well as to overcome the arbitrary specification of design parameters 1
and 2 in the BOIN design, we propose a robust optimal interval (ROI) design
that only requires the specification of the target toxicity rate (the minimal design
specification for a trial). As a result, with fewer parameters to calibrate, the proposed
method is more robust to various design parameters and unknown toxicity curves.
In addition to the single-agent ROI design, we also develop a multi-agent random-
walk ROI (RW-ROI) design, which is applicable to dose finding with two or more
combined drugs. The proposed RW-ROI design adaptively searches for the MTD
using the accrued information, and it can be easily extended to high-dimensional
dose finding. We compare the RW-ROI method with existing approaches and
demonstrate its comparative and stable operating characteristics.
The rest of the paper is organized as follows. In Sect. 4.2, we propose the single-
agent ROI design, and in Sect. 4.3 we make an extension to multi-dimensional dose-
finding trials with RW-ROI. Simulation studies are conducted in Sect. 4.4 to examine
the operating characteristics of the new design as well as comparisons with existing
methods. Section 4.5 illustrates the proposed RW-ROI method with a trial example,
and Sect. 4.6 provides some concluding remarks.

4.2 Single-Agent Robust Optimal Interval Design

Consider a phase I dose-finding trial with J prespecified dose levels, whose toxicity
rates monotonically increase; that is, p1 < < pJ , where pj is the true toxicity
rate at dose level j, j D 1; : : : ; J. Let be the target toxicity rate specified by the
investigator. The trial starts with treating the first cohort of patients at the lowest
dose level. Suppose the current dose level is j and the total number of patients treated
at dose level j is nj . The interval design proceeds by comparing yj , the cumulative
number of DLTs at level j, with the prespecified toxicity lower and upper boundaries
L .nj / and U .nj /:
• If yj L .nj /, the dose for the next cohort is escalated to level j C 1.
• If yj U .nj /, the dose for the next cohort is de-escalated to level j 1.
4 Robust Optimal Interval Design 59

• If L .nj / < yj < U .nj / or the next dose assignment falls outside of the
prespecified dose range, the next cohort is treated at the same dose level j.
For safety, overly toxic dose levels that satisfy Pr. pj > jyj / and nj 3
are excluded from the trial, where is the prespecified threshold probability. Based
on the safety constraint, we can obtain the dose elimination cutoffs T .nj /: if yj
T .nj /, dose level j and all the higher levels are eliminated from the trial.
To avoid arbitrary prespecifications of L .nj / and U .nj /, Liu and Yuan (2015)
derived the lower and upper boundaries by casting the dose-finding problem in a
Bayesian hypothesis testing framework for each j,

H0 W pj D ; H1 W p j D 1 ; H2 W p j D 2 :

Their method requires prespecification of two design parameters, 1 and 2 , which

are viewed as the highest toxicity rate (but subtherapeutic) such that the dose should
be escalated and the lowest toxicity rate (while still overly toxic) such that the dose
should be de-escalated, respectively. To enhance the robustness of the design as
well as to circumvent calibration of redundant parameters, we consider a hypothesis
setting with a single target rate parameter ,

H0 W pj D ; H1 W pj < ; H2 W pj > ;

where H0 , H1 and H2 indicate that the current dose level j is the MTD, below and
above the MTD, respectively. Under the Bayesian paradigm, we assume the three
hypotheses are a priori equally probable, i.e., Pr.H0 / D Pr.H1 / D Pr.H2 / D 1=3.
Under the composite alternatives H1 and H2 , we assign noninformative uniform
prior distributions for pj ,

pj jH1 Unif.0; / and pj jH2 Unif.; 1/;

while the prior distribution under H0 is a point mass on . Based on the accumulated
data at dose level j, the posterior probability of each hypothesis kj is given by

Pr.Hk / Pr.yj jHk /

kj Pr.Hk jyj / D P2 ; k D 0; 1; 2;
k0 D0 Pr.Hk0 / Pr.yj jHk0 /

where the marginal likelihood Pr.yj jHk / can be obtained by integrating out the
parameter pj with respect to its prior f .pj jHk /,
Z
pj j .1 pj /.nj yj / f .pj jHk /dpj ;
y
Pr.yj jHk / / k D 0; 1; 2:
60 R. Lin and G. Yin

Given the accumulated data yj , the posterior probability of making incorrect

decisions is formulated as

Pr.Incorrectjyj / D 0j Pr.E or DjH0 / C 1j Pr.S or DjH1 / C 2j Pr.S or EjH2 /
D 0j Pr.yj L .nj / or yj U .nj /jH0 /
C1j Pr.yj > L .nj /jH1 / C 2j Pr.yj < U .nj /jH2 /; (4.1)

where E, D and S stand for “Escalation”, “De-escalation” and “Stay”, respectively.

We can also take a weighting scheme to penalize more for de-escalation under H1
than staying at the same dose and for escalation under H2 than staying. If it is
further assumed that escalation is more dangerous than de-escalation, we can assign
asymmetric weights or penalties for escalation and de-escalation under H0 . The ROI
design aims to minimize the probability of making incorrect decisions at each step,
and the optimal interval boundaries for yj can be derived as
( )
m .1 /nj m
L .nj / D max m W R 1 ;
0 pm .1 p/nj m f .p j H1 /dp
( R1 )
pm .1 p/nj m f .p j H2 /dp
U .nj / D max m W 1 ; (4.2)
m .1 /nj m

which in fact do not depend on yj . We can see from (4.2) that the escalation rules
of ROI are equivalent to escalating the dose if 1j > 0j , and de-escalating the
dose if 2j > 0j . Let L .nj / D L .nj /=nj and U .nj / D U .nj /=nj , which are the
boundaries for the toxicity rate.
Theorem 1 The values of L .nj / and U .nj / converge to almost surely, as
nj ! 1.
The proof of Theorem 1 is based on the consistency of the posterior probability
of Hk as nj ! 1, which is straightforward and thus omitted. It indicates that the
optimal interval would shrink to the target toxicity rate as the sample size increases.
Several remarks are in place for comparisons between the ROI and BOIN
designs. First, kj in (4.1) of BOIN is the prior probability of each hypothesis Hk and
thus BOIN is developed based on the prior information at the trial planning stage,
while ROI aims to control the incorrect decisions based on the posterior distribution
using the accrued information. Second, BOIN requires prespecification of 1 and 2 ,
while there is no theoretical guidance for selection of the two values. In addition,
the interpretations of 1 and 2 as well as the optimal intervals produced by BOIN
are somewhat counterintuitive: 1 and 2 are claimed to be the highest and lowest
toxicity rates corresponding to escalation and de-escalation respectively, but the trial
is conducted using the derived optimal boundaries, which lie inside .1 ; 2 /, instead
of using 1 and 2 directly. By contrast, there is no ambiguity for ROI, since it only
requires the specification of , the target toxicity rate. Last but most importantly, the
4 Robust Optimal Interval Design 61

limiting interval of BOIN depends on the values of 1 and 2 and does not shrink to
the target toxicity rate with an increasing sample size. As a result, BOIN would
randomly locate one of the dose levels that lie inside the limiting optimal interval,
while ROI converges almost surely to the true MTD because its optimal interval
indeed shrinks to the target.

4.3 Multi-agent Robust Optimal Interval Design

4.3.1 Combining Two Drugs

The decisions of dose escalation, de-escalation or retention based on the ROI

design only depend on the accumulative information at the current dose level, and
thus can be applied to a multi-agent combination trial in a straightforward way.
However, there are up to eight adjacent dose levels at a typical location in the two-
dimensional dosing space and the toxicity orders are partially known. To determine
an appropriate dose assignment, we propose a random walk rule to assign each
new cohort of patients to the level that has the maximum posterior probability of
being the MTD. More specifically, we consider combining J dose levels of drug A
and K levels of drug B in a two-dimensional dose-finding study. Let pjk denote the
toxicity probability of the two agents at dose level . j; k/, j D 1; : : : ; J; k D 1; : : : ; K.
Suppose the current dose combination level is .j; k/, and we define an admissible
escalation set as

AE D f.j C 1; k/; .j; k C 1/g;

and an admissible de-escalation set as

AD D f.j 1; k/; .j; k 1/g;

as shown in Fig. 4.1. The admissible dose escalation/de-escalation set only contains
the dose levels by upgrading or downgrading one drug by one dose level while fixing
the level of the other drug. We exclude the dose levels that are out of the dose range
from the admissible dose escalation/de-escalation set. For example, when j D 1,
the dose level .j 1; k/ should be excluded from the dose de-escalation set. The
random-walk robust optimal interval (RW-ROI) design begins with treating the first
cohort at the lowest dose combination .1; 1/. Based on the cumulative number of
DLTs observed at dose level .j; k/, yjk , the dose level for the next cohort of patients
is determined as follows:
1. If yjk L .njk /, escalate to the dose level in the admissible escalation set,
which has the largest posterior probability Pr.H0 jyj0 k0 /, .j0 ; k0 / 2 AE . If the
admissible escalation set contains untried dose levels (i.e., nj0 k0 D 0), we set
Pr.H0 jyj0 k0 / D 1, which thus facilitates exploring the untried dose levels as well
as preventing the trial from being trapped in some suboptimal doses.
62 R. Lin and G. Yin

Fig. 4.1 Admissible sets for

dose escalation or
de-escalation in the RW-ROI
design

2. If yjk U .njk /, de-escalate to the dose level in the admissible de-escalation set,
which has the largest posterior probability Pr.H0 jyj0 k0 /, .j0 ; k0 / 2 AD . Similarly,
we take Pr.H0 jyj0 k0 / D 1 for the untried admissible dose levels.
3. Otherwise, if L .njk / < yjk < U .njk /, the doses stay at the same level .j; k/.
During the process of dose escalation and de-escalation, if there exist multiple
optimal dose levels, we randomly choose one with equal probability. The trial
continues until the total sample size is exhausted. Additionally, if the most recent
patients are treated at the lowest dose level .1; 1/ and y11 U .n11 /, the next dose
retains at the same dose level. Symmetrically, if the current dose level is the highest
dose level .J; K/ and yJK L .nJK /, we still treat the next cohort of patients at the
same dose level.

4.3.2 Combining Three or More Drugs

Existing two-agent dose-finding methods can hardly be extended to the cases with
three or more drugs combined. By contrast, the proposed random walk rule is
suitable for any arbitrary number of drugs. For illustration, we consider a three-
dimensional dose-finding study that combines J dose levels of drug A, K levels of
drug B and L levels of drug C.
Suppose that yjkl out of njkl patients have experienced the DLT at the current
dose level .j; k; l/, whose true toxicity probability is pjkl . As before, we define an
admissible escalation set by increasing one dose level of one drug while fixing the
other two,

AE D f.j C 1; k; l/; .j; k C 1; l/; .j; k; l C 1/g:

4 Robust Optimal Interval Design 63

Similarly, the admissible de-escalation set is defined by decreasing one dose level
of one drug while fixing the other two,

AD D f.j 1; k; l/; .j; k 1; l/; .j; k; l 1/g:

Following similar rules as the double-agent design, the RW-ROI for a triple-agent
trial begins with treating the first cohort of patients at the lowest dose combination
.1; 1; 1/. Based on the cumulative number of DLTs observed at dose level .j; k; l/,
yjkl , the dose level for the next cohort is determined as follows:
1. If yjkl L .njkl /, escalate to the dose level in the admissible escalation set,
which has the largest posterior probability Pr.H0 jyj0 k0 l0 /, .j0 ; k0 ; l0 / 2 AE . If the
admissible escalation set contains untried dose levels (i.e., nj0 k0 l0 D 0), we set
Pr.H0 jyj0 k0 l0 / D 1, which thus facilitates exploring the untried dose levels as well
as preventing the trial from being trapped in some suboptimal doses.
2. If yjkl U .njkl /, we de-escalate to the dose level that lies inside the admissible
de-escalation set and also has the largest posterior probability Pr.H0 jyj0 k0 l0 /,
.j0 ; k0 ; l0 / 2 AD . Similarly, we take Pr.H0 jyj0 k0 l0 / D 1 for the untried admissible
dose levels.
3. Otherwise, if L .njkl / < yjkl < U .njkl /, the doses stay at the same level .j; k; l/.
The trial continues until the total sample size is exhausted. During the process
of dose escalation and de-escalation, if there exist multiple optimal dose levels, we
randomly choose one with equal probability.
Such an algorithm can be straightforwardly extended to the drug-combination
trial with more than three drugs. Suppose the current dose level is .j; k; l; : : : ; /, and
then the admissible escalation set is

AE D f.j C 1; k; l; : : :/; .j; k C 1; l; : : :/; .j; k; l C 1; : : :/; : : :g;

and the admissible de-escalation set is

AD D f.j 1; k; l; : : :/; .j; k 1; l; : : :/; .j; k; l 1; : : :/; : : :g:

The dose-finding rules remain unchanged.

After the trial by RW-ROI is completed, we perform the isotonic regression so
that the estimated toxicity rates satisfy partial ordering of the toxicity rates when
allowing only the dose of one drug to change and fixing the other drugs at certain
levels. Specifically, in a three-agent trial, we perform three-dimensional isotonic
regression (Dykstra and Robertson 1982) to the estimated toxicity rate pO jkl , and let
pQ jkl denote the trivariate isotonic regression estimator. The MTD .j ; k ; l / is finally
selected as the dose level whose toxicity rate pQ j k l is closest to the target :

.j ; k ; l / D arg min.j;k;l/2N jQpjkl j;

64 R. Lin and G. Yin

where the set N D f.j; k; l/ W njkl > 0g contains all the tested dose levels.
When there are ties for pQ j k l on the same row, the same column, or the same
layer, the highest dose combination satisfying pQ j k l < , or the lowest dose
combination satisfying pQ j k l > , is finally selected as the MTD. However, it is
difficult to distinguish the ties when they lie on different rows, columns, or layers,
e.g., .j C 1; k 1; l/ and .j 1; k C 1; l/. In this case, we select the one that has
the largest value of Pr.H0 jyj k l /, which is approximately equivalent to the dose
combination that has been tested with more patients.
Similar to the single-agent ROI design, the RW-ROI design has desirable large-
sample properties. Based on the accrued information in the trial, it can be shown
that the estimates of the posterior probabilities Pr.H0 jyjkl / in the RW-ROI design
would converge to their true values (either 1 or 0). Thus, RW-ROI would adaptively
assign patients to the dose level that is closer to the MTD instead of being trapped
in a local neighborhood, and the dose assignment converges to the MTD.

4.4 Simulation Study

4.4.1 Single-Agent ROI Versus BOIN

First, we conduct a simulation study of the single-agent ROI design with a

comparison to the BOIN design in terms of the operating characteristics. The trial
under consideration consists of eight dose levels with a target toxicity rate D 0:3.
The total sample size planned is 30 and patients are assigned in cohorts of size 3. For
the BOIN method, we consider three paired values for .1 ; 2 /: the default interval
.1 ; 2 / D .0:6; 1:4/ recommended by Liu and Yuan (2015), the narrow interval
.1 ; 2 / D .0:8; 1:2/, and the wide interval .1 ; 2 / D .0:5; 1:5/. In addition,
we impose a safety constraint by setting D 0:95 for the two methods.
Table 4.1 shows the simulation results under three toxicity scenarios and each
scenario is replicated for 1000 times. In scenario 1, the seventh dose is the MTD, and
the MTD selection probabilities of BOIN using the three intervals are very different.
In particular, the BOIN design with the narrow interval has the lowest selection
percentage, while that based on the default interval is the best. Under scenario 2,
all three BOIN designs perform similarly, while the default one behaves slightly
poor. For scenario 3, the default and wide interval BOIN designs perform much
worse than the narrow interval BOIN. By contrast, the proposed ROI design does not
depend on any extra design parameters, such as 1 and 2 , and it tends to perform
comparably with the best of the three BOIN designs under each scenario. These
findings suggest that the prespecified interval indeed plays a critical role in the BOIN
design, and the performance could be much compromised if the interval is chosen
inappropriately. However, it is difficult, if not impossible, to justify which interval
is more sensible in the trial planning stage.
4 Robust Optimal Interval Design 65

Table 4.1 Comparison between BOIN and ROI for single-agent trials under three toxicity
scenarios with a target toxicity rate of 0.3
Recommendation percentage at dose level Average Average
Design 1 2 3 4 5 6 7 8 # DLTs # patients
Scenario 1 0:01 0:02 0:03 0:05 0:08 0:13 0:30 0:50
BOIN(0.6,1.4) 0:0 0:0 0:0 0:3 2:0 25:3 53:0 19:4 3:9 30:0
# patients 3:1 3:2 3:3 3:7 4:1 5:3 5:3 2:1
BOIN(0.8,1.2) 0:0 0:0 0:1 1:3 5:5 38:5 40:5 14:1 3:1 30:0
# patients 3:3 3:5 3:8 4:4 4:8 5:7 3:1 1:4
BOIN(0.5,1.5) 0:0 0:0 0:0 0:6 3:3 27:9 50:8 17:4 3:8 30:0
# patients 3:1 3:2 3:4 3:8 4:2 5:3 5:1 2:0
ROI 0:0 0:0 0:0 0:8 3:9 25:1 52:8 17:4 3:8 30:0
# patients 3:1 3:2 3:4 3:8 4:1 5:1 5:3 2:1
Scenario 2 0:15 0:30 0:42 0:55 0:65 0:68 0:70 0:80
BOIN(0.6,1.4) 23:9 51:8 21:0 2:5 0:0 0:0 0:0 0:0 8:4 29:8
# patients 10:1 12:8 5:7 1:1 0:1 0:0 0:0 0:0
BOIN(0.8,1.2) 20:9 55:0 20:5 2:8 0:2 0:0 0:0 0:0 7:8 29:9
# patients 12:6 11:9 4:5 0:9 0:0 0:0 0:0 0:0
BOIN(0.5,1.5) 25:3 55:1 16:6 2:2 0:0 0:0 0:0 0:0 8:1 29:8
# patients 10:7 13:2 5:0 0:9 0:0 0:0 0:0 0:0
ROI 23:1 54:6 18:8 2:7 0:0 0:0 0:0 0:0 8:4 29:8
# patients 9:6 13:5 5:6 1:0 0:1 0:0 0:0 0:0
Scenario 3 0:10 0:15 0:22 0:30 0:38 0:46 0:55 0:60
BOIN(0.6,1.4) 1:6 10:5 31:0 32:4 17:6 5:8 0:9 0:0 6:6 30:0
# patients 4:8 6:7 8:3 6:3 2:8 0:9 0:1 0:0
BOIN(0.8,1.2) 1:5 13:5 29:1 38:8 13:4 2:9 0:4 0:0 5:8 29:9
# patients 6:8 8:5 7:6 5:1 1:5 0:4 0:0 0:0
BOIN(0.5,1.5) 2:5 12:0 33:3 31:8 15:3 4:3 0:6 0:0 6:4 30:0
# patients 5:2 7:3 8:4 5:8 2:4 0:7 0:1 0:0
ROI 2:8 11:6 32:1 33:5 14:5 4:6 0:7 0:0 6:5 30:0
# patients 5:1 7:0 8:4 6:1 2:5 0:7 0:1 0:0
BOIN stands for the Bayesian optimal interval design, and the values in the parentheses are the
prespecified design parameters 1 and 2 in BOIN; ROI is the proposed robust optimal interval
design

4.4.2 Double-Agent RW-ROI Versus Model-Based Designs

For dose finding with two drugs in combination, we investigate the performance
of the proposed RW-ROI design with comparisons to four existing model-based
methods that are described as follows:
(1) Two-dimensional escalation with overdose control (TEWOC): Shi and Yin
(2013) proposed a TEWOC design for dose finding on the basis of a four-
parameter logistic regression model, under which the joint toxicity probability
66 R. Lin and G. Yin

at dose level .j; k/ is given by

exp.ˇ0 C ˇ1 djA C ˇ2 dkB C ˇ3 djA dkB /

pjk D ; (4.3)
1 C exp.ˇ0 C ˇ1 djA C ˇ2 dkB C ˇ3 djA dkB /

where djA and dkB are the dosages of the two agents in combination. The
assignment of the next dose level is based on the estimated MTD distribution
with respect to a prespecified quantile level a, which is set as a D 0:25.
In the simulation study, we consider .d1A ; d2A ; d3A / D .0:1; 0:2; 0:3/ and
.d1 ; : : : ; d5B / D .0:1; 0:2; 0:3; 0:4; 0:5/ and assign noninformative priors to the
B

unknown parameters: ˇ0 N.0; 2/, ˇ1 ; ˇ2 ; ˇ3 Gamma.4; 0:8/.

(2) Copula-type method: To model the toxicity surface, Yin and Yuan (2009)
proposed a copula-type regression method to capture drug–drug interactions.
Specifically, they used a Clayton copula regression function, which is given by

ˇ
pjk D 1 f.1 a˛j / C .1 bk / 1g1= ;

where ˛; ˇ; > 0 are unknown parameters, and aj .j D 1; : : : ; J/ and bk .k D

1; : : : ; K/ are the prespecified toxicity probabilities for each dose level of drug A
and drug B, respectively. The dose escalation decision is based on the posterior
probability given the cumulative data D, Pr.pjk < jD/, and two prespecified
cutoffs ce and cd : if Pr.pjk < jD/ > ce , the dose is escalated to an adjacent dose
combination with its toxicity rate higher than the current value as well as closest
to the target rate; similarly, if Pr.pjk < jD/ < cd , the dose is de-escalated for
the next cohort of patients; otherwise, the current dose combination stays at
the same level. We set ce D 0:7 and cd D 0:55 to direct dose escalation and
de-escalation, respectively.
We take an even partition from 0 to 0.3 for both aj ’s and bk ’s:
.a1 ; a2 ; a3 / D .0:1; 0:2; 0:3/, .b1 ; : : : ; b5 / D .0:06; 0:12; 0:18; 0:24; 0:3/. We
specify Gamma.2; 2/ as the prior distribution for ˛ and ˇ, and a relatively
noninformative Gamma.0:1; 0:2/ as the prior distribution for .
(3) Log-linear model: Wang and Ivanova (2005) utilized a log-linear working
model for the dose–toxicity relationship in drug-combination trials:

ˇ
pjk D 1 .1 a˛j /.1 bk / expf log.1 aj / log.1 bk /g;

where ˛; ˇ; > 0. For comparison, all the trial specifications under the log-
linear model are identical to those in the copula-type method.
(4) Logistic model: We make a further comparison of the proposed method with
the logistic model in (4.3) while keeping the dose allocation rule the same as
the copula-type method.
4 Robust Optimal Interval Design 67

Table 4.2 Ten toxicity scenarios for two-drug combinations with a target toxicity probability of
30 %. The MTDs are in boldface
Drug B
Dose level 1 2 3 4 5 1 2 3 4 5
Scenario 1 Scenario 2
3 0:15 0:30 0:45 0:50 0:65 0:30 0:50 0:60 0:65 0:75
2 0:10 0:15 0:30 0:45 0:50 0:15 0:30 0:45 0:52 0:60
1 0:05 0:10 0:15 0:30 0:45 0:05 0:10 0:12 0:15 0:30
Scenario 3 Scenario 4
3 0:10 0:15 0:30 0:45 0:55 0:12 0:15 0:17 0:30 0:50
2 0:06 0:10 0:15 0:30 0:45 0:06 0:08 0:15 0:20 0:45
1 0:04 0:06 0:10 0:15 0:30 0:02 0:06 0:10 0:15 0:30
Scenario 5 Scenario 6
Durg A

3 0:40 0:42 0:48 0:55 0:60 0:15 0:30 0:45 0:50 0:60
2 0:30 0:40 0:43 0:48 0:55 0:08 0:12 0:15 0:30 0:45
1 0:15 0:30 0:40 0:45 0:50 0:04 0:06 0:10 0:12 0:15
Scenario 7 Scenario 8
3 0:50 0:60 0:70 0:75 0:80 0:08 0:15 0:45 0:60 0:70
2 0:10 0:30 0:45 0:60 0:70 0:05 0:20 0:30 0:45 0:70
1 0:06 0:10 0:15 0:30 0:40 0:02 0:10 0:15 0:40 0:50
Scenario 9 Scenario 10
3 0:15 0:30 0:40 0:60 0:70 0:70 0:75 0:80 0:85 0:90
2 0:02 0:05 0:08 0:12 0:15 0:45 0:55 0:60 0:65 0:70
1 0:01 0:02 0:03 0:04 0:10 0:05 0:08 0:20 0:30 0:40

We compare the RW-ROI design with the four model-based methods in terms of the
operating characteristics under the 10 scenarios in Table 4.2, which involves various
numbers and locations of the MTDs. We take the maximum sample size to be 60
with a cohort size of 3, and the target toxicity probability is set at 0.3. To ensure
comparability across different methods, we do not impose any early stopping rule
so that we run the entire trial till the exhaustion of the maximum sample size. We
simulate 1000 replications for each scenario.
Table 4.3 presents the simulation results of our proposed RW-ROI design in
conjunction with those of existing model-based methods, which include three
performance statistics: the percentage of MTD selection, the percentage of patients
allocated at the MTDs, and the average number of DLTs. Among the four model-
based designs considered, the logistic method performs the best, with their MTD
selection and patient allocation percentages substantially greater than those of the
other model-based designs under scenarios 1, 3, 6, and 8. The log-linear method and
Clayton method have similar performance under the 10 scenarios. The performance
of the model-based methods is sensitive to the MTD locations. For example,
68 R. Lin and G. Yin

Table 4.3 Comparisons of the proposed two-dimensional RW-ROI design with the model-based
methods under ten scenarios with a target toxicity rate of 0.3. The best performance statistics are
in boldface
Scenarios
Method 1 2 3 4 5 6 7 8 9 10
Percentage of MTD selections
TEWOC 64:4 44:4 73:5 45:8 60:2 38:8 59:4 31:5 12:8 39:3
Logistic 74:7 52:7 76:2 43:0 63:8 56:4 57:1 43:8 5:1 22:6
Log-linear 59:1 56:8 59:1 32:7 64:2 47:0 52:1 27:3 15:8 32:5
Clayton 58:9 50:3 59:6 32:8 63:1 40:0 47:0 24:1 17:4 28:2
RW-ROI 63:4 66:2 67:9 52:5 55:3 51:7 58:2 31:5 35:2 33:5
Percentage of patients allocated at the MTDs
TEWOC 47:8 36:1 42:8 26:1 57:7 31:3 36:7 24:6 7:8 23:4
Logistic 48:2 32:4 47:3 25:4 43:2 32:6 35:4 27:2 4:6 12:9
Log-linear 44:7 36:1 42:6 24:0 50:4 30:3 41:8 16:7 13:1 17:4
Clayton 40:1 37:6 40:7 25:4 43:3 28:8 34:4 16:5 14:2 9:6
RW-ROI 42:5 42:6 40:2 26:4 43:6 31:9 37:3 19:2 21:8 15:2
Average number of DLTs
TEWOC 15:6 17:4 14:2 13:2 18:5 15:4 18:1 16:0 16:1 19:8
Logistic 17:3 17:4 16:4 15:7 19:0 16:5 17:8 17:3 16:3 17:8
Log-linear 16:2 16:3 15:2 14:6 19:4 14:9 16:5 16:4 14:7 17:3
Clayton 16:1 15:9 15:4 15:0 17:2 15:2 16:5 15:2 15:3 16:6
RW-ROI 16:6 17:4 15:2 14:8 18:7 15:4 18:2 16:9 14:9 18:9
TEWOC stands for the two-dimensional escalation with overdose control method, and RW-ROI
represents the random-walk robust optimal interval design

under scenarios 1–3 where three MTDs exist, the MTD selection percentages
under TEWOC and logistic methods have an over 20 % range of variations due to
different MTD locations. Similarly, the selection percentages of the four model-
based methods vary from 32 % to 64 % under scenarios 4–7 which have two
MTDs. These findings demonstrate that the mode-based methods are not robust.
In addition, we find that the TEWOC and logistic model are also sensitive to the
design calibration parameters. By contrast, the MTD selection percentages based
on the RW-ROI design is more stable with respect to various toxicity scenarios.
For the first seven scenarios, RW-ROI design has an average selection percentage
of 60 %, with improvement between 5 % to 20 % over the log-linear and Clayton
methods. Scenarios 2, 4, 7 are difficult ones because their toxicity surfaces are quite
irregular, for which the proposed RW-ROI design has the best performance among
all the methods. When the MTD is unique in scenarios 8–10, the proposed design is
also comparable with the model-based methods. Similar conclusions can be made
with respect to the percentage of patients allocated at the MTDs. The five designs
have similar operating characteristics in terms of the average number of DLTs, .
4 Robust Optimal Interval Design 69

100 RW−ROI (Scenarios 1−5) RW−ROI (Scenarios 6−10)

100
80

80
% MTD selections

% MTD selections
60

60
40

40
Scenario 1 Scenario 6
Scenario 7
20

20
Scenario 2
Scenario 3 Scenario 8
Scenario 4 Scenario 9
Scenario 5 Scenario 10
0

0
40 60 80 100 120 140 160 180 40 60 80 100 120 140 160 180
Number of patients Number of patients

Fig. 4.2 Relationship between the sample size and the percentage of MTD selection under the 10
scenarios in Table 4.2

To examine the limiting performance of the proposed methods, we increase the

maximum sample size of the simulated trials. Figure 4.2 presents the trends of
the percentages of MTD selection with respect to an increasing sample size under
the ten scenarios in Table 4.2. Clearly, the performance of the proposed design
continuously improves by accumulating more data and would not be trapped in
any suboptimal doses. In general, the more MTDs in the two-dimensional space,
the higher the selection percentage. In scenarios 2 and 3, the percentages of MTD
selection increase from 40 % to 80 % as the sample size is enlarged from 40 to 180.
In scenarios 8 and 10, where only one MTD exists, the MTD selection percentages
using RW-ROI can still improve substantially as the sample size increases.

4.4.3 Triple-Agent RW-ROI

To investigate the operating characteristics of RW-ROI in multi-agent combination

trials, we expand the dosing space to three dimensions. Specifically, we consider
four dose levels of drug A, three levels of drug B, and two levels of drug C.
Therefore, there are 24 combination dose levels in total. Two scenarios with a
target toxicity rate of 0.3 are provided in Table 4.4, where four MTDs exist under
each scenario. The sample size is 90 patients with 3 patients in a cohort. Based on
5000 replications, the percentage of MTD selection for scenario 1 is 62.2 %, and
on average 41.4 % patients are allocated to the MTDs by RW-ROI. For scenario 2,
RW-ROI also achieves a 62.4 % correct selection percentage, and allocates 38.1 %
of the patients to the MTDs. The average numbers of DLTs are 27.8 and 26.3 under
scenarios 1 and 2, respectively, which are very close to the expectation as if all the 90
70 R. Lin and G. Yin

Table 4.4 Two toxicity scenarios for three-drug combinations with a target toxicity probability of
30 %. The MTD is in boldface
Drug A
Dose level 1 2 3 4 1 2 3 4
Scenario 1 Scenario 2
Drug C: Level 1 Drug C: Level 1
3 0:15 0:30 0:45 0:60 0:12 0:15 0:30 0:50
2 0:05 0:15 0:30 0:45 0:06 0:08 0:20 0:45
Durg B

1 0:01 0:05 0:10 0:15 0:02 0:06 0:15 0:30

Drug C: Level 2 Drug C: Level 2
3 0:45 0:55 0:65 0:80 0:17 0:30 0:45 0:65
2 0:15 0:30 0:45 0:65 0:15 0:18 0:30 0:55
1 0:05 0:15 0:30 0:45 0:10 0:15 0:18 0:45

patients were allocated to the MTDs (with the target toxicity rate of 0.3). The triple-
agent simulation results demonstrate that RW-ROI also has a desirable and robust
performance in multi-agent combination trials. With an even higher dimension,
RW-ROI is expected to still perform well and its implementation is simple and
straightforward.

4.5 Illustrative Example

4.5.1 Prostate Cancer Trial

For patients with metastatic hormone-refractory prostate cancer, mitoxantrone

has been demonstrated to be an active agent, but its prostate-specific antigen
response rate is low. Genasense is a phosphorothioate antisense oligonucleotide
complementary to the bcl-2 mRNA open reading frame, which contributes to
inhibiting expression of bcl-2, delaying androgen independence as well as enhanc-
ing chemosensitivity in prostate and other cancer models. As a result, a phase
I dose-finding study of combined treatment with mitoxantrone and genasense is
considered to meet the need for more effective treatment of the prostate cancer (Chi
et al. 2001). The goals of the trial were to evaluate the safety and biological effect of
the combination of genasense and mitoxantrone, and to determine the preliminary
antitumor activity. Specifically, three doses (4, 8, and 12 mg/m2 ) of mitoxantone
and five doses (0.6, 1.2, 2.0, 3.1, 5.0 mg/kg) of genasense were investigated in this
trial. To identify the MTD combination, the trial selected seven combination doses:
(mitoxantone, genasense) D .4; 0:6/, .4; 1:2/, .4; 2:0/, .4; 3:1/, .8; 3:1/, .12; 3:1/,
.12; 5:0/, and applied the modified 3 C 3 dose escalation scheme. However, the
chosen dose pairs from the two-dimensional space are arbitrary, so that the true
MTD might have been excluded. Due to the limitation of the 3 C 3 design, only
4 Robust Optimal Interval Design 71

one MTD can be identified in the trial, even though multiple MTDs may exist in
the drug-combination space. In addition, the 3 C 3 design does not even guarantee
the recommended MTD is correct. This example demonstrates the need for a more
effective dose-finding design in drug-combination trials.

4.5.2 Trial Illustration

For illustration, we apply the proposed RW-ROI design to the aforementioned

prostate cancer trial. As described previously, the trial examined 3 dose levels
of mitoxantrone and 5 dose levels of genasense, which results in a 3 5 drug-
combination space. The target toxicity rate is D 0:3, the cohort size is set as 3
and 20 cohorts are planned for the trial. Based on the formulae in (4.2), the optimal
boundaries for the RW-ROI design are given in Table 4.5.
In addition, we impose a safety rule by setting the threshold D 0:95. The first
cohort of patients is treated at the lowest dose level .1; 1/. Figure 4.3 shows the path
of the dose assignments for the subsequent cohorts, from which we can see that the
RW-ROI design can search the MTD adaptively and treat most of the patients at
the right dose level. Specially, three DLTs are observed for the 8th cohort at dose
level .3; 3/, which is beyond the dose elimination cutoff. Therefore, the dose level
.3; 3/ and the higher dose combinations are eliminated from the trial, and dose de-
escalation should be made for the next cohort. Note that the admissible de-escalation
set is f.3; 2/; .2; 3/g while dose level .3; 2/ has never been administrated before, so
the RW-ROI design selects dose level .3; 2/ for the next assignment. In addition,
dose-escalation for the 14th cohort is based on comparison between the posterior
probabilities Pr.H0 jy23 / and Pr.H0 jy32 /, and finally chooses dose level .2; 3/. At the

Table 4.5 Interval boundaries and dose elimination cutoffs for the number of DLTs in the robust
optimal interval design with a target toxicity rate D 0:3
nj 3 6 9 12 15 18 21 24 27 30
L .nj / 0 1 1 2 2 3 4 4 5 6
U .nj / 2 4 5 6 7 9 10 11 12 14
T .nj / 3 4 5 7 8 9 10 11 12 14
nj 33 36 39 42 45 48 51 54 57 60
L .nj / 6 7 8 8 9 10 11 11 12 13
U .nj / 15 16 17 18 19 20 21 22 23 24
T .nj / 15 16 17 18 19 20 21 22 23 24
Note: nj is the cumulative number of patients at dose level j
72 R. Lin and G. Yin

(3,5) 0/3 toxicities

(3,4) 1/3 toxicities
(3,3) 2/3 toxicities
(3,2) 3/3 toxicities
(3,1)
(2,5)
Dose level

(2,4)
(2,3)
(2,2)
(2,1)
(1,5)
(1,4)
(1,3)
(1,2)
(1,1)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Sequence of enrollment

Fig. 4.3 Illustration of RW-ROI for the prostate cancer trial with a target toxicity rate of 0.3.
Circle indicates patients without toxicity, triangle and diamond respectively denote one and two
toxicities, and cross represents three toxicities

end of the trial, the estimated toxicity probability matrix after implementing the two-
dimensional pool adjacent violators algorithm is given by
2 3
0:68 1
4 0:15 0:15 0:28 5 ;
0 0 0

where “–” represents the dose levels that have not been administered in the trial.
Thus, dose level .2; 4/, which is 8 mg/m2 mitoxantone combined with 3.1 mg/kg
genasense, would be selected as the MTD.

4.6 Concluding Remarks

To simplify dose-finding procedure while still maintaining the trial performance,

a robust optimal interval design is developed. Its extension to double-, triple-, or
higher-dimensional drug combinations is straightforward, which greatly simplifies
the current practice of dose finding in combination treatment. The proposed ROI
and RW-ROI methods can substantially outperform the BOIN design, if the interval
parameters .1 ; 2 / of BOIN is poorly specified. The ROI design only requires
the prespecification of the target toxicity rate of the trial and thus it dramatically
improves the robustness of the existing dose-finding designs. In addition, we have
4 Robust Optimal Interval Design 73

demonstrated the good performance and operating characteristics of all the single-,
double- and triple-agent ROI designs by conducting extensive simulation studies.
An arguable point is that the allocation decisions by the ROI designs are solely
determined by the information at the current dose level, while the sequential dose-
escalation procedure as well as the isotonic regression at the end of the trial
implicitly account for the majority of information from other doses.

Acknowledgements We are grateful to a co-editor for many helpful suggestions that have
improved this chapter immensely. The research was supported in part by a grant (17125814) from
the Research Grants Council of Hong Kong.

References

Ahn C (1998) An evaluation of phase I cancer clinical trial designs. Stat Med 17:1537–1549
Chi KN, Gleave ME, Klasa R, Murray N, Bryce C, de Menezes DEL, D0 Aloisio S, Tolcher
AW (2001) A phase I dose-finding study of combined treatment with an antisense bcl-2
oligonucleotide (genasense) and mitoxantrone in patients with metastatic hormone-refractory
prostate cancer. Clin Cancer Res 7:3920–3927
Conaway MR, Dunbar S, Peddada SD (2004) Designs for single- or multiple-agent phase I trials.
Biometrics 60:661–669
Durham SD, Flournoy N, Rosenberger WF (1997) A random walk rule for phase I clinical trials.
Biometrics 53:745–760
Dykstra RL, Robertson T (1982) An algorithm for isotonic regression for two or more independent
variables. Ann Stat 10:708–716
Fan SK, Venook AP, Lu Y (2009) Design issues in dose-finding phase I trials for combinations of
two agents. J Biopharm Stat 19:509–523
Gezmu M, Flournoy N (2006) Group up-and-down designs for dose-finding. J Stat Plan Infer
136:1749–1764
Heyd JM, Carlin PB (1999) Adaptive design improvements in the continual reassessment method
for phase I studies. Stat Med 18:1307–1321
Huang X, Biswas S, Oki Y, Issa JP, Berry DA (2007) A parallel phase I/II clinical trial design for
combination therapies. Biometrics 63:429–436
Ivanova A, Wang K (2004) A non-parametric approach to the design and analysis of two-
dimensional dose-finding trials. Stat Med 23:1861–1870
Ivanova A, Flournoy N, Chung Y (2007) Cumulative cohort design for dose finding. J Stat Plan
Infer 137:2316–2327
Ji Y, Li Y, and Yin G (2007) Bayesian dose finding in phase I clinical trials based on a new statistical
framework. Stat Sinica 17:531–547
Ji Y, Liu P, Li Y, Bekele BN (2010) A modified toxicity probability interval method for dose-finding
trials. Clin Trials 7:653–663
Leung DHY, Wang YG (2002) An extension of the continual reassessment method using decision
theory. Stat Med 21:51–63
Lin R, Yin G (2016, in press) Bayesian optimal interval design for dose finding in drug-
combination trials. Stat Methods Med Res. doi: 10.1177/0962280215594494
Liu S, Cai C, Ning J (2013) Up-and-down designs for phase I clinical trials. Contemp Clin Trials
36:218–227
Liu S, Yuan Y (2015) Bayesian optimal interval designs for phase I clinical trials. J R Stat Soc Ser
C Appl Stat 64:507–523
74 R. Lin and G. Yin

Oron A, Azriel D, Hoff P (2011) Dose-finding designs: the role of convergence properties. Int J
Biostat 7, Article 39
O’Quigley J, Pepe M, Fisher L (1990) Continual reassessment method: a practical design for phase
1 clinical trials in cancer. Biometrics 46:33–48
Rogatko A, Schoeneck D, Jonas W, Tighiouart M, Khuri FR, Porter A (2007) Translation of
innovative designs into phase I trials. J Clin Oncol Res 25:4982–4986
Shi Y, Yin G (2013) Escalation with overdose control for phase I drug-combination trials. Stat Med
32:4400–4412
Simon R, Rubinstein L, Arbuck SG, Christian MC, Freidlin B, Collins J (1997) Accelerated
titration designs for phase I clinical trials in oncology. J Natl Cancer Inst 89:1138–1147
Storer BE (1989) Design and analysis of phase I clinical trials. Biometrics 45:925–937
Thall PF, Millikan RE, MRuller P, Lee SJ (2003) Dose-finding with two agents in phase I oncology
trials. Biometrics 59:487–496
Wang K, Ivanova A (2005) Two-dimensional dose finding in discrete dose space. Biometrics
61:217–222
Yuan Z, Chappell R (2004) Isotonic designs for phase I cancer clinical trials with multiple risk
groups. Clin Trials 1:499–508
Yin G (2012) Clinical trial design: bayesian and frequentist adaptive methods. John Wiley & Sons,
Hoboken
Yin G, Yuan Y (2009) Bayesian model averaging continual reassessment method in phase I clinical
trials. J Am Stat Assoc 104:954–968
Yin G, Yuan Y (2009) Bayesian dose finding in oncology for drug combinations by copula
regression. J R Stat Soc Ser C Appl Stat 61:217–222
Yin G, Yuan Y (2009) A latent contingency table approach to dose finding for combinations of two
agents. Biometrics 65:866–875
Yin G, Lin R (2014) Comments on “Competing designs for drug combination in phase I dose-
finding clinical trials” In: Riviere M-K, Dubois F, Zohar S (eds) Stat Med 34:13–17
Yuan Z, Chappell R, Bailey H (2007) The continual reassessment method for multiple toxicity
grades: a Bayesian quasi-likelihood approach. Biometrics 63:173–179
Yuan Y, Yin G (2008) Sequential continual reassessment method for two-dimensional dose finding.
Stat Med 27:5664–5678
Part II
Life Time Data Analysis
Chapter 5
Group Selection in Semiparametric Accelerated
Failure Time Model

Longlong Huang, Karen Kopciuk, and Xuewen Lu

Abstract In survival analysis, a number of regression models can be used to

estimate the effects of covariates on the censored survival outcome. When covariates
can be naturally grouped, group selection is important in these models. Motivated
by the group bridge approach for variable selection in a multiple linear regression
model, we consider group selection in a semiparametric accelerated failure time
(AFT) model using Stute’s weighted least squares and a group bridge penalty.
This method is able to simultaneously carry out feature selection at both the group
and within-group individual variable levels, and enjoys the powerful oracle group
selection property. Simulation studies indicate that the group bridge approach for
the AFT model can correctly identify important groups and variables even with
high censoring rate. A real data analysis is provided to illustrate the application of
the proposed method.

5.1 Introduction

Variable selection, an important objective of survival analysis, is to choose a

minimum number of important variables to model the relationship between a
lifetime response and potential risk factors. In an attempt to select significant
variables and estimate regression coefficients automatically and simultaneously,
a family of penalized or regularized approaches is proposed. Variable selection
is conducted by minimizing a penalized objective function by adding a penalty

L. Huang () • X. Lu
Department of Mathematics and Statistics, University of Calgary, 2500 University Drive NW,
T2N 1N4, Calgary, AB, Canada
e-mail: [email protected]; [email protected]
K. Kopciuk
Department of Cancer Epidemiology and Prevention Research, Alberta Health Services,
5th Floor, Holy Cross Centre Box ACB, 2210 2 St. SW, T2S 3C3, Calgary, AB, Canada
e-mail: [email protected]

© Springer Science+Business Media Singapore 2016 77

D.-G. (Din) Chen et al. (eds.), Advanced Statistical Methods in Data Science,
ICSA Book Series in Statistics, DOI 10.1007/978-981-10-2594-5_5
78 L. Huang et al.

function with the following form

min fLoss function C Penaltyg :

The popular choices of loss functions are least squares and negative log-likelihood.
Many different penalty functions have been used for penalized regression, such as
the least absolute shrinkage and selection operator (LASSO) (Tibshirani 1996), the
bridge penalty (Fu 1998), the smoothly clipped absolute deviation (SCAD) method
(Fan and Li 2001), the elastic-net method (Zou and Hastie 2005), the minimax
concave penalty (MCP) (Zhang 2010) and the smooth integration of counting and
absolute deviation (SICA) method (Lv and Fan 2009). These methods are designed
for individual variables selection.
In many applications, covariates in X are grouped. For example, in multi-factor
analysis of variance (ANOVA) problem, in which each factor may have several
levels and can be expressed through a group of dummy variables, such as for
response Z with two factors ˛ and ˇ, the intercept and the random error ",

Z D C ˛j C ˇk C "; j D 1; : : : ; J; k D 1; : : : ; K;
˚
J
where ˛j jD1 and fˇk gKkD1 can be considered as two groups. Another example
is the additive model with polynomial or nonparametric components, where each
component in the additive model may be expressed as a linear combination of a
number of basis functions of the original measured variable, for example,

Z D C 1 .W1 / C C J .WJ / C ";

P ˚
m
where each function j .Wj / D m lD1 lj Bl .Wj /, here Bl .Wj / l are basis functions,
and considered as a group.
Ma and Huang (2007) pointed that complex diseases such as cancer are often
caused by mutations in gene pathways, it would be reasonable to select groups of
related genes rather than individual genes. Bakin (1999) proposed the group LASSO
and a computational algorithm. Later Yuan and Lin (2006) developed this method
and related group selection methods, such as group least angle regression and
group nonnegative garret methods. The group LASSO is a natural extension of the
LASSO, in which an L2 norm of the coefficients associated with a group of variables
is used as a component of the penalty function. Meier et al. (2008) studied the group
LASSO for logistic regression. Motivated by identifying transcriptional factors
that can explain the observed variation of microarray time course gene expression
over time during a given biological process, Wang et al. (2007) introduced a
group SCAD penalized estimation procedure for selecting variables with time-
varying coefficients in the context of functional response models. Zhao et al. (2009)
5 Group Selection in Semiparametric Accelerated Failure Time Model 79

introduced the Composite Absolute Penalties (CAP) family, which allows given
grouping and hierarchical relationships between the predictors to be expressed.
These studies only considered group selection, but did not take the individual
variable selection within groups into account. Huang et al. (2009) proposed the
group bridge method in a multiple linear regression model with data uncensored,
which is capable of carrying out variable selection at the group and within-group
individual variable levels simultaneously. Huang et al. (2014) studied the group
bridge for the Cox model. Breheny and Huang (2009) developed the group MCP
approach in the linear regression model with uncensored data to select important
groups as well as identifying important members of these groups. They refer to this
as bi-level selection.
In this paper, we consider the group bridge method for the AFT model with
right censored data. The Stute’s weighted least squares estimator in AFT models is
introduced in Sect. 5.2. In Sect. 5.3 we describe the group bridge method for the AFT
model and present the computation steps and tuning parameters selection methods.
The asymptotic properties are stated in Sect. 5.4. In Sect. 5.5 simulation studies are
produced to evaluate the proposed method comparing to the group LASSO method.
In Sect. 5.6 we apply the proposed methods to the primary biliary cirrhosis data set.
Summary and discussion are reported in Sect. 5.7.

5.2 Stute’s Weighted Least Squares Estimation

in The AFT Model

For i D 1; : : : ; n, let Ti represent the logarithm of the survival time for the ith subject,
Xi be the associated d-dimensional vector of covariates, Ci denote the logarithm of
the censoring time and ıi denote the event indicator, i.e., ıi D I.Ti Ci /, which
takes value 1 if the event time is observed, or 0 if the event time is censored. Define
Yi as the minimum of the logarithm of the survival time and the censoring time, i.e.,
Yi D min.Ti ; Ci /. Then, the observed data are in the form .Yi ; ıi ; Xi /, i D 1; 2; : : : ; n,
which are assumed to be an independent and identically distributed (i.i.d.) sample
from .Y; ı; X/. Survival analysis focuses on the distribution of survival times and
the association between survival time and risk factors or covariates. The AFT model
directly relates the logarithm of the failure time linearly to the covariates, and
resembles a conventional linear model.
The AFT model is defined as

Ti D ˛ C X >
i ˇ C "i ; i D 1; : : : ; n; (5.1)

where ˛ is the intercept, ˇ is a d 1 regression parameter vector to be estimated,

and the "i ’s are independent identically distributed random errors with a common
distribution function. If the distribution function of the error term is known, this
80 L. Huang et al.

model is a parametric model. If the distribution function of error term is unspecified,

this model is considered as a semiparametric model.
In order to estimate the coefficients .˛; ˇ/ in the AFT model, there are three
popular approaches. One is the Buckley and James estimator (1979) that adjusts
censored observations using the Kaplan-Meier estimator. This original Buckley-
James approach has no theoretical justification and does not provide a reliable
numerical method for implementation. Later, Ritov (1990) studied the asymptotic
properties of the Buckley-James estimator. The second one is the rank based
estimators (Fygenson and Ritov 1994; Heller 2007; Tsiatis 1990; Ying 1993) that
are motivated by the score function of the partial likelihood. The existing rank based
methods are computationally intensive for semiparametric estimators.
Stute (1993) proposed a weighted least squares estimator in the semiparametric
AFT model for right censored data, which uses the Kaplan-Meier weights to account
for censoring in the least squares criterion in the AFT model. The weights in the
equation are the jumps in the Kaplan-Meier estimator, which is computationally
more feasible than the Buckley-James and rank based estimators.
Let FO n be the Kaplan-Meier estimator of the distribution function F of T
and assume Y.1/ Y.n/ are the order statistics of Yi ’s; ı.1/ ; : : : ; ı.n/ and
X.1/ ; : : : ; X.n/ are the associated censoring indicators and covariates of the ordered
Yi ’s, respectively. According O
P to Stute and Wang (1993) and Stute (1996), Fn can
be written as FO n .y/ D niD1 wni 1.Y.i/ y/, where the wni ’s are the jumps in the
Kaplan-Meier estimator and can be expressed as

ı.1/
wn1 D ;
n
i1
Y ı. j/
ı.i/ nj
wni D ; i D 2; : : : ; n:
n i C 1 jD1 n j C 1

The wni ’s are also called the Kaplan-Meier weights. Then the Stute’s weighted least
squares objective function is

1X
n 2
wni Y.i/ ˛ X>
.i/ ˇ :
2 iD1

By centering X .i/ and Y.i/ with their wni -weighted means, the intercept becomes
0. Denote e 1=2 e 1=2
Pn X.i/ D .nw Pnin/ .X .i/ X w / and
PnY .i/ D .nwPnin/ .Y.i/ Y w /, where
X w D iD1 wni X .i/ = iD1 wni and Y w D iD1 wni Y.i/ = iD1 wni . We can rewrite
the Stute’s weighted least squares objective function as

1 X e 2
n
L.ˇ/ D Y .i/ e
X>
.i/ ˇ : (5.2)
2 iD1
5 Group Selection in Semiparametric Accelerated Failure Time Model 81

The Stute’s weighted least squares estimator of ˇ can be obtained by minimizing the
objective function (5.2). Since this objective function uses the least squares method,
it is easy to solve. Assuming that T and C are independent, Stute (1993, 1996)
showed that the estimator Ǒ is consistent and asymptotically normal as n ! 1.
Stute’s weighted least squares method can be used to construct a loss function,
and then by combining with penalty terms, variable selection in the AFT model will
be performed.

5.3 The Group Bridge Estimator in The AFT Model

When covariates are grouped, instead of individual variable selection, we should

treat the related covariates as a group. Let X k D .X1k ; : : : ; Xnk /> , k D 1; : : : ; d, be
the design vectors and T D .T1 ; : : : ; Tn /> be the response vector in (5.1), then the
regression model can be written as

T D ˛ C ˇ1 X1 C : : : C ˇd X d C "

with an error vector " D ."1 ; : : : ; "n /> . Let A1 ; ; AJ be subsets of f1; ; dg
representing known groupings of the design vectors and denote the regression
coefficients in the jth group by ˇ Aj D .ˇk ; k Aj /> , j D 1; : : : ; J. For any k 1
vector a, kak1 denotes the L1 norm: kak1 D ja1 j C C jak j.
Let ˇ be the parameters of interest in the AFT model (5.1). After adding the
group bridge penalty function proposed by Huang et al. (2009) to the Stute’s
weighted least squares loss function (5.2), the group bridge penalized Stute’s
weighted least squares objective function is

1 X e 2 X
n J
Y .i/ e

L n .ˇ/ D X>
.i/ ˇ C n cj kˇ Aj k1
2 iD1 jD1
!2
1X e X X
n d J
e
D Y .i/ Xk ˇk C n cj kˇ Aj k1
2 iD1 kD1 jD1
2
1
e X e
X
d J

D Y Xk ˇk C n cj kˇ Aj k1 ; (5.3)
2
kD1 2 jD1

where n > 0 is the penalty tuning parameter and cj ’s are constants for the
adjustment of the different dimensions of Aj . In the case of uncensored data,
Huang et al. (2009) suggested a simple choice of cj is cj / jAj j1 , where jAj j
82 L. Huang et al.

is the cardinality of Aj , and they also showed that when 0 < < 1, the group
bridge penalty is able to carry out variable selection at the group and individual
variable levels simultaneously. For simplicity, we use e Xk D .e X .1/k ; : : : ; e
X .n/k /> and
e e e e >
Y D .Y .1/ ; Y .2/ ; ; Y .n/ / . Then we can obtain the penalized estimator of ˇ by
minimizing L n .ˇ/.

5.3.1 Computation

Since the group bridge penalty is not a convex function for 0 < < 1, direct
minimization of L n .ˇ/ is difficult. Following Huang et al. (2009), we formulate an
equivalent minimization problem that is easier to solve computationally. For 0 <
< 1, define
2
1
e X e
X X
d J J
11= 1=
L1n .ˇ; / D Y Xk ˇk C j cj kˇ Aj k1 C j ;
2
kD1 jD1 2 jD1

where is a penalty parameter.

Proposition 1 Suppose 0 < < 1. If n D 1 .1 / 1 , then Ǒ minimizes
L n .ˇ/ if and only if . Ǒ ; /
O minimizes L1n .ˇ; / subject to j 0, for j D 1; : : : ; J.

When data are uncensored, i.e., wni D 1=n; i D 1; : : : ; n, Huang et al. (2009)
pointed out that minimizing L1n .ˇ; / with respect to .ˇ; / yields sparse solutions
at the group and individual variable levels, that is, the penalty is an adaptively
weighted L1 penalty, which conduct the sparsity in ˇ, and when 0 < < 1, small
j will force ˇ Aj D 0 and leads to group selection.
Based on Proposition 1, for s D 1; 2; : : :, we have the iterative computation
algorithm as the following:
Step 1. Obtain an initial value ˇ .0/ .
Step 2. Compute

.s/ 1 .s1/
j D cj kˇ Aj k1 ; j D 1; ; JI (5.4)

Step 3. Compute
8 2 9
<1 X
d J
X 11= =
e e .s/ 1=
ˇ .sC1/ D arg min Y X k ˇk C j cj kˇ Aj k1 I
ˇ :2 ;
kD1 2 jD1

Step 4. Repeat steps 2 and 3 until convergence.

5 Group Selection in Semiparametric Accelerated Failure Time Model 83

The original value in Step 1 could be obtained by least squares method or LASSO
11=
.s/ 1=
approach. The main computation task is Step 3. Let !Aj D j cj , !k D
!Aj W 9j; such that k 2 Aj , e
X !k D e
X k =!k , ˇ!k D !k ˇk , ˇ !A D !Aj ˇ A . Rewrite
j j

ˇ .sC1/ in Step 3 as
8 2 9
<1 =
e X e X
d J

ˇ .sC1/ D arg min Y X!k ˇ!k C 1 kˇ !Aj k1 : (5.5)
ˇ :2 ;
kD1 2 jD1

In Eq. (5.5), e
X !k is the weighted covariate matrix for kth covariate, ˇ!k is the
weighted coefficient for the covariate and ˇ !Aj is the weighted coefficient for each
group. Whenever !Aj is 0 or very small, we set ˇ!k D 0 and ˇk D 0, and remove the
associated X k . Now the objective function (5.5) becomes a LASSO problem with
tuning parameter fixed at 1, and this can be solved using the existing R function
“predict.lars” by setting “s D 1”. After the value of ˇO!k has been estimated, we
could calculate ˇOk D ˇO!k =!k , k D 1; : : : ; d.

5.3.2 Tuning Parameter Selection

Following the procedure in Huang et al. (2009), for a fixed n , let Ǒ D Ǒ . n /

be the group bridge estimator of ˇ. Let Oj , j D 1; : : : ; J, be the jth component of
O D .ˇ.
O e e
n // as defined in Step 2. Let X D .X 1 ; : : : ; X d / be the n d covariate
matrix. The Karush-Kuhn-Tucker condition for Step 3 is
8 P P 11= 1=
jWAj 2k .j /
2
< 1 @ke
ˆ Y dlD1 e
Xk ˇk k2 cj ˇk
C D 0 8ˇk ¤ 0
ˇ 2
Pd
@ˇk ˇ jˇ k j
ˇ 1 @ke
Y lD1 e
2ˇ
Xk ˇk k2 P 11= 1=
:̂ ˇˇ 2 @ˇk
ˇ
ˇ jWAj 2k j cj 8ˇk D 0;

which implies that

X
Oj cj sgn.ˇOk /; 8ˇOk ¤ 0:
11= 1=
.e
Y X Ǒ />e
Xk D (5.6)
jWAj 3k

Since sgn.ˇk / D ˇk =jˇk j, then the fitted response vector is

b
e
Y D X Ǒ D X n ŒX > 1 >e
n X n C W n X n Y;
84 L. Huang et al.

where X n is the sub matrix of X whose columns correspond to covariates with

nonzero estimated coefficients for the given n and W n is the diagonal matrix with
diagonal elements
X
Oj cj =jˇOk j; ˇOk ¤ 0:
11= 1=

k2Aj

Therefore, the number of effective parameters with a given n can be approxi-

mated by
1
d. n / D trace X n X >
n X n C W n X >
n :

An AIC-type criterion for choosing n is

( , )
2

AIC. n / D ln e
Y X Ǒ . n / n C 2d. n /=n:
2

A BIC-type criterion for choosing n is

( , )
2

BIC. n / D ln e
Y X Ǒ . n / n C ln.n/d. n /=n:
2

A generalized cross-validation (GCV)-type criterion for choosing n is

,
2 n o

GCV. n / D e
Y X Ǒ . n / n .1 d. n /=n/2 :
2

The tuning parameter n is selected by minimizing the criteria AIC. n /, BIC. n /,

or GCV. n /.

5.3.3 Comparison With the Group LASSO

Yuan and Lin (2006) introduced the group LASSO method to select grouped
variables in the linear regression model with uncensored data, which uses an L2
norm of the coefficients associated with a group of variables in the penalty function.
5 Group Selection in Semiparametric Accelerated Failure Time Model 85

We propose the group LASSO estimator for the AFT model with censored data to be
2
1 Xd XJ

Q̌ D arg min e e
Y Xk ˇk C n cj kˇ Aj kKj ;2 ; (5.7)
ˇ 2
kD1 2 jD1

where n > 0 is the tuning parameter and Kj is a positive definite matrix and
kˇ Aj kKj ;2 D .ˇ >
Aj Kj ˇ Aj /
1=2
. Yuan and Lin (2006) suggested the choice of Kj is
Kj D jAj jIj with Ij being the jAj j jAj j identity matrix.
Similar to the group bridge approach, let be a penalty parameter, then
2
1
e X e
X X
d J J
L2n .ˇ; / D Y X k ˇk C j1 kˇ Aj k2Kj ;2 C j : (5.8)
2
kD1 2
jD1 jD1

Proposition 2 If n D 2 1=2 , then Q̌ satisfies (5.7) if and only if . Q̌ ; /

Q minimizes
Q
L2n .ˇ; / subject to 0, for some 0.
From the penalized objective function (5.8) we see that the sum of the squared
coefficients in group j is penalized by j , and the sum of j ’s is penalized by
. The large j tends to keep all of the elements of ˇ Aj . Therefore, in order to
minimize (5.8), either ˇ Aj D 0, that is, the group is dropped from the model,
otherwise, with large j , ˇ Aj ¤ 0, which means all the elements of ˇ Aj are non-
zero and all the variables in group j are retained in the model. So the group LASSO
selects either the group with all the variables inside or deletes the whole group.
This is the reason why the group LASSO can conduct group selection, but it cannot
select individual variables within groups. Our simulation studies in Sect. 5.5 also
reexamine this property.

5.4 Asymptotic Properties of the Group Bridge Stute’s

Weighted Least Squares Estimator

Stute (1993, 1996) proved consistency and asymptotic normality of the weighted
least squares estimator with the Kaplan-Meier weights under some conditions.
Huang et al. (2009) derived the symptomatic properties of the group bridge
estimators with uncensored data. Combining the methods of these authors, we
derive the asymptotic distribution of the Stute’s weighted estimator under the group
bridge penalty. We can show that, for 0 < < 1, the group bridge estimators
correctly select nonzero groups with probability converging to one under reasonable
conditions.
86 L. Huang et al.

According to Huang and Ma (2006)’s regularization estimation in the AFT

model for ungrouped variables, let H denote the distribution function of Y. By the
independence between T and C, 1 H.y/ D .1 F.y//.1 G.y//, where F and
G are the distribution functions of T and C, respectively. Let Y , T and C be the
endpoints of the support of Y, T and C, respectively. Let F 0 be the joint distribution
of .X; T/. Denote

F 0 .x; t/; t < Y
FQ 0 .x; t/ D
F 0 .x; Y / C F 0 .x; Y /1fY 2 Ag t Y ;

with A denoting the set of atoms of H. Define two sub distribution functions:

Q 11 .x; y/ D P.X x; Y y; ı D 1/;

Q 0 .y/ D P.Y y; ı D 0/:

For j D 0; : : : ; d, let
(Z )
y Q 0 .dw/
H
0 .y/ D exp ;
0 1 H.w/
Z
1 Q 11 .dx; dw/;
1;j .yI ˇ/ D 1w>y .w x> ˇ/xj 0 .w/H
1 H.y/
“
1v<y;v<w .w x> ˇ/xj 0 .w/ Q 0
2;j .y; ˇ/ D H .dv/HQ 11 .dx; dw/;
Œ1 H.v/2

l .y; ˇ/ D .l;0 .y; ˇ/; l;1 .y; ˇ/; : : : ; l;d .y; ˇ//> ; l D 1; 2:

Denote the true regression coefficients by ˇ 0n D .ˇ > > > >

0A ; ˇ 0B ; ˇ 0C / . For
o j D
˚

1; :::; J, let A D k 2 Aj W ˇ0k ¤ 0 ; B D k 2 Aj W ˇ0k D 0; ˇ 0Aj ¤ 0 ; C D

˚ o
k 2 Aj W ˇ0k D 0; ˇ 0Aj D 0 . So A contains the indices of nonzero coefficients,
B contains the indices of zero coefficients that belong to nonzero groups, and
C contains the indices of zero coefficients that belong to zero groups. We write
D D B [ C , which contains the indices of all zero coefficients. Since ˇ 0D D 0,
the true model is fully explained by the first A subset. Then Ǒ A and Ǒ D are the
estimates of ˇ A and ˇ D from the group bridge estimator Ǒ , respectively.
5 Group Selection in Semiparametric Accelerated Failure Time Model 87

Let W D diag.nw1 ; : : : ; nwn / be the diagonal matrix of the Kaplan-Meier

weights. Let X Aj D .e X k ; k 2 Aj / be the matrix with columns e
Xk ’s for k 2 Aj ,
and denote ˙Aj D X Aj WX Aj =n. For i D 1; : : : ; n, let ei D e
>
Y .i/ e
X>.i/ ˇ 0 and
Pn
k D e >
iD1 X ik ei ; 1 k d. Define ˙n D X X =n and let n and n be the

smallest and largest eigenvalues of ˙n . We assume the following.

(A1) The number of nonzero coefficients q is finite;
(A2) (a) The observations .Yi ; Xi ; ıi /; i D 1; : : : ; n are independent and identically
distributed; (b) The random errors "1 ; : : : ; "n are independent and identically
distributed with mean 0 and finite variance 2 , and furthermore, there exist
K1 ; K2 > 0 such that the tail probabilities of "i satisfy P.j"i > uj/
K2 exp.K1 u2 / for all u 0 and all i;
(A3) (a) The distribution of i ’s are subgaussian; (b) The covariates are bounded,
that is, there exists a constant M > 0 such that jXik j M; 1 i n; 1
k d;
(A4) The covariate matrix X satisfies the sparse Riesz condition (SRC) with rank
q : there exist constants 0 < c < c < 1, such that for q D .3C4C/q, with
probability converging to 1, c v> ˙Aj v=kvk22 c , 8Aj with jAj j D q

and v 2 Rq ; P
(A5) The maximum multiplicity Cn D maxk JjD1 Ifk 2 Aj g is bounded and

2n X
J1
2 2
c2 kˇ k jAj j Mn ln.d/; Mn D O.1/I
nn jD1 j 0Aj 1

(A6) The constants cj are scaled so that min1jJ cj 1 and

n .n2 /1=2
1=2
! 1:
fln.d/g .q C n /1=2 n n=2

The model is sparse by assumption (A1) (Huang and Ma 2010; Ma and Du

2012), which is reasonable in genomic studies, that is, although the total number
of covariates may be large, the number of covariates with nonzero coefficients is
still small. The subgaussian assumption (A2) has been made in high dimensional
linear regression models (Zhang and Huang 2008). Assumption (A3) proposed by
Ma and Du (2012) shows the subguassian tail property still holds under censoring,
and it is required for Theorem 1. The SRC condition proposed by Zhang and Huang
(2008) in assumption (A4) implies that all the eigenvalues of any p p submatrix
of X > WX =n with p q lie between c and c . It ensures that any model with
dimension no greater than q is identifiable. Similar assumptions (A5) and (A6)
were used under uncensored data in Huang et al. (2009). We allow ln(d) = o(n)
88 L. Huang et al.

or d = exp(o(n)), so our work is more general than that of Huang et al. (2009).
Also (A5) and (A6) put restrictions on the magnitude of the penalty parameter,
which is 0 < < 1.
Theorem p1 (Group Bridge) Suppose that 0 < < 1, conditions (A1)–(A6) hold
and n = n ! 0 0. Let X1 D .X k ; k 2 A / and ˙1 D E.X1 X>
1 /. Then

(i) (Zero Group Selection Consistency)

Prf Ǒ nC D 0g ! 1:

(ii) (Asymptotic Distribution of Nonzero Parameter Estimators in Nonzero Groups)

p
n. Ǒ nA ˇ 0A / !D arg minfU1 .b/ W b 2 RjA j g;

where
1
U1 .b/ D b> V 1 C b> ˙1 b
2
X
J
1
X
C 0 cj kˇ 0Aj k1 fbk sgn.ˇ0k /g:
jD1 k2A

Here

V 1 N.0; ˝1 /;

˝1 D Varfı0 .Y/.Y X>

1 ˇ 0A /X 1 C .1 ı/1 .YI ˇ 0A / 2 .YI ˇ 0A /g:

Note that if 0 D 0, the penalty part is negligible, the asymptotic distribution of

nonzero parameters estimators in both zero and nonzero groups becomes that of
the Stute’s estimator. Part (i) of Theorem 1 states that the group bridge estimates
of the coefficients of the zero groups exactly equal 0 with probability converging
to one; part (ii) shows the normality property of the group bridge estimates of the
coefficients of the nonzero parameters in both zero and nonzero groups. Part (i) of
Theorem 1 implies that the group bridge estimator can distinguish nonzero groups
from zero groups correctly, but it does not address the zero coefficients in nonzero
groups, so the proposed method possesses group selection consistency but lacks
individual selection consistency. In order to archive individual selection consistency,
following Wang et al. (2009), an adaptive group bridge penalty is needed, this
issue will be explored in our future research. Our part (ii) of Theorem 1 is also
different from that in Theorem 1 (b) of Huang et al. (2009), where A is replaced by
B1 D A [B, which shows the asymptotic distribution of nonzero group estimators.
5 Group Selection in Semiparametric Accelerated Failure Time Model 89

However, we found that their asymptotic distribution results were hard to get under
the given conditions, and further investigation is needed to solve the problem.
We also present the Theorem 2 for the group LASSO estimator to compare the
asymptotic properties of the group bridge and group LASSO estimators.
Theorem 2 (Group LASSO) Suppose fˇ; d; Aj ; cj ; Kj ; j Jg are all fixed. "i ’s are
iid errors with E."i / D 0 and Var."i / D 2 2 .0; 1/. Let ˙2 D E.XX > / and
suppose n1=2 n ! 0 > 0 when n ! 1. Then
p
n. Q̌ n ˇ 0 / !D arg minfU2 .b/ W b 2 Rd g;

where
1
U2 .b/ D b> V 2 C b> ˙2 b
2
( > )
XJ
bAj Kj ˇ0Aj
C 0 cj I.ˇ Aj ¤ 0/ C kbAj kKj ;2 I.ˇ Aj D 0/ :
jD1
kˇ 0Aj kKj ;2

Here

V 2 N.0; ˝2 /;

˝2 D Varfı0 .Y/.Y X > ˇ 0 /X C .1 ı/1 .YI ˇ 0 / 2 .YI ˇ 0 /g:

From Theorem 2 we notice that, when 0 D 0, the group LASSO estimator is

the same as the Stute’s estimator. When 0 > 0, the asymptotic distribution of Q̌ n
puts positive probability at 0 when ˇ Aj D 0. Since this positive probability is less
than one in general, which results in the non-consistency property in selecting the
nonzero groups.

5.5 Simulation Studies

In this section, simulations are conducted to compare the bi-level (group and within-
group individual variable levels) performance of the group bridge estimator and
the group LASSO estimator. Following Huang et al. (2009)’s simulations set up
for uncensored data, two scenarios are considered. Since the proposed method can
deal with right censored data, the logarithm of censoring times, C, are generated
by the logarithm of random variables from the exponential distribution with a rate
parameter , where is chosen to obtain 20 %; 50 % and 70 % censoring rates for
both scenarios. In Scenario 1, the number of groups is moderately large, the group
90 L. Huang et al.

sizes are equal and relatively large, and within each group the coefficients are either
all nonzero or all zero. In Scenario 2, the group sizes vary and there are coefficients
equal to zero in a nonzero group. We use D 0:5 in the group bridge estimator.
The sample size n D 200 in each scenario. The simulation results are based on 400
replications.
We calculate the average number of groups selected (No.Grp), the average
number of variables selected (No.Var), the percentage of occasions on which the
model produced contains the same groups as the true model (%Corr.Grp), the
percentage of occasions on which the model produced contains the same variables as
the true model (%Corr.Var) and the model error (Model Error), which is computed
as . Ǒ ˇ 0 /> E.X> X/. Ǒ ˇ 0 /, where ˇ 0 is the true coefficient value. Enclosed
in parentheses are the corresponding standard deviations. And the last line in each
table gives the true values used in the generation model. For example, in Scenario
1, there are 2 nonzero groups and 16 nonzero coefficients in the generation model.
For both of the group bridge estimator and group LASSO estimator, AIC, BIC
and GCV tuning parameter selection methods were used to evaluate performance.
The variable selection and coefficient estimation results based on GCV are similar
to those using AIC. A comparison of different tuning parameter selection methods
indicates that tuning based on BIC in general does better than that based on AIC and
GCV in terms of selection at the group and individual variable levels. We therefore
focus on the comparisons of the methods with BIC tuning parameter.

5.5.1 Scenario 1

In this experiment, there are five groups and each group consists of eight covariates.
The covariate vector is X D .X 1 ; : : : ; X 5 / and, for any j in 1; : : : ; 5, the subvector
of covariates that belong to the same group is X j D .x8.j1/C1 ; : : : ; x8.j1/C8 /.
To generate the covariates x1 ; : : : ; x40 , we first simulate 40 random variables
R1 ; : : : ; R40 independently from the standard normal distribution. Then Zj .j D
1; : : : ; 5/ are simulated with a normal distribution and an AR(1) structure such
that cov.Zj1 ; Zj2 / D 0:4jj1 j2 j , for j1 ; j2 D 1; : : : ; 5. The covariates x1 ; : : : ; x40 are
generated as
p
xj D .Zgj C Rj /= 2; j D 1; : : : ; 40;

where gj is the smallest integer greater than .j 1/=8 and the xj s with the same value
of gj belong to the same group. The logarithm of failure times are generated from
P
the log-Normal model, T D 40 2
jD1 xj ˇj C ", where the random error is " N.0; 2 /,
5 Group Selection in Semiparametric Accelerated Failure Time Model 91

and

.ˇ1 ; : : : ; ˇ8 / D .0:5; 1; 1:5; 2; 2:5; 3; 3:5; 4/;

.ˇ9 ; : : : ; ˇ16 / D .2; 2; : : : ; 2/;

.ˇ17 ; : : : ; ˇ24 / D .ˇ25 ; : : : ; ˇ32 / D .ˇ33 ; : : : ; ˇ40 / D .0; 0; : : : ; 0/:

Thus, the coefficients in each group are either all nonzero or all zero.
Table 5.1 summarizes the simulation results for Scenario 1. From these results
we notice that as the censoring rate increases, the model error increases and the
percentage of correct variables selected decreases for both of the group bridge and
group LASSO methods. But comparing the group bridge approach with the group
LASSO approach, the group bridge method tends to more accurately select correct
groups as well as the variables in each group, even when the censoring rate is high
as 70 %. While the group LASSO method tends to select more groups and variables
than the true models, and when the censoring rate is high, the group LASSO method
performs poorer than the group bridge method. So in terms of the number of groups
selected, the number of variables selected, the percentage of correct models selected
and the correct variable selected, the group bridge considerably outperforms the
group LASSO, and the group bridge incurs smaller model error than the group
LASSO.

Table 5.1 Simulation results for Scenario 1

CR% Method No.Grp No.Var %Corr.Grp %Corr.Var Model error
20 GBridge 2.00(0.000) 15.98(0.156) 100.00(0.000) 97.50(0.156) 0.596(0.253)
GLASSO 2.23(0.508) 17.84(4.061) 80.50(0.397) 80.50(0.397) 2.025(0.269)
50 GBridge 2.00(0.000) 15.93(0.251) 100.00(0.000) 93.25(0.251) 0.900(0.369)
GLASSO 2.57(0.723) 20.52(5.780) 56.25(0.497) 56.25(0.497) 2.328(0.673)
70 GBridge 2.00(0.000) 15.86(0.352) 100.00(0.000) 86.50(0.342) 1.442(0.683)
GLASSO 2.30(0.544) 18.40(4.351) 73.75(0.441) 73.75(0.441) 2.710(1.014)
True 2 16 100 100 0
GBridge the group bridge method, GLASSO the group LASSO method, CR censoring rate,
No.Grp the average number of groups selected, No.Var the average number of variables selected,
%Corr.Grp the percentage of occasions on which the model produced contains the same groups as
the true model, %Corr.Var the percentage of occasions on which the model produced contains the
same variables as the true model, Model Error D . Ǒ ˇ 0 /> E.X> X/. Ǒ ˇ 0 /. Empirical standard
deviations are in the parentheses
92 L. Huang et al.

5.5.2 Scenario 2

In this experiment, the group size differs across groups. There are six groups made
up of three groups each of size 10 and three groups each of size 4. The covariate
vector is X D .X 1 ; : : : ; X 6 /, where the six subvectors of covariates are X j D
.x10.j1/C1 ; : : : ; x10.j1/C10 /, for j D 1; 2; 3, and X j D .x4.j4/C31 ; : : : ; x4.j4/C34 /,
for j D 4; 5; 6. To generate the covariates x1 ; : : : ; x42 , we first simulate Zi .i D
1; : : : ; 6/ and R1 ; : : : ; R42 independently from the standard normal distribution. For
j D 1; : : : ; 30, let gj be the largest integer less than j=10 C 1, then gj D 1; 2; 3,
and for j D 31; : : : ; 42, let gj be the largest integer less than .j 30/=4 C 4, then
gj D 4; 5; 6. The covariates .x1 ; : : : ; x42 / are obtained as
p
xj D .Zgj C Rj /= 2; j D 1; : : : ; 42:
P42
The logarithm of failure times are generated from T D jD1 xj ˇj C ", where the
random error is " N.0; 22 /, and

.ˇ1 ; : : : ; ˇ10 / D .0:5; 2; 0:5; 2; 1; 1; 2; 1:5; 2; 2/;

.ˇ11 ; : : : ; ˇ20 / D .1:5; 2; 1; 2; 1:5; 0; 0; 0; 0; 0/;

.ˇ21 ; : : : ; ˇ30 / D .0; : : : ; 0/; .ˇ31 ; : : : ; ˇ34 / D .2; 2; 1; 1:5/;

.ˇ35 ; : : : ; ˇ38 / D .1:5; 1:5; 0; 0/; .ˇ39 ; : : : ; ˇ42 / D .0; : : : ; 0/:

Thus we consider the situation that the group size differs across groups and the
coefficients in a group can be either all zero, all nonzero or partly zero.
Table 5.2 summarizes the simulation results for Scenario 2. From Table 5.2 we
can see that when the censoring rate is low, the group bridge method choses more

Table 5.2 Simulation results for Scenario 2

CR% Method No.Grp No.Var %Corr.Grp Model error
20 GBridge 4.00(0.279) 24.50(1.460) 96.0(0.196) 1.569(0.583)
GLASSO 5.12(0.994) 35.38(5.491) 44.0(0.497) 3.045(0.882)
50 GBridge 4.08(0.473) 24.86(1.962) 90.5(0.293) 2.241(0.907)
GLASSO 5.42(0.909) 37.20(5.075) 29.0(0.454) 3.487(1.015)
70 GBridge 4.02(0.453) 22.51(2.212) 89.3(0.310) 4.113(1.823)
GLASSO 4.97(1.001) 34.35(5.393) 51.8(0.500) 3.775(1.122)
True 4 21 100 0
GBridge the group bridge method, GLASSO the group LASSO method, CR censoring rate,
No.Grp the average number of groups selected, No.Var the average number of variables selected,
%Corr.Grp the percentage of occasions on which the model produced contains the same groups as
the true model, Model Error = . Ǒ ˇ 0 /> E.X> X/. Ǒ ˇ 0 /. Empirical standard deviations are in
the parentheses
5 Group Selection in Semiparametric Accelerated Failure Time Model 93

Table 5.3 Simulation results No.Var

in each group for Scenario 2
CR% Method G1 G2 G3 G4 G5 G6
20 GBridge 9.8 7.9 0.0 4.0 2.7 0.0
GLASSO 10.0 10.0 5.5 4.0 4.0 1.9
50 GBridge 9.8 8.2 0.2 4.0 2.7 0.0
GLASSO 10.0 10.0 7.0 4.0 4.0 2.2
70 GBridge 9.3 7.5 0.1 3.9 1.6 0.0
GLASSO 10.0 10.0 4.7 4.0 4.0 1.6
True 10 5 0 4 2 0
GBridge the group bridge method, GLASSO the group
LASSO method, CR censoring rate, No.Var the average
number of variables selected, G1, . . . , G6 the six groups

accurate groups and variables than the group LASSO method. When the censoring
rate goes up, both of the group bridge and group LASSO approaches result in
higher model errors, but the group bridge method still performs better than the group
LASSO in terms of the number of groups selected, the number of variables selected
and the percentage of correct models selected. Table 5.3 gives the average variable
selected in each group. The group bridge estimator is closer to the true value while
the group LASSO method tends to choose more variables than the true variables in
each group.

5.6 Real Data Analysis

The PBC data can be found in Fleming and Harrington (2011), and is obtained by
using attach(pbc) inside the R {SMPractials} package. The data set is from the
Mayo Clinic trial in primary biliary cirrhosis (PBC) of the liver conducted between
1974 and 1984. In this study, 312 out of 424 patients who agreed to participate
in the randomized trial are eligible for the analysis. Among the 312 patients, 152
were assigned to the drug D-penicillanmine, while the others were assigned to a
control group with placebo drug. Some covariates, such as age, gender and albumin
level, were recorded. The primary interest was to investigate the effectiveness of
D-penicillanmine in curing PBC disease. To compare with the analysis of PBC data
in Huang et al. (2014), we restrict our attention to the 276 observations without
missing covariate values. The censoring rate is 60 %. All of the 17 risk factors are
naturally clustered into 9 different categories, measuring different aspects, such as
liver reserve function and demographics, etc. The definitions of the 10 continuous
and 7 categorical variables are given in the accompanying study Dictionary table
Table 5.4.
We fitted the PBC data in the AFT model, then used the group bridge and group
LASSO methods both with BIC to select the tuning parameter n . Huang et al.
(2014) fitted this data set in the Cox proportional hazards model with the group
bridge penalty under the BIC tuning parameter selection method. For comparison,
Table 5.5 includes these different penalties and different estimation results. For the
94 L. Huang et al.

Table 5.4 Dictionary of PBC data covariates

Group Variable Type Definition
Age(G1) X1 C Age(years)
Gender(G2) X2 D Gender(0 male; 1 female)
Phynotype(G3) X3 D Ascites(0 absence)
X4 D Hepatomegaly(0 absence; 1 presence)
X5 D Spiders(0 absence; 1 presence)
X6 D Edemaoed(0 no edema; 0.5 untreated/successfully treated)
Liver function X7 C Alkaline phosphatase(units/litre)
damage(G4)
X8 C Sgot(liver enzyme in units/ml)
Excretory function of X9 C Serum bilirubin(mg/dl)
the liver(G5)
X10 C Serum cholesterol(mg/dl)
X11 C Triglyserides(mg/dl)
Liver reserve X12 C Albumin(g/dl)
function(G6)
X13 C Prothrombin time(seconds)
Treatment(G7) X14 D Penicillamine v.s. placebo(1 control; 2 treatment)
Reflection(G8) X15 D Stage(histological stage of disease, graded 1,2,3 or 4)
X16 C Urine copper(ug/day)
Haematology(G9) X17 C Platelets(per cubic ml/1000)
Note: Type: type of variable (C: continuous; D: discrete)

Table 5.5 Estimation results of PBC data

AFT-BIC Cox-BIC
Group Covariate GroupLASSO GroupBridge GroupBridge
G1 Age 0:002 0 0
G2 Gender 0:454 0:25 0:945
asc 0:455 0:481 0:136
hep 0:136 0:071 0:146
G3
spid 0:404 0:36 0:102
oed 0:310 0:4 0:566
alk 0 0 0
G4
sgot 0 0 0
bil 0:059 0:047 0:060
G5 chol 0:001 0 0
trig 0:001 0 0
alb 1:134 1:12 1:289
G6
prot 0:268 0:322 0:124
G7 trt 0 0 0:237
stage 0:033 0 0
G8
cop 0:002 0:002 0
G9 plat 0 0 0
Note: The results for Cox-BIC are from Huang et al. (2014)
5 Group Selection in Semiparametric Accelerated Failure Time Model 95

PBC Data

BIC Tuning
1.0 Group Bridge
Group Lasso
Regression Coefficients
0.5
0.0
-0.5

Group 1
Group 2

Group 3

Group 4

Group 5

Group 6

Group 7

Group 8

Group 9
Fig. 5.1 Group bridge vs. Group LASSO estimation results of PBC data based on AFT model

AFT model, the group LASSO and group bridge methods obtain similar estimation
coefficients values, except that comparing with the group bridge method, the group
LASSO method selected one variable age in group 1, two variables chol and trig
in group 5 and the variable stage in group 8, while the group bridge method did
not select these variables. Using the group bridge with BIC under different models,
the AFT and Cox models selected almost the same groups and variables, except the
AFT model chose the variable cop in group 8, while the Cox model did not. The
Cox model selected the variable trt in group 7, while the AFT model did not.
In order to have a clear visual comparison, we plotted the coefficients based
on different models and different penalty functions. Figure 5.1 shows the estimated
coefficients based on the AFT model with the group LASSO penalty (blue triangles)
and the group bridge penalty (red circles), respectively. We could see that except
group 8, the group bridge and group LASSO choose the same groups and the
96 L. Huang et al.

PBC Data
GBridge-BIC
1.0 AFT Model
Cox Model
0.5
Regression Coefficients
-0.5 0.0
-1.0

Group 1

Group 2

Group 3

Group 4

Group 5

Group 6

Group 7

Group 8

Fig. 5.2 Group bridge estimation results of PBC data based on AFT vs. Cox model Group 9

estimated coefficients for each variable are very similar. Figure 5.2 contains the
estimated coefficients under the group bridge method in the AFT model and the
Cox proportional hazards model. The coefficients in the AFT model indicate the
relationship between the covariates and the logarithm of survival time and the
coefficients in the Cox model represent the relationship between the covariates and
the logarithm of hazard, their signs are opposite. From Fig. 5.2 we also see that
except for group 7, these two models select the same groups and variables based on
the group bridge method. Figure 5.3 combines Figs. 5.1 and 5.2 for a better visual
comparison.
5 Group Selection in Semiparametric Accelerated Failure Time Model 97

PBC Data

BIC Tuning
1.0 GBridge-AFT
GLasso-AFT
GBridge-Cox
0.5
Regression Coefficients
0.0
-0.5
-1.0

Group 1
Group 2

Group 3

Group 4

Group 5

Group 6

Group 7

Group 8

Group 9
Fig. 5.3 Comparison of the estimation results of PBC data based on three methods

5.7 Summary and Discussion

We have considered an extension of the group LASSO and group bridge regression
to the AFT model with right censored data. Stute’s weighted least squares estimator
with the group bridge penalty in AFT model is comparable to that in the Cox
regression model with group bridge penalty. The group bridge approach performs
better in selecting both the correct groups and individual variables than the group
LASSO method even when censoring rates are high. We have established the
asymptotic properties of the group bridge penalized Stute’s weighted least squares
estimators and allow the dimension of covariates to be larger than the sample size,
which is applicable for the high-dimensional genomic data.
98 L. Huang et al.

We focused on the group bridge penalty for the group and within-group variable
selection, and only compared it to the group LASSO penalty. It is possible to
consider the Stute’s weighted least squares estimators with the group SCAD penalty
(Wang et al. 2007), or with the group MCP penalty (Breheny and Huang 2009),
although the asymptotic properties of each penalized method need to be studied.
On the other hand, in many real life survival data sets, covariates have nonpara-
metric effects on the survival time, so the nonparametric or partial linear regressions
are of interest. In order to distinguish the nonzero components from the zero
components, the group bridge approach could also be applied in the nonparametric
and partial linear regressions. We are working on these projects now and the detailed
information will be reported elsewhere.

References

Bakin S (1999) Adaptive regression and model selection in data mining problems. PhD disserta-
tion, The Australian National University
Breheny P, Huang J (2009) Penalized methods for bi-level variable selection. Stat Interface 2:269–
380
Buckley J, James I (1979) Linear regression with censored data. Biometrika 66:429–436
Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties.
J Am Stat Assoc 96:1348–1360
Fleming TR, Harrington DP (2011) Counting processes and survival analysis, vol 169. Wiley, New
York
Fu WJ (1998) Penalized regressions: the bridge versus the lasso. J Comput Graphic Stat 7:397–416
Fygenson M, Ritov Y (1994) Monotone estimating equations for censored data. The Ann Stat
22:732–746
Heller G (2007) Smoothed rank regression with censored data. J Am Stat Assoc 102(478):552–559
Huang J, Ma S, Xie H (2006) Regularized estimation in the accelerated failure time model with
high-dimensional covariates. Biometrics 62:813–820
Huang J, Ma S, Xie H, Zhang CH (2009) A group bridge approach for variable selection.
Biometrika 96:339–355
Huang J, Ma S (2010) Variable selection in the accelerated failure time model via the bridge
method. Lifetime Data Anal 16:176–195
Huang J, Liu L, Liu Y, Zhao X (2014) Group selection in the Cox model with a diverging number
of covariates. Stat Sinica 24:1787–1810
Lv J, Fan Y (2009) A unified approach to model selection and sparse recovery using regularized
least squares. Ann Stat 37:3498–3528
Ma S, Huang J (2007) Clustering threshold gradient descent regularization: with applications to
microarray studies. Bioinformatics 23:466–472
Ma S, Du P (2012) Variable selection in partly linear regression model with diverging dimensions
for right censored data. Stat Sinica 22:1003–1020
Meier L, Van De Geer S, Bühlmann P (2008) The group lasso for logistic regression. J Royal Stat
Soc: Ser B (Stat Methodol) 70:53–71
Ritov Y (1990) Estimation in a linear regression model with censored data. Ann Stat 18:303–328
Stute W (1993) Almost sure representations of the product-limit estimator for truncated data. Ann
Stat 21:146–156
Stute W (1996) Distributional convergence under random censorship when covariables are present.
Scand J Stat 23:461–471
Stute W, Wang JL (1993) The strong law under random censorship. Ann Stat 9:1591–1607
5 Group Selection in Semiparametric Accelerated Failure Time Model 99

Tibshirani R (1996) Regression shrinkage and selection via the lasso. J Royal Stat Soc. Ser B
(Methodol) 58:267–288
Tsiatis AA (1990) Estimating regression parameters using linear rank tests for censored data. Ann
Stat 18:354–372
Wang L, Chen G, Li H (2007) Group SCAD regression analysis for microarray time course gene
expression data. Bioinformatics 23:1486–1494
Wang S, Nan B, Zhu N, Zhu J (2009) Hierarchically penalized Cox regression with grouped
variables. Biometrika 96:307–322
Ying Z (1993) A large sample study of rank estimation for censored regression data. Ann Stat
21:76–99
Yuan M, Lin Y (2006) Model selection and estimation in regression with grouped variables. J
Royal Stat Soc: Ser B (Stat Methodol) 68:49–67
Zhang CH, Huang J (2008) The sparsity and bias of the Lasso selection in high-dimensional linear
regression. Ann Stat 36:1567–1594
Zhao P, Rocha G, Yu B (2009) The composite absolute penalties family for grouped and
hierarchical variable selection. Ann Stat 37:3468–3497
Zhang CH (2010) Nearly unbiased variable selection under minimax concave penalty. Ann Stat
38:894–942
Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J Royal Stat Soc:
Ser B (Stat Methodol) 67:301–320
Chapter 6
A Proportional Odds Model for Regression
Analysis of Case I Interval-Censored Data

Pooneh Pordeli and Xuewen Lu

Abstract Case I interval censored or current status data arise in many areas such
as demography, economics, epidemiology and medical science. We introduce a
partially linear single-index proportional odds model to analyze these types of data.
Polynomial smoothing spline method is applied to estimate the nuisance parameters
of our model including the baseline log-odds function and the nonparametric link
function with and without monotonicity constraint, respectively. Then, we propose
a simultaneous sieve maximum likelihood estimation (SMLE). It is also shown that
the resultant estimator of regression parameter vector is asymptotically normal and
achieves the semiparametric information bound, considering that the nonparametric
link function is truly a spline. A simulation experiment presents the finite sample
performance of the proposed estimation method, and an analysis of renal function
recovery data is performed for the illustration.

6.1 Introduction

The proportional odds (PO) model has been used widely as a major model for
analyzing survival data which is particularly practical in the analysis of categorical
data and is possibly the most popular model in the case of ordinal outcomes related
to survival data. Ordinal responses are very common in the medical, epidemiolog-
ical, and social sciences. The PO model was first proposed by McCullagh (1980)
where he extended the idea of constant odds ratio to more than two samples by
means of the PO model. Pettitt (1982) and Bennett (1983) generalized this model
to the survival analysis context and subsequently much effort and research has gone
into proposing reasonable estimators for the regression coefficients for this model.
Although the proportional hazards (PH) model is the most common approach used
for studying the relationship of event times and covariates, alternate models are
needed for occasions when it does not fit the data. As mentioned in Hedeker and
Mermelstein (2011), in analysis of failure time data, when subjects are measured

P. Pordeli () • X. Lu
Department of Mathematics and Statistics, University of Calgary, 2500 University Drive NW,
T2N 1N4, Calgary, AB, Canada
e-mail: [email protected]; [email protected]

© Springer Science+Business Media Singapore 2016 101

D.-G. (Din) Chen et al. (eds.), Advanced Statistical Methods in Data Science,
ICSA Book Series in Statistics, DOI 10.1007/978-981-10-2594-5_6
102 P. Pordeli and X. Lu

repeatedly at fixed intervals in terms of the occurrence of some event, or when

determination of the exact time of the event is only known within grouped intervals
of time, the PO model is a rather elegant and popular choice considering its ordered
categorical nature without any substantial increase in the difficulty of interpretation.
The regression parameter estimates have a nice interpretation as the additive change
in the log-odds (multiplicative effect on the odds) of survival associated with a one
unit change in covariate values.
Suppose T is the failure time of some event of interest and C is a random
censoring time. The observations of failure time, T, are from current status data
type where the only information that we have about them is that if the failure has
happened before or after the examination time C instead of being observed exactly.
Let V D .V1 ; : : : ; Vq /> is a q-dimensional linear covariate vector which is time
independent. The linear PO model is defined as

1 S.tjV/ 1 S0 .t/
D exp.˛ > V/:
S.tjV/ S0 .t/

Since logit.u/ D lnfu=.1 u/g, by taking natural logarithm of both sides, we can
write the model as follows

logit f1 S.tjV/g D logit f1 S0 .t/g C ˛ > V; (6.1)

where S.tjV/ is the survival function of T conditional on covariate V, ˛ D

.˛1 ; : : : ; ˛q / is a q-dimensional regression coefficient vector, and S0 .t/ is the
baseline survival function corresponding to V D 0. Thus, logit f1 S0 .t/g is the
baseline log-odds function. It is a monotone increasing function since 1 S0 .t/ D
F0 .t/ and logit./ are increasing. In this model, ˛j , j D 1; : : : ; q, is the increase in
log-odds of falling into or below any category of the response variable, associated
with the one unit increase in Vj holding all other Vj0 ’s ( j0 ¤ j) constant. Therefore, a
positive slope indicates a tendency for the response level to increase as the covariate
increases. In other words, the PO model considers the effect that changes in the
explanatory variables V1 to Vq have on the log-odds of T being in a lower rather
than a higher category. A key advantage of this model is that it uses a logit link
yielding constant odds ratios; hence the name is proportional odds model.
One important property of the PO model is that the hazard ratio converges from
expf˛ > Vg to unity as the time changes from zero to 1. From (6.1) we can write

1
S.tjV/ D ;
1S0 .t/
1C S0 .t/ exp.˛ > V/

and since S.tjV/ D e.tjV/ and S0 .t/ D e0 .t/ we have

.tjV/ D ln 1 C .e0 .t/ 1/ exp ˛ > V ;

6 Partially Linear Single-Index Proportional Odds Model 103

and then considering .tjV/ D @.tjV/=@t, it follows that

.tjV/ 1
D ˚
:
0 .t/ 1 C exp.˛ > V/ 1 S0 .t/

Thus, when t D 0, S0 .t/ is 1 and .tjV/= 0 .t/ D expf˛ > Vg, and when t D 1,
S0 .t/ is 0 and .tjV/= 0 .t/ D 1. This is different from the PH model where the
hazard ratio remains constant as time passes, such that .tjV/= 0 .t/ D exp.˛ > V/,
which could be unreasonable in some applications where the initial effects such as
differences in the stage of disease or in treatment can disappear over the time. In this
case, the property of the PO model that the hazard ratio converges to 1 as t increases
to infinity makes more sense.
To analyze interval-censored data, a number of articles considered the PO model.
Dinse and Lagakos (1983) focused on score tests derived under model (6.1) which
expressed tumour prevalence as a function of time and treatment. Huang and
Rossini (1997) used sieve maximum likelihood estimation (SMLE) to estimate
the finite-dimensional regressionp parameter. They showed that the estimators are
asymptotically normal with n convergence rate and achieve the information
bound. Shen (1998) also developed an estimation procedure for the baseline func-
tion and the regression parameters based on a random sieve maximum likelihood
method for linear regression with an unspecified error distribution, taking the
PH and PO models as special cases. Their procedures used monotone splines
to approximate the baseline survival function. They implemented the proposed
procedures for right-censored and case II interval-censored data. The estimated
regression parameters are shown to be asymptotically normal and efficient.
For PO models with current status data, Huang (1995) used maximum likelihood
estimation (MLE). Rossini and Tsiatis (1996) treated the baseline log-odds of
failure time as the infinite-dimensional nuisance parameter of their model and
approximated it with a uniformly spaced non-decreasing step function, and then
proceeded with a maximum likelihood procedure. In Rabinowitz et al. (2000)
the basis for estimation of the regression coefficients of the linear PO model is
maximum likelihood estimation based on the conditional likelihood. Their approach
is applicable to both current status and more generally, interval-censored data. Wang
and Dunson (2011) used monotone splines for approximating the baseline log-
odds function, and McMahan et al. (2013) proposed new EM algorithms to analyze
current status data under two popular semiparametric regression models including
the PO model. They used monotone splines to model the baseline odds function and
provided variance estimation in a closed form.
However, many predictors in a regression model do not necessarily affect the
response linearly. In order to consider the non-linear covariate effects, we have
to think about a more flexible model. There are not many articles for this in the
literature, but as a special case of the transformation model for this, Ma and Kosorok
(2005) presented a partly linear proportional odds (PL-PO) model which is defined
104 P. Pordeli and X. Lu

as follows

logitf1 S.tjV; X1 /g D logitf1 S0 .t/g C ˛ > V C .X1 /; (6.2)

where is an unknown function relating to the one-dimensional non-linear covariate

X1 . They used penalized MLE to estimate parameters of their model and made
inference based on a block jackknife method. To analyze right-censored data, Lu
and Zhang (2010) studied a PL transformation model where (6.2) is a special case.
They applied a martingale based estimating equation approach, consisting of both
global and kernel-weighted local estimation equations to estimate parameters of
their model and presented asymptotic properties of their estimators. They also used
a resampling method to estimate the asymptotic variance-covariance matrix of the
estimates. In these models they could handle just one non-linear covariate X1 2 R.
Since in many real applications we may face with more than one non-linear
covariate, we need to think about a model to incorporate high dimensionality. In
this chapter, we propose a partially linear single-index proportional odds (PLSI-
PO) model to deal with high dimensionality in analyzing current status data. This
model reduces the dimension of data through a single-index term and it involves
the log-odds of the baseline survival function which is unknown and has to be
estimated. We use B-splines to approximate log-odds of baseline survival function,
ln fS0 ./= .1 S0 .//g, and also to approximate the link function of single-index
term, ./. Asymptotic properties of the estimators are derived using the theory
of counting processes.

6.2 Model Assumptions and Methods

In many situations there is limited information for a single observation about the
event of interest and the only information that we have is that it has occurred before
or after the examination time. The failure time is either left- or right-censored
instead of being observed exactly and we only observe whether or not the event
time, T, has occurred before some monitoring time C. In this case, we are dealing
with current status data. Suppose Z D .V > ; X > /> is a covariate vector. In terms
of the odds ratio of S.tjZ/ which is the survival function of T conditional on Z, we
define the PLSI-PO model as follows

S.tjZ/ S0 .t/
D expf˛ > V .ˇ > X/g; (6.3)
1 S.tjZ/ 1 S0 .t/

where ˛ D .˛1 ; : : : ; ˛q /> and ˇ D .ˇ1 ; : : : ; ˇp /> are q- and p-dimensional

regression coefficient vectors, respectively, S0 .t/ is the baseline survival function
corresponding to V D 0, X D 0; and ./ is the unknown link function for
the single-index term. Following Huang and Liu (2006) and Sun et al. (2008),
for identifiability of the model, we consider some constraints. For this respect,
6 Partially Linear Single-Index Proportional Odds Model 105

we assume ˇ1 > 0 in order to the sign of ˇ be identifiable, and because any

constant scale can be absorbed in ./ we can only estimate the direction of ˇ
and the scale of it is not identifiable, so it is required that kˇk D 1, where
kak D .a> a/1=2 is the Euclidean norm for any vector a. On the other hand, since
there are two nonparametric functions and thus any constant in one of them can be
assimilated in the other one, for identifiability of the model, we assume E.V/ D 0
and Ef .ˇ > X/g D 0.
By taking natural logarithm of both sides of (6.3), we have the model as follows

1 S.tjZ/ 1 S0 .t/
ln D ln C ˛> V C .ˇ > X/;
S.tjZ/ S0 .t/

that is

logitf1 S.tjZ/g D logitf1 S0 .t/g C ˛ > V C .ˇ > X/; (6.4)

where logit.u/ D lnfu=.1 u/g for 0 < u < 1 and S0 ./ D e0 ./ is the baseline
survival function.
In the setting of current status data, we do not observe the values of T directly,
thus the observations are in the form of independent samples of fCi ; ıi ; Vi ; Xi gniD1 ,
drawn from the population fC; ı; V; Xg, where censoring time C is continuous on
the interval Œac ; bc with the hazard function c .tjZ/ D .tjZ/; ı D I.C T/ is the
censoring indicator where ı D 1 if the event of interest has not occurred by time C
and otherwise ı D 0.
Since H./ D logitf1 S0 ./g D logitfS0 ./g and ./ are two unknown
functions of the model (6.4), we need to estimate them. We use the B-spline method
to approximate the two unknown functions H./ D logitfS0 ./g and ./, and then
we use a maximum likelihood approach to estimate parameters of the model.
Suppose Ln is the collection of nonnegative and nondecreasing functions 0 ./
on Œac ; bc and Ln be the space of polynomial splines of order L 1, where
each functional element of this space is defined on a sub interval of a partition.
To get faster convergence rate in B-spline approximation, we assume for L 2
and 0 r L 2, each functional element of this space is r times continuously
differentiable on Œac ; bc . For each 0 2 Ln , we have

logitf1 e0 .t/ g D logitf1 S0 .t/g

X
dfL
D logitfS0 .t/g D k Lk .C/ D > L.C/; (6.5)
kD1

where L.C/ D .L1 .C/; : : : ; LdfL .C//> is the vector of B-spline basis functions with
Lk .C/ 2 Ln for each k D 1; : : : ; dfL and D .1 ; : : : ; dfL /> is the vector of B-
spline coefficients. We have dfL D KL C L is the degree of freedom (number of
basis functions) with KL interior knots and B-splines of order L .
106 P. Pordeli and X. Lu

Let n be the collection of ./ functions on Œaxb ; bxb and Bn be the space of
polynomial splines of order B 1, with the same properties as Ln . To satisfy
the identifiability centering constraint,
PdfEf .ˇ0 X/g D 0,Pwe focus on a subspace of
spline functions S0 D fs W s.x/ D `D1 B
` B` .ˇ > x/; n
s.Xi / D 0g with basis
iD1P
fB1 .ˇ X/; : : : ; BdfB .ˇ X/g where B` .ˇ x/ D B` .ˇ x/ . niD1 B` .ˇ > Xi /=n/ for
> > > >

` D 1; : : : ; dfB , and mention that as to the empirical version of the constraint, this
subspace is dfB1 D dfB 1 dimensional. So 2 n can be approximated at ˇ > X as
follows

X
dfB1
.ˇ > X/ D ` B` .ˇ > X/ D > B.ˇ > X/; (6.6)
`D1

where B.ˇ > X/ D .B1 .ˇ > X/; : : : ; BdfB1 .ˇ > X//> is the vector of local normalized
B-spline basis functions, B` .ˇ > X/ 2 Bn , D .1 ; : : : ; dfB1 /> is the vector of B-
spline coefficients. Having KB interior knots and B-splines of order B , the degree
of freedom of the B-spline would be dfB D KB C B .
For identifiability purpose we assume kˇk D 1 and perform the delete-one-
component method by defining ˇQ D .ˇ2 ; : : : ; ˇp / as a . p 1/-dimensional
vector deleting the first componentq ˇ1 . We also assume ˇ1 > 0 which could be
q
implemented by considering ˇ1 D 1 kˇk Q 2 D 1 Pp ˇ 2 . Then, we have
kD2 k
q Pp
ˇ D . 1 kD2 ˇk2 ; ˇ2 ; : : : ; ˇp /> where the true parameter kˇQ0 k < 1. Therefore,
ˇ is infinitely differentiable in a neighborhood of the true parameter ˇQ0 . Since we
use B-splinen to estimateo H./ D logitf1 e0 ./ g, the baseline hazard function is
>
0 ./ D ln 1 C e L./ which has to be positive and nondecreasing. The positivity
is guaranteed by property of a logarithmic function and we just need to satisfy the
condition of being nondecreasing by putting a constraint on the coefficients of the
basis functions, that is 1 dfL .
Under suitable smoothness assumptions logitf1 S0 ./g and ./ can be well
approximated by functions in Ln and Bn , respectively. Therefore, we have to
find members of Ln and Bn along with values for ˛ and ˇ that maximize the
semiparametric log-likelihood function.
Now in Eq. (6.4) we replace B-spline approximations for logitf1 S0 .t/g and
.ˇ > X/ from (6.5) and (6.6), respectively. Considering logitf1 S.tjZ/g D
logitfS.tjZ/g, (6.4) is equivalent to

logit. p/ D > L.C/ ˛ > V > B.ˇ > X/; (6.7)

where p D p.t/ D S.tjZ/. Since for subject i, i D 1; : : : ; n, we have S.Ci jZi / D

Pr.Ci Ti jZi / D EfI.Ci Ti /jZi g D E.ıi jZi /, by assuming ıi as a binary
response, we can consider model (6.7) as a generalized linear model (GLM) with
linear predictor D > L.C/ ˛ > V > B.ˇ > X/ and “logit” link. Then we
use the GLM methods, available in many computer software packages, to estimate
6 Partially Linear Single-Index Proportional Odds Model 107

parameters ˛; ˇ; and . We use “glm” function in the R package to do that. The

constraints set up for the model are not used for the estimated values obtained in
this step, and we consider them as the initial values of the parameters of PLSI-
PO model for the next step which is maximizing the semiparametric log-likelihood
function subject to the mentioned constraints.
For current status data, the likelihood
Q function at observed censoring times, Ci ,
given covariates, Zi , is proportional to niD1 fS.Ci jZi /gıi f1 S.Ci jZi /g1ıi where for
each i D 1; : : : ; n, S.Ci jZi / D expf.Ci jZi /g D pi .Ci / D pi . Thus, we can write
the semiparametric log-likelihood function for PLSI-PO model as follows

X
n
`O D `O .˛; ˇ; S0 ; / D fıi ln. pi / C .1 ıi / ln.1 pi /g
iD1
" ( )
X
n
1
D ıi ln ˚

iD1
1 C exp logit fS0 .Ci /g C ˛ > Vi C .ˇ > Xi /
( ˚
)#
exp logit fS0 .Ci /g C ˛ > Vi C .ˇ > Xi /
C.1 ıi / ln ˚
:
1 C exp logit fS0 .Ci /g C ˛ > Vi C .ˇ > Xi /

(6.8)

Then, we plug in the B-spline approximations of logit .S0 .// and ./ obtained
from (6.5) and (6.6) into the semiparametric log-likelihood function (6.8). So we
have the log-likelihood function as follows
" ( )
X
n
1
`O D `O .˛; ˇ; ; / D ıi ln ˚

iD1
1 C exp > L.Ci / C ˛ > Vi C > B.ˇ > Xi /
( ˚
)#
exp > L.Ci / C ˛ > Vi C > B.ˇ > Xi /
C.1 ıi / ln ˚
:
1 C exp > L.Ci / C ˛ > Vi C > B.ˇ > Xi /

(6.9)

Now we can estimate the parameters of our regression model, .˛; ˇ; ; /,

by maximizing the log-likelihood function (6.9) which has a parametric form
after using the B-spline approximated values of the infinite-dimensional nuisance
parameters. To maximize (6.9), we use a sieve method through an iterative algorithm
subject to the mentioned constraints on coefficients and ˇ; i.e., 1 dfL for
monotonicity of 0 .C/ and ˇ1 > 0 and kˇk D 1 for the purpose of identifiability
in .ˇ > X/.
108 P. Pordeli and X. Lu

Some regularity conditions have to be satisfied in order to establish large sample

properties of the estimators in PLSI-PO model. These conditions are as follows:
(C1) (i) T and C are independent given the covariate history Z. (ii) Censoring time,
C, has an absolutely continuous distribution on Œac ; bc where 0 < ac < bc <
1, with hazard function .t/ D .t; Z/ D c .t; Z/ conditional on covariate
vector Z.
(C2) Assume for any integer s 1, there exist continuous and positive sth
.s/
derivatives 0 and .s/ . Then let the finite-dimensional parameter spaces 1
for ˛ and 2 for ˇQ are bounded subsets of Rq and Rp1 , respectively. Assume
the true regression parameter values ˛0 2 1 and ˇQ0 2 2 are interior points
of the true functions 0 2 Ln , and 2 n .
(C3) (i) For any ˛0 ¤ ˛ and ˇ0 ¤ ˇ we have Pr.˛0> V ¤ ˛ > V/ > 0 and Pr.ˇ0> X ¤
ˇ > X/ > 0. (ii) Assume E.V/ D 0 and for the true parameter ˇQ0 2 2 and the
true function ./ assume Ef .ˇ0> X/g D 0.
(C4) (i) Covariates V and X have bounded supports which are subsets of Rq and Rp ,
respectively. That is, there exist v0 and x0 such that kVk v0 and kXk x0
with probability 1. (ii) If we denote the distribution of T by F0 such that
F0 .0/ D 0, then the support of C is strictly contained in the support of F0 ,
that is for tF0 D infft W F0 .t/ D 1g we have 0 < ac < bc < tF0 .
(C5) For a small " > 0 we have Pr.T < ac jC; V; X/ > " and Pr.T > bc jC; V; X/ >
" with probability one.
(C6) The baseline cumulative hazard function 0 has strictly positive derivative on
Œac ; bc , and the joint distribution function G.c; v; x/ of .C; V; X/ has bounded
second-order partial derivative with respect to c.
Condition (C1) is to satisfy non-informative censoring condition. (C2) is to
ensure identifiability of the parameters, (C3) implies certain characteristics in order
to apply spline smoothing techniques. (C4) bounds likelihood and score functions
away from infinity at the boundaries of the support of the observed event time.
(C5) is required to ensure that the probability of being either left-censored or
right-censored is positive and bounded away from zero regardless of the covariate
values. (C6) requires for the partial score functions (or partial derivatives) of the
nonparametric
p components in the least favorable direction to be close to zero, so
that the n convergence rate and asymptotic normality of the finite-dimensional
estimator can be obtained. Similar conditions in a linear PH model for current status
data are discussed in Huang (1996).

6.3 Theory of Estimation

We first assume that (6.6) holds, i.e., the link function is a B-spline function with
fixed knots and order, then model (6.3) contains only one nonparametric function of
S0 .t/, we can calculate the information matrix of the estimators of the regression
6 Partially Linear Single-Index Proportional Odds Model 109

parameters. When is a smooth nonparametric function instead of a B-spline

function, the derived theory works as an approximation. Our simulation results
indicate the approximation is quite accurate and provides a practical solution for
real data analysis.

6.3.1 Information Calculation

Considering observations fCi ; ıi ; Vi ; Xi gniD1 , we define two counting processes

N1i .t/ D ıi I.Ci t/ and N2i .t/ D .1 ıi /I.Ci t/. Then let Ni .t/ D
N1i .t/ C N2i .t/ D ıi I.Ci t/ C .1 ıi /I.Ci t/ D I.Ci t/ and Yi .t/ D I.t Ci /
be at risk process for time point t. Having N1i .t/ and N2i .t/ with intensity processes
fN1i .t/ D Yi .t/i .t/pi .t/ D Yi i pi and fN2i .t/ D Yi .t/i .t/.1 pi .t// D Yi i .1 pi /,
we define M1i .t/ and M2i .t/ as their corresponding compensated counting processes
as follows
Z t
M1i .t/ D N1i .t/ Yi .s/i .s/pi .s/ds;
0
Z t
M2i .t/ D N2i .t/ Yi .s/i .s/.1 pi .s//ds;
0

which are martingales as shown in Martinussen and Scheike (2002). In the

following, we shall drop t from N1 .t/, N2 .t/, M1 .t/, M2 .t/, .t/, Y.t/ and p.t/ unless
it needs to be specified. We can write the log-likelihood function given in (6.8), as
follows
n Z
X Z
`O D `O .˛; ˇ; ; / D .ln pi / dN1i C .ln .1 pi // dN2i ; (6.10)
iD1

where pi D p.Ci / D S.Ci jZi / for each i D 1; : : : ; n with

1
S.Ci jZi / D :
1 C expŒlogit fS0 .Ci /g C ˛ > Vi C .ˇ > Xi /

Since we have an unknown parameter in one of the two unknown functions of our
model, in the procedure of obtaining the efficient information bound, it is difficult to
use projection onto a sumspace of two non-orthogonal L2 spaces. Thus, we replace
the B-spline approximated value of .ˇ > X/ from (6.7) and consider H0 ./ D
logitfS0 ./g D logitfe0 ./ g as the only infinite-dimensional nuisance parameter
of the model. We consider a simpler finite-dimensional parametric submodels
contained within this semiparametric model. Assuming parametric submodel for
the nuisance parameter H./ D H0 ./ as a mapping of the form ! H./
in fH./ W 2 RqC. p1/CdfB1 g. Then, characterize H./ by a finite-dimensional
110 P. Pordeli and X. Lu

parameter such that

@H./
.t/ D a.t/ D a: (6.11)
@

As we re-parametrized ˇ as ˇ D .ˇ1 ; ˇQ > /> , the marginal score vector for Q D

.ˇQ > ; ˛ > ; > /> is obtained by partially differentiating `O .;
Q H./ / given in (6.10),
for one observation, with respect to Q such that

@`O >
SQ D D S>Q̌ ; S˛> ; S> ;
@Q

where
q by defining XQ D .X2 ; : : : ; Xp /> , with X1 as the first element of X, and ˇ1 D
Pp
1 kD2 ˇk2 , we have

Z !
@`O ˇQ
S Q̌ D D B .ˇ X/ XQ X1
> 0 >
f p dN2 .1 p/dN1 g ;
@ˇQ ˇ1
Z
@`O
S˛ D D V f p dN2 .1 p/dN1 g ;
@˛

and
Z
@`O
S D D B.ˇ > X/ f p dN2 .1 p/dN1 g ;
@

with
1
p D p.C/ D S.CjZ/ D ˚
:
1 C exp H.C/ C ˛ > V C > B.ˇ > X/

Knowing that dN1 D dM1 C Ypdt and dN2 D dM2 C Y.1 p/dt we can write
Z
SQ D Q f p dM2 .1 p/dM1 g ;
U

Q is a .q C . p 1/ C dfB1 / 1 vector defined as follows

where U
2( !) > 3>
Q ˚

Q D 4 > B0 .ˇ > X/ XQ X1 ˇ
U
>
; V > ; B.ˇ > X/ 5 :
ˇ1
6 Partially Linear Single-Index Proportional Odds Model 111

Let e1 D exp ˛ > V C > B.ˇ > X/ , and `o be the expression inside the integral
R `o D .ln p/ dN1 C .lno.1 @H./
in (6.10) such that p// dN2 . For H./ D logitfS0 ./g we
have SH .a/ D S1 .a/ where S1 .a/ D @`@H @ and

H
@`o @ 1 e e1
D ln dM1 C ln dM2
@H @H 1 C eH e1 1 C eH e1

@ 1 C eH e1
D ln 1 C eH e1 dM1 ln dM2
@H eH e1
eH e1 1
D dM1 C dM2
1 C eH e1 1 C eH e1
D p dM2 .1 p/dM1 :

Thus, the score operator associated with H is as follows

Z
SH .a/ D a.t/ f p dM2 .1 p/dM1 g :

Under conditions (C1) to (C6), the efficient score for the finite-dimensional
parameter Q is the difference between its score vector, SQ , and the score for a
particular submodel of the nuisance parameter, SH .a/. The particular submodel is
the one with the property that the difference is uncorrelated with the scores for all
other submodels of the nuisance parameters. Thus, the score operator for Q is as
follows

SQ D SQ SH .a/:

We have to find a D a .t/ such that

Z

SQ D SQ SH .a / D Q a f p dM2 .1 p/dM1 g
U (6.12)

be orthogonal to any other SH .a/ 2 AH where AH D fSH .a/ W a 2 L2 .PC /g and

L2 .PC / D fa W EŒka.C/k2 p.C/.1 p.C// < 1g (i.e., E.SQ SH / D 0). Thus, we

have

E.SQ SH / D EŒfSQ SH .a /gSH .a/ D 0; (6.13)

for any a 2 L2 .PC /. The orthogonality Eq. (6.13) is equivalent to

Z Z
E Q a / f p dM2 .1 p/dM1 g
.U a f p dM2 .1 p/dM1 g
112 P. Pordeli and X. Lu

Z Z
DE Q a / f p dM2 .1 p/dM1 g
.U a f p dM2 .1 p/dM1 g
Z Z
DE Q a /ap2 dhM2 i C
.U Q a /a.1 p/2 dhM1 i D 0:
.U (6.14)

Then, since dhM1 i D pYdt and dhM2 i D .1 p/Ydt, it is equivalent to

R ˚

Q a /p.1 p/Ydt D 0;
aE .U

and then
R ˚

Q p.1 p/Y C a E f p.1 p/Yg dt D 0:
a E U

So considering a ¤ 0, we have
˚

E UQ p.1 p/Y

a D :
E f p.1 p/Yg

By plugging in a into (6.12) we have the efficient score for Q as follows

Z " Q
#
Q Ef U p.1 p/Yg
SQ D U f p dM2 .1 p/dM1 g : (6.15)
E f p.1 p/Yg

The empirical version of the efficient score for Q is

n Z Q !
X ./
Q / D
S.; Q i S1
U f pi dM2i .1 pi /dM1i g ; (6.16)
Q
./
iD1 S0

where
Q Q
X
Su./ D Su./ .t/ D Q i /˝u pi .1 pi /Yi i ; for u D 0; 1;
.U
i

with ˝ denoting the Kronecker operation defined as b˝0 D 1, b˝1 D b and

b˝2 D bb> . Since i D .tjZi / is an unknown function of the covariate vector, in
Q
./
order to obtain an estimated value for Su , we need to estimate i . Following the idea
of Martinussen and Scheike (2002), we use a simple kernel estimator for estimating
Q
./
value of Su , where i .t/dt is replaced by the convolution of the kernel estimator
Kb ./ and dNi .s/ such that O .tjZi /dt D Kb .s t/dNi .s/. The kernel function satisfies
Kb ./ D .1=b/K.=b/,
R and b > R 0 is the bandwidth of the kernel estimator. We also
assume that Kb .u/du D 1, uKb .u/du D 0 and the kernel has compact support.
6 Partially Linear Single-Index Proportional Odds Model 113

QO
Therefore, after obtaining the semiparametric maximum likelihood estimator ,
Q
./
using the plug-in method, Su is estimated as follows:
n Z
X
Q Q
SO u./ D SO u./ .t/ D OQ /˝u K .s t/dN .s/; for u D 0; 1;
pO i .s/ .1 pO i .s// Yi .s/.U i b i
iD1

where
1
pO i .s/ D h n no oi
1 C exp logit SO 0 .s/ C ˛O > V C O > B.ˇO > X/

1
D h n n o oi ;
O
1 C exp logit e.s/ C ˛O > V C O > B.ˇO > X/

with kernel function Kb ./ D .1=b/K.=b/, and b > 0 is the bandwidth of the kernel
estimator. We see that Oi .t/dt D .tjZ
O i /dt D Kb .s t/dNi .s/ D Kb .Ci t/I.Ci t/
for any t 2 Œac ; bc , so we have

X
n
Q
SO u./ .t/ D OQ .C /g˝u K .C t/; for u D 0; 1:
pO i .Ci / .1 pO i .Ci // Yi .Ci /fU i i b i
iD1

The Epanechnikov kernel function is used here which is defined as

Kb .u/ D .1=b/.3=4/ 1 .u=b/2 I.ju=bj 1/:

Then we can write the information matrix at Q0 as follows:

Z
˚

I.Q0 / D E Q E U
U Q Yp.1 p/ E1 .Yp.1 p// ˝2 p.1 p/Ydt :

Using the central limit theorem for martingales, n1=2 S.Q0 ; /

O converges in distri-
bution to a normal distribution with mean zero and covariance matrix ˙1 which can
be consistently estimated by

Z Q ! Q !>
1X
./ ./
OQ SO 1 OQ SO 1 ˚

n
˙O 1 D U i Q
U i Q
pO 2i dN2i C .1 pO i /2 dN1i ;
./ ./
n iD1 SO0 SO0

and ˙O 1 converges in probability to ˙1 and therefore, we have n1=2 .OQ Q0 /

converges in distribution to a mean zero normal distribution with covariance matrix
˙ D I 1 .Q0 /˙1 I 1 .Q0 /. The robust sandwich estimator of the variance is given
by ˙O D IO1 ./OQ ˙O IO1 ./.
OQ With the consistent estimator of we can conclude
1
114 P. Pordeli and X. Lu

that ˙O 1 D I. OQ C o .1/, and thus, n1=2 .OQ Q / converges in distribution to a

O / p 0

mean zero random vector with covariance matrix I 1 .Q0 / estimated by IO1 ./. OQ
Therefore, our obtained estimators are efficient. A rigorous proof of the consistency
and asymptotic normality of the semimparametric estimator OQ can be obtained by
using the theory developed by Huang (1996) for empirical processes with current
status data, assuming that (6.6) holds, i.e., the link function is a B-spline function
with fixed knots and order.

6.3.2 Inference

To obtain the variance-covariance matrix of O D .ˇO > ; ˛O > ; O > /> , we define a map
Q ˛; / ! .ˇ; ˛; /. Then after obtaining the variance-covariance matrix of
G W .ˇ;
D .ˇOQ > ; ˛O > ; O > /> as IO1 ./
O
Q OQ D Var.ˇ;
OQ ˛;
O O /, by the delta method we have

O ˛;
Var.ˇ; OQ ˛;
O O / D Var.G.ˇ; OQ ˛;
O O // D G0 .ˇ; OQ ˛;
O O /Var.ˇ; OQ ˛;
O O /G0> .ˇ; O O /; (6.17)

where G0 is a . p C q C dfB1 / .. p 1/ C q C dfB1 / matrix and can be calculated

as follows:
Q ˛; /
@G.ˇ; @.ˇ; ˛; /
Q ˛; / D
G0 .ˇ; D
Q ˛; /
@.ˇ; Q ˛; /
@.ˇ;
!
ˇ
ˇˇ21 : : : ˇp1 0 0
D :
I. p1/CqCdfB1

O O ; ˛/
Then, Var.ˇ; O Var.O / and Var.˛/
O can be estimated using (6.17). Var.ˇ/, O are
estimated by the corresponding block matrices in Var.ˇ; O O ; ˛/.
O Inferences, such as
Wald-type tests and confidence intervals for the parameters, can be made using these
estimated variances. Moreover, for any s 2 the support of ˇ0> X, an approximate
.1 #/100 % confidence band for .s/ D > B.s/ is obtained as follows

O .s/ ˙ Z#=2 SE. O .s//;

where O .s/ D O > B.s/, SE. O .s// D fVar. O .s//g1=2 D fB> .s/Var.O /B.s/g1=2 and
Z#=2 is the upper quantile of the standard normal distribution.
6 Partially Linear Single-Index Proportional Odds Model 115

6.3.3 Implementation

Because the unknown functions of this model are approximated using B-spline
approximation, it may cause some numerical instability in the functional estimation
when available data are sparse. The monitoring time C tends to be sparse in the right
tail of the distribution, and the estimator of H.t/ D logitf1 S0 .t/g D logitfS0 .t/g
may deviate from the true curve there. To overcome this problem, as suggested by
Gray (1992), we add a penalty term in the estimation, such that the penalty functions
in the penalized approach pull the estimates away from very extreme values and the
estimates from the penalized maximization are less biased and better behaved. Since
the distribution of ˇ > X is less sparse in the tails than that of C, we don’t impose a
penalty on ./. From (6.10), the penalized likelihood function becomes
Z
1
`P .˛; ˇ; ; / D `O .˛; ˇ; ; / fH 0 .t/g2 dt; (6.18)
2

where H.t/ is a quadratic B-spline defined in (6.5), and is a penalty tuning

parameter which controls the smoothing. When is close to zero it means that there
is no penalty and when it goes to 1 it forces H.t/ to become a constant. Because
the penalty term in (6.18) is a quadratic function of , it can be written as follows

1 >
P;
2
Rb
where P D . acc L0s .t/L0r .t/dt/1sdfL ;1rdfL is a dfL dfL nonnegative definite matrix,
which can be approximated by a Monte Carlo integration method.
We use the sieve method for maximum penalized log-likelihood estimation of the
parameters and functions in the model. We maximize the penalized log-likelihood
function `P ./ in (6.18) as the objective function subject to some constraints
indicated as gu ./. Then, we apply the adaptive barrier algorithm for solving a
constrained optimization problem through the function constrOptim.nl in the
R package alabama.
The iterative algorithm we use to maximize `P follows these steps:
• Step 0: Considering that the direction of ˇ can be correctly estimated, ˛ .0/ , .0/ ,
.0/ can be obtained from the GLM method by fixing ˇ at an initial value ˇ .0/ .
• Step 1. In step k, given current values ˛ .k/ , .k/ , .k/ , update the value of ˇ .k/ by
maximizing the log-likelihood function given in (6.10) subject to q the constraint
Pp 2
Pp 2
1 `D2 ˇ` > 0 which ensures the constraints ˇ1 > 0 and kˇk D `D1 ˇ` D
1. In this respect, we use the barrier method which is implemented by the
constrOptim.nl function in the R package alabama. Denote the updated
value by ˇ .kC1/ .
116 P. Pordeli and X. Lu

• Step 2. Having ˇ .kC1/ , update the values of ˛ .k/ , .k/ , .k/ simultaneously
through GLM with the binary response ı and linear predictor D > L.C/ C
˛ > V C > B.ˇ > X/ with logit link. Then by letting ! D .˛ > ; > ; > /> , we use
> > >
the Newton-Raphson method to obtain ! .kC1/ D .˛ .kC1/ ; .kC1/ ; .kC1/ />
through maximizing the log-likelihood function `O given in (6.10) without
considering the constraint on and . This is implemented by the nlminb
function in R which uses a quasi-Newton algorithm.
.kC1/
• Step 3. Using the same procedure as Step 1, considering 1
.kC1/ .kC1/
dfL as the constraints on , we further update the value of using
constrOptim.nl.
• Step 4. Further update ˛ .kC1/ to obtain ˛ .kC2/ by fixing other parameters in the
.k C 1/ step and maximizing the likelihood function. For doing this, we use
nlminb in R.
• Step 5. Repeat Step 1 to 4 until a certain convergence criterion is met. Note: the
two further updates in Step 3 and Step 4 considering the constraints on produce
better results than skipping these two steps.
Finally, after m iterations where the algorithm converges, we use ˇ .m/ , .m/ , ˛ .m/
and .m/ as the estimated values for ˇ, , ˛ and , respectively. Variance estimation
is presented using the above variance estimators.

6.4 Simulation Studies

To evaluate the finite-sample performance of our estimators, we conduct a sim-

ulation study with current status data for the PLSI-PO model. The failure time,
T, is generated from model (6.3) through the inverse transform sampling method
and considering a Weibull distribution for the baseline survival function with scale
parameter 1 and shape parameter w D 2. Thus T has the following form
1=w
U expf˛ > V C .ˇ > X/g
T D ln ;
.1 U/ C U expf˛ > V C .ˇ > X/g

where U Uniform.0; 1/, and the single-index function is defined as .ˇ0> X/ D

sin.ˇ0> X/. Two covariate vectors are considered, one is q D 2 dimensional linear
covariate vector V D .V1 ; V2 /> and the other is p D 3 dimensional non-linear
> >
covariate vector
>
p X D .X1 ; X2 ; X3 / . We assume ˛0 D .0:5; 1/ and ˇ0 D
.2; 1; 1/ = 6 and generate X1 ; X2 ; X3 from continuous uniform distribution on
interval .4; 4/ and let V1 Uniform.1; 4/2:5 and V2 Bernoulli.0:5/0:5. The
covariates satisfy condition (C3) such that E.V1 / D E.V2 / D Ef .ˇ0> X/g D 0. In
order to satisfy the identifiability
P condition on the link function, we center .ˇ > X/
> >
as .ˇ Xi / .1=n/ j .ˇ Xj / for i D 1; : : : ; n.
6 Partially Linear Single-Index Proportional Odds Model 117

The censoring time, C, is confined to interval Œac ; bc D Œ0:01; 3:00 and

generated from a truncated exponential distribution, i.e.,

C D .1= c / ln Œexp. c ac / U fexp. c ac / exp. c bc /g ;

where c D c0 C .0:5/.V1 C V2 / C .0:1/.X1 C X2 C X3 / and c0 D 1. In kernel

estimation of the covariance, we use bandwidth b D bf n.1=5/ sd.C/ computed
from the data, where bf D 1=15.
We use cubic B-spline basis functions (order=4) for .ˇ > X/ and quadratic basis
functions (order=3) for H.C/ D logit fS0 .C/g to approximate the two unknown
curves. The BIC method is applied to find the optimized value for the number of B-
spline basis functions indicated by degree of freedom, df , and the number of interior
knots, K D df – (order of B-spline basis functions). That is, we choose the value of
.dfL ; dfB / such that it locally minimizes the BIC objective function given as follows

BIC.dfL ; dfB / D 2`O C ln.n/f. p 1/ C q C dfL C .dfB 1/g;

where `O is the log-likelihood function given in (6.9). A large value of BIC implies
lack of fit. Various forms of BIC have been proposed in the literature and tested for
knots selection in semiparametric models (For example see He et al. (2002)).
Table 6.1 summarizes the simulation results based on 1000 replications with
sample sizes 200, 400 and 800. In this table, we can see that the bias for the
estimated values of ˇ D .ˇ1 ; ˇ2 ; ˇ3 / and ˛ D .˛1 ; ˛2 / are reasonably small. The
Monte Carlo standard deviations of the estimates which are shown as StDev.O/
are very close to the estimated average standard errors of the estimates indicated

Table 6.1 (PLSI-PO) Simulation results for estimation of ˇ and ˛ using the sieve MLE
Sample Summary True ˇ True ˛
size (n) statistics ˇ1 D p2 6 ˇ2 D 1
p
6
ˇ3 D 1
p
6
˛1 D 0:5 ˛2 D 1:0
200 Bias 0:0272 0:0608 0:0896 0:0222 0:1227
StDev.O/ 0:1530 0:2438 0:2687 0:3968 0:6117
Avg.{SE.O/} 0:1638 0:2360 0:2364 0:4827 0:7811
Cov. prob. 0:9327 0:9328 0:9267 0:9754 0:9769
400 Bias 0:0103 0:0029 0:0050 0:0125 0:0438
StDev.O/ 0:0614 0:0967 0:1006 0:2206 0:3735
Avg.{SE.O/} 0:0658 0:1028 0:1013 0:2619 0:4291
Cov. prob. 0:9585 0:9351 0:9347 0:9704 0:9621
800 Bias 0:0035 0:0013 0:0012 0:0084 0:0304
StDev.O/ 0:0410 0:0641 0:0635 0:1381 0:2374
Avg.{SE.O/} 0:0416 0:0660 0:0687 0:1608 0:2647
Cov. prob. 0:9593 0:9438 0:9385 0:9687 0:9643
BIAS empirical bias, STDEV sample empirical standard deviation, AVG.{SE} estimated average
standard error, COV. PROB. empirical coverage probability of the 95 % confidence interval
118 P. Pordeli and X. Lu

Fig. 6.1 (PLSI-PO) True and estimated curves: Left: The estimated curves; Middle: The
estimated curves for logitfS0 g; Right: The estimated 0 curves; corresponding to the sample
sizes n = 200, 400, 800. Solid lines show the true curves, dashed lines the estimated curves and
dotted lines illustrate the 95 % point-wise confidence bands based on Monte Carlo results

by Avg.{SE.O/}. The Monte Carlo coverage probabilities of the 95 % confidence

intervals are shown as Cov. prob. which are very close to the nominal level
specifically for larger sample size.
Plots in Fig. 6.1 show the curves indicating the estimated nuisance parameters
./, H./ D logit fexp .0 .//g D logitfS0 .t/g and 0 ./. It is seen that the
fitted curves match the true functions closely, indicating good performance of the
proposed method. Figures 6.2 and 6.3 illustrate the histograms for the estimated
values of ˇ and ˛, respectively, which are close to normal probability density curves.
6 Partially Linear Single-Index Proportional Odds Model 119

Fig. 6.2 (PLSI-PO) Histogram of estimated ˇ values including ˇO1 (on the left), ˇO2 (in the middle)
and ˇO3 (on the right) corresponding to the sample sizes n= 200, 400, 800

Fig. 6.3 (PLSI-PO) Histogram of estimated ˛ values including ˛O 1 (on the left) and ˛O 2 (on the
right) corresponding to the sample sizes n= 200, 400, 800
120 P. Pordeli and X. Lu

6.5 Real Data Analysis

In this section, we apply PLSI-PO model to the Acute kidney injury (AKI) dataset
which is a typical kidney disease syndrome with substantial impact on both short and
long-term clinical outcomes. A study was conducted at the University of Michigan
Hospital on 170 hospitalized adult patients with AKI to identify risk factors
associated with renal recovery in those who required renal replacement therapy
(RRT). This study is conducted in order to help clinicians in developing strategies to
prevent non-recovery and improve patient’s quality of life. Data collection included
patient characteristics, laboratory data, details of hospital course and degree of fluid
overload at RRT initiation. For each of the patients, his/her time of the inception
of dialysis was recorded along with time of hospital discharge, which may be
regarded as a monitoring time. In this study, the investigators only observed patient’s
current status of renal recovery at discharge time but did not know exactly when
renal function recovery occurred. More details about the study background and
preliminary findings can be found in Heung et al. (2012) and Lu and Song (2015).
To assess the relationship between the hazard of occurrence of renal recovery
and the clinical factors, let T be the failure time indicating the number of days from
the time of starting dialysis to the date of renal recovery, and C be the monitoring
time which is the time of hospital discharge. Two nonlinear covariates are baseline
serum creatinine (BScr) and use of vasopressor (VP), and the linear ones are age
(Age) and gender (Gender). VP is coded as 1 for Yes and 0 for No. Gender is
coded as 1 for male and 0 for female. Let V D .V1 D VP; V2 D Gender/> is
the linear covariate vector where V1 is use of vasopressor and V2 is gender, and
X D .X1 D BScr; X2 D Age/> represents the non-linear covariate vector where X1
is baseline serum creatinine level and X2 is age. For ith observation, i D 1; : : : ; n,
we standardize Xpi as .Xpi min.Xpi //=.max.Xpi / min.Xpi // for p D 1; 2. Then,
i i i
the support of Xpi is Œ0; 1. Suppose ı D I.C T/ is the indicator of renal recovery
situation at the time of discharge where ı D 0 means recovered and ı D 1 means
not recovered at the time of hospital discharge. Let .tI V; X/ be the hazard function
of recovery time, T. We apply PLSI-PO model to establish a relationship between
these four covariates and the survival odds and consequently the hazard function
of T. The estimated values for the parameters of the model fitted to the data are
˛O 1 D 1:3873, ˛O 2 D 0:0375, ˇO1 D 0:6314 and ˇO2 D 0:7754 with estimated
standard errors equal to 0:5909, 0:5803, 0:0953 and 0:0776, respectively. The Z-
test statistic values for ˛1 and ˛2 equal to 2:348 and 0:0647 , and the p-values
equal to 0:019 and 0:948, implying that the use of VP has an effect on the survival
log-odds of renal recovery but Gender is not significant in the PLSI-PO model.
The negative log-odds of the baseline survival function, the baseline cumulative
hazard function and the link function of the single-index term are fitted using B-
spline approximation with respectively 5 and 7 knots which are chosen by the BIC.
Figure 6.4 shows the estimated curves for logitfS0 .C/g and .ˇ > X/. It can be
seen from the curve for ./ that its effect on the log-odds of death or the hazard is
mostly deceasing specifically for values of the single-index less than 0:2 and greater
6 Partially Linear Single-Index Proportional Odds Model 121

Fig. 6.4 (PLSI-PO) Left: The estimated nonparametric function ./ with 95 % pointwise confi-
dence intervals. The solid line is the estimated function, the dashed line is the identity function and
the dotted lines are the 95 % pointwise confidence intervals; Middle: The estimated logit .S0 .//
function; Right: The estimated cumulative hazard function 0 ./

than 0:8 and there is a mild fluctuation in between. the 95 % point-wise confidence
bands are wider before the index value of 0:8. The identity link function lies outside
the confidence band which indicates that the linear PO model is not appropriate for
this data set. The negative log-odds of the unknown baseline survival function and
baseline cumulative hazard function have an increasing nonlinear trend.

6.6 Concluding Remarks

In this chapter, we establish an efficient estimation method for the PLSI-PO model
with current status data. This model can handle high dimensional nonparametric
covariate effects in predicting the survival odds of failure time. This partially linear
model is more practical in the analysis of current status data than models with
only a single linear term of covariates or just a few nonlinear covariates. We use
B-splines to approximate the link function of the single-index term and negative
logit of the baseline survival function. The splines for the negative logit of the
baseline survival as a function of the cumulative hazard function should be restricted
to monotone polynomial splines. By maximizing the log-likelihood function over
the splines spanned sieve spaces, we estimate the unspecified negative logit of
the baseline survival function, single-index link function, orientation parameter
and the parametric vector of the regression coefficients. Under the assumption
that the true nonparametric link function is a smoothing splines function, we
show that the estimators for the parameter vector of the regression coefficients and
the orientation parameter vector of the single-index term are semiparametrically
efficient by applying theory of counting processes, martingales and empirical
processes. Utilizing martingale theory is a new approach in the analysis of current
122 P. Pordeli and X. Lu

status data through this model. To show the efficacy of the proposed model and the
estimation algorithm, we present a simulation study and apply the model to a real
clinical data set.

References

Bennett S (1983) Analysis of survival data by the proportional odds model. Stat Med 2:273–277
Dinse GE, Lagakos SW (1983) Regression analysis of tumour prevalence data. Appl Stat 32:236–
248
Gray RJ (1992) Flexible methods for analyzing survival data using splines, with applications to
breast cancer prognosis. J Am Stat Assoc 87:942–951
He X, Zhu ZY, Fung WK (2002) Estimation in a semiparametric model for longitudinal data with
unspecified dependence structure. Biometrika 89:579–590
Hedeker D, Mermelstein RJ (2011) Multilevel analysis of ordinal outcomes related to survival
data. In: Hox JJ, Roberts KJ (eds) Handbook of advanced multilevel analysis. Taylor & Francis
Group, New York, pp 115–136
Heung M, Wolfgram DF, Kommareddi M, Hu Y, Song PX-K, Ojo AO (2012) Fluid overload at
initiation of renal replacement therapy is associated with lack of renal recovery in patients with
acute kidney injury. Nephrol Dial Transpl 27:956–961
Huang J (1995) Maximum likelihood estimation for proportional odds regression model with
current status data. Lect Notes Monogr Ser 27:129–145
Huang J (1996) Efficient estimation for the proportional hazards model with interval censoring.
Ann Stat 24:540–568
Huang J, Rossini AJ (1997) Sieve estimation for the proportional-odds failure-time regression
model with interval censoring. J Am Stat Assoc 92:960–967
Huang JZ, Liu L (2006) Polynomial spline estimation and inference of proportional hazards
regression models with flexible relative risk form. Biometrics 62:793–802
Lu W, Zhang HH (2010) On estimation of partially linear transformation models. J Am Stat Assoc
105:683–691
Lu X, Song PX-K (2015) Efficient estimation of the partly linear additive hazards model with
current status data. Scand J Stat 42:306–328
Ma S, Kosorok MR (2005) Penalized log-likelihood estimation for partly linear transformation
models with current status data. Ann Stat 33:2256–2290
Martinussen T, Scheike TH (2002) Efficient estimation in additive hazards regression with current
status data. Biometrika 89:649–658
McCullagh P (1980) Regression models for ordinal data. J R Stat Soc Ser B Stat Methodol 42:109–
142
McMahan CS, Wang L, Tebbs JM (2013) Regression analysis for current status data using the EM
algorithm. Stat Med 32:4452–4466
Pettitt AN (1982) Inference for the linear model using a likelihood based on ranks. J R Stat Soc
Ser B Methodol 44:234–243
Rabinowitz D, Betensky RA, Tsiatis AA (2000) Using conditional logistic regression to fit
proportional odds models to interval censored data. Biometrics 56:511–518
Rossini AJ, Tsiatis AA (1996) A semiparametric proportional odds regression model for the
analysis of current status data. J Am Stat Assoc 91:713–721
Shen X (1998) Propotional odds regression and sieve maximum likelihood estimation. Biometrika
85:165–177
Sun J, Kopciuk KA, Lu X (2008) Polynomial spline estimation of partially linear single-index
proportional hazards regression models. Comput Stat Data Anal 53:176–188
Wang L, Dunson DB (2011) Semiparametric Bayes’ proportional odds models for current status
data with underreporting. Biometrics 67:1111–1118
Chapter 7
Empirical Likelihood Inference Under Density
Ratio Models Based on Type I Censored
Samples: Hypothesis Testing and Quantile
Estimation

Song Cai and Jiahua Chen

Abstract We present a general empirical likelihood inference framework for Type

I censored multiple samples. Based on this framework, we develop an effective
empirical likelihood ratio test and efficient distribution function and quantile
estimation methods for Type I censored samples. In particular, we pool information
across multiple samples through a semiparametric density ratio model and propose
an empirical likelihood approach to data analysis. This approach achieves high
efficiency without making risky model assumptions. The maximum empirical
likelihood estimator is found to be asymptotically normal. The corresponding
empirical likelihood ratio is shown to have a simple chi-square limiting distribution
under the null model of a composite hypothesis about the DRM parameters. The
power of the EL ratio test is also derived under a class of local alternative models.
Distribution function and quantile estimators based on this framework are developed
and are shown to be more efficient than the empirical estimators based on single
samples. Our approach also permits consistent estimations of distribution functions
and quantiles over a broader range than would otherwise be possible. Simulation
studies suggest that the proposed distribution function and quantile estimators are
more efficient than the classical empirical estimators, and are robust to outliers and
misspecification of density ratio functions. Simulations also show that the proposed
EL ratio test has superior power compared to some semiparametric competitors
under a wide range of population distribution settings.

S. Cai ()
School of Mathematics and Statistics, Carleton University, Ottawa, ON, Canada
e-mail: [email protected]
J. Chen
Big Data Research Institute of Yunnan University and Department of Statistics, University
of British Columbia, Vancouver, BC, Canada
e-mail: [email protected]

© Springer Science+Business Media Singapore 2016 123

D.-G. (Din) Chen et al. (eds.), Advanced Statistical Methods in Data Science,
ICSA Book Series in Statistics, DOI 10.1007/978-981-10-2594-5_7
124 S. Cai and J. Chen

7.1 Introduction

Type I censored observations are often encountered in reliability engineering and

medical studies. In a research project on long-term monitoring of lumber quality,
lumber strength data have been collected from mills across Canada over a period
of years. The strength-testing machines were set to prefixed tension levels. Those
pieces of lumber that are not broken in the test yield Type I right-censored
observations. A primary task of the project is to monitor the quality index based
on lower quantiles, such as the 5 % quantile, as the years pass. Another important
task is to detect the possible change in the overall quality of lumber over time.
The statistical nature of the first task is quantile estimation, and that of the second
is testing for difference among distribution functions of different populations. We
are hence motivated to develop effective quantile estimation and hypothesis testing
methods based on multiple Type I censored samples. To achieve high efficiency
without restrictive model assumptions, we pool information in multiple samples via
a semiparametric density ratio model (DRM) and propose an empirical likelihood
(EL) approach to data analysis.
Suppose we have Type I censored samples from m C 1 populations with
cumulative distribution functions (CDFs) Fk , k D 0; 1; : : : ; m. Particularly for our
target application, it is reasonable to assume that these CDFs satisfy the relationship
˚ |

dFk .x/ D exp ˛k C ˇ k q.x/ dF0 .x/ (7.1)

for a pre-specified d-dimensional basis function q.x/ and model parameter k D

| |
.˛k ; ˇ k / . The baseline distribution F0 .x/ in the above model is left unspecified.
Due to symmetry, the role of F0 is equivalent to that of any of the Fk . When the data
are subject to Type I censoring, it is best to choose the population with the largest
censoring point as the baseline.
The DRM (7.1) has some advantages. First, it pools the information across
different samples through a link between the population distributions without a
restrictive parametric model assumption. Highly statistical efficient data analysis
methods are therefore possible compared with methods based on models without
such links. Second, the DRM is semiparametric and hence very flexible. For
example, every exponential family of distributions is a special case of this DRM. In
life-data modeling, the moment-parameter family and Laplace-transform-parameter
family of distributions (Marshall and Olkin 2007, 7. H & I) both satisfy the DRM.
The commonly used logistic regression model under case-control study is also
closely related to the DRM (Qin and Zhang 1997). This flexibility makes DRM
resistant to misspecification of q.x/. Last but not the least, standard nonparametric
methods would permit sensible inferences of Fk .x/ only for x ck , where ck is the
censoring cutting point of the kth sample. The proposed EL method based on the
DRM, however, extends this range to x maxfck g.
Data analysis under the DRM (7.1) based on full observations has attracted a
lot of attention. Qin (1998) introduced the EL-based inference under a two-sample
7 EL Inference under DRM Based on Type I Censored Samples 125

setting; Zhang (2002) subsequently developed a goodness-of-fit test; Fokianos

(2004) studied the corresponding density estimation; quantile estimation was
investigated by Chen and Liu (2013); in Cai et al. (2016), a dual EL ratio test was
developed for testing composite hypotheses about DRM parameters. Research on
DRM based on Type I censored data has been scarce. Under a two-sample setting
with equal Type I censoring points, Wang et al. (2011) studied the properties of
the parameter estimations based on full empirical likelihood. However, without
building a link between the EL and its dual, as we do in this paper, both numerical
solution and analytical properties of the quantile estimation were found technically
challenging.
In this chapter, we first establish a general EL inference framework for Type
I censored multiple samples under the DRM. Based on this framework, we then
develop an effective EL ratio test and efficient CDF and quantile estimation
methods. Instead of a direct employment of the full EL function, we develop a dual
partial empirical likelihood function (DPEL). The DPEL is equivalent to the full
EL function for most inference purposes. However, unlike EL, it is concave and
therefore allows simple numerical solutions as well as facilitating deeper analytical
investigations of the resulting statistical methods. Using the DPEL, we show that
the maximum EL estimators of the k is asymptotically normal. We also show
that the corresponding EL ratio has a chi-square and a non-central chi-square
limiting distribution under the null model and local alternative model of a composite
hypothesis about the DRM parameters, respectively. We further construct CDF and
quantile estimators based on DPEL, and show that they have high efficiency and nice
asymptotic properties. The DPEL-based approach is readily extended to address
other inference problems, such as EL density estimation and goodness-of-fit test.
The chapter is organized as follows. In Sect. 7.2, we work out the EL function
for the DRM based on Type I censored data, introduce the maximum EL estimators,
and study their properties. Section 7.3 presents the theory of EL ratio test under the
DRM. Sections 7.4 and 7.5 study the EL CDF and quantile estimations. Section 7.6
provides numerical solutions. Simulation results are reported in Sects. 7.7 and 7.8.
Section 7.9 illustrates the use of the proposed methods with real lumber quality data.

7.2 Empirical Likelihood Based on Type I Censored

Observations

Consider the case where nk sample units are drawn from the kth population out
of which nk nQ k units are right-censored at ck , with nQ k being the number of
uncensored observations. Without loss of generality, we assume c0 ck for all
k. Denote the uncensored observations by xkj for j D 1; : : : ; nQ k . Write dFk .x/ D
Fk .x/ Fk .x /. Based on the principle of empirical likelihood of Owen (2001), the
126 S. Cai and J. Chen

EL is defined to be

Y nQ k
m
Y
Ln .fFk g/ D dFk .xkj /f1 Fk .ck /gnk Qnk :
kD0 jD1

Under the DRM assumption (7.1), the above EL can be further written as
nQk nQ k
Y
m Y
Y
m Y
| Y
m

Ln .fFk g/ D dF0 .xkj / expf k Q.xkj /g f1 Fk .ck /gnk Qnk ;
kD0 jD1 kD0 jD1 kD0
(7.2)
| | | |
with Q.x/ D .1; q.x/ / and k D .˛k ; ˇ k / .
| | | | | | |
Denote ˛ D .˛1 ; : : : ; ˛m / , ˇ D .ˇ 1 ; : : : ; ˇ m / , and D .˛ ; ˇ / . For
convenience, we set ˛0 D 0 and ˇ 0 D 0. Further, let pkj D dF0 .xkj /, p D f pkj g,
&k D Fk .ck /, and & D f&k g. Finally, introduce the new notation
˚ |

'k .; x; c/ D exp k Q.x/ 1.x c/:

The EL is seen to be a function of , p, and &, and we will denote it Ln .; p; &/.
The maximum EL estimators (MELEs) are now defined to be
X nQ k
m X

O pO ; &O Dargmax Ln .; p; &/ W
; pkj 'r .; xkj ; cr / D &r ;
; p; & kD0 jD1

pkj 0; 0 < &r 1; r D 0; : : : ; m : (7.3)

The constraints for (7.3) are given by a equality implied by the DRM assump-
tion (7.1) as follows:
Z Z
˚ |

&k D Fk .ck / D exp k Q.x/ 1.x ck /dF0 .x/ D 'k .; x; ck /dF0 .x/

for k D 0; 1; : : : ; m.

7.2.1 MELE and the Dual PEL Function

The MELE of &k is given by &O k D nQ k =nk , a useful fact for a simple numerical
solution to (7.3). To demonstrate, let us factorize the EL (7.2) as
Y nQ k
m Y Y nQ k
m Y Y
m
Ln .; p; &/ D .pkj =&0 / .&0 =&k /'k .; xkj ; ck / &knQk f1 &k gnk Qnk
kD0 jD1 kD1 jD1 kD0
(7.4)
7 EL Inference under DRM Based on Type I Censored Samples 127

D PLn .; p; &/ Ln .&/ (7.5)

We call PLn .; p; &/ the partial empirical likelihood (PEL) function. Under the
constraints specified in (7.3), sup; p PLn .; p; &/ is constant in & as follows.
Proposition 1 Let &Q and &L be two values for parameter & that satisfy the
constraints in (7.3). Then we have

sup PLn .; p; &/

Q D sup PLn .; p; &/:
L
; p ; p

Proof (Proof) Suppose pQ and Q form a solution to sup; p PLn .; p; &/,
Q namely,

Q pQ ; &/
PLn .; Q D sup PLn .; p; &/:
Q
; p

Let pL and L be defined by

pL kj D pQ kj .&L 0 =&Q 0 /; ˛L k D ˛Q k C log.&Q 0 =&L 0 / log.&Q k =&L k /:

It is easily verified that

L pL ; &/
PLn .; Q pQ ; &/:
L D PLn .; Q

Hence, we must have

sup PLn .; p; &/

Q sup PLn .; p; &/:
L
; p ; p

Clearly, the reverse inequality is also true, and the proposition follows.
This proposition implies that &O D argmaxLn .&/ and therefore that &O k D nQ k =nk . It
O pO / D argmaxPLn .; p; &/
further implies that .; O under the same set of constraints.
Because of this, we can compute .; O pO / with standard EL tools. We first get
the profile function `Qn ./ D supp logfPLn .; p; &/g O and then compute for O D
Q
argmax `n ./. Pm
More specifically, let nQ D Q k be the total number of uncensored obser-
kD0 n
vations. By the method of Lagrange multipliers, for each given , the solution of
sup PLn .; p; &/
O in p is given by

n X
m
o1
pkj D nQ 1 1=&O 0 C r 'r .; xkj ; cr / &O r =&O 0 ; (7.6)
rD1
128 S. Cai and J. Chen

Pm PnQk
where the Lagrange multipliers f k gm kD1 solve kD0 jD1 pkj 'r .; xkj ; cr / D &
O r for
r D 0; 1; : : : ; m. The resulting profile log-PEL is then given by

X nQ k
m X n X
m
o X nQ k
m X
|
`Qn ./ D log nQ 1=&O 0 C r 'r .; xkj ; cr / &O r =&O 0 C k Q.xkj /:
kD0 jD1 rD1 kD1 jD1
(7.7)

At its maximum when D , O the profile log-PEL satisfies @`Qn ./=@˛k D 0 for
k D 1; : : : ; m. Some simple algebra shows that consequently, when D , O we
O
have the corresponding Lagrange multipliers are k D nk =Qn.
Put Ok D nk =n and recall that &O r D nQ r =nr . We find that

X nQ k
m X nX
m o X nQ k
m X
|
`Qn ./
O D log O xkj ; cr / C
Or 'r .; O r Q.xkj /:
kD0 jD1 rD0 kD1 jD1

For this reason, we define a dual PEL (DPEL) function

X nQ k
m X nX
m o X nQk
m X
|
`n ./ D log Or 'r .; xkj ; cr / C r Q.xkj /: (7.8)
kD0 jD1 rD0 kD1 jD1

Clearly, we have O D argmax ln ./, and the DPEL is a concave function of . The
relationship also implies that

nX
m o1
pO kj D n1 O xkj ; cr /
Or 'r .; : (7.9)
rD0

The DPEL is analytically simple, facilitating deeper theoretical investigations and

simplifying the numerical problem associated with the data analysis.

7.2.2 Asymptotic Properties of the MELE O

Let be the true value of the parameter . Suppose that, for some constants
k 2 .0; 1/, Ok D nk =n ! k as n ! 1, for k D 0; : : :. Define the partial
|
empirical information matrix Un D n1 @2 `n . /=@@ . By the strong law of
large numbers, Un converges almost surely to a matrix U because it is the average
of several independent and identically distributed (i.i.d) samples. The limit U serves
as a partial information matrix.
7 EL Inference under DRM Based on Type I Censored Samples 129

Define
|
h.; x/ D .1 '1 .; x; c1 /; : : : ; m 'm .; x; cm // ;
X
m
s.; x/ D k 'k .; x; ck /;
kD0
|
H.; x/ D diagfh.; x/g h.; x/h .; x/=s.; x/:

We partition the entries of U in agreement with ˛ and ˇ and denote them by U˛˛ ,
U˛ˇ , Uˇ˛ , and Uˇˇ . The blockwise algebraic expressions of U can be written as
˚

U˛˛ D E0 H. ; X/ ;
n o
|
Uˇˇ D E0 H. ; X/ ˝ q.X/q .X/ ;
| ˚ |

U˛ˇ D Uˇ˛ D E0 H. ; X/ ˝ q .X/ ;

where E0 ./ is the expectation operator with respect to F0 .x/ and ˝ is the Kronecker
product operator.
R The |above blockwise expressions of U reveal that U is positive
definite when Q.x/Q .x/dF0 .x/ > 0.
We found that the MELE O is asymptotically normal as summarized as follows.
Theorem 1 Suppose we have m C 1 Type I censored random samples with
censoring cutting points ck , k D 0; : : : ; m, from populations with distributions
satisfying
R the DRM assumption (7.1) with a true parameter value such that
|
expfˇ k q.x/gdF0 .x/ < 1 for all in a neighborhood of . Also, suppose
R |
Q.x/Q .x/dF0 .x/ > 0 and Ok D nk =n ! k as n ! 1 for some constants
k 2 .0; 1/.
Then, as n ! 1,

p .d/
n.O / ! N 0; U 1 W ;

where
0 1
01 C 11 01 01
B C
Tmm 0 B 01 01 C 21 01 C
WD with T D B :: :: :: :: C:
0 0mdmd @ : : : : A
01 01 01 C m1

The asymptotic normality of O forms the basis for developing the EL–DRM
hypothesis testing and estimation methods in the sequel.
130 S. Cai and J. Chen

7.3 EL Ratio Test for Composite Hypotheses About ˇ

An appealing property of the classical EL inference for single samples is that the
EL ratio has a simple chi-square limiting distribution (Owen 2001). For uncensored
multiple samples satisfying the DRM assumption (7.1), the dual EL (DEL) ratio is
also shown to have an asymptotic chi-square distribution by Cai et al. (2016) under
the null hypothesis of a large class of composite hypothesis testing problems. Such
nice property of the EL ratio also carries to the case of Type I censored multiple
samples as we will show in this section.
As noted in the Introduction, a primary interest in our lumber project is to
check whether the overall quality of the lumber change over time. This amounts
to test the difference among the underlying distributions fFk gm
kD0 of random lumber
samples collected from different years, i.e. testing H0 W F0 D : : : D Fm against
Ha W Fi ¤ Fj for some i ¤ j. When the fFk g satisfy the DRM assumption (7.1), this
hypothesis testing problem is equivalent to a hypothesis testing problem about the
DRM parameter ˇ: H0 W ˇ D 0 against Ha W ˇ ¤ 0. Note that, the parameter ˛ is
not included because it is just a normalizing constant, and ˇ k D 0 implies ˛k D 0.
Here we consider a more general composite hypothesis testing problem about ˇ

H0 W g.ˇ/ D 0 against H1 W g.ˇ/ ¤ 0 (7.10)

for some smooth function g W Rmd ! Rq , with q md, the length of ˇ. We will
assume that g is thrice differentiable with a full rank Jacobian matrix @g=@ˇ. The
parameters f˛k g are usually not a part of the hypothesis, because their values are
fully determined by the fˇk g and F0 under the DRM assumption.
Q pQ ; &/
We propose an EL ratio test for the above hypothesis. Let .; Q be the MELE
based on Type I censored samples under the null constraint of g.ˇ/ D 0. Define the
EL ratio statistic as

O pO ; &/
Rn D 2flog Ln .; Q pQ ; &/g:
O log Ln .; Q

Our following lemma shows that Rn is equal to the DPEL ratio, a quantity that enjoys
a much simpler analytical expression.
Lemma 1 The EL ratio statistic Rn equals the DEPL ratio statistic, i.e.

O `n ./g;
Rn D 2f`n ./ Q

where `n ./ is the DPEL function (7.8).

Note that, except for the additional indicator terms, the expression of the DPEL
is identical to that of the DEL function defined in Cai et al. (2016) for uncensored
samples under the DRM. Hence the techniques for showing the asymptotic proper-
ties of the DEL ratio in Cai et al. (2016) can be readily adapted here to prove our
7 EL Inference under DRM Based on Type I Censored Samples 131

next two theorems (Theorems 2 and 3) about the asymptotic properties of the EL
ratio Rn based on Type I censored samples.
Let 2q denote a chi-square distribution with q degrees of freedom, and 2q .ı 2 /
denote a non-central chi-square distribution with q degrees of freedom and non-
central parameter ı 2 . Partition the Jacobian matrix of g.ˇ/ evaluated at ˇ , 5 D
@g.ˇ /=@ˇ, into .51 ; 52 /, with q and md q columns respectively. Without loss
of generality, we assume that 51 has a full rank. Let Ik be an identity matrix of size
| |
k k and J D ..51 1 52 / ; Imdq /.

Theorem 2 Adopt the conditions postulated in Theorem 1. Then, under the null
hypothesis, H0 W g.ˇ/ D 0, of (7.10), we have

.d/
Rn ! 2q

as n ! 1.
Theorem 2 is most useful for constructing a EL ratio test for composite hypothe-
sis testing problem (7.10). In addition, it can be used to construct a confidence region
for the true DRM parameter ˇ . Take the null hypothesis to be g.ˇ/ D ˇ ˇ D 0
for any given ˇ , then Rn D ln . Ǒ / ln .ˇ / ! 2md . We can use this result to
construct a chi-square confidence region for ˇ . The advantage of a chi-square
confidence region over a normal region is extensively discussed in Owen (2001),
so we do not elaborate further.
We shall focus on the hypothesis testing problem since that is our primary goal
in application. The next theorem gives the limiting distribution of the EL ratio under
a class of local alternatives. It can be used to approximate the power of the EL ratio
test and to calculate the required sample size for achieving a certain power.
Theorem 3 Adopt the conditions postulated in Theorem 1. Let fˇk gm
kD1 be a set of
DRM parameter values that satisfy the null hypothesis of (7.10).
Then, under the local alternative model:
1=2
ˇ k D ˇ k C nk ck ; k D 1; : : : ; m; (7.11)

where fck g are some constants, we have

.d/
Rn ! 2q .ı 2 /

as n ! 1. The expression of the non-central parameter ı 2 is given by

(
|˚
Q J | J

Q 1 J | Q if q < md
2 Q J
ı D | Q
if q D md

1=2 | 1=2 | 1=2 | |

where Q D Uˇˇ Uˇ˛ U˛˛ 1
U˛ˇ . and D 1 c1 ; 2 c2 ; : : : ; m cm .
Moreover, ı 2 > 0 unless is in the column space of J.
132 S. Cai and J. Chen

In many situations, a hypothesis of interest may well focus on characteristics of

just a subset of the populations fFk gm
kD0 . If so, should our test be based on all the
samples or only the samples of interest? An answer is found in the improved local
power of the EL ratio test based on all samples over the test based on the subset
of the samples. Such an answer is the same as that for the DEL ratio test under the
DRM based on uncensored multiple samples. A rigorous treatment of this argument
is given by Theorem 3 of for uncensored case. Again, because of the similarity in
expressions of the DPEL and DEL, that theorem also holds for the proposed EL
ratio test for Type I censored samples. The details can be found therein.

7.4 Estimation of Fk

We now turn to the estimation of CDFs fFk g. The estimation of population quantiles
is studied in the next section.
Due to the DRM assumption, we naturally estimate Fr at any z c0 by

X nQ k
m X X
m X
nk
˚ |

FO r .z/ D pO kj exp O r Q.xkj / 1.xkj z/ D pO kj 'k .; xkj ; z/: (7.12)

kD0 jD1 kD0 jD1

P
A few notational conventions have been and will continue to be used: k;j will be
regarded as summation over k D 0; : : : ; m and j D 1; : : : ; nk . When nQ k < j nk , the
value of xkj is censored and we define
˚

'k .; xkj ; z/ D exp O r Q.xkj / 1.xkj z/ D 0 or 1: (7.13)

Whether 'k .; xkj ; z/ takes value 0 or 1 when nQ k < j nk depends on whether it
serves as an additive or a product term, andP similarly for other quantities involving
xkj . With this convention, we may regard k;j as being a sum over m C 1 i.i.d
samples.
Even though observations from the population Fr are censored at cr c0 , the
connection between Fr and F0 through DRM makes it possible to consistently
estimate Fr .z/ for z 2 .cr ; c0 when cr < c0 .
Theorem 4 Assume the conditions of Theorem 1. For any pair of integers 0
r1 ; r2 m and pair of real values z1 ; z2 c0 , we have

p ˚
| .d/
n FO r1 .z1 / Fr1 .z1 /; FO r2 .z2 / Fr2 .z2 / ! N.0; ˝EL /:
7 EL Inference under DRM Based on Type I Censored Samples 133

The expression for the .i; j/ entry of ˝EL is

|
!i;j D E0 f'ri . ; X; zi /'rj . ; X; zj /=s. ; X/g C ri .zi /U 1 rj .zj /

r1i
E0 f'ri . ; X; zi /g E0 f'rj . ; X; zj /g 1.i D j/;
| | |
where, for k D 0; : : : ; m, k .z/ D k;1 .z/; k;2 .z/ , is a vector with
˚

k;1 .z/ D E0 ek 1.k ¤ 0/ h. ; X/=s. ; X/ 'k . ; X; z/ ;
˚

k;2 .z/ D E0 ek 1.k ¤ 0/ h. ; X/=s. ; X/ 'k . ; X; z/ ˝ q.X/ ;

and ek is a vector of length m with the kth entry being 1 and the others 0.
The result of this theorem extends to multiple r1 ; : : : ; rJ with their corresponding
z1 ; : : : ; zJ c0 . For ease of presentation, we have given the result only for J D 2
above.
We have assumed (without of loss of generality) that cr c0 . When cr < c0 , there
is no direct information to estimate Fr .z/ for z 2 .cr ; c0 . The DRM assumption,
however, allows us to borrow information from samples from F0 and other Fk to
sensibly estimate Fr .z/ for z in this range. This is an interesting result. When z cr ,
we may estimate Fr .z/ via its empirical distribution based only on the sample from
Fr . Theorem 4 can be used to show that the EL–DRM-based estimator has lower
asymptotic variance, i.e., ˝EL ˝EM , where ˝EM is the covariance matrix of the
empirical distribution.

7.5 Quantile Estimation

With a well-behaved EL–DRM CDF estimator in hand, we propose to estimate the

th, 2 .0; 1/, quantile r of the population Fr .x/ by

Or D inffx W FO r .x/ g: (7.14)

Our following theorem characterizes an important asymptotic property of Or .

Theorem 5 Assume the conditions of Theorem 1. Assume also that the density
function, fr .x/, of Fr .x/, is positive and differentiable at x D r for some 2
.0; Fr .c0 //.
Then the EL quantile estimator (7.14) admits the following representation:

Or D r fFO r .r / g=fr .r / C Op .n3=4 flog ng1=2 /:

A result of this nature was first obtained by Bahadur (1966) for the sample
quantiles, hence the name Bahadur representation. It is proven to hold much more
134 S. Cai and J. Chen

broadly, including for EL-based quartile estimators such as those in Chen and
Chen (2000) and Chen and Liu (2013). Our result is of particular interest as the
representation goes beyond the range restricted due to Type I censorship: r < cr .
Bahadur representation sheds light on the large sample behavior of the quantile
process, and it is most useful for studying the multivariate asymptotic normality of
Or , as follows.
Theorem 6 Let j be the j th quantile of Frj with j 2 .0; Frj .c0 // for j D 1; 2.
Under the conditions of Theorem 5, we have

p | .d/
n O1 1 ; O2 2 ! N.0; A˝EL A/

with A D diagf fr1 .1 /; fr2 .2 /g1 .

Clearly, improved precision in terms of ˝EL over ˝EM leads to improved
efficiency of the proposed Or over the sample quantile.

7.6 Other Inferences on Quantiles

Theorem 6 has laid a solid basis for constructing approximate Wald-type confidence
intervals and hypothesis tests for quantiles. For this purpose, we need a consistent
estimator of ˝EL and fr .r /, the density function at r .
The analytical expression for ˝EL is a function of and has the general form
E0 fg.X; /1.X c0 /g. To estimate ˝EL , it is most convenient to use the method
of moments with EL weights pO kj :
X
O
EO 0 fg.X; /1.X c0 /g D O
pO kj g.xkj ; /;
k;j

where O is the MELE of and pO kj is as given in (7.9).

The value of fr .r / is most effectively
R estimated by the
R kernel method. Let K.t/
be a positive-valued function such that K.t/dt D 1 and tK.t/dt D 0. We estimate
fr .z/ at z c0 by

X
m X
nk
˚

fOr .z/ D h1

n pO kj 'k .; xkj ; ck /K .z xkj /=hn :
kD0 jD1

When the bandwidth hn ! 0 and nhn ! 1 as n ! 1, this kernel estimator is

easily shown to be consistent. The optimal bandwidth is of order n1=5 in terms of
the asymptotic mean integrated squared error (Silverman 1986) at z values that are
not close to the boundary. In our simulation, we choose the density function of the
standard normal distribution as the kernel and n1=5 as the bandwidth.
7 EL Inference under DRM Based on Type I Censored Samples 135

The covariance matrix, ˙EL , of the proposed quantile estimator is then estimated
by substituting the estimated ˝EL and the density estimates fOri .Oi / into the expression
of ˙EL . The procedures have been implemented in our R software package
“drmdel”, which is available on the Comprehensive R Archive Network (CRAN).

7.7 Simulation Studies I: CDF and Quantile Estimation

We use simulation studies to demonstrate the advantages of combining information

through the use of the DRM and the EL methodology. This section presents the
simulation results for the proposed CDF and quantile estimation methods are
presented in this section, and those for EL ratio test are given in the next section. Our
estimation approach is particularly effective at efficient estimation of lower quantiles
and is resistant to mild model misspecification and the influence of large outliers.
As individual samples are by nature sparse at observations at lower quantiles, it is
particularly important to pool information from several samples. At the same time,
the presence of Type I censorship matters little for estimating lower quantiles.
In all simulations, we put the number of samples to be m C 1 D 4, and the
sample sizes nk to be .110; 90; 100; 120/. The number of simulation repetitions is
set to 10;000.

7.7.1 Populations Satisfying the DRM Assumption

The DRM encompasses a large range of statistical models as subsets. In this sim-
ulation, we consider populations from a flexible parametric family, the generalized
gamma distribution (Stacy 1962), denoted GG.a; b; p/. It has density function
a
f .x/ D xap1 expf.x=b/a g; x > 0:
bap . p/

When p D 1, the generalized gamma distribution becomes the Weibull distribution;

when a D 1, it becomes the gamma distribution. Generalized gamma distributions
with known shape parameter a satisfy the DRM assumption with basis function
|
q.x/ D .log x; xa / . In our simulations, we fix a at 2 and choose parameter values
such that the shapes and the first two moments of populations closely resemble
those of our lumber quality samples in real applications. We generate samples from
four such populations and make these samples right-censored around their 75 %
population quantiles. We also conduct simulations based on samples from Weibull,
gamma, and normal populations, respectively.
We first study CDF estimation for Fr .z/ for all r D 0; : : : ; m with z being the
5 %, 10 %, 30 %, 40 %, and 50 % quantiles of the baseline population F0 . Based on
the Type I censored data, the empirical distribution FL r .z/ is well defined for z values
smaller than the censoring point but not for larger values of z. We purposely selected
136 S. Cai and J. Chen

Table 7.1 Simulation results for CDF estimation based on generalized gamma samples
2 .F
Or / O 2 .F
Or / Lr /
Fr .z/ V.FO r /.103 / B.FO r /.%/ B.FL r /.%/
V.F
z nV.F Or / nV.F Or / Or /
V.F
GG.2; 4:3; 2:8/ 3.64 0.05 0.34 0.99 1.00 1:24 0:92 0.45
c0 D 8:2 4.26 0.10 0.67 0.96 0.98 1:21 1:36 0.40
5.68 0.30 1.62 0.98 0.98 1:20 0:08 0.03
6.23 0.40 1.92 0.99 0.98 1:14 0:07 0.03
6.76 0.50 2.10 0.99 0.98 1:09 0:05 0.06
GG.2; 2:6; 5/ 3.64 0.05 0.39 1.02 1.02 1:32 0:93 0.07
c1 D 6:5 4.26 0.13 1.05 0.96 0.97 1:26 0:34 0:16
5.68 0.52 2.43 1.00 1.00 1:14 0:10 0.05
6.23 0.68 2.07 1.02 1.01 1:15 0:01 0.01
6.76 0.80 3.27 0.95 1.02 NA 0:34 NA
GG.2; 3:8; 3:5/ 3.64 0.03 0.23 0.96 0.98 1:38 0:78 0.20
c2 D 8:1 4.26 0.07 0.54 0.94 0.95 1:27 0:21 0.22
5.68 0.28 1.62 1.00 0.99 1:21 0:20 0.06
6.23 0.39 2.00 1.01 1.01 1:18 0:01 0.11
6.76 0.50 2.23 1.02 1.01 1:10 0:15 0.05
GG.2; 2:5; 6/ 3.64 0.02 0.12 1.02 1.05 1:54 1:24 1.21
c3 D 6:8 4.26 0.07 0.45 0.98 0.98 1:31 0:52 0.39
5.68 0.41 1.76 1.01 1.00 1:14 0:02 0:05
6.23 0.59 1.77 1.01 1.01 1:13 0:10 0:06
6.76 0.74 1.55 1.01 1.01 1:03 0:03 0:05
Fr : true CDF value; FO r and FL r : EL–DRM and EM estimates; V: simulated variance; 2 : theoretical
variance; O 2 : average of variance estimates; B: relative bias in percentage; NA: FL r .z/ not defined

populations such that at some z values, FL r .z/ is not well-defined but the EL–DRM
estimator FO r .z/ is. We report various summary statistics from the simulation for both
the empirical distribution and our proposed EL–DRM estimators.
The simulation results for data from the generalized gamma populations are
reported in Table 7.1. The parameter values of the populations are listed in
the leftmost column. Note that the basis function of the DRM for these pop-
|
ulations is q.x/ D .log x; x2 / . The censoring points for the four samples are
.8:2; 6:5; 8:1; 6:8/. The z values at which the CDFs are estimated are listed in the
“z” column, and the corresponding true CDF values are listed in the “Fr .z/” column.
The fourth column is the variance of the CDF estimator observed in the simula-
tion (over 10,000 repetitions). The fifth (respectively, sixth) column is the ratio of
the theoretical asymptotic variance (respectively, average estimated variance) to the
variance observed in the simulation. All the values in these columns are close to
1, indicating that the variance estimator constructed in Sect. 7.4 and its numerical
computation method given in Sect. 7.6 work very well. The seventh column is the
ratio between two simulated variances, one based on the empirical distribution
and the other on our proposed EL–DRM-based estimator. For small z values, our
7 EL Inference under DRM Based on Type I Censored Samples 137

proposed estimator gains 20 %–55 % precision in terms of simulated variances. The

gain in efficiency is smaller at large z values but still noticeable.
The last two columns are relative biases. The relative biases of both the EL–DRM
estimators and the empirical (EM) estimators are generally below 1:5 %, which
seems rather satisfactory. Overall, compared to the variance, the bias is a very small
portion of the mean squared error (MSE), so we omit the comparison of the two
estimators in MSE.
When the empirical distribution FL r .z/ is not defined for a particular z due to Type
I censorship, we put “NA”s in the corresponding cells in the table. As indicated,
z D 6:76 is beyond the censoring point of F1 , so F1 .6:76/ cannot be estimated by
the empirical distribution.
We next examine the results for quantile estimation. For each population, we
estimate quantiles at levels D 0:05, 0:1, 0:3, 0:5 and 0:6. The results for data
from the generalized gamma populations are given in Table 7.2. The first column
lists the parameter values of the populations, the second gives the levels at which
the quantiles are estimated, and the third provides the corresponding true quantile
values. The fourth column is the simulated variance. The fifth and sixth columns
reflect the quality of the theoretical asymptotic variance and the variance estimation

Table 7.2 Simulation results for quantile estimation based on generalized gamma samples
2 .Or / O 2 .Or / V.Lr /
r V.Or /.102 / nV.Or / nV.Or / V.Or /
B.Or /.%/ B.Lr /.%/
GG.2; 4:3; 2:8/ 0.05 3.64 8.44 1.02 0.98 1.27 0:30 0:44
c0 D 8:2 0.10 4.26 6.75 0.96 0.99 1.23 0:10 0:93
0.30 5.68 5.38 0.97 1.02 1.17 0:07 0:40
0.50 6.76 5.89 1.00 1.01 1.09 0:09 0:33
0.60 7.31 6.44 0.99 1.04 1.06 0:03 0:36
GG.2; 2:6; 5/ 0.05 3.65 4.41 1.02 1.01 1.29 0:52 0:36
c3 D 6:5 0.10 4.06 3.30 0.99 1.00 1.27 0:25 0:69
0.30 4.96 2.43 0.98 1.03 1.22 0:03 0:38
0.50 5.62 2.53 1.01 1.04 1.15 0:01 0:32
0.60 5.95 2.72 0.99 1.06 1.12 0:02 0:34
GG.2; 3:8; 3:5/ 0.05 3.96 7.94 0.95 0.94 1.31 0:37 1:40
c2 D 8:1 0.10 4.52 6.07 0.92 0.96 1.28 0:14 0:86
0.30 5.81 4.47 1.01 1.03 1.20 0:12 0:40
0.50 6.77 4.80 1.05 1.06 1.13 0:00 0:35
0.60 7.25 5.34 1.02 1.06 1.08 0:03 0:32
GG.2; 2:5; 6/ 0.05 4.04 3.28 0.98 0.96 1.40 0:27 0:97
c1 D 6:8 0.10 4.44 2.38 0.98 0.99 1.30 0:16 0:54
0.30 5.31 1.72 1.00 1.04 1.21 0:07 0:23
0.50 5.95 1.76 1.02 1.05 1.12 0:02 0:19
0.60 6.27 1.91 0.99 1.05 1.10 0:02 0:19
r : true quantile; Or and Lr : EL–DRM and EM quantile estimates; V: simulated variance; 2 :
theoretical variance; O 2 : average of variance estimates; B: relative bias in percentage
138 S. Cai and J. Chen

relative to the simulated variance. Again, we see that all numbers in these columns
are close to 1, showing that the asymptotic variance and its estimates are good
approximations to the actual variance of the estimator.
The seventh column, which gives the ratio of two simulated variances-one based
on the empirical distribution and the other on the EL–DRM quantile estimator-
shows that the EL–DRM quantile estimator is uniformly more efficient than the EM
quantile estimator (i.e., the sample quantile). In applications, the gain is particularly
substantial and helpful at lower quantiles (5 % and 10 %), being between 23 % and
40 %. The biases of the EL–DRM quantile estimator are generally smaller than those
of the EM estimator in absolute value, except in one case. In general, the biases of
both estimators are satisfactorily low.
The asymptotic normality shown in Theorem 6 can be used to construct confi-
dence intervals for quantiles and quantile differences based on EL–DRM quantile
estimators. The same is true for the EM quantile estimators. The simulation results
for confidence intervals of quantiles and quantile differences are shown in Table 7.3.
In all cases, the EL–DRM intervals outperform the EM intervals in terms of
both coverage probabilities and lengths. For single lower quantiles at D 0:05 and
0:1, the coverage probabilities of EL–DRM intervals are between 91 % and 93:5 %.
In comparison, on average the EM intervals have around 2 % less coverage. Both
methods have closer to nominal coverage probabilities for the higher quantiles. The
EL–DRM intervals have shorter average lengths and therefore are more efficient.
The simulation results for confidence intervals of quantile differences between F0
and F1 are also shown in the last two rows in Table 7.3. The coverage probabilities
of the EL–DRM intervals are all between 94 % and 95 % so are very close to
the nominal level. The coverage probabilities of the EM intervals are 1 % to 2 %
less than those of the EL–DRM intervals, while the average lengths are somewhat
greater. The simulation results for quantile differences between other populations
are similar and are omitted.

Table 7.3 Simulation results for EL–DRM and EM confidence intervals of quantiles
0:05 0:10 0:30 0:50 0:60
EL EM EL EM EL EM EL EM EL EM
0 Length: 1:09 1:22 0:99 1:10 0:91 0:96 0:94 0:97 0:99 1:01
Coverage: 91:1 89:0 92:9 91:1 93:9 92:3 93:4 92:7 94:2 92:8
1 Length: 0:80 0:92 0:70 0:80 0:61 0:67 0:63 0:67 0:65 0:71
Coverage: 91:6 90:2 93:0 91:9 94:4 93:4 94:7 93:5 94:4 94:7
2 Length: 1:04 1:21 0:93 1:05 0:83 0:91 0:87 0:91 0:91 0:94
Coverage: 91:5 89:6 92:8 90:9 94:2 92:8 94:0 92:8 94:0 93:4
3 Length: 0:68 0:80 0:59 0:68 0:52 0:57 0:53 0:56 0:55 0:59
Coverage: 92:3 91:8 93:3 93:1 94:8 94:1 94:9 94:6 94:8 95:6
0 1 Length: 1:27 1:47 1:13 1:30 1:05 1:12 1:08 1:12 1:14 1:18
Coverage: 94:7 92:7 94:7 93:1 94:3 94:0 94:3 93:7 94:8 94:5
Nominal level: 95 %; 0 1 : difference between quantiles of F0 and F1 ; EL: EL–DRM intervals;
EM: EM intervals
7 EL Inference under DRM Based on Type I Censored Samples 139

The simulation results for the Weibull, gamma and normal populations are quite
similar to the reported results for the generalized gamma populations: the EL–DRM
CDF and quantile point and interval estimators are generally more efficient than
their empirical counterparts. The detailed results are omitted for brevity. It is worth
noting that, compared to the EM quantile estimator, the efficiency gain of the EL–
DRM quantile estimator for the Weibull distributions is as high as 100 % to 200 %
for lower quantiles at D 0:05 and 0:1. The efficiency gains of the EL–DRM
estimators for both the normal and the gamma populations are between 10 % and
45 %.

7.7.2 Robustness to Model Misspecification

The DRM is flexible and includes a large number of distribution families as

special cases. The model fits most situations, being applicable whenever the basis
function q.x/ is sufficiently rich. Misspecifying q.x/ has an adverse effect on the
estimation of DRM parameters (Fokianos and Kaimi 2006). However, in the targeted
applications, and likely many others, the coefficients of q.x/ in the model do not
have direct interpretations. The adverse effect of model misspecification on the
estimation of population distributions and quantiles is limited.
In this simulation, we choose the population distributions from among four
different families: Weibull, generalized gamma, gamma, and log-normal. The
parameter values are again chosen such that the shape and the first two moments
of the populations approximately match real lumber data. These parameter values
are listed in the first column of Table 7.4, where W. ; / stands for a Weibull
distribution with shape and scale , G.a; b/ stands for a gamma distribution
with shape a and rate b, and LN.; / stands for a log-normal distribution with
mean and standard deviation on log scale. Note that the generalized gamma
distribution in the current simulation is the F1 used in the simulation in Sect. 7.7.1.
These four distributions do not fit into any DRM, and therefore no true q.x/ exists.
In applications, these four families are often chosen to model positively distributed
populations. None of them are necessarily true. Hence, taken in combination, they
form a sensible example of a misspecified model.
In the simulation, the observations are censored around their 75 % population
|
quantiles, which are 9:7, 6:5, 9:1, and 8:7. We choose q.x/ D .log x; x; x2 / , which
combines the basis function that suits gamma distributions and the one that suits the
generalized gamma distributions in our previous simulation. The settings for this
simulation are otherwise the same as the one in the previous subsection.
The simulation results for quantile estimation are summarized in Table 7.4. As
the fifth column of the table shows, the EL–DRM-based variance estimates obtained
according to Theorem 6 closely match the simulated variances. The efficiency
comparison again strongly favours the EL–DRM quantile estimator, showing gains
in the range of 4 % to 45 %. The efficiency gain is most prominent at the 5 % and
140 S. Cai and J. Chen

Table 7.4 Simulation results for quantile estimation under misspecified DRM
O 2 .Or / V.Lr /
r V.Or / nV.Or / V.Or /
B.Or /.%/ B.Lr /.%/
W.4:5; 9/ 0.05 4.65 0.16 1.03 1.15 0:05 0:32
c0 D 9:7 0.10 5.46 0.11 1.07 1.16 0:06 1:00
0.30 7.16 0.07 1.06 1.15 0:12 0:35
0.50 8.30 0.06 1.00 1.06 0:00 0:29
0.60 8.83 0.06 1.03 1.04 0:01 0:30
GG.2; 2:6; 5/ 0.05 3.65 0.05 0.99 1.05 0:53 0:43
c1 D 6:5 0.10 4.06 0.04 1.01 1.12 0:24 0:70
0.30 4.96 0.03 1.04 1.08 0:05 0:35
0.50 5.62 0.03 1.02 1.06 0:05 0:27
0.60 5.95 0.03 1.03 1.06 0:06 0:30
.20; 2:5/ 0.05 5.30 0.06 1.00 1.44 0:39 1:10
c2 D 9:1 0.10 5.81 0.05 1.01 1.27 0:19 0:56
0.30 6.97 0.04 1.02 1.13 0:00 0:33
0.50 7.87 0.05 1.00 1.08 0:01 0:28
0.60 8.32 0.05 1.01 1.06 0:01 0:25
LN.2; 0:25/ 0.05 4.90 0.04 0.99 1.35 0:46 0:79
c3 D 8:7 0.10 5.36 0.04 0.97 1.20 0:23 0:46
0.30 6.48 0.03 1.02 1.15 0:00 0:26
0.50 7.39 0.04 1.06 1.09 0:00 0:24
0.60 7.87 0.05 1.06 1.05 0:02 0:26
r : true quantile; Or and Lr : EL–DRM and EM quantile estimates; V: simulated variance; O 2 :
average of variance estimates; B: relative bias in percentage

10 % quantiles. The relative bias of the EL–DRM estimator is smaller than that of
the EM estimator in most cases, and both are reasonably small.
The results for quantile interval estimation are given in Table 7.5. The EL–DRM
intervals for both quantiles and quantile differences have closer to nominal coverage
probabilities in most cases compared to the EM intervals. In all cases, the EL–DRM
intervals are also superior to the EM intervals in terms of average lengths.
In conclusion, the simulation results show that the EL–DRM method retains its
efficiency gain even when there is a mild model misspecification.
We also conducted simulations for two-parameter Weibull populations and two-
component normal mixture populations under misspecified DRMs. The results are
similar to those presented and are omitted for brevity.

7.7.3 Robustness to Outliers

Often, the assumed model in an application fails to account for a small proportion
of comparatively very large values in the sample. These outliers may introduce
7 EL Inference under DRM Based on Type I Censored Samples 141

Table 7.5 Simulation results for confidence intervals of quantiles under misspecified DRM
0:05 0:10 0:30 0:50 0:60
EL EM EL EM EL EM EL EM EL EM
0 Length: 1:52 1:56 1:32 1:38 1:02 1:07 0:95 0:97 0:95 0:96
Coverage: 89:9 87:2 93:0 90:6 94:1 92:2 93:1 92:5 93:3 92:8
1 Length: 0:87 0:91 0:75 0:80 0:65 0:67 0:65 0:67 0:67 0:70
Coverage: 89:6 89:8 91:9 92:3 94:1 93:8 93:8 93:3 93:1 94:6
2 Length: 0:93 1:11 0:85 0:95 0:79 0:83 0:83 0:86 0:88 0:91
Coverage: 92:2 90:1 92:6 91:8 94:1 93:1 93:8 92:4 93:1 92:9
3 Length: 0:79 0:92 0:73 0:81 0:72 0:76 0:79 0:82 0:86 0:88
Coverage: 91:9 91:5 92:5 92:7 94:2 93:2 94:2 93:1 93:9 93:8
0 1 Length: 1:65 1:70 1:46 1:51 1:16 1:20 1:09 1:12 1:11 1:14
Coverage: 90:7 90:2 94:0 92:5 94:7 93:6 94:0 93:7 94:1 94:2
Nominal level: 95 %; 0 1 : difference between quantiles of F0 and F1
EL: EL–DRM intervals; EM: EM intervals

substantial instability into the classical optimal inference methods. Therefore,

specific robustified procedures are often developed to limit the influence of the
potential outliers.
The proposed EL–DRM method for Type I censored data is by nature robust
to large-valued outliers for lower quantile estimation. In fact, potential outliers
would be censored automatically. We may purposely induce Type I censoring to
full observations to achieve robust estimation of lower quantiles.
In this simulation, we form new populations by mixing the Fk in Sect. 7.7.1
with a 10 % subpopulation from a normal distribution. The mean of the normal
subpopulation is set to 11, around the 95 % quantile of F0 , and its standard deviation
is taken to be 1, about half of that of F0 . More precisely, the population distributions
are mixtures given by

0:9Fk C 0:1N.11; 1/; k D 0; 1; 2; 3:

Accurately estimating the lower population quantiles remains our target. The
observations are censored at the 85 % quantiles of the corresponding populations:
10:0, 7:8, 9:8, and 8:0. We compute the estimates for the 0:05, 0:1, and 0:15
population quantiles. The simulation results are reported in Table 7.6. The first
column lists the parameter values of the populations, the second gives the levels
at which the quantiles are estimated, and the third provides the corresponding true
quantile values. The fourth and fifth columns are the relative biases of the EL–DRM
estimators based on censored data (Or ) and full data (Or ). At the 0:05 quantiles, the
.f/

relative biases of Or are below 0:51 %, compared to between 4 and 8 % for Or .
. f/

The sixth and seventh columns are the simulated variances of Or and Or ,
. f/

respectively. The variances of Or are slightly larger than those of Or in general.
. f/

The eighth and ninth columns reflect the precision of the variance estimators.
142

Table 7.6 Simulation results for quantile estimation when samples contain large outliers
.f/
.f/ .f/ O 2 .Or / O 2 .Or / . f/
Fr r B.Or / B.Or / V.Or / V.Or / nV.Or / .f/ M.Or / M.Or / M.Lr /
nV.Or /

0:9GG.2; 4:3; 2:8/ 0.05 3.73 0:44 7:64 0.86 0.71 0.98 0.61 0.86 1.52 1.12
C0:1N.11; 1/ 0.10 4.37 0:06 3:90 0.70 0.55 1.00 0.78 0.70 0.84 0.91
0.15 4.83 0:13 1:71 0.63 0.49 1.01 0.89 0.63 0.56 0.78
0:9GG.2; 2:6; 5/ 0.05 3.71 0:51 4:63 0.45 0.42 1.00 1.16 0.45 0.72 0.60
C0:1N.11; 1/ 0.10 4.13 0:23 2:81 0.35 0.34 1.00 1.26 0.36 0.47 0.47
0.15 4.42 0:17 1:74 0.31 0.31 1.01 1.30 0.31 0.37 0.39
0:9GG.2; 3:8; 3:5/ 0.05 4.04 0:28 5:37 0.76 0.60 1.00 0.66 0.76 1.07 1.15
C0:1N.11; 1/ 0.10 4.62 0:10 2:78 0.62 0.49 1.00 0.83 0.62 0.65 0.83
0.15 5.04 0:11 1:24 0.55 0.44 1.00 0.93 0.55 0.48 0.72
0:9GG.2; 3:8; 3:5/ 0.05 4.10 0:27 4:05 0.32 0.27 0.99 1.14 0.32 0.55 0.47
C0:1N.11; 1/ 0.10 4.51 0:18 2:62 0.25 0.23 1.02 1.28 0.25 0.36 0.33
0.15 4.79 0:12 1:78 0.21 0.20 1.04 1.36 0.21 0.28 0.28
.f/
r : true quantile; Or and Or : EL–DRM quantile estimates based on censored and full data; Lr : EM quantile estimates; V: simulated variance on the scale of
1 2
10 ; O : average of variance estimates; B: relative bias in percentage; M: MSE on the scale of 101
S. Cai and J. Chen
7 EL Inference under DRM Based on Type I Censored Samples 143

Table 7.7 Simulation results for confidence intervals of quantiles when samples contain large
outliers
0:05 0:10 0:15
EL EL( f) EM EL EL( f) EM EL EL( f) EM
0 Length: 1:10 0:81 1:26 1:02 0:81 1:14 0:98 0:81 1:07
Coverage: 91:8 63:9 89:1 93:3 81:6 91:0 93:6 90:8 91:5
1 Length: 0:81 0:85 0:94 0:73 0:80 0:83 0:68 0:78 0:77
Coverage: 92:0 92:4 90:3 93:1 94:8 92:3 93:6 96:3 92:7
2 Length: 1:05 0:77 1:24 0:96 0:78 1:10 0:91 0:79 1:03
Coverage: 92:3 72:3 89:3 93:6 85:9 91:2 93:8 92:1 92:0
3 Length: 0:69 0:69 0:83 0:61 0:66 0:71 0:58 0:65 0:65
Coverage: 92:7 89:0 92:1 93:8 93:5 93:1 94:2 95:8 93:6
0 1 Length: 1:27 0:71 1:51 1:17 0:92 1:34 1:11 1:02 1:26
Coverage: 94:8 63:4 92:5 95:1 83:3 92:7 94:7 92:6 93:2
Nominal level: 95 %; 0 1 : difference between quantiles of population 0 and 1 EM: EM intervals;
EL: EL–DRM intervals based on censored data; EL( f): EL–DRM intervals based on full data

The entries for Or are all close to 1, showing that both precision and stability of
variance estimation for Or are superior. The entires for Or based on the full data
. f/

fluctuate between 0:61 and 1:36, revealing non-robustness and inaccurate variance
estimation.
We also report the MSEs of the estimators in the last three columns of Table 7.6.
In most of the cases, Or is superior to both Or and Lr . The gains in MSE are most
. f/

remarkable at the 0:05 quantile.

Table 7.7 shows the simulation results for quantile confidence intervals. Because
of the outliers, the EL–DRM intervals based on the full data in most cases have
much lower coverage probabilities than the nominal 95 % level for 0:05 and 0:1
quantiles; however, the performance improves much for the 0:15 quantile. The
EL–DRM intervals based on censored data have much closer to nominal coverage
probabilities in general.
In conclusion, for lower quantile estimation, it makes sense to collect much
cheaper censored data, as doing so will result in a large gain in robustness with
only a very mild loss of efficiency. If efficiency is deemed especially important, the
money saved by collecting cheaper data could be used to increase the sample size.

7.8 Simulation Studies II: EL Ratio Test

We now carry out simulations to study the power of the EL ratio test under
correctly specified and misspecified DRMs. As in simulation studies for CDF and
quantile estimations, we set the number of populations to be m C 1 D 4, and
consider populations with generalized gamma distributions. The parameters of these
144 S. Cai and J. Chen

distributions are again chosen so that their first two moments closely match those
of the lumber strength samples in our application. Again, we set the number of
simulation repetitions to 10;000.
Since our primary application is to detect difference among lumber populations,
we test the following hypotheses in our simulations

H0 W F0 D F1 D F2 D F3 against Ha W Fi ¤ Fj for some i; j D 0; : : : ; 3

(7.15)

at the significance level of 0:05. This is the same as (7.10) with g.ˇ/ D ˇ.
In all simulations, the four samples are set to Type I censored at 0:9, 0:8, 0:87
and 0:83 population quantiles of the baseline distribution F0 , respectively.
When samples are Type I censored, the proposed EL ratio test does not have
many non or semiparametric competitors. A straightforward competitor would
be a Wald type test based on the asymptotic normality of ˇ under the DRM.
|
It uses the test statistic n Ǒ ˙O 1 Ǒ with ˙O being a consistent estimator of the
asymptotic covariance matrix of Ǒ . The Wald type test also has a chi-square
limiting reference distribution under H0 of (7.15). Probably the most commonly
used semiparametric competitor is the partial likelihood ratio test based on the
celebrated Cox Proportional Hazards (CoxPH) model (Cox 1972). The CoxPH
model for multiple samples amounts to assuming

hk .x/ D exp.ˇk /h0 .x/;

where hk .x/ is the hazard function of the kth sample. Clearly, the CoxPH model
impose strong restrictions on how m C 1 populations are connected. In comparison,
the DRM is more flexible by allowing the density ratio to be a function of x. This
limitation of the CoxPH approach has been shown by Cai et al. (2016) for the case
of uncensored multiple samples: the power of the partial likelihood ratio test under
the CoxPH model is lower than that of the DEL ratio test under the DRM when the
hazards ratio of populations is not constant in x.
In our simulations, we compare the powers of the proposed EL ratio test under
the DRM (ELRT), the Wald type test under the DRM (Wald-DRM), and the partial
likelihood ratio test under the CoxPH model. In the CoxPH model, we use m dummy
population indicators as covariates. The corresponding partial likelihood ratio has a
2m limiting distribution under the null hypothesis of (7.15).

7.8.1 Populations Satisfying the DRM Assumption

We consider two settings for population distributions: generalized gamma distribu-

tions with fixed a D 2 in the first setting, and generalized gamma distributions with
fixed a D 1 in the second. Note that in the second settings, a D 1 gives regular
7 EL Inference under DRM Based on Type I Censored Samples 145

Table 7.8 Parameter values for power comparison under correctly specified DRMs (Sect. 7.8.1).
F0 remains unchanged across parameter settings 0–5
GG.a; b; p/: generalized gamma distribution with parameters a, b and p
Parameter settings
1 2 3 4 5
F0 b p b p b p b p b p
F1 : 6.6 0.75 6.9 0.70 7.2 0.68 7.2 0.7 7.5 0.66
GG.2; 6; 0:8/
F2 : 6.35 0.66 6.35 0.63 6.65 0.61 6.6 0.65 6.8 0.60
a D 2 in all settings F3 : 5.5 0.88 5.3 0.92 5.2 1 5.2 1.2 5.1 1.3
GG.1; 2; 2/; F1 : 2.2 1.8 2.3 1.75 2.4 1.72 2.5 1.62 2.6 1.40
a D 1 in all settings F2 : 1.75 2.2 1.7 2.3 1.7 2.35 1.6 2.45 1.55 2.5
(gamma) F3 : 1.8 2.3 1.78 2.55 1.78 2.7 1.75 2.8 1.7 2.9

Generalized gamma with shape a=2; Generalized gamma with shape a=1 (gamma);
q(x) = (log(x), x2)T q(x) = (log(x), x)T
1.0

1.0

l l

l
l
ELRT l l
ELRT
l Wald−DRM l
l Wald−DRM
0.8

0.8

CoxPH CoxPH l

l
l
0.6

0.6

l
Power

l
0.4

0.4

l
l
0.2

0.2

l
l

l
0.05

0.05

l l
l l

0 1 2 3 4 5 0 1 2 3 4 5
Parameter settings Parameter settings

Fig. 7.1 Power curves of the ELRT, Wald-DRM, and CoxPH; the ELRT and Wald-DRM are based
on correctly specified DRMs; the parameter setting 0 corresponds to the null model and the settings
1–5 correspond to alternative models

gamma distributions. Under each setting of population distributions we compare

the power of the three competing tests for six parameter settings (settings 0–5), with
parameter setting 0 under the null hypothesis H0 and parameter settings 1–5 under
the alternative hypothesis Ha . These parameter settings are given in Table 7.8.
The populations in the first distribution setting satisfy a DRM with basis function
|
q.x/ D .log.x/; x2 / , and those in the second distribution setting satisfy a DRM
|
with basis function q.x/ D .log.x/; x/ . We fit DRMs with these basis functions to
the corresponding Type I censored samples.
The power curves of the three testing methods are shown in Fig. 7.1. The type
I errors of the ELRT and the CoxPH are very close to the nominal level, while
146 S. Cai and J. Chen

that of the Wald-DRM is a little lower than the nominal level. The ELRT has the
highest power under all the settings. The Wald-DRM is also more powerful than the
CoxPH under most parameter settings. In summary, under our simulation settnigs,
the DRM-based tests are more powerful the CoxPH, and the proposed ELRT is the
most powerful of all.

7.8.2 Robustness to Model Misspecification

We now examine the power of the ELRT under misspecified DRMs. As we have
argued, since the DRM is very flexible, a mild misspecification of the basis function
should not significantly reduce the power of a DRM-based test. We now demonstrate
this idea by simulation studies.
We again consider two settings for population distributions: three-parameter
generalized gamma distributions in the first setting, and generalized gamma distri-
butions with fixed p D 1 in the second. Note that in the second settings, p D 1 gives
the two-parameter Weibull distributions. As in the last simulation study, under each
setting of population distributions we compare the power of the three competing
tests for six parameter settings (settings 0–5), with parameter setting 0 under the
null hypothesis H0 and parameter settings 1–5 under the alternative hypothesis Ha .
These parameter settings are given in Table 7.9.
The populations in neither distribution setting satisfy the DRM assumption (7.1).
In either case, we still fit a DRM with the basis function that is suitable to gamma
|
populations, q.x/ D .log.x/; x/ , to the censored samples. Such a basis function
is chosen by the shapes of the histograms of the samples: both generalized gamma
and Weibull samples have similar shapes to gamma samples, and hence are easily
recognized falsely as from the gamma family.
The simulated rejection rates of the three tests are shown in Fig. 7.2. It is clear
that, although the DRMs are misspecified, the ELRT still has the highest power
while its type I error rates are close to the nominal. Again, the Wald-DRM is not as
powerful as the ELRT, but is more powerful than the CoxPH in most cases.

7.9 Analysis of Lumber Quality Data

Two important lumber strength measures are modulus of tension (MOT) and
modulus of rupture (MOR). They measure, respectively, the tension and bending
strengths of lumber. Three MOT and three MOR samples were collected in labs.
The first MOT sample (MOT 0) is not subject to censoring, the second (MOT 1)
is right-censored at 5:0 103 pounds per square inch (psi), and the third (MOT 2) at
4:0 103 psi. The size of each sample is 80. The number of uncensored observations
for MOT 1 and MOT 2 are 52 and 38, respectively. The kernel density plots of these
samples are shown in Fig. 7.3a.
Table 7.9 Parameter values for power comparison under misspecified DRMs (Sect. 7.8.2). F0 remains unchanged across parameter settings 0–5
GG.a; b; p/: generalized gamma distribution with parameters a, b and p
Parameter settings
1 2 3 4 5
F0 a b p a b p a b p a b p a b p
F1 : 1.47 2.1 1.85 1.45 2.2 1.85 1.4 2.3 1.8 1.4 2.3 1.75 1.35 2.35 1.7
GG.1:5; 2; 2/ F2 : 1.42 2.2 1.8 1.4 2.3 1.75 1.35 2.4 1.65 1.3 2.43 1.6 1.3 2.55 1.55
F3 : 1.55 1.9 2.2 1.6 1.8 2.45 1.65 1.75 2.5 1.7 1.8 2.5 1.75 1.78 2.65
GG.2; 6; 1/; F1 : 1.75 5.7 1 1.7 5.6 1 1.65 5.55 1 1.6 5.45 1 1.5 5.4 1
p D 1 in all settings F2 : 2.1 6.2 1 2.2 6.3 1 2.3 6.35 1 2.4 6.45 1 2.5 6.6 1
7 EL Inference under DRM Based on Type I Censored Samples

(Weibull) F3 : 1.87 5.9 1 1.8 5.8 1 1.75 5.7 1 1.7 5.7 1 1.6 5.6 1
147
148 S. Cai and J. Chen

Generalized gamma; Generalized gamma with p=1 (Weibull);

q(x) = (log(x), x)T q(x) = (log(x), x)T
1.0

1.0
l
l

l
ELRT l
ELRT l

l Wald−DRM l Wald−DRM l
0.8

0.8
l
l
CoxPH CoxPH
l

l
0.6

0.6
l l
Power

l
0.4

0.4
l
l l

l
0.2

0.2
l

l
l
0.05

0.05
l l

l l

0 1 2 3 4 5 0 1 2 3 4 5
Parameter settings Parameter settings

Fig. 7.2 Power curves of the ELRT, Wald-DRM, and CoxPH; the ELRT and Wald-DRM are based
on misspecified DRMs; the parameter setting 0 corresponds to the null model and the settings 1–5
correspond to alternative models

(a) MOT samples (b) MOR samples

0.3

MOT 0 MOR 0
MOT 1, c1 = 5 MOR 1
MOT 2, c2 = 4 MOR 2
0.2

0.2
Density
0.1

0.1
0.0

0.0

0 2 c2 c 1 6 8 10 12 0 2 4 6 8 10 12
Modulus of tension (103 psi) Modulus of rupture (103 psi)

Fig. 7.3 Kernel density plots of MOT and MOR samples

|
We fit a DRM with basis function q.x/ D .log x; x; x2 / to the samples. Quantile
estimations are similar when other two- and three-dimensional basis functions are
used, although the estimates for are quite different.
We compute the EL–DRM and EM quantile estimates (Or and Lr ), along with
their estimated standard deviations, at D 0:05, 0:1, 0:3, 0:5 and 0:6. Note that
7 EL Inference under DRM Based on Type I Censored Samples 149

Table 7.10 Quantile Fr Or O .Or / Lr O .Lr /

estimates for MOT
populations (unit: 103 psi) MOT 0 0.05 2.11 3.38 2.08 5.33
0.10 2.47 3.15 2.47 3.67
0.30 3.45 2.82 3.28 2.58
0.50 4.11 2.68 4.11 3.12
0.60 4.36 3.21 4.38 3.31
MOT 1 0.05 2.11 2.49 2.10 2.51
0.10 2.45 2.50 2.30 2.85
0.30 3.33 3.61 3.48 3.85
0.50 4.19 3.27 4.15 3.11
0.60 4.60 4.13 4.55 4.89
MOT 2 0.05 1.72 5.17 1.66 5.84
0.10 2.29 3.82 2.19 3.97
0.30 3.38 3.24 3.37 3.70
0.50 4.09 2.85 NA NA
0.60 4.33 6.36 NA NA
Or : EL–DRM quantile estimates; Lr : EM quan-
tile estimates; :
O estimated standard deviation;
NA: Lr not well-defined

only 47:5 % of observations in the third MOT sample are uncensored, invalidating
the EM estimator at D 0:5 and 0:6. The EL–DRM estimator, however, is effective
at all values.
The results are shown in Table 7.10. The EL–DRM and EM quantile estimates
are close. Except in two cases, we see that O .Or /, the estimated standard error of Or ,
is smaller than O .Lr /, the estimated standard error of Lr . On average, O .Or / is 12 %
smaller than O .Lr /. Such an efficiency gain is likely to be real, as implied by our
earlier simulation studies.
We next analyze the non-censored MOR samples of sizes 282, 98, and 445,
denoted MOR 0, MOR 1 and MOR 2 in the kernel density plots shown in Fig. 7.3b.
There are no obvious large outliers.
In addition to being robust, the EL–DRM estimator based on Type I censored
data is believed to lose little efficiency compared to that based on full data for lower
quantile estimation. To illustrate this point, we induced right censoring to all the
MOR samples around their 85 % quantiles at 8:4, 8:3 and 8:3 ( 103 psi). We fitted
|
a DRM with basis function q.x/ D .log x; x; x2 / to the censored and full samples,
respectively.
The quantile estimates for the MOR populations and the corresponding estimated
standard deviations (O ) are shown in Table 7.11. The three estimators give similar
quantile estimates. The standard deviations of Or based on censored observations
O Or /=. O Or / is
. f/
are close to the ones based on the full data. The average of the ratio .
0:98. Moreover, . O
O r / and . O . f/
O r / are 16 % and 18 % smaller than . L
O r / on average.
In conclusion, the EL–DRM estimator based on the censored data is almost as
150 S. Cai and J. Chen

Table 7.11 Quantile .f/ .f/

Fr Or O .Or / Or O .Or / Lr O .Lr /
estimates for MOR
populations (unit: 103 psi) MOR 0 0.05 4.52 3.62 4.53 3.70 4.40 5.50
0.10 4.96 3.54 5.00 3.44 5.00 4.10
0.30 5.99 2.69 6.02 2.55 6.05 2.82
0.50 6.74 3.06 6.72 2.84 6.67 3.14
0.60 7.11 3.56 7.06 3.16 7.23 4.44
MOR 1 0.05 4.57 5.86 4.55 5.61 4.70 8.01
0.10 5.03 5.35 4.97 5.20 5.17 4.92
0.30 5.93 3.76 5.93 3.89 5.88 3.86
0.50 6.59 4.25 6.62 4.41 6.46 5.32
0.60 6.88 4.71 6.95 4.79 6.92 6.46
MOR 2 0.05 3.54 4.08 3.58 3.91 3.58 4.43
0.10 4.03 3.91 4.03 3.84 4.08 3.74
0.30 5.36 3.25 5.34 3.18 5.32 3.58
0.50 6.26 2.82 6.21 2.80 6.27 3.35
0.60 6.72 3.08 6.72 3.13 6.76 3.01
.f/
Or and Or : EL–DRM quantile estimates based on censored
and full data; Lr : EM quantile estimates; :
O estimated standard
deviation

efficient as the one based on the full data, and both are more efficient than the EM
estimator.

7.10 Concluding Remarks

We have developed EL–DRM-based statistical methods for multiple samples with

Type I censored observations. The proposed EL ratio test is shown to have a simple
chi-square limiting distribution under the null model of a composite hypothesis
about the DRM parameters. The limiting distribution of the EL ratio under a class of
local alternative models is shown to be non-central chi-square. This result is useful
for approximating the power of the EL ratio test and calculating the required sample
size for achieving a certain power. Simulations show that the proposed EL ratio test
has a superior power compared to some semiparametric competitors under a wide
range of population distribution settings.
The proposed CDF and quantile estimators are shown to be asymptotically
normal, as well as to be more efficient and have a broader range of consistent
estimation than their empirical counterparts. Extensive simulations support these
theoretical findings. The advantages of the new methods are particularly remarkable
for lower quantiles. Simulation results also suggest that the proposed method is
robust to mild model misspecifications and useful when data are corrupted by large
outliers.
7 EL Inference under DRM Based on Type I Censored Samples 151

This work is motivated by a research project on the long-term monitoring

of lumber quality. The proposed methods have broad applications in reliability
engineering and medical studies, where Type I censored samples are frequently
encountered.

References

Bahadur RR (1966) A note on quantiles in large samples. Ann Math Stat 37(3):577–580
Cai S, Chen J, Zidek JV (2016) Hypothesis testing in the presence of multiple samples under
density ratio models. In press Stat Sin
Chen H, Chen J (2000) Bahadur representation of the empirical likelihood quantile process. J
Nonparametric Stat 12:645–665
Chen J, Liu Y (2013) Quantile and quantile-function estimations under density ratio model. Ann
Stat 41(3):1669–1692
Cox DR (1972) Regression models and life-tables. J R Stat Soc Ser B (Stat Methodol) 34(2):187–
220
Fokianos K (2004) Merging information for semiparametric density estimation. J R Stat Soc Ser B
(Stat Methodol) 66(4):941–958
Fokianos K, Kaimi I (2006) On the effect of misspecifying the density ratio model. Ann Inst Stat
Math 58:475–497
Marshall AW, Olkin I (2007) Life distributions – structure of nonparametric, semiparametric and
parametric families. Springer, New York
Owen AB (2001) Empirical likelihood. Chapman & Hall, New York
Qin J (1998) Inferences for case-control and semiparametric two-sample density ratio models.
Biometrika 85(3):619–630
Qin J, Zhang B (1997) A goodness-of-fit test for the logistic regression model based on case-control
data. Biometrika 84:609–618
Silverman BW (1986) Density estimation for statistics and data analysis, 1st edn. Chapman & Hall,
Boca Raton
Stacy EW (1962) A generalization of the gamma distribution. Ann Math Stat 33(3):1187–1192
Wang C, Tan Z, Louis TA (2011) Exponential tilt models for two-group comparison with censored
data. J Stat Plan Inference 141:1102–1117
Zhang B (2002) Assessing goodness-of-fit of generalized logit models based on case-control data.
J Multivar Anal 82:17–38
Chapter 8
Recent Development in the Joint Modeling
of Longitudinal Quality of Life Measurements
and Survival Data from Cancer Clinical Trials

Hui Song, Yingwei Peng, and Dongsheng Tu

Abstract In cancer clinical trials, longitudinal Quality of Life (QoL) measurements

and survival time on a patient may be analyzed by joint models, which provide more
efficient estimation than modeling QoL data and survival time separately, especially
when there is strong association between the longitudinal measurements and the
survival time. Most joint models in the literature assumed classical linear mixed
model for longitudinal measurements and Cox’s proportional hazards model for
survival times. The linear mixed model with normal distributed random components
may not be sufficient to model bounded QoL measurements. Moreover, when some
patients are immune to recurrence of relapse and can be viewed as cured, the
proportional hazards model is not suitable for survival times. In this paper, we
review some recent developments in joint models to deal with bounded longitudinal
QoL measurements and survival times with a possible cure fraction. One of such
joint models assumes a linear mixed tt model for longitudinal measurements and a
promotion time cure model for survival data, and the two parts are linked through
a latent variable. Another joint model employs a simplex distribution to model the
bounded QoL measurements and a classical proportional hazard to model survival
times, and the two parts share a random effect. Semiparametric estimation methods
have been proposed to estimate the parameters in the models. The models are
illustrated with QoL measurements and recurrence times from a clinical trial on
women with early breast cancer.

H. Song ()
School of Mathematical Sciences, Dalian University of Technology, 116024, Dalian,
Liaoning, China
e-mail: [email protected]
Y. Peng • D. Tu
Departments of Public Health Sciences and Mathematics and Statistics, Queens University, K7L
3N6, Kingston, ON, Canada
e-mail: [email protected]; [email protected]

D.-G. (Din) Chen et al. (eds.), Advanced Statistical Methods in Data Science,
ICSA Book Series in Statistics, DOI 10.1007/978-981-10-2594-5_8
154 H. Song et al.

8.1 Introduction

In cancer clinical trials, the patient’s quality of life (QoL) is an important subjective
endpoint beyond the traditional objective endpoints such as tumor response and
relapse-free or overall survival. Specially, when the improvement in survival may be
limited by a new treatment for a specific type of cancer, patients’ QoL is important
to determine whether this new treatment is useful (Richards and Ramirez 1997).
Several studies (Dancey et al. 1997; Ganz et al. 2006) have also found that QoL
measurements such as overall QoL, physical well-being, mood and pain are of
prognostic importance for patients with cancer, so they may help to make a treatment
decision.
The quality of life of cancer patients is usually assessed by a questionnaire,
which consists of a large number of questions assessing various aspects of quality
of life, at different timepoints before, during, and after their cancer treatments. The
score for a specific domain or scale of QoL is usually calculated as the mean of
the answers from a patient to a set of the questions which define this QoL domain
or scale and, therefore, can be considered as a continuous measurement. In this
chapter, we only consider measurements of a specific QoL domain or scale, which
can be defined as Yi D .Yi1 ; : : : ; Yini /T , where Yij is the measurement of this QoL
domain or scale from the ith subject at the jth occasion, j D 1; : : : ; ni , i D 1; : : : ; n.
These longitudinal QoL measurements from cancer clinical trials can be analyzed
by standard statistical methods for repeated measurements, such as linear mixed
models (Fairclough 2010). These models provide valid statistical inference when
complete longitudinal measurements from all patients are available or the missing
longitudinal measurements can be assumed missing at random. In cancer clinical
trials, some seriously ill patients who have worse QoL may drop out of the study
because of disease recurrence or death. The QoL measurements are not available
from these patients. In this case, dropping out is directly related to what is being
measured, and the missing of QoL measurements caused by the dropout of these
patients is informative and may not be assumed as missing at random. Application
of a standard longitudinal data analysis in this context could give biased results.
Tu et al. (1994) analyzed the longitudinal QoL data of a breast cancer clinical trial
based on a nonparametric test taking into account of informative censoring.
Patients’ poor QoL can lead to both informative dropout for QoL measurements
and censoring in survival time. Let Ti0 be the survival time and ıi be the censoring
indicator of the ith subject. A joint modeling framework for longitudinal QoL
measurements Yi and survival data .Ti0 ; ıi / allows modeling both longitudinal mea-
surements and survival time measurements jointly to accommodate the association
between them. It may provide a valid inference for data with missing measurements
caused by the dropout of patients. As pointed out in an introductory review by
Ibrahim et al. (2010), Tsiatis and Davidian (2004) and Diggle et al. (2008), joint
analysis of longitudinal QoL data together with data on a time to event outcome,
such as disease-free or overall survival would also provide more efficient estimates
of the treatment effects on both QoL and time to event endpoints and also reduce
bias in the estimates of overall treatment effect.
8 Joint Modeling QoL and Survival Data 155

Since Schluchter (1992) proposed the analysis of the longitudinal measurements

jointly with the dropout time by a multivariate normal distribution, many approaches
have been proposed for jointly modeling longitudinal measurements and survival
data. Some recent reviews can be found in, for example, Wu et al. (2012) and Gould
et al. (2015).
In this paper, we will review some of our recent contributions in this area, which
were motivated from the analysis of data from a clinical trial on early breast cancer.
We developed some new joint models which take into account of features of QoL
and survival data observed from the clinical trials. The models employ the Student’s
t distribution for the random components of the model for the longitudinal QoL data
to better accommodate extreme observations. The simplex distribution is considered
to accommodate longitudinal bounded QoL data. A promotion time cure model is
considered for survival data to allow a possible cure fraction in the patients. To
simplify the model estimation, a penalized joint likelihood generated by the Laplace
approximation is attempted to estimate the parameters in the models. The details of
these new models and statistical procedures for the inference of parameters in these
models are reviewed in the following sections, followed by an illustration with the
breast cancer data. The paper is concluded with conclusions and discussions.

8.2 QoL and Survival Data from a Breast Cancer

Clinical Trial

The data that motivated our recent contributions are from a clinical trial (MA.5) on
early breast cancer conducted by NCIC Clinical Trials Group which compared two
chemotherapy regimens for early breast cancer, a new and intensive treatment of
cyclophosphamide, epirubicin, and fluorouracil (CEF) with the standard treatment
of cyclophosphamide, methotrexate, and fluorouracil (CMF) (Levine et al. 1998,
2005). In this study, 356 patients were randomly assigned to CEF and 360 patients
to the CMF arm. Both CEF and CMF were administered monthly for six months.
The survival time of our interest is the time to relapse or recurrence free survival
time (RFS), which was the primary endpoint of this trial. The median follow-up
time of all patients is 59 months, and there are 169 and 132 uncensored RFS times
from patients randomized to CEF and CMF respectively in the data set updated in
2002. The difference in 10-year relapse-free survival between two treatment arm
is 7 % (52 % on CEF and 45 % on CMF, respectively), which was considered as
moderate and may not be able to completely determine the relative advantages of
CEF over CMF. Therefore, information on QoL may be helpful in the process of
decision-making (Brundage et al. 2005).
In MA.5, the quality of patients undergoing chemotherapy treatment was
assessed by the self-answered Breast Cancer Questionnaire (BCQ) which consists
of 30 questions measuring different dimensions of QoL and was administered at
each of the clinical visits (every one during the treatment and then every 3 months
156 H. Song et al.

after the completion of the treatment) until the end of the second year or until
recurrence or death, whichever came first. The specific QoL scale of interest is the
global QoL of patients defined as the means of the answers to all 30 questions.
Because of drop-out of patients due to recurrence or death, joint analysis of RFS
and the global QoL would provide more robust and also efficient inference on the
difference in global QoL between two treatment groups.
Joint modeling of longitudinal QoL measures and survival data was studied
previously by Zeng and Cai (2005). They assumed a linear mixed effects model
to the longitudinal QoL measurements Yi and a multiplicative hazards model to the
survival time Ti0 . These two models share a vector of common random effects, which
reflects the unobserved heterogeneity for different subjects and its coefficient in the
model characterizes the correlation between the longitudinal QoL measurements
and survival times. The EM algorithm was implemented by them to calculate the
maximum likelihood estimates. This simultaneous joint model with shared random
effect established a basic framework for the joint modeling of longitudinal QoL
measurements and survival data but during the process to apply it to analyze data
from MA.5, we found it necessary to extend this framework for several reasons.
One reason is, for early breast cancer, with advances in the development of new
cancer treatments, the existence of cured patients becomes possible and, therefore,
the models for the survival times need to take this into consideration. This can also
be seen from the plateau of the Kaplan-Meier survival curves of the two treatments
shown in Fig. 8.1. Another reason is that QoL measurements may be restricted to
an interval. For example, since each question on BCQ are on a Likert scale from 0
to 7 with the best outcome marked as 7, the minimum and maximum of the scores
1.0
0.8
0.6
survival
0.4
0.2

CMF
CEF
0.0

0 20 40 60 80 100 120 140

time to relapse (month)

Fig. 8.1 Kaplan-Meier survival curves of patients in the two treatment groups of MA.5
8 Joint Modeling QoL and Survival Data 157

for global QoL are respectively 0 and 7. Such data is often called bounded data
(Aitchison 1986). There are totally 7769 QoL measurements from both arms.

8.3 Modeling QoL and Survival Data with a Cure Fraction

As mentioned above, Zeng and Cai (2005) considered the change in longitudinal
QoL measures over time and the time to event as two simultaneous processes. They
used a classical linear mixed model for the longitudinal QoL measurements Yi and a
Cox proportional hazards model with random effects for the survival time Ti0 and
assumed the random effects in these two models are the same. When there are
potential cured patients, this simultaneous joint model may not be suitable to model
longitudinal QoL measures and survival times. To accommodate a possible cure
fraction in the data, we proposed to use a linear mixed tt model for the longitudinal
QoL measurements Yi and a promotion time cure model for Ti0 (Song et al. 2012).
These two models are connected through a latent gamma variate, which defines a
joint model for the longitudinal QoL measurements and survival time. Let i be
the latent gamma variate with the density function P.i j; / so that EŒi D 1 and
varŒi D 1=. Given i , the proposed joint model specifies that the longitudinal
QoL measurements Yi in the joint model is assumed to follow the linear mixed tt
model,

Yi D Xi ˇ C XQ i ˛i C ei ; (8.1)

where Xi and XQ i are the design matrices of covariates for the fixed effects, ˇ and
the random effects ˛i respectively, and ei is the random error. Given i , ˛i and ei
are assumed to be independent and both are normal random variates with ˛i ji
N.0; ˛2 =i / and ei ji N.0; Ini ni e2 =i /.
It can be shown that the marginal distributions of ˛i and ei are respectively
t.0; ˛2 ; 2/ and tni .0; e2 Ini ni ; 2/, where tni .0; e2 Ini ni ; 2/ is an ni -dimensional
t distribution (Pinheiro et al. 2001) with mean 0, a positive definite scale matrix
e2 Ini ni and a degree of freedom 2, and t.0; ˛2 ; 2/ is that distribution for
univariate variable (Cornish 1954; Lange et al. 1989). Therefore, Eq. (8.1) is often
called the linear mixed tt model (Song 2007; Song et al. 2007; Zhang et al. 2009).

Since the marginal variances of ˛i and ei are respectively varŒ˛i D 1 ˛2 and

varŒei D 1 e2 Ini ni , a condition > 1 in the joint model is required to guarantee
the existence of the variances.
Given i , the survival time Ti0 in the joint model is assumed to follow a conditional
distribution with the following survival function,

Spop .tjZi ; i / D expŒi .Zi /F0 .t/; (8.2)

158 H. Song et al.

where F0 .t/ is an arbitrary proper distribution function, .Zi / D expŒ T Zi with

Zi the vector of covariates which may have partial or complete overlap with Xi ,
and is the corresponding regression coefficients. Eq. (8.2) is often called the
promotion time cure model (Chen et al. 1999; Yakovlev et al. 1993; Yin 2005).
The unconditional
R1 cure probability for a subject with Zi under this model is
limt!1 0 Spop .tjZi ; i /P.i j; /di D . .Z i / C 1/ .
Equations (8.1) and (8.2) are connected by the latent gamma variate i , which
acts multiplicatively on the hazard function of Ti0 and on the variances of ˛i and
eij in the model for Yi . These two equations together define a joint model for the
longitudinal QoL measures and survival times, referred as JMtt.
Since conditional on ˛i ; i , the contribution of yi to the likelihood under (8.1) is

1 1
lli .yi j˛i ; i / D p expŒ 2 .yi Xi ˇ XQ i ˛i /T .yi Xi ˇ XQ i ˛i /
2
. 2e =i / ni 2e =i

and the contribution of .ti ; ıi / to the likelihood under (8.2) and independent
censoring is

lsi .ti j˛i ; i / D Œi .Zi /f0 .ti /ıi expŒi .Zi /F0 .ti /:

The observed likelihood for the proposed joint model based on all n subjects is
n “
Y
lD lli .yi j˛i ; i ; e2 ; ˇ/lsi .ti j˛i ; i ; ; F0 .t//'.˛i j0; ˛2 =i / .i j; /d˛i di ;
iD1

where '.j0; ˛2 =i / is the density function of the normal distribution with mean 0
and variance ˛2 =i , and .i j; / is the density function of the gamma distribution
with mean 1 and variance 1=. The unknown parameters in this model are denoted
as D .ˇ, , , ˛ , e , F0 .t//.
EM algorithm is used to obtain the maximum likelihood estimates of parameters
(Klein 1992). If ˛i and i are observed, the complete log-likelihood from the joint
model is

Lc D Ll C Ls C L˛ C L ;

where
Pn
Ll D Œ n2i .log i log e2 / i
.y
2e2 i
Xi ˇ XQ i ˛i /T .yi Xi ˇ XQ i ˛i /;
PiD1
n
Ls D iD1 Œıi .log f0 .ti / C log i C log .Zi // i .Zi /F0 .ti /;
Pn 1 2 i ˛i2
L˛ D iD1 Œ 2 .log i log ˛ / 2˛2 ;
Pn
L D iD1 Œ log log ./ C . 1/ log i i :
8 Joint Modeling QoL and Survival Data 159

Denote the entire observed data as O D fyi ; ti ; ıi ; Xi ; XQ i ; Zi g. Let k be the estimate

of in the kth iteration. The E-step in the .k C 1/th iteration of the EM algorithm
computes the conditional expectation of Lc with respect to ˛i and i , which is
equivalent to evaluating the following four conditional expectations,
Pn
EŒLl jk ; O D iD1 f 2 .Q
ni
ai log e2 / 21 2 Œ.yi Xi ˇ/T .yi Xi ˇ/Qri
e
2.yi Xi ˇ/T XQ i gQ i C XQ iT XQ i bQ i gI
P
EŒLs jk ; O D niD1 fıi Œlog f0 .ti / C aQ i C log .Zi / .Zi /F0 .ti /Qri gI
P Q
EŒL˛ jk ; O D niD1 f 21 ŒQai log ˛2 2bi2 gI
Pn ˛
EŒL jk ; O D iD1 Œ log log ./ C . 1/Qai Qri ;

where aQ i , bQ i , gQ i and rQi are given in the Appendix of Song et al. (2012). In the M-step
of the EM algorithm, maximizing EŒLl jk ; O, EŒL˛ jk ; O, and EŒL jk ; O can
be accomplished through the Newton-Raphson method. To update the parameters
.; F0 .t// in EŒLs jk ; O, it is assumed that F0 .t/ is a proper cumulative distri-
bution function and only increases at the observed event times, and maximizing
EŒLs jk ; O can be carried out using the semiparametric method of Tsodikov
et al. (2003). The maximum likelihood estimates of the parameters are obtained
after iterating the E-step and M-step until convergence. To calculate the variance
estimates for the parameter estimates, the method of Louis (1982) is employed
to obtain the observed information matrix. Extensive simulations studies showed
that the proposed joint model and estimation method are more efficient than the
estimation methods based on separate models and also the joint model proposed by
Zeng and Cai (2005).
Other researchers also proposed joint models to accommodate a possible cure
fraction. Brown and Ibrahim (2003) considered a Bayesian method for a joint model
that joins a promotion time cure model (8.2) for the survival data and a mixture
model for the longitudinal data. Law et al. (2002) and Yu et al. (2008) proposed
a joint model which employed another mixture cure model for the survival data
and nonlinear mixed effect model for longitudinal data and employed a Bayesian
method to estimate the parameters in the model. Chen et al. (2004) investigated the
joint model of multivariate longitudinal measurements and survival data through a
Bayesian method, where the promotion time cure model with a log-normal frailty
is assumed for survival data. Due to complexity of the joint models, the Bayesian
methods used in the models can be very tedious. In contrast, the estimation method
proposed by Song et al. (2012) for model (8.1) and (8.2) is simple to use and takes
less time to obtain estimates due to the simplicity of the conditional distributions
of the latent variables in the model, which greatly reduces the complexity of
estimation.
160 H. Song et al.

8.4 Modeling QoL and Survival Data with the Simplex

Distribution

As mentioned in the introduction, although QoL measurements can be considered

as continuous, their values are usually restricted in an interval and thus are bounded
data. Therefore, classical linear mixed models with normal or t-distributions may
not be proper models for the QoL measurements. Recently, we have explored the use
of a model which assumes that the QoL measurements follow a simplex distribution
for the longitudinal QoL measurements and developed estimation procedures for
a joint model with simplex distribution based models for the longitudinal QoL
measurements and Cox proportional hazards models for the survival times (Song
et al. 2015). Specifically, given a random effect ˛i , the density function of a
simplex distribution for a longitudinal QoL measurements Yij has the following form
(Barndorff-Nielsen and Jørgensen 1991):
( d.yIij /
2 a.yI 2 / expŒ 2 2
; y 2 .0; 1/
f .yjij ; / D ; (8.3)
0 otherwise

.yij /2
where a.yI 2 / D Œ2 2 .y.1 y//3 1=2 , d.yI ij / D y.1y/2ij .1ij /2
, 0 < ij < 1
is the location parameter and is equal to the mean of the distribution, and 2 > 0 is
the dispersion parameter. The simplex distribution has a support on (0, 1) which
make it suitable as a model for bounded data. Further details and properties of
the simplex distribution can be found in Song and Tan (2008), Qiu et al. (2008),
Jørgensen (1997), and Song (2007). We assume that the mean of Yij ’s satisfies the
following:

logit.ij / D XijT ˇ C ˛i ; (8.4)

where Xij is a vector of covariates with coefficient ˇ and ˛i is a random effect which
satisfies that

˛1 ; : : : ; ˛n i:i:d: '.˛j0; ˛2 /: (8.5)

This model is called a simplex model for Yij , which allows a directly modeling
effects of Xij on the mean of Yij , a desirable property that the existing joint modeling
approaches for bounded longitudinal data do not have. The effect of Xij on the mean,
measured by ˇ, can be easily interpreted in a similar way as the log odds ratio in the
logistic regression.
For the survival time Ti0 , we assume it follows the proportional hazards model

h.tjZi ; ˛i / D h0 .t/ exp.ZiT C ˛i /; (8.6)

8 Joint Modeling QoL and Survival Data 161

where is the coefficient of Zi , is the coefficient of the random effect ˛i , and h0 .t/
is an arbitrary unspecified baseline hazard function. The assumptions for ˛i ’s are the
same as in (8.5).
Similarly, the random effect ˛i in the joint model (8.4) and (8.6) reflects the
unobserved heterogeneity in the mean of Yij and the hazard of Ti0 for different
subject, and characterizes the correlation between the longitudinal bounded
measurements and survival times, which can be seen from the joint density function
of Ti0 and Yij

f .t; y/ D h0 .t/Œ2 2 .y.1 y//3 1=2 eZ

2 3
Z
T
eX ˇC˛
C1 T .y /2
.t/eZ C˛ 1CeX T ˇC˛
e˛H0 exp 4 T
5 '.˛/d˛:
e2.X ˇC˛/
1 2 2 y.1 y/ T
.1CeX ˇC˛ /4

When D 0, the survival times and longitudinal measurements will be independent

R C1
conditional on the observed covariates since f .t; y/ D f .t/ 1 f .yj˛/'.˛/d˛ D
f .t/f .y/. The above joint model (8.3), (8.4), (8.5), and (8.6) is referred as JMSIM.
Since the simplex distribution is not in an exponential family, the joint model
JMSIM is an extension of Viviani and Rizopoulos (2013) who assumed the
distribution of the longitudinal response is in the exponential family and used the
generalized linear mixed model together with the Cox model to construct the model.
The parameters in the joint model based on the simplex distribution, denoted as
D .ˇ, , , h0 .t/, 2 , ˛2 /, can be estimated from the marginal log-likelihood of
the proposed joint model:
2 3
n Z
Y Y
ni
ZiT C˛i
l./ D log 4 f .yij jij ; 2 /5 Œh0 .ti /eZi C˛i ıi eH0 .ti /e
T
'.˛i j0; ˛2 /d˛i
iD1 jD1
(8.7)
Rt
where H0 .t/ D 0 h0 .t/dt. Since (8.7) involves an integration that does not have a
closed form, it is difficult to maximize it directly. Instead, we considered the Laplace
approximation to (8.7) and obtained the following first-order and the second-order
penalized joint partial log-likelihoods (PJPL):
Pn
ni n Xn
e PJPL . 0 / D
LL iD1
log 2 log ˛2 C PJPL .˛O i /;
2 2 iD1
Pn
Xn
1X
n
iD1 ni n .2/
LLPJPL . 0 / D log 2 log ˛2 C PJPL .˛O i / log j PJPL .˛O i /j;
2 2 iD1
2 iD1
162 H. Song et al.

where 0 D .ˇ; ; ; 2 ; ˛2 /,

1 X X T
ni
Zk C˛k ˛i2
PJPL .˛i / D d.yij I ij / C ıi ŒZ T
C ˛ i log. e / ;
2 2 jD1 i
k2R.t /
2˛2
i

1 X @2 d.yij I ij /
ni T T
.2/ eZi Ci ˛i 2 eZi Ci ˛i 1
PJPL .˛i / D 2 ıi P .1 P / 2:
2 jD1 @˛i 2 k2R.t / e
T
Zk C˛k
k2R.t / eZkT C˛k ˛
i i

and ˛O i D argmax˛i PJPL .˛i /. Following Ripatti and Palmgren (2000), Ye et al.
(2008), and Rondeau et al. (2003), omitting the complicated term log j .2/ .˛O i /j in
LLPJPL . 0 / have a negligible effect on the parameter estimation but can make the
computation faster. The parameter ˇ; can be updated by maximizing LL fPJPL . 0 /,
2 2
then ; ˛ ; can be updated with current estimates of ˇ; by maximizing
LLPJPL . 0 /. The maximum likelihood estimate of 0 is obtained by iterating these
two steps until convergence, and h0 .t/ can be estimated by using Breslow’s estimator
(Breslow 1972). The variances of these parameter estimates are approximated from
inverting the information matrices of LL fPJPL . 0 / for parameters .ˇ; / and of
0 2 2
LLPJPL . / for parameters . ; ˛ ; /.
Another intuitive approach to deal with longitudinal bounded data is to perform
logit transformation (Aitchison and Shen 1980; Lesaffre et al. 2007) of the bounded
data before applying a joint model, such as Zeng and Cai (2005), to the data. Song
et al. (2015) explored this approach in details. Numerical studies showed that the
two approaches are comparable in performance and both are better than the simple
approach that ignores the restrictive nature of the longitudinal bounded data, such
as the method of Ye et al. (2008). The joint model based on logit-transformed
longitudinal bounded data is more robust to model mis-specification. The approach
based on simplex distribution has, however, an advantage in its simpler and easier
interpretation of covariate effects on the original scale of the data. Similar to the
classic generalized linear model, this approach allows the effects of covariates on
the mean of the longitudinal bounded data while the dispersion of the distribution
stays intact.

8.5 Application to a Data from a Breast Cancer Clinical Trial

As an illustration, the joint models and the estimation methods of JMtt and JMSIM
are applied respectively to analyze the data from the MA.5 described in Sect.8.1.
Preliminary examination of QoL data revealed that patterns of change in the
scores are different in different time periods and in different treatment groups
(Fig. 8.2). This lead to the piecewise polynomial linear predictors (8.8) and (8.9)
to better interpret the change in scores over time of joint models JMtt and JMSIM,
8 Joint Modeling QoL and Survival Data 163

1.0
0.5
BCQ Summary Score
0.0
−0.5

CMF
CEF
−1.0

0 5 10 15 20
Months from randomization

Fig. 8.2 Averages of BCQ scores for patients in the two treatment groups of MA.5

respectively:

JMtt W XijT ˇ C ˛i D Œˇ0 C ˇ1 xi C ˇ2 tij C ˇ3 xi tij C ˇ4 tij2 C ˇ5 xi tij2 Itij 2Œ0;2/ C

Œˇ6 C ˇ7 xi C ˇ8 tij C ˇ9 tij2 Itij 2Œ2;9/ C
Œˇ10 C ˇ11 xi C ˇ12 tij Itij 9 C ˛i ; (8.8)
JMSIM W XijT ˇ C ˛i D Œˇ0 C ˇ1 xi C ˇ2 tij C ˇ3 xi tij C ˇ4 tij2 C ˇ5 xi tij2 Itij 2Œ0;2/ C
Œˇ6 C ˇ7 xi C ˇ8 tij C ˇ9 xi tij C ˇ10 tij2 C ˇ11 xi tij2 Itij 2Œ2;9/ C
Œˇ12 C ˇ13 xi C ˇ14 tij C ˇ15 xi tij Itij 9 C ˛i ; (8.9)

where xi is the binary treatment indicator (=1 for CEF and =0 for CMF) and tij
denotes the time(in month from the randomization) when the QoL of a patient was
assessed. The RFS times in the joint model are modeled by (8.2) in JMtt and (8.6)
in JMSIM with Zi D xi .
164 H. Song et al.

Table 8.1 Estimations, standard errors of parameters in joint models JMtt and JMSIM with
longitudinal parts (8.8) and (8.9) for MA.5 respectively
Joint Model Parameter ˇ0 ˇ1 ˇ2 ˇ3 ˇ4 ˇ5
JMtt Est. 0.0382 0.0679 0.4426 0.5360 0.1380 0.2662
Std. 0.0306 0.0433 0.0597 0.0841 0.0309 0.0436
JMSIM Est. 1.0516 0.0350 0.2171 0.4345 0.0393 0.1904
Std. 0.0316 0.0447 0.0624 0.0900 0.0332 0.0469
Joint Model Parameter ˇ6 ˇ7 ˇ8 ˇ9 ˇ10 ˇ11
JMtt Est. 0.0278 0.0732 0.2174 0.0297 0.3630 0.1192
Std. 0.0553 0.0355 0.0207 0.0019 0.0356 0.0361
JMSIM Est. 0.7431 0.0152 0.0311 0.0105 0.0102 0.0016
Std. 0.0787 0.1118 0.0316 0.0447 0.0028 0.0004
Joint Model Parameter ˇ12 ˇ13 ˇ14 ˇ15 0 1
JMtt Est. 0.0062 – – – 0.0454 0.2863
Std. 0.0015 – – – 0.0788 0.1163
JMSIM Est. 1.4364 0.0110 0.0070 0.0012 – 0.2679
Std. 0.0397 0.0565 0.0019 0.0027 – 0.1122
Joint Model Parameter ˛2 ./
JMtt Est. 0.1499 3.6392
Std. 0.0099 0.2991
JMSIM Est. 0.1781 0.3444
Std. 0.0101 0.1425

Table 8.1 summarizes the estimates of the parameters and their standard errors
in the two joint models of JMtt and JMSIM fitted to the data respectively. The
hazard of the RFS times was significantly lower for patients randomized to CEF than
those to CMF( p-value 0.0138 from JMtt, p-value 0.0170 from JMSIM). In addition,
both O in JMtt and O in JMSIM are significantly different from 0, which indicate a
strong dependence between longitudinal QoL and RFS data from MA.5. The fitted
mean curves and their confidence bands of the longitudinal data for CEF and CMF
from JMtt and JMSIM are in Fig. 8.3. The plots fitted by JMtt and JMSIM have a
similar trend. The mean BCQ score in the CMF group decreases to a nadir after
randomization and them increases steadily over the next 7 months. In contrast, the
mean BCQ score in the CEF group decreases more quickly to a nadir and gradually
increases in the remaining months of chemotherapy treatment. After 6-month of the
chemotherapy treatments, the scores of both arms tend to be stable. This implies that
patients treated by CEF had worse QoL than those treated by CMF at very beginning
but gradually recovered to a slightly better than those treated by CMF.
8 Joint Modeling QoL and Survival Data 165

1.0
0.5
BCQ Summary Score
0.0
−0.5

CMF fitted by JMtt

CEF fitted by JMtt
−1.0

0 5 10 15 20
Months from randomization
0.85
0.80
BCQ Summary Score
0.75
0.70
0.65

CMF fitted by JMSIM

0.60

CEF fitted by JMSIM

0 5 10 15 20
Months from randomization

Fig. 8.3 Estimated mean curves and their confidence bands of longitudinal data from JMtt and
JMSIM for CMF and CEF arms respectively

8.6 Discussions and Future Work

In this paper, we have reviewed our recent contributions in the joint modeling of
longitudinal QoL and survival data to deal with bounded longitudinal QoL data and
a possible cure fraction. Our work involves a linear mixed tt model and a generalized
linear mixed effect model with simplex distribution for the longitudinal QoL data to
better accommodate extreme and bounded QoL measurements. A promotion time
cure model is considered for survival data to accommodate a possible cure fraction.
Semiparametric inference procedures with an EM algorithm and a penalized joint
likelihood based on the Laplace approximation were developed for the parameters
in the joint models. The simulation studies show that the estimation of parameters
in the joint models is more efficient than that in the separate analysis and the
existing joint models. The models also enjoy intuitive interpretation. The models
are illustrated with the data from a breast cancer clinical trial.
Our work is limited in the definition of new models and inference of parameters
in these models. Recently, motivated also by the analysis of data from MA.5, Park
166 H. Song et al.

and Qiu (2014) developed statistical procedures for the selection of diagnostics for
a joint model which uses a linear mixed model for longitudinal measurements and a
time-varying coefficient model for the survival times. Procedures for the assessment
of fit for each component of the joint model were derived recently by Zhang et al.
(2014). He et al. (2015) developed some procedures which can be used to select
simultaneously variables in both components of a joint model. Developments of
similar procedures for the models we proposed would be an interesting but also
challenging future research topic.
Other future extensions of the work reviewed in this paper include treating
the QoL measurements as categorical data based on their original scales. An
item response model (Wang et al. 2002) may be considered in a joint model to
accommodate categorical QoL measurements. It is also an interesting topic to
develop a smooth function of time for the longitudinal trajectory and the baseline
hazard function via splines. Also the relapse free survival time is, by nature,
interval-censored since assessment of recurrence is usually done at each clinical
assessment. Extensions of the joint models reviewed to take into account of the
interval censoring are of interest in the applications.

Acknowledgements The authors wish to thank the editors and associate editors for their helpful
comments and suggestions. Hui Song was supported by National Natural Sciences Foundation of
China (Grant No.11601060), Dalian High Level Talent Innovation Programme (No.2015R051) and
Fundamental Research Funds for the Central Universities of China.

References

Aitchison J (1986) The statistical analysis of compositional data. Chapman and Hall, London
Aitchison J, Shen SM (1980) Logistic-normal distributions: some properties and uses. Biometrika
67:261–272
Barndorff-Nielsen OE, Jørgensen B (1991) Some parametric models on the simplex. J Multivar
Anal 39:106–116
Breslow NE (1972) Contribution to the discussion of D. R. Cox. J R Stat Soc Ser B 34:216–217
Brown ER, Ibrahim JG (2003) Bayesian approaches to joint cure-rate and longitudinal models with
applications to cancer vaccine trials. Biometrics 59:686–693
Brundage M, Feld-Stewart D, Leis A, Bezjak A, Degner L, Velji K, Zetes-Zanatta L, Tu D, Ritvo
P, Pater J (2005) Communicating quality of life information to cancer patients: a study of six
presentation formats. J Clin Oncol 28:6949–6956
Chen MH, Ibrahim JG, Sinha D (1999) A new Bayesian model for survival data with a surviving
fraction. J Am Stat Assoc 94:909–919
Chen MH, Ibrahim JG, Sinha D (2004) A new joint model for longitudinal and survival data with
a cure fraction. J Multivar Anal 91:18–34
Cornish EA (1954) The multivariate t-distribution associated with a set of normal sample deviates.
Aust J Phys 7:531–542
Dancey J, Zee B, Osoba D, Whitehead M, Lu F, Kaizer L, Latreille J, Pater JL (1997) Quality of life
score: an independent prognostic variable in a general population of cancer patients receiving
chemotherapy. Qualif Life Res 6:151–158
Diggle P, Sousa I, Chetwynd A (2008) Joint modelling of repeated measurements and time-to-event
outcomes: the fourth Armitage lecture. Stat Med 27:2981–2998
8 Joint Modeling QoL and Survival Data 167

Fairclough DL (2010) Design and analysis of quality of life studies in clinical trials. Chapman and
Hall/CRC, Boca Raton
Ganz PA, Lee JJ, Siau J (2006) Qualify of life assessment: an independent prognostic variable for
survival in lung cancer. Cancer 67:3131–3135
Gould LA, Boye ME, Crowther MJ, Ibrahim JG, Quartey G, Micallef S, Bois FY (2015), Joint
modeling of survival and longitudinal non-survival data: current methods and issues. Report of
the DIA Bayesian joint modeling working group. Stat Med 34:2181–2195
He Z, Tu W, Wang S, Fu H, Yu Z (2015) Simultaneous Variable Selection for Joint Models of
Longitudinal and Survival Outcomes. Biometrics 71:178–187
Ibrahim JG, Chu H, Chen LM (2010) Basic concepts and methods for joint models of longitudinal
and survival data. J Clin Oncol 28:2796–2801
Jørgensen B (1997) The theory of dispersion models. Chapman and Hall, London
Klein JP (1992) Semiparametric estimation of random effects using the Cox model based on the
EM algorithm. Biometrics 48:795–806
Lange KL, Little RJA, Taylor JMG (1989) Robust statistical modeling using the t distribution. J
Am Stat Assoc 84:881–896
Law NJ, Taylor JMG, Sandler H (2002) The joint modelling of a longitudinal disease progression
marker and the failure time process in the presence of cure. Biostatistics 3:547–563
Lesaffre E, Rizopoulos D, Tsonaka R (2007) The logistic transform for bounded outcome scores.
Biostatistics 8:72–85
Levine MN, Bramwell VH, Pritchard KI, Norris BD, Shepherd LE, Abu-Zahra H, Findlay B,
Warr D, Bowman D, Myles J, Arnold A, Vandenberg T, MacKenzie R, Robert J, Ottaway
J, Burnell M, Williams CK, Tu DS (1998) Randomized trial of intensive cyclophosphamide,
epirubicin, and fluorouracil chemotherapy compared with cyclophosphamide, methotrexate,
and fluorouracil in premenopausal women with node-positive breast caner. J Clin Oncol
16:2651–2658
Levine MN, Pritchard KI, Bramwell VH, Shepherd LE, Tu D, Paul N (2005) Randomized trial
comparing cyclophosphamide, epirubicin, and fluorouracil with cyclophosphamide, methotrex-
ate, and fluorouracil in premenopausal women with node-positive breast cancer: Update of
national cancer institute of canada clinical trials group trial MA.5. J Clin Oncol 23:5166–5170
Louis TA (1982) Finding the observed information matrix when using the EM algorithm. J R Stat
Soc Ser B 44:226–233
Park KY, Qiu P (2014) Model selection and diagnostics for joint modeling Of survival and
longitudinal data with crossing hazard rate functions. Stat Med 33: 4532–4546
Pinheiro JC, Liu CH, Wu YN (2001) Efficient algorithms for robust estimation in linear mixed-
effects models using the multivariate t-distribution. J Comput Graph Stat 10:249–276
Qiu Z, Song PXK, Tan M (2008) Simplex mixed-effects models for longitudinal proportional data.
Scand J Stat 35:577–596. http://ideas.repec.org/a/bla/scjsta/v35y2008i4p577-596.html
Richards MA, Ramirez AJ (1997) Quality of life: the main outcome measure of palliative care.
Palliat Med 11:89–92
Ripatti S, Palmgren J (2000) Estimation of multivariate frailty models using penalized partial
likelihood. Biometrics 56:1016–1022
Rondeau V, Commenges D, Joly P (2003) Maximum penalized likelihood estimation in a gamma-
frailty model. Lifetime Data Anal 9:139–153
Schluchter MD (1992) Methods for the analysis of informatively censored longitudinal data. Stat
Med 11:1861–1870
Song PXK (2007) Correlated data analysis: modeling, analytics, and applications. Springer,
New York
Song PXK, Tan M (2000) Marginal models for longitudinal continuous proportional data.
Biometrics 56:496–502
Song PXK, Zhang P, Qu A (2007) Maximum likelihood inference in robust linear mixed-effects
models using multivariate t distributions. Stat Sin 17:929–943
Song H, Peng YW, Tu DS (2012) A new approach for joint modelling of longitudinal measure-
ments and survival times with a cure fration. Can J Stat 40:207–224
168 H. Song et al.

Song H, Peng YW, Tu DS (2015, Online) Jointly modeling longitudinal proportional data and
survival times with an application to the quality of life data in a breast cancer trial. Lifetime
Data Anal. doi: 10.1007/s10985-015-9346-8
Tsiatis AA, Davidian M (2004) Joint modelling of longitudinal and time-to-event data: an
overview. Stat Sin 14:809–834
Tsodikov A, Ibrahim JG, Yakovlev AY (2003) Estimating cure rates from survival data: an
alternative to two-component mixture mdoels. J Am Stat Assoc 98:1063–1078
Tu DS, Chen YQ, Song PX (2004) Analysis of quality of life data from a clinical trial on early
breast cancer based on a non-parametric global test for repeated measures with informative
censoring. Qual Life Res 13:1520–1530
Viviani S, Alfó M, Rizopoulos D (2013) Generalized linear mixed joint model for longitudinal and
survival outcomes. Stat Comput 24:417–427
Wang C, Douglas J, Anderson S (2002) Item response models for joint analysis of quality of life
and survival. Stat Med 21:129–142
Wu L, Liu W, Yi GY, Huang Y (2012) Analysis of longitudinal and survival data: joint modeling,
inference methods, and issues. J Probab Stat Article ID 640153. doi:10.1155/2012/640153
Yakovlev AH, Asselain B, Bardou VJ, Fourquet A, Hoang T, Rochefordiere A, Tsodikov
AD (1993) A simple stochastic model of tumor recurrence and its application to data on
premenopausal breast cancer. Biometriete Analysse de Donnees Spatio-Temporelles 12:66–82
Ye W, Lin XH, Taylor JMG (2008) A penalized likelihood approach to joint modeling of
longitudinal measurements and time-to-event data. Stat Interface 1:33–45
Yin GS (2005) Bayesian cure rate frailty models with application to a root canal therapy study.
Biometrics 61:552–558
Yu MG, Taylor JMG, Sandler HM (2008) Individual prediction in prostate cancer studies using a
joint longitudinal survival-cure model. J Am Stat Assoc 103:178–187
Zeng D, Cai J (2005) Simultaneous modelling of survival and longitudinal data with an application
to repeated quality of life measures. Lifetime Data Anal 11:151–174
Zhang P, Qiu ZG, Fu YJ, Song PXK (2009) Robust transformation mixed-effects models for
longitudinal continuous proportional data. Can J Stat 37:266–281
Zhang D, Chen MH, Ibrahim JG, Boye ME, Wang P, Shen W (2014) Assessing model fit in joint
models of longitudinal and survival data with applications to cancer clinical trials. Stat Med
33:4715–4733
Part III
Applied Data Analysis
Chapter 9
Confidence Weighting Procedures for
Multiple-Choice Tests

Michael Cavers and Joseph Ling

Abstract Multiple-choice tests are extensively used in the testing of mathematics

and statistics in undergraduate courses. This paper discusses a confidence weighting
model of multiple choice testing called the student-weighted model. In this model,
students are asked to indicate an answer choice and their certainty of its correctness.
This method was implemented in two first year Calculus courses at the University
of Calgary in 2014 and 2015. The results of this implementation are discussed here.

9.1 Introduction

Multiple-choice exams are extensively used to test students’ knowledge in math-

ematics and statistics. Typically a student has four or five options to select from
with exactly one option being correct. Using multiple-choice exams allow the
examiners to cover a broad range of topics, to score the exam quickly, and to
measure various learning outcomes. However, it is difficult to write questions that
test mathematical ideas instead of factual recall. Shank (2010) is an excellent
resource discussing the construction of good multiple-choice questions. It is also
debated as to whether multiple-choice exams truly measure a students’ knowledge
since in the conventional model there is no penalty for guessing (Echternacht 1972).
This paper discusses an implementation of a confidence weighting procedure
known as the student-weighted model. In this model, each multiple-choice question
has four or five options to select from with exactly one correct answer. The examinee
must also indicate their certainty of the correctness of their answers, for example on
a three-point scale. We first briefly discuss past studies where confidence testing was
applied.
In 1932, Hevner applied a confidence testing method to music appreciation
true/false tests. In 1936, Soderquist completed a similar study with true/false tests,
whereas Wiley and Trimble (1936) analyzed whether or not there are personality
factors in the confidence testing method. A five-point scale for confidence levels
was implemented by Gritten and Johnson in 1941.

M. Cavers () • J. Ling

Department of Mathematics and Statistics, University of Calgary, T2N 1N4, Calgary, AB, Canada
e-mail: [email protected];; [email protected]

D.-G. (Din) Chen et al. (eds.), Advanced Statistical Methods in Data Science,
ICSA Book Series in Statistics, DOI 10.1007/978-981-10-2594-5_9
172 M. Cavers and J. Ling

A particularly interesting study, completed in 1953 by Dressel and Schmid,

compares four non-conventional multiple-choice models. Five sections of a course
in physical science were used in the study where each section contained about 90
students. In the first hour of the exam, each section wrote a common (conventional)
multiple-choice test. In the second hour, five multiple-choice models were imple-
mented, one for each of the five sections: a conventional multiple-choice test with
one correct answer from five options; a free-choice test where each question had
exactly one correct answer and examinees must mark as many answers as they
felt they needed to be sure they had not omitted the correct answer; a degree-of-
certainty test where each question had exactly one correct answer and examinees
indicate how certain they are of their answer as being correct; a multiple-answer test
where any number of the options may be correct and the examinee is to mark each
correct alternative; and a two-answer test where exactly two of the five options is
known to be correct. After comparing the above models, it was found that students
completed the conventional test the fastest, followed by degree-of-certainty, free-
choice, two-answer, and multiple-answer. From most to least reliable the models
rank as multiple-answer, two-answer, degree-of-certainty, conventional followed by
free-choice.
Relevant studies to confidence weighting methods, including those described
above, are summarized by Echternacht (1972). In Echternacht (1972), it is stated
that such methods can be used to discourage guessing since the expected score
is maximized only if the examinee reveals the true degree of certainty in their
responses. Frary (1989) reviews methods used in the 1970s and 1980s.
Ling and Cavers (2015) describe overall results of an early implementation of
the student-weighted model completed in 2014. The current paper compares results
from Ling and Cavers (2015) to our Spring 2015 study. This study is approved by
the Conjoint Faculties Research Ethics Board (CFREB) at the University of Calgary.
In Sect. 9.2 we discuss the implementation of the student-weighted model at the
University of Calgary in 2014. We then discuss differences in how the method was
applied in 2015. Survey results and the effects of the method on class performance
are analyzed. In Sect. 9.3 we conclude with issues for future investigation.

9.2 The Student-Weighted Model

The student-weighted model was implemented on the multiple-choice sections

of the midterms and final examination in two first year Calculus courses at the
University of Calgary in 2014 and 2015. In particular, we applied the method in
four semesters: Winter 2014 (one section of Math 249 and one section of Math
251), Spring 2014 (Math 249), Fall 2014 (five sections of Math 249), and Spring
2015 (one section of Math 249). This paper mainly focuses on the data collected
from the Fall 2014 and Spring 2015 semesters.
9 Confidence Weighting Procedures for Multiple-Choice Tests 173

In what follows, we first describe the method and how scores are calculated.
Second, we discuss the effect relative weights had on class average along with
percentage of beneficial and detrimental cases. Third, student feedback from surveys
conducted to solicit student feedback about the method are analyzed.

9.2.1 Description of the Method

Out of five options, each question had exactly one correct answer. After each
question, the examinee was asked to assign a relative weight to the question (i.e.,
a confidence level) on a three-point scale. In an early implementation of the model
in the Winter 2014 semester, students indicated their answers on a custom answer
sheet. They were instructed to place an X in the box for their choice (options
A, B, C, D, E) along with an X in one of three relative weight boxes for that
question (options 1, 2, 3). Sample answer sheets with tallied scores were distributed
in advance to explain the scoring method to the students. Later implementations
made use of the standard university multiple-choice answer sheets where the odd
numbered items were exam questions and the even numbered items were the relative
weights. A sample exam question followed by a request for students to indicate a
relative weight is shown in Fig. 9.1.
Students were told to assign weights to questions based on their confidence level
for each problem. That is, when they felt very confident, “Relative Weight = 3”
should be assigned, whereas, when they did not feel confident, “Relative Weight
= 1” should be assigned. When the weighting part of the question is left blank, a
default weight of 2 was assigned. To calculate the multiple-choice score for each

Question 1. Suppose that is everywhere continuous and that for all x,

What is the value of

(a)
(b)
(c)
(d)
(e) There is insufficient information for us to determine the value of
Question 2. Please assign a (relative) weight to Question 1 above.
(a) Weight = 1.
(b) Weight = 2.
(c) Weight = 3.

Fig. 9.1 A sample exam question and follow-up asking examinees to indicate a relative weight
174 M. Cavers and J. Ling

student, the total sum of the relative weights for all correct answers was divided
by the total sum of the relative weights assigned to all questions. Thus, in such
a method, each student may have a different total sum of relative weights. These
scores were then multiplied by the portion of the test that is multiple-choice: in the
case the exam had a 30 % written component and 70 % multiple-choice component,
by 70; in the case the exam was all multiple-choice, by 100. Students were instructed
they could default to a conventional multiple-choice test by assigning the same
relative weight to each question, or by leaving the relative weights blank. Students
who scored 100 % on the multiple-choice did not see a benefit in grade from the
method since a perfect score would have been obtained regardless of the assignment
of relative weights.

9.2.2 Effect on Class Average and Student Grades

Using actual data from midterms and examinations we are able to compare students’
scores between a conventional (uniform-weight) scoring method and the student-
weighted model. Here, we present the data collected from the Fall 2014 and Spring
2015 semesters. In the Fall 2014 semester, we implemented the method in five
sections of Math 249 (Introductory Calculus) for the two midterms and the final
examination. A total of 723 students wrote the first midterm, 657 students wrote
the second midterm and 603 students wrote the final examination. In the Spring
2015 semester, the method was implemented in one section of Math 249 for both
the midterm and final examination. A total of 87 students wrote the midterm and 81
students wrote the final examination. Note that at the University of Calgary, the Fall
and Winter semesters last 13 weeks each whereas the Spring and Summer semesters
range from 6 to 7 weeks each, thus, one midterm was given in the Spring 2015
semester where two midterms were given in the Fall 2014 semester.
To measure the effect of relative weights on student grades, for each student,
we first computed their grade assuming a conventional scoring method where all
questions are weighted the same, we then compared this to the student-weighted
model using the relative weights assigned by the student. Below we will present
three ways of looking at the impacts of relative weights on student grades:
• The change in the class average.
• The percentage of students that received a higher mark, a lower mark and the
same mark on tests and exams.
• The percentage of students that experienced a substantial difference (5 percent-
age points or more and 3 percentage points or more) in test and exam marks.
9 Confidence Weighting Procedures for Multiple-Choice Tests 175

Fig. 9.2 Effect of relative weights on class average

Fig. 9.3 Percentage of students whose midterm/final examination mark increased, remained
unchanged or decreased after applying the student-weighted scoring method

Figure 9.2 shows the effect that relative weights had on the class average.
These results indicate that course performance as a whole is better when we use
the student-weighted scoring method compared to using the conventional scoring
method.
Figure 9.3 shows that a majority of students had an increase in their midterm
or final examination mark after applying the student-weighted model. In the Spring
2015 semester, eight students scored 100 % on the midterm and thus their mark is
indicated as unchanged when compared to an equal weight scheme. Additionally,
some students either left the relative weight options blank or assigned each question
the same relative weight, thus, their score is also unchanged when comparing the
relative weight method to the conventional. Note that some students who saw their
mark decrease come from top performers who had one or two incorrect responses
but assigned a high relative weight to those particular problems.
In Figs. 9.4 and 9.5, we further analyze the effect that relative weights had
on student grades. Students whose multiple-choice mark is affected by at least
five percentage points would likely receive a different letter grade than if the
conventional scoring method were used. Here, we have separated the “beneficial”
cases from the “detrimental” cases.
176 M. Cavers and J. Ling

Fig. 9.4 Percentage of beneficial cases

Fig. 9.5 Percentage of detrimental cases

9.2.3 Survey Results

During each course, a survey was conducted to solicit student feedback about the
student-weighted method. The main purpose of each survey is for us to learn about
the students’ perspectives on their experience with the student-weighted method.
We use student comments to help us improve both our teaching and future students’
learning experience. See Ling and Cavers (2015) for comments on survey questions
and responses from the Winter 2014, Spring 2014 and Fall 2014 semesters. Here,
we summarize results from the Spring 2015 semester and compare it to that of the
Fall 2014 semester.
In the Fall 2014 semester, one survey was conducted about two weeks after the
second midterm test, but response rate was low: 73 out of 646 students completed
the survey. Questions were asked to measure students’ perception on learning
and course performance. Note that Figs. 9.2 and 9.3 illustrate the actual effect of
the method. Because of the low response rate, we hesitate to draw any definite
conclusions about the general student perceptions on the scoring method.
In the Spring 2015 semester, we incorporated the use of the student-weighted
format in the tutorials/labs. All lab problem sets were laid out in this format, thus
practice with the method is built into the entire course. In the Fall 2014 semester,
9 Confidence Weighting Procedures for Multiple-Choice Tests 177

a multi-page explanatory document was posted for students to read, but such a
document was not posted in the Spring 2015 semester. Bonus effort marks for
submitting individual or group solutions to the lab problem sets were awarded.
In the Spring 2015 semester, the number of surveys was increased from one to
two, and the number of survey questions from 6 to 30 for the first survey and 31 for
the second survey. By conducting two surveys, we are better able to track possible
change of feeling/experience during the semester. In the Spring 2015 semester we
also asked about impacts on study habits and exam strategies, impact on stress level,
and perception of fairness of the method.
Completion of the two surveys was built into the course grade, in particular, 1 %
of the course mark was for completion of the surveys. As a result, the response
rate to the surveys increased when compared to the Fall 2014 semester: 72 out
of 87 students (82.76 %) for survey one and 65 out of 81 students (80.25 %) for
survey two. This gives us a lot more concrete information to help improve our future
teaching. However, valid consent for research use of the data is low: twelve students
gave consent to present their responses for both surveys. Three students consented
for use of their data in one of the surveys but not the other, while another three
students consented but did not complete one (or both) of the surveys. We caution the
reader that information shared here is from the twelve consented respondents who
completed both surveys and may or may not be representative of all respondents.
In what follows, we highlight, report and reflect on a few survey questions from
consented participants. We use a vector

.SA; A; N; D; SD/

to indicate the number of students that selected Strongly Agree .SA/, Agree .A/,
Neutral .N/, Disagree .D/ and Strongly Disagree .SD/, respectively, and a vector

.VCon; SCon; N; SClear; VClear/

to indicate the number of students that selected Very Confusing .VCon/, Somewhat
Confusing .SCon/, Neither .N/, Somewhat Clear .SClear/ and Very Clear .VClear/,
respectively.
The first question asked on both surveys was “Which multiple-choice model,
traditional or weighted, do you prefer?”. On survey one, one student selected the
“Traditional method”, ten students selected the “Weighted method”, zero students
selected “Neither” and one student selected “No opinion”. Interestingly, on survey
two all twelve students chose the “Weighted method”.
The second question asked was “How would you rate your understanding of the
weighting method?” for survey one and “How would you rate your understanding
of the weighting method after the midterm test?” for survey two. We observe

.VCon; SCon; N; SClear; VClear/ D .0; 1; 0; 6; 5/

178 M. Cavers and J. Ling

for survey one and

.VCon; SCon; N; SClear; VClear/ D .0; 1; 0; 5; 6/

for survey two. The student who chose “Somewhat Confusing” in survey one,
chose “Somewhat Clear” in survey two, while the student who chose “Somewhat
Confusing”in survey two, chose “Somewhat Clear” in survey one. Overall, most of
the students understood the weighting.
The third question asked was “How would you rate the calculation of your
grade using the weighting method?” for survey one and “How would you rate the
calculation of your grade using the weighting method after the midterm test?” for
survey two. We observe

.VCon; SCon; N; SClear; VClear/ D .0; 2; 1; 7; 2/

for survey one and

.VCon; SCon; N; SClear; VClear/ D .0; 1; 2; 4; 5/

for survey two. Five students selected the same option on both surveys while five
students chose options indicating the calculation of grade became clearer after
the midterm test. The remaining two students chose “Somewhat Clear” on survey
one but on survey two, one chose “Neither” while the other chose “Somewhat
Confusing”.
The fourth item on the surveys was “I paid attention to how I assigned weight
when I did the questions in the lab problem sets” for survey one and “ Since
the midterm test, I have paid attention to how I assigned weight when I did the
questions in the lab problem sets” for survey two. We observe .SA; A; N; D; SD/ D
.4; 6; 0; 1; 1/ for survey one and .SA; A; N; D; SD/ D .1; 4; 6; 0; 1/ for survey two.
Overall, most of the students paid more attention to weight assignment in the labs
before the midterm while still learning about the method.
The next set of questions asked students about double-checking their work. For
the question “How often do you typically double-check your work on midterms
(respectively, examinations)?”, we observe

.Always; Very Frequently; Occasionally; Rarely; Never/ D .7; 3; 0; 2; 0/

for survey one and

.Always; Very Frequently; Occasionally; Rarely; Never/ D .5; 3; 4; 0; 0/

for survey two. For the statements “While working on the problem sets, I felt an
increased need to double-check my work as I was to assign weights to my answers”
for survey one and “Since the midterm test, I felt an increased need to double-check
my work as I was to assign weights to my answers while working on the lab problem
9 Confidence Weighting Procedures for Multiple-Choice Tests 179

sets” for survey two, we observe .SA; A; N; D; SD/ D .1; 2; 5; 4; 0/ for survey one
and .SA; A; N; D; SD/ D .5; 2; 4; 1; 0/ for survey two. Finally, for the question
“During the midterm test (respectively, final examination), I felt an increased
need to double-check my work as I was to assign weights to my answers” we
observe .SA; A; N; D; SD/ D .4; 5; 2; 1; 0/ for survey one and .SA; A; N; D; SD/ D
.8; 4; 0; 0; 0/ for survey two. Overall, although students already frequently double-
check their work, students perceived the assignment of weights to cause an increase
in need to double-check their work.
The next question asked “During the midterm test (respectively, final examina-
tion), I experienced an increased level of stress as I was to assign weights to my
answers” with a follow-up question requesting students to provide a reason for
their choice. We observe .SA; A; N; D; SD/ D .2; 4; 2; 2; 2/ for survey one and
.SA; A; N; D; SD/ D .1; 4; 3; 3; 1/ for survey two. The most common reasons given
for those who selected .SA/ or .A/ are that the stake is high and weighting impacts
grade. For those who selected .SD/ or .D/ the most common reason given is that
the method can lower weight when unsure.
We also asked students their perception of the weighted method on learning:
“Assigning relative weights to multiple-choice questions was beneficial to my
learning.” We observe .SA; A; N; D; SD/ D .3; 7; 1; 1; 0/ for survey one and
.SA; A; N; D; SD/ D .6; 5; 1; 0; 0/ for survey two. After the final exam, it appears
that an increased number found the method beneficial. When asked how did they
think the method was beneficial to their learning, the most common response was
that it helped identify weak areas for more practice and study. We also asked
students the following: “My test mark based on the weighting method is a more
accurate reflection of my knowledge of the course material than the mark based on
the traditional grading method would have been.” We observe .SA; A; N; D; SD/ D
.2; 5; 4; 1; 0/ on survey one, however, there was a mistake in two of the options on
survey two so we omit that data here.
The next question asked “Prior to writing the midterm test (respectively,
final exam) I developed my own strategy to assign weights to my answers.” We
observe .SA; A; N; D; SD/ D .2; 2; 4; 2; 2/ for survey one and .SA; A; N; D; SD/ D
.2; 7; 2; 1; 0/ for survey two. From this it appears that more students developed a
strategy for the final exam than in midterm. When asked if they used the same
strategy we found .SA; A; N; D; SD/ D .4; 4; 0; 4; 0/.
The next question asked “I personally believe that there is a positive correlation
between one’s knowledge of a subject and one’s confidence in one’s knowledge
of that subject” with a follow-up question requesting students to provide a reason
for their choice. We observe .SA; A; N; D; SD/ D .9; 3; 0; 0; 0/ for survey one and
.SA; A; N; D; SD/ D .9; 2; 0; 0; 1/ for survey two. Note that the student with the
.SD/ response in survey two chose .SA/ in survey one for the same question.
Reasons are respectively “If you are confident you know something, you’re more
likely to have it stick to your brain instead of just learning it for the exam” and “If
you think you don’t know it, you probably don’t.”
180 M. Cavers and J. Ling

When asked “Knowing that I could assign weights affected how I studied,” we
observe .SA; A; N; D; SD/ D .0; 5; 2; 4; 1/ for survey one and .SA; A; N; D; SD/ D
.1; 4; 4; 2; 1/ for survey two. Reasons given for .SA/ and .A/ are that students
focus more on content they weight a one and that if there was a topic they did
not understand they did not stress as much about it knowing they can weight it a
one. Reasons given for .SD/ and .D/ mostly mention that they would study all of
the material regardless and that they need to know all concepts for future courses.
On the other hand, when asked “Knowing that I could assign weights affected how I
distrubuted (sic) my study time to different topics,” we observe .SA; A; N; D; SD/ D
.1; 4; 2; 4; 1/ for survey one and .SA; A; N; D; SD/ D .2; 4; 2; 3; 1/ for survey two.
Comments given for .SA/ and .A/ are that they focused more time on topics they
were not confident in, whereas comments for .D/ and .SD/ are that the weighting
method does not affect grade enough to completely ignore a topic.
To conclude, consented respondents found the method beneficial to their learn-
ing, mainly by helping them to identify areas of weakness; developed strategies
as the term progressed; believed that there is a positive correlation between
knowledge and confidence; thought that the weighting method reflected their level
of knowledge more accurately than the traditional method; and seemed to be neutral
in relation to the impact on stress level in tests and exam.

9.3 Issues for Future Investigation

After implementation of the student-weighted method, feedback from students and

colleagues have sparked many questions, for example, see Ling and Cavers (2015).
Results indicate that course performance as a whole is better when we use the
student-weighted scoring method compared to that of the conventional scoring
method. However, how is one to interpret this in terms of student learning? More
research targeted at specific issues is needed.

References

Dressel PL, Schmid J (1953) Some modifications of the multiple-choice item. Educ Psychol
Measurement 13:574–595
Echternacht GJ (1972) The use of confidence testing in objective tests. Rev Educ Res 42(2):217–
236
Frary RB (1989) Partial-credit scoring methods for multiple-choice tests. Appl Measur Educ
2(1):79–96
Gritten F, Johnson DM (1941) Individual differences in judging multiple-choice questions. J Educ
Psychol 32:423–430
Hevner KA (1932) A method of correcting for guessing in true-false tests and empirical evidence
in support of it. J Soc Psychol 3:359–362
9 Confidence Weighting Procedures for Multiple-Choice Tests 181

Ling J, Cavers M (2015) Student-weighted multiple choice tests. In: Proceedings of the 2015
UC postsecondary conference on learning and teaching, PRISM: University of Calgary Digital
Repository
Shank P (2010) Create better multiple-choice questions, vol 27, Issue 1009. The American Society
for Training & Development, Alexandria
Soderquist HO (1936) A new method of weighting scores in a true-false test. J Educ Res 30:
290–292
Wiley LN, Trimble OC (1936) The ordinary objective test as a possible criterion of certain
personality traits. Sch Soc 43:446–448
Chapter 10
Improving the Robustness of Parametric
Imputation

Peisong Han

Abstract Parametric imputation is widely used in missing data analysis. When

the imputation model is misspecified, estimators based on parametric imputation
are usually inconsistent. In this case, we propose to estimate and subtract off the
asymptotic bias to obtain consistent estimators. Estimation of the bias involves
modeling the missingness mechanism, and we allow multiple models for it. Our
method simultaneously accommodates these models. The resulting estimator is
consistent if any one of the missingness mechanism models or the imputation model
is correctly specified.

10.1 Introduction

Imputation is a widely used method in missing data analysis, where the missing
values are filled in by imputed values and the analysis is done as if the data
were completely observed. Parametric imputation (Little and Rubin 2002; Rubin
1978, 1987), which imputes the missing values based on a parametric model, is
the most commonly taken form in practice, due to its simplicity and straight-
forwardness. However, parametric imputation is sensitive to misspecification of
the imputation model. The resulting estimators are usually inconsistent when the
model is misspecified. Because of this sensitivity, many researchers suggested
nonparametric imputation; for example, Cheng (1994), Lipsitz et al. (1998), Aerts
et al. (2002), Wang and Rao (2002), Zhou et al. (2008) and Wang and Chen (2009).
Despite the robustness against model misspecification, nonparametric imputation
usually suffers from the curse of dimensionality. In addition, for kernel-based
techniques, bandwidth selection could be a complicated problem. In this paper,
within the framework of parametric imputation, we propose a method to improve
the robustness against possible model misspecifications. The idea is to estimate and
subtract off the asymptotic bias of the imputation estimators when the imputation
model is misspecified. Estimation of the bias involves modeling the missingness

P. Han ()
Department of Statistics and Actuarial Science, University of Waterloo, 200 University Ave. W.,
N2L 3G1, Waterloo, ON, Canada
e-mail: [email protected]

D.-G. (Din) Chen et al. (eds.), Advanced Statistical Methods in Data Science,
ICSA Book Series in Statistics, DOI 10.1007/978-981-10-2594-5_10
184 P. Han

mechanism, and we allow multiple models for it. Our method simultaneously
accommodates these models. The resulting estimator is consistent if any one of
these models or the imputation model is correctly specified. A detailed numerical
study of the proposed method is presented.

10.2 Notations and Existing Methods

We consider the setting of estimating the population mean of an outcome of interest

that is subject to missingness. This simple yet important setting has been studied
in many recent works on missing data analysis, including Tan (2006, 2010), Kang
and Schafer (2007) and its discussion, Qin and Zhang (2007), Qin et al. (2008), Cao
et al. (2009), Rotnitzky et al. (2012), Han and Wang (2013), Chan and Yam (2014),
and Han (2014a).
Let Y denote an outcome that is subject to missingness, and X a vector of
auxiliary variables that are always observed. Our goal is to estimate 0 D E.Y/,
the marginal mean of Y. Let R denote the indicator of observing Y; that is, R D 1
if Y is observed and R D 0 if Y is missing. Our observed data are .Ri ; Ri Yi ; Xi /,
i D 1; : : : ; n, which are independent and identically distributed. We assume the
missingness mechanism to be missing-at-random (MAR) (Little and Rubin 2002)
in the sense that

P.R D 1jY; X/ D P.R D 1jX/;

and we use .X/ to denote this probability. UnderPthe MAR assumption, the
sample average based on complete cases, namely n1 niD1 Ri Yi , is not a consistent
estimator of 0 .
Parametric imputation postulates a parametric model a.I X/ for E.YjX/, and
imputes the missing Yi by a.I O Xi /, where O is some estimated value of .
One typical way of calculating O is the complete-case analysis, as E.YjX/ D
E.YjX; R D 1/ under the MAR mechanism. When a.I X/ is a correct model for
E.YjX/, in the sense that a. 0 I X/ D E.YjX/ for some 0 , the two imputation
estimators of 0 ,

1X
n
O imp;1 D a.I
O Xi /;
n iD1

1X
n
O imp;2 D fRi Yi C .1 Ri /a.I
O Xi /g;
n iD1

p
are both consistent. When a.I X/ is a misspecified model, we have O !
for some ¤ 0 . In this case neither O imp;1 nor O imp;2 is consistent. Their
probability limits are Efa. I X/g and EfRY C .1 R/a. I X/g, respectively, and
10 Improving the Robustness of Parametric Imputation 185

their asymptotic biases are

bias1 D Efa. I X/ Yg; (10.1)

bias2 D EŒf1 .X/gfa. I X/ Yg: (10.2)

Thus, if a consistent estimator of bias1 (or bias2 ) can be found, subtracting this bias
estimator off from O imp;1 (or O imp;2 ) leads to a consistent estimator of 0 .
Noticing that

R
bias1 D E fa. I X/ Yg ;
.X/

R
bias2 D E f1 .X/gfa. I X/ Yg
.X/

under the MAR mechanism, one straightforward way to obtain a consistent estima-
tor of the asymptotic bias is to model .X/. Let .˛I X/ denote a parametric model
for .X/, and ˛O the maximizer of the Binomial likelihood

Y
n
f.˛I Xi /gRi f1 .˛I Xi /g1Ri : (10.3)
iD1

The two biases in (10.1) and (10.2) can be respectively estimated by

n
be
1X Ri
ias 1 D fa.I
O Xi / Yi g ;
n iD1 .˛I
O Xi /
n
be
1X Ri
ias 2 D O Xi /gfa.I
f1 .˛I O Xi / Yi g :
n iD1 .˛I
O Xi /

It is easy to see that, when .˛I X/ is a correctly specified model in the sense that
.˛0 I X/ D .X/ for some ˛0 , bias1 ! ep
bias1 and bias2 ! ep
bias2 , and thus both
e e
O imp;1 bias1 and O imp;2 bias2 are consistent estimators of 0 . On the other hand,
when a.I X/ is a correctly specified model for E.YjX/, we have

e
bias1 !
E
p R
.˛ I X/
fa. 0 I X/ Yg

E.RjX/
DE fa. 0 I X/ E.YjX/g D 0;
.˛ I X/

where ˛ is the probability limit of ˛O and may not be equal to ˛0 , and the second
last equality follows from R ? Y j X under the MAR mechanism. Similarly, we
e
have bias2 !
p
e e
0. So O imp;1 bias1 and O imp;2 bias2 are again consistent estimators
186 P. Han

e e
of 0 . Therefore, O imp;1 bias1 and O imp;2 bias2 are more robust than O imp;1 and
O imp;2 against possible misspecification of a.I X/.
e e
Actually, O imp;1 bias1 and O imp;2 bias2 are both equal to

n
1X Ri Ri .˛I
O Xi /
O aipw D Yi O Xi / ;
a.I
n iD1 .˛I
O Xi / .˛I
O Xi /

which is the augmented inverse probability weighted (AIPW) estimator (Robins

p
et al. 1994). The fact that O aipw !
0 if either .X/ or E.YjX/ is correctly modeled
is known as the double robustness property (Scharfstein et al. 1999; Tsiatis 2006).
The improvement of robustness is achieved by introducing an extra model .˛I X/
in addition to a.I X/.

10.3 The Proposed Method

In observational studies, the correct model for .X/ is typically unknown. To

increase the likelihood of correct specification, multiple models j .˛j I X/, j D
1; : : : ; J, may be fitted instead of just one. Refer to Robins et al. (2007) for more
discussion on some practical scenarios where multiple models may be fitted. Our
goal is to propose a method that can simultaneously accommodates all these models,
so that bias1 and bias2 in (10.1) and (10.2) are consistently estimated if one of
j .˛j I X/ is correct or a.I X/ is correct.
Because of

R
0DE Œ .˛ I X/ Ef .˛ I X/g
j j j j
.X/

1
DE Œ .˛ I X/ Ef .˛ I X/g j R D 1 P.R D 1/;
j j j j
.X/

j .˛j I X/, j D 1; : : : ; J, satisfy

1
E Œ .˛ I X/ Ef .˛ I X/g j R D 1 D 0:
j j j j
(10.4)
.X/
Pn
Let m D iD1 Ri be the number of subjects who have their outcome observed,
and index those subjects by i D 1; : : : ; m without loss of generality. Let ˛O j denote
the maximizer of the Binomial likelihood (10.3) with j .˛j I X/. We construct the
10 Improving the Robustness of Parametric Imputation 187

empirical version of (10.4) as

X
m
wi 0 .i D 1; : : : ; m/; wi D 1;
iD1

X
m
wi f j .˛O j I Xi / O j .˛O j /g D 0 . j D 1; : : : ; J/; (10.5)
iD1

where wi , i D 1; : : : ; m, are positive weights on the complete cases under

P
sum-to-one regularization, and O j .˛O j / D n1 niD1 j .˛O j I Xi /. The wi naturally
accommodate all models j .˛j I X/.
Being positive and sum-to-one, wi may be viewed as an empirical likelihood on
the
Qm complete cases. Applying the principle of maximum likelihood, we maximize
w
iD1 i subject to the constraints in (10.5). Write ˛
O T
D f. ˛O 1 /T ; : : : ; .˛O J /T g and

O T D f 1 .˛O 1 I Xi / O 1 .˛O 1 /; : : : ; J .˛O J I Xi / O J .˛O J /g:

gO i .˛/

From empirical likelihood theory (Owen 1988, 2001; Qin and Lawless 1994) the
maximizer is given by

1 1
wO i D .i D 1; ; m/; (10.6)
m 1 C O T gO i .˛/
O

where O T D .O1 ; ; OJ / is the J-dimensional Lagrange multiplier solving

1 X
m
gO i .˛/
O
D 0: (10.7)
m iD1 1 C gO i .˛/
T O

We propose to estimate bias1 and bias2 in (10.1) and (10.2) respectively by

b X
m
bias1 D wO i fa.I
O Xi / Yi g;
iD1

bb
X
m
ias 2 D O i n1 /fa.I
.w O Xi / Yi g:
iD1

b
To discuss the large sample properties of bias1 and bias2 , let ˛ , and b j j

denote the probability limits of ˛O j , O j .˛O j / and ,

j
O respectively. It is clear that D
j j 1 T
Ef .˛ I X/g. Write ˛ D f.˛ / ; ; .˛ / g and
T J T

g.˛ /T D f 1 .˛1 I X/ 1 ; : : : ; J .˛J I X/ J g: (10.8)

188 P. Han

p
When a.I X/ is a correct model for E.YjX/, we have O !
0 . Therefore,

b X
m
bias1 D wO i fa.I
O Xi / Yi g
iD1
n
n1X Ri
D fa. I
O X i / Yi g
m n iD1 1 C O T gO i .˛/
O

p 1 R
!
E fa. 0 I X/ Yg
P.R D 1/ 1 C T g.˛ /

1 E.RjX/
D E fa. 0 I X/ E.YjX/g D 0;
P.R D 1/ 1 C T g.˛ /

where the second last equality follows from R ? Y j X under the MAR mechanism.
Similarly,

b X
m
bias2 D .wO i n1 /fa.I
O Xi / Yi g
iD1
n
n1X 1 m
D Ri f gfa. I
O X i / Yi g
m n iD1 1 C O T gO i .˛/
O n

p 1 1
!
E Rf P.R D 1/gfa. 0 I X/ Yg D 0:
P.R D 1/ 1 C T g.˛ /

b b
Hence, both O imp;1 bias1 and O imp;2 bias2 are consistent estimators of 0 when
a.I X/ is a correct model for E.YjX/.
b b
In the following, we will show that bias1 and bias2 are consistent estimators of
bias1 and bias2 , respectively, when any one of j .˛j I X/ is a correct model for .X/.
Without loss of generality, let the correct model be 1 .˛1 I X/. It is easy to see that

1 X
m
gO i .˛/
O
m iD1 1 C gO i .˛/
T O

O 1 .˛O 1 / X
m
gO i .˛/
O
D n oT
m iD1
1 .˛O 1 I Xi / C O 1 .˛O 1 /1 1; O 1 .˛O 1 /2 ; ; O 1 .˛O 1 /J gO i .˛/
O

O 1 .˛O 1 / X .˛O 1 I Xi /
m 1
gO i .˛/=
O
D n oT :
m iD1
1 C O 1 .˛O 1 /1 1; O 1 .˛O 1 /2 ; ; O 1 .˛O 1 /J gO i .˛/=
O O 1 I Xi /
1 .˛
10 Improving the Robustness of Parametric Imputation 189

Because O solves (10.7), if we define O 1 D O 1 .˛O 1 /O1 1 and O t D O 1 .˛O 1 /Ot ,

O T D . O 1 ; ; O J / solves
t D 2; : : : ; J, then

O 1 .˛O 1 / X 1 1
m
gO i .˛/=
O .˛O I Xi /
D 0: (10.9)
m iD1 1 C gO i .˛/=
T
O O 1 I Xi /
1 .˛

O wO i given by (10.6) can be re-expressed as

In terms of ,

1 O 1 .˛O 1 /= 1 .˛O 1 I Xi /

wO i D :
m1C O T gO i .˛/=
O 1 .˛O 1 I Xi /

Since 1 .˛1 I X/ is a correct model for .X/, we have ˛1 D ˛10 , and thus D 0 is
the solution to

Rg.˛ /= 1 .˛1 I X/
E D 0;
1 C T g.˛ /= 1 .˛1 I X/

where g.˛ / is given by (10.8). In addition, the left-hand side of (10.9) converges in
probability to the left-hand side of the above equation. Therefore, O solving (10.9)
has probability limit 0 and is of order Op .n1=2 / from the M-estimator theory (e.g.,
van der Varrt 1998). With this fact, we have

bb
X
m
ias 1 D O i fa.I
w O Xi / Yi g
iD1
" #
nO 1 .˛O 1 / 1 X Ri = 1 .˛O 1 I Xi /
n
D O Xi / Yi g
fa.I
m n iD1 1 C O T gO i .˛/=
O O 1 I Xi /
1 .˛

p R
!
E fa. I X/ Yg D bias1
.X/
and

bb
X
m
ias 2 D O i n1 /fa.I
.w O Xi / Yi g
iD1
" (
nO 1 .˛O 1 / 1 X
n
Ri 1
D 1
1
n iD1 .˛O I Xi / 1 C O gO i .˛/=
T
m O O 1 I Xi /
1 .˛
) #
m 1 1
.˛O I Xi / fa.I O Xi / Yi g
nO 1 .˛O 1 /

p R
!
E f1 .X/gfa. I X/ Yg D bias2 :
.X/
190 P. Han

b b
b b
Thus, bias1 and bias2 are consistent estimators of bias1 and bias2 , respectively,
which makes O imp;1 bias1 and O imp;2 bias2 consistent estimators of 0 .
b
As a matter of fact, simple algebra shows that O imp;1 bias1 is equal to
b
O imp;2 bias2 . Let O mr denote this difference, where “mr” stands for multiple
robustness. Based on our arguments above, O mr is a consistent estimator of 0 if any
one of j .˛j I X/, j D 1; : : : ; J, is a correct model for .X/ or a.I X/ is a correct
model for E.YjX/. Therefore, O mr improves the robustness over the imputation
estimators and the AIPW estimator. The asymptotic distribution of O mr depends
on which model is correctly specified, but such information is usually unavailable
in real studies. This makes the asymptotic distribution to be of little practical use
for inference. Hence, we choose not to derive the asymptotic distribution, but
rather recommend the bootstrapping method to calculate the standard error of O mr .
Numerical performance of the bootstrapping method will be evaluated in the next
section.
A main step in the numerical implementation of the proposed method is to
calculate .
O Since O solves (10.7) and wO i are positive, O is actually the minimizer of

1X
n
Fn ./ D Ri logf1 C T gO i .˛/g
O
n iD1

over the region Dn D f W 1 C T gO i .˛/

O > 0; i D 1; : : : ; mg. Han (2014a) showed
that the minimizer of Fn ./ over Dn indeed exists, and O is the unique and global
minimizer, at least when n is large. On the other hand, it is easy to verify that Dn
is an open convex set and Fn ./ is a strictly convex function. Therefore, calculating
O pertains to a convex minimization problem. Refer to Chen et al. (2002) and Han
(2014a) for a detailed description of the numerical implementation using Newton–
Raphson algorithm.

10.4 Simulation Study

Our simulation setting follows that in Kang and Schafer (2007). The data are
generated with X D fX .1/; : : : ; X .4/ g N.0; I4 /, YjX NfE.YjX/; 1g, and
RjX Berf.X/g, where I4 is the 4 4 identity matrix,

.X/ D Œ1 C expfX .1/ 0:5X .2/ C 0:25X .3/ C 0:1X .4/g1 ;

E.YjX/ D 210 C 27:4X .1/ C 13:7fX .2/ C C X .4/ g:

The .X/ leads to approximately 50 % of the subjects with missing Y. As in Kang

and Schafer (2007), we calculate Z .1/ D expfX .1/ =2g, Z .2/ D X .2/ =Œ1 CexpfX .1/ gC
10, Z .3/ D fX .1/ X .3/ =25 C 0:6g3 and Z .4/ D fX .2/ C X .4/ C 20g2 . The correct models
10 Improving the Robustness of Parametric Imputation 191

for .X/ and E.YjX/ are given by

1 .˛1 I X/ D Œ1 C expf˛11 C ˛21 X .1/ C C ˛51 X .4/ g1 ;

a1 . 1 I X/ D 11 C 21 X .1/ C C 51 X .4/ ;

respectively. The incorrect models are fitted by 2 .˛2 I Z/ and a2 . 2 I Z/, which
replace X in 1 .˛1 I X/ and a1 . 1 I X/ by Z D fZ .1/ ; : : : ; Z .4/ g. As in Robins
et al. (2007) and Rotnitzky et al. (2012), we also consider the scenario where
Y is observed when R D 0 instead of when R D 1. The estimators under our
comparison include the inverse probability weighted (IPW) (Horvitz and Thompson
1952) estimator
Pn
Ri Yi =.˛I Xi /
O ipw D PiD1
n ;
iD1 Ri =.˛I Xi /

the imputation estimators O imp;1 and O imp;2 , the AIPW estimator O aipw , and the
estimator O rlsr proposed by Rotnitzky et al. (2012). We use a four-digit superscript
to distinguish estimators constructed using different postulated models, with each
digit, from left to right, indicating if 1 .˛1 I X/, 2 .˛2 I Z/, a1 . 1 I X/ and a2 . 2 I Z/
is used, respectively. We take the sample sizes n D 200; 800 and conduct 2000
replications for the simulation study. The results are summarized in Table 10.1.
Both imputation estimators O imp;1 and O imp;2 have large bias when E.YjX/ is
incorrectly modeled by a2 . 2 I Z/, especially in the second scenario where Y is
observed if R D 0. Using the correct model 1 .˛1 I X/ for .X/, our proposed
estimators O 1001 mr and O 1101
mr are able to significantly reduce the bias. While the
existing estimators O 1001aipw and O 1001
rlsr have similar capability, they explicitly require
to know which model for .X/ is correct. Our estimator O 1101 mr , on the contrary,
accommodates both 1 .˛1 I X/ and 2 .˛2 I Z/ to reduce the bias without requiring
such knowledge. This is important in practice, as it is usually impossible to tell
which one among the multiple fitted models is correct. The estimator O 1101 mr also
has high efficiency, illustrated by its significantly smaller root mean square error
compared to the AIPW estimator O 1001 aipw . It is well known that this AIPW estimator
using an incorrect model for E.YjX/ could be very inefficient (e.g., Tan 2006, 2010;
Cao et al. 2009), and the estimator O 1001 rlsr was proposed to improve the efficiency
over O 1001
aipw . Our estimator O 1101
mr has efficiency even comparable to O 1001rlsr , judging by
their root mean square errors.
When only the incorrect model 2 .˛2 I Z/ is used in addition to a2 . 2 I Z/, our
estimator O 0101 mr is not consistent, the same as O 0101 O 0101
aipw and rlsr . In this case, similar
to O rlsr , O mr has much more stable numerical performance than O 0101
0101 0101
aipw in the first
scenario where Y is observed if R D 1. Here the poor performance of O 0101 aipw is
2 2
because that, some .˛O I Zi / for a few subjects with Ri D 1 are erroneously
close to zero, yielding extremely large weights Ri = 2 .˛O 2 I Zi / (Robins et al. 2007).
192 P. Han

Table 10.1 Comparison of different estimators based on 2000 replications. Each digit of the four-
digit superscript, from left to right, indicates if 1 .˛1 I X/, 2 .˛2 I Z/, a1 . 1 I X/ and a2 . 2 I Z/ is
used, respectively. The numbers have been multiplied by 100
Y observed if and only if R D 1 Y observed if and only if R D 0
n D 200 n D 800 n D 200 n D 800
Bias RMSE MAE Bias RMSE MAE Bias RMSE MAE Bias RMSE MAE
O 1000
ipw 9 388 241 5 202 126 15 427 253 8 195 125
O 0100
ipw 154 863 315 505 1252 265 381 509 387 371 407 373
O 0010
imp,1 4 261 180 1 127 84 4 261 179 1 127 85
O 0001
imp,1 52 338 227 81 184 131 497 584 500 496 518 496
O 0010
imp,2 4 261 180 1 127 84 4 261 179 1 127 85
O 0001
imp,2 52 338 227 81 184 131 497 584 500 496 518 496
O 1010
aipw 4 261 179 1 127 84 4 261 179 1 127 85
O 1001
aipw 35 358 233 4 190 116 43 432 252 16 192 120
O 0110
aipw 3 261 179 10 456 86 4 261 180 1 127 85
O 0101
aipw 693 5688 361 2397 37088 510 326 462 350 308 348 311
O 1010
rlsr 4 261 179 1 127 84 4 261 179 1 127 85
O 1001
rlsr 31 297 202 7 137 91 116 312 210 34 141 94
O 0110
rlsr 3 262 178 2 129 84 4 261 178 1 127 86
O 0101
rlsr 170 356 244 237 345 261 254 396 287 157 221 163
O 1001
mr 49 344 233 12 170 113 62 308 214 18 150 97
O 0101
mr 244 417 295 314 432 314 300 435 328 277 318 278
O 1101
mr 8 309 214 1 156 107 85 308 211 27 148 94
O 1010
mr 4 261 179 1 127 84 4 261 179 1 127 85
O 0110
mr 4 261 179 1 127 85 4 261 179 1 127 85
O 1110
mr 4 261 180 1 127 84 4 261 179 1 127 85
RMSE root mean square error, MAE median absolute error

This also explains the problematic performance of the corresponding IPW estimator
O 0100
ipw . Our estimator O 0101 is not affected much by the close-to-zero 2 .˛O 2 I Zi /
mr Q
because it uses weights wO i that maximize m iD1 wi . The maximization prevents the
occurrence of extreme weights for our proposed method.
When the correct model a1 . 1 I X/ for E.YjX/ is used, the proposed estimators
O mr , O 0110
1010
mr and O 1110
mr have almost identical performance to the imputation estima-
0010 0010
tors O imp;1 and O imp;2 .
Table 10.2 summarizes the performance of the bootstrapping method in calcu-
lating the standard error of the proposed estimator. The re-sampling size is 200.
The means of bootstrapping-based standard errors over the 2000 replications are
close the corresponding empirical standard errors. In addition, except for the case
where the proposed estimator is inconsistent (i.e. O 0101 mr ), the 95 % bootstrapping-
based confidence intervals have coverage probabilities very close to 95 %. These
observations demonstrate that the bootstrapping method provides a reliable way to
make statistical inference.
10 Improving the Robustness of Parametric Imputation 193

Table 10.2 Bootstrapping method for the calculation of standard errors based on 2000 replications
with re-sampling size 200. Each digit of the four-digit superscript, from left to right, indicates if
1 .˛1 I X/, 2 .˛2 I Z/, a1 . 1 I X/ and a2 . 2 I Z/ is used, respectively. Except for the percentages,
the numbers have been multiplied by 100
Y observed if and only if R D 1 Y observed if and only if R D 0
n D 200 n D 800 n D 200 n D 800
EMP EST PER EMP EST PER EMP EST PER EMP EST PER
O 1001
mr 338 326 94:3 % 171 167 94:4 % 295 285 93:3 % 148 146 93:5 %
O 0101
mr 346 334 87:4 % 226 176 54:3 % 303 298 82:3 % 152 151 55:0 %
O 1101
mr 311 309 94:2 % 156 155 94:5 % 284 281 93:5 % 145 143 93:8 %
O 1010
mr 253 255 95:2 % 125 128 95:2 % 253 255 95:1 % 125 128 94:9 %
O 0110
mr 253 255 95:2 % 125 128 95:3 % 253 255 95:0 % 125 128 94:9 %
O 1110
mr 253 255 95:2 % 125 128 95:2 % 253 255 95:1 % 125 128 94:9 %
EMP empirical standard error, EST mean of estimated standard error, PER percentage out of 2000
replications that the 95 % confidence interval based on the estimated standard error covers 0

10.5 Discussion

In the literature, many researchers model both .X/ and E.YjX/ to improve
the robustness against model misspecification over the IPW estimator and the
imputation estimator. It has been a common practice to propose and study new
estimators by incorporating the imputation approach into the weighting approach
(e.g., Robins et al. 1994; Tan 2006, 2010; Tsiatis 2006; Qin and Zhang 2007; Cao
et al. 2009; Rotnitzky et al. 2012; Han and Wang 2013; Chan and Yam 2014; Han
2014a,b). We took an alternative view by incorporating the weighting approach into
the imputation approach. This is similar to Qin et al. (2008). But the estimator
proposed by Qin et al. (2008) can only take one model for .X/, resulting double
robustness.
Although the proposed idea was described in the setting of estimating the
population mean, it can be easily extended to regression setting with missing
responses and/or covariates. We leave this straightforward extension to empirical
researchers who choose to apply the proposed idea.

References

Aerts M, Claeskens G, Hens N, Molenberghs G (2002) Local multiple imputation. Biometrika

89:375–388
Cao W, Tsiatis AA, Davidian M (2009) Improving efficiency and robustness of the doubly robust
estimator for a population mean with incomplete data. Biometrika 96:723–734
Chan KCG, Yam SCP (2014) Oracle, multiple robust and multipurpose calibration in a missing
response problem. Stat Sci 29:380–396
Chen J, Sitter RR, Wu C (2002) Using empirical likelihood methods to obtain range restricted
weights in regression estimators for surveys. Biometrika 89:230–237
194 P. Han

Cheng PE (1994) Nonparametric estimation of mean functionals with data missing at random. J
Am Stat Assoc 89:81–87
Han P (2014a) A further study of the multiply robust estimator in missing data analysis. J Stat Plan
Inference 148:101–110
Han P (2014b) Multiply robust estimation in regression analysis with missing data. J Am Stat
Assoc 109:1159–1173
Han P, Wang L (2013) Estimation with missing data: beyond double robustness. Biometrika
100:417–430
Horvitz DG, Thompson DJ (1952) A generalization of sampling without replacement from a finite
universe. J Am Stat Assoc 47:663–685
Kang JDY, Schafer JL (2007) Demystifying double robustness: a comparison of alternative
strategies for estimating a population mean from incomplete data (with discussion). Stat Sci
22:523–539
Lipsitz SR, Zhao LP, Molenberghs G (1998) A semiparametric method of multiple imputation. J
R Stat Soc Ser B 60:127–144
Little RJA, Rubin DB (2002) Statistical analysis with missing data, 2nd edn. Wiley, New York
Owen A (1988) Empirical likelihood ratio confidence intervals for a single functional. Biometrika
75:237–249
Owen A (2001) Empirical likelihood. Chapman & Hall/CRC Press, New York
Qin J, Lawless J (1994) Empirical likelihood and general estimating equations. Ann Stat 22:300–
325
Qin J, Shao J, Zhang B (2008) Efficient and doubly robust imputation for covariate-dependent
missing responses. J Am Stat Assoc 103:797–810
Qin J, Zhang B (2007) Empirical-likelihood-based inference in missing response problems and its
application in observational studies. J R Stat Soc Ser B 69:101–122
Robins JM, Rotnitzky A, Zhao LP (1994) Estimation of regression coefficients when some
regressors are not always observed. J Am Stat Assoc 89:846–866
Robins JM, Sued M, Gomez-Lei Q, Rotnitzky A (2007) Comment: performance of double-robust
estimators when “inverse probability” weights are highly variable. Stat Sci 22:544–559
Rotnitzky A, Lei Q, Sued M, Robins JM (2012) Improved double-robust estimation in missing data
and causal inference models. Biometrika 99:439–456
Rubin DB (1978) Multiple imputations in sample surveys. In: Proceedings of the survey research
methods section, American Statistical Association, Washington, DC, pp 20–34
Rubin DB (1987) Multiple imputation for nonresponse in surveys. Wiley, New York
Scharfstein DO, Rotnitzky A, Robins JM (1999) Adjusting for nonignorable drop-out using
semiparametric nonresponse models. J Am Stat Assoc 94:1096–1120
Tan Z (2006) A distributional approach for causal inference using propensity scores. J Am Stat
Assoc 101:1619–1637
Tan Z (2010) Bounded efficient and doubly robust estimation with inverse weighting. Biometrika
97:661–682
Tsiatis AA (2006) Semiparametric theory and missing data. Springer, New York
van der Varrt AW (1998) Asymptotic statistics. Cambridge University Press, Cambridge
Wang D, Chen SX (2009) Empirical likelihood for estimating equations with missing values. Ann
Stat 37:490–517
Wang Q, Rao JNK (2002) Empirical likelihood-based inference under imputation for missing
response data. Ann Stat 30:896–924
Zhou Y, Wan ATK, Wang X (2008) Estimating equations inference with missing data. J Am Stat
Assoc 103:1187–1199
Chapter 11
Maximum Smoothed Likelihood Estimation
of the Centre of a Symmetric Distribution

Pengfei Li and Zhaoyang Tian

Abstract Estimating the centre of a symmetric distribution is one of the basic

and important problems in statistics. Given a random sample from the symmetric
distribution, natural estimators of the centre are the sample mean and sample
median. However, these two estimators are either not robust or inefficient. Other
estimators, such as Hodges-Lehmann estimator (Hodges and Lehmann, Ann Math
Stat 34:598–611, 1963), the location M-estimator (Huber, Ann Math Stat 35:73–
101, 1964) and Bondell (Commun Stat Theory Methods 37:318–327, 2008)’s
estimator, were proposed to achieve high robustness and efficiency. In this paper,
we propose an estimator by maximizing a smoothed likelihood. Simulation studies
show that the proposed estimator has much smaller mean square errors than the
existing methods under uniform distribution, t-distribution with one degree of
freedom, and mixtures of normal distributions on the mean parameter, and is
comparable to the existing methods under other symmetric distributions. A real
example is used to illustrate the proposed method. The R code for implementing
the proposed method is also provided.

11.1 Introduction

Estimating the centre of a symmetric distribution is one of the basic and important
problems in statistics. When the population distribution is symmetric, the centre
represents the population mean (if it exists) and population median. Let X1 ; : : : ; Xn
be independent and identically distributed random variables. Assume that their
probability distribution function is f .x / with f being a probability density
function and symmetric with respect to the origin. Our interest is to estimate .

P. Li () • Z. Tian
Department of Statistics and Actuarial Science, University of Waterloo, N2L 3G1,
Waterloo, ON, Canada
e-mail: [email protected]; [email protected]

D.-G. (Din) Chen et al. (eds.), Advanced Statistical Methods in Data Science,
ICSA Book Series in Statistics, DOI 10.1007/978-981-10-2594-5_11
196 P. Li and Z. Tian

Symmetric population is seen in many applications. For example, Naylor and

Smith (1983) and Niu et al. (2015) analyzed an dataset about biochemical measure-
ments by a mixture of normal distributions on the scale parameter, which is a special
case of symmetric distributions. More description and further analysis of this dataset
will be given in Sect. 11.4. Symmetric distribution also naturally appears in paired
comparison. Two data sets of the same kind, such as heights of two groups of people
and the daily outputs of a factory in two different months, can be regarded as coming
from the same distribution family with different location parameters. Estimating
the location difference between those two data sets is one of the focuses. Assume
that X and Y are two independent random variables from those two populations
respectively, and let D E.Y/ E.X/. Then X and Z D Y have the same
distribution. By symmetry

P.X Z < t/ D P.Z X < t/ D P.X Z > t/:

Replacing Z with Y gives that

P.Y X < C t/ D P .Y X > t/:

That is, the distribution of Y X is symmetric with respect to . A natural question

is whether D 0 or not. An accurate estimator of can be used to construct a
reliable testing procedure for H0 W D 0.
With the random sample X1 ; : : : ; Xn from the symmetric distribution f .x /, a
simple and also traditional estimator of is the sample mean of the random sample.
This estimator is the maximum likelihood estimator of under the normality
assumption on f .x/. In such situations, sample mean has many nice properties.
For example, the sample mean is normally distributed and its variance attains the
Cramer-Rao lower bound. Even the underlined distribution is not normal, the central
limit theorem implies that the sample mean is asymptotically normal as long as
Var.X1 / < 1. However, when the data exhibits heavy tails, the sample mean may
have pool performance. Because of that, many other estimators are proposed in the
literature. For example, sample median of X1 ; : : : ; Xn is another natural choice. It
performs much better than the sample mean when the data appear heavy tailed.
Although, sample median displays high robustness, its efficiency is disappointing
sometimes. Other nonparametric methods are then proposed to improve the effi-
ciency. Two popular choices are the Hodges-Lehmann (HL) estimator (Hodges and
Lehmann 1963) and the location M-estimator of Huber (1964). Recently, Bondell
(2008) proposed an estimator based on the characteristic function, which is shown
to be more robust than HL estimator and M-estimator in simulation studies.
Although HL estimator, M-estimator, and Bondell’s estimator are shown to be
robust and efficient in simulation studies, to the best of our knowledge, none of
them has a nonparametric likelihood interpretation. We feel that there is room for
improvement. On one hand, empirical likelihood method (Owen 2001) seems to be
11 Maximum Smoothed Likelihood Estimation of the Centre of a Symmetric. . . 197

a quite natural choice. On the other hand, it is quite challenging to incorporate the
symmetry information on f .x/. See the discussion in Sect. 10.1 of Owen (2001).
In this paper, we propose to estimate by maximizing a smoothed likelihood.
The concept of smoothed likelihood is first proposed by Eggermont and LaRiccia
(1995). This method gives a nonparametric likelihood interpretation for the classic
kernel density estimation. Later on, it has been used to estimate the monotone
and unimodal densities (Eggermont and Lariccia 2000), the component densities
in multivariate mixtures (Levine et al. 2011), the component densities in mixture
model with known mixing proportions (Yu et al. 2014). These papers demonstrated
that the smoothed likelihood method can easily incorporate the constraint on the
density function. This motivates us to use it to estimate the centre by incorporating
the symmetric assumption through the density function.
The organization of the paper is as follows. In Sect. 11.2, we present the idea
of smoothed likelihood and apply it to obtain the maximum smoothed likelihood
estimator of . In Sects. 11.3 and 11.4, we present some simulation results and a real
data analysis, respectively. Conclusion is given in Sect. 11.5. The R (R Development
Core Team 2011) code for implementing the proposed method is given in the
Appendix.

11.2 Maximum Smoothed Likelihood Estimation

In this section, we first review the idea of smoothed likelihood and then apply it to
estimate the centre of a symmetric distribution.

11.2.1 Idea of Smoothed Likelihood

Given a sample X1 ; : : : ; Xn from the probability density function g.x/, the log-
likelihood function of g.x/ is defined to be

X
n
logfg.Xi /g
iD1

subject to the constraint that g.x/ is a probability density function. Maximizing such
a log-likelihood, however, does not lead to a consistent solution, since we can make
the log-likelihood function arbitrarily large by setting g.Xi / ! 1 for a specific i or
even every i D 1; : : : ; n.
The unboundedness of the likelihood function can be tackled by the follow-
ing smoothed log likelihood approach (see Eggermont and LaRiccia 1995 and
198 P. Li and Z. Tian

Eggermont and Lariccia 2001, Chapter 4). Define the nonlinear smoothing operator
Nh g.x/ of a density g.x/ by
Z
Nh g.x/ D exp Kh .u x/ log g.u/du ;

where Kh .x/ D 1h K.x=h/, K./ is a symmetric kernel function, and h is the

bandwidth for the nonlinear smoothing operator. The smoothed likelihood of g.x/ is
defined as

X
n Z
log Nh g.Xi / D n gQ n .x/ log g.x/dx;
iD1

P
where gQ n .x/ D n1 niD1 Kh .x Xi / is the usual kernel density estimator of g.x/.
Interestingly, the smoothed likelihood function is maximized at g.x/ D gQ n .x/.
This gives a nonparametric likelihood interpretation for the usual kernel density
estimator.

11.2.2 Maximum Smoothed Likelihood Estimator of the Centre

Suppose we have a random sample X1 ; : : : ; Xn from the population with the

probability density function f .x /. In this subsection, we consider estimating
the centre by using the maximum smoothed likelihood method. Following the
principle of the smoothed likelihood presented in last subsection, we define the
smoothed likelihood of f f ; g as follows:

X
n
ln .f ; / D log Nh f .Xi /:
iD1

After some calculation, ln .f ; / can be written as

Z
ln .f ; / D n fQn .x C / log f .x/dx;

P
where fQn .x/ D n1 niD1 Kh .x Xi / is the kernel density estimation of f .x /. The
maximum smoothed likelihood estimator of ff .x/; g is defined to be

ffOSmo .x/; O Smo g D arg sup ln .f ; /

f .x/;

subject to the constraint that f .x/ is a symmetric probability density function with
respect to the origin.
11 Maximum Smoothed Likelihood Estimation of the Centre of a Symmetric. . . 199

We proceed in two steps to get O Smo . In the first step, we fix and maximize
ln .f ; / subject to the constraint that f .x/ is a symmetric probability density function
around 0. Since f .x/ is symmetric around 0, we have
Z Z Z
fQn .x C / log f .x/dx D fQn .x C / log f .x/dx D fQn .x C / log f .x/dx:

Hence the smoothed likelihood function can be written as

Z
ln .f ; / D n 0:5ffQn .x C / C fQn .x C /g log f .x/dx:

Note that 0:5ffQn .x C / C fQn .x C /g is a probability density function and is
symmetric around 0. Hence ln .f ; / is maximized at

f .x/ D 0:5ffQn .x C / C fQn .x C /g:

In the second step, we plug f .x/ D 0:5ffQn .x C / C fQn .x C /g to ln .f ; / and
obtain the profile smoothed likelihood function of :
Z
pln ./ D n fQn .x C / logf0:5fQn .x C / C 0:5fQn . x/gdx
Z
Dn fQn .x/ logf0:5fQn .x/ C 0:5fQn .2 x/gdx:

The maximum smoothed likelihood estimator of can be equivalently defined as

O Smo D arg sup pln ./:

To implement the above method, we need to specify the kernel density K.x/
and select the bandwidth h. The commonly used kernel density is the standard
normal density, which is used in our implementation. Methods for choosing a
bandwidth for kernel density estimation are readily available in the literature. In
our implementation, we have used function dpik() in the R package KernSmooth to
choose the bandwidth h. This package essentially implements the kernel methods in
Wand and Jones (1995). We have written a R function to calculate pln ./ and then
use optim() to numerically calculate O Smo . These code is provided in the Appendix.

11.3 Simulation Study

We conduct simulation to test the efficiency of the maximum smoothed likelihood

method and compare it with the five existing methods: the sample mean, sample
median, HL estimator, M-estimator, and Bondell (2008)’s estimator. We choose
200 P. Li and Z. Tian

Table 11.1 Mean square errors ( sample size) for six estimates under thirteen symmetric
distributions with the sample size equal to 50
Distribution Mean Median HL M-estimator Bondell
O Smo
N.0; 1/ 1.092 1:613 1:130 1:110 1:147 1:140
DE 1.906 1:138 1:300 1:379 1:361 1:434
U.2; 2/ 1.376 3:780 1:556 1:454 1:686 0:682
t.1/ 5634 2:701 3:786 4:128 3:017 2:940
t.2/ 71.13 1:986 1:925 2:102 1:834 2:010
t.3/ 2.855 1:821 1:580 1:506 1:552 1:678
0:5N.1; 0:5/ C 0:5N.1; 0:5/ 1.555 5:088 1:749 1:574 1:858 1:150
0:5N.1; 0:75/ C 0:5N.1; 0:75/ 1.722 4:148 1:956 2:074 1:991 1:610
0:5N.0; 1/ C 0:5N.0; 3/ 2.058 2:461 1:885 1:990 1:892 2:016
0:5N.0; 1/ C 0:5N.0; 5/ 3.080 3:135 2:644 2:585 2:684 2:903
0:5N.0; 1/ C 0:5N.0; 10/ 5.705 3:891 4:014 4:220 4:166 4:123
0:9N.0; 1/ C 0:1N.0; 9/ 1.680 1:663 1:247 1:232 1:178 1:211
0:8N.0; 1/ C 0:2N.0; 9/ 2.574 1:939 1:697 1:728 1:680 1:802

Table 11.2 Mean square errors ( sample size) for six estimates under thirteen symmetric
distributions with the sample size equal to 100
Distribution Mean Median HL M-estimator Bondell
O Smo
N.0; 1/ 0.991 1:626 1:042 1:013 1:045 1:032
DE 2.035 1:071 1:357 1:466 1:418 1:392
U.2; 2/ 1.351 3:872 1:483 1:456 1:633 0:572
t.1/ 20572 2:476 3:306 3:749 2:801 2:487
t.2/ 9.517 2:050 2:045 2:132 1:944 1:980
t.3/ 2.829 1:772 1:503 1:677 1:486 1:589
0:5N.1; 0:5/ C 0:5N.1; 0:5/ 1.590 5:318 1:770 1:519 1:887 1:122
0:5N.1; 0:75/ C 0:5N.1; 0:75/ 1.766 4:328 2:035 1:898 2:064 1:553
0:5N.0; 1/ C 0:5N.0; 3/ 1.996 2:588 1:864 1:843 1:883 1:978
0:5N.0; 1/ C 0:5N.0; 5/ 3.058 3:026 2:548 2:766 2:600 2:730
0:5N.0; 1/ C 0:5N.0; 10/ 5.493 3:538 3:585 3:887 3:703 3:443
0:9N.0; 1/ C 0:1N.0; 9/ 1.868 1:736 1:274 1:243 1:205 1:260
0:8N.0; 1/ C 0:2N.0; 9/ 2.568 2:091 1:694 1:690 1:644 1:717

thirteen symmetric distributions with the centre zero: the standard normal distribu-
tion N.0; 1/, double exponential (DE) distribution, uniform distribution over .2; 2/
(denoted by U.2; 2/), three t distributions (denoted by t.v/ with v being the
degrees of freedom), two mixtures of normal distributions on the mean parameter,
and five mixtures of normal distributions on the scale parameter. We consider three
sample sizes: 50, 100, and 200. We calculate the mean square errors for each of six
estimates based on 1000 replications. The results are summarized in Tables 11.1,
11.2 and 11.3, in which the mean square errors are multiplied by the sample size.
It is seen from Tables 11.1, 11.2 and 11.3 that mean square errors of sample
means are significantly larger under some specific distributions, such as t.1/
11 Maximum Smoothed Likelihood Estimation of the Centre of a Symmetric. . . 201

Table 11.3 Mean square errors ( sample size) for six estimates under thirteen symmetric
distributions with the sample size equal to 200
Distribution Mean Median HL M-estimator Bondell
O Smo
N.0; 1/ 1.003 1:537 1:031 1:085 1:034 1:033
DE 1.847 1:069 1:256 1:403 1:316 1:260
U.2; 2/ 1.407 4:125 1:512 1:410 1:707 0:448
t.1/ 74611 2:653 3:608 3:881 2:908 2:575
t.2/ 13.17 1:977 1:914 2:011 1:821 1:815
t.3/ 3.470 1:745 1:537 1:756 1:492 1:552
0:5N.1; 0:5/ C 0:5N.1; 0:5/ 1.471 5:426 1:611 1:605 1:736 0:980
0:5N.1; 0:75/ C 0:5N.1; 0:75/ 1.758 4:521 2:007 1:945 2:041 1:483
0:5N.0; 1/ C 0:5N.0; 3/ 1.893 2:257 1:745 1:912 1:766 1:860
0:5N.0; 1/ C 0:5N.0; 5/ 3.224 3:115 2:650 2:480 2:691 2:738
0:5N.0; 1/ C 0:5N.0; 10/ 5.322 3:679 3:568 4:006 3:709 3:226
0:9N.0; 1/ C 0:1N.0; 9/ 1.790 2:036 1:377 1:370 1:340 1:375
0:8N.0; 1/ C 0:2N.0; 9/ 2.795 2:213 1:793 1:693 1:714 1:714

Fig. 11.1 Histograms of maximum smoothed likelihood estimates under t.3/ and 0:9N.0; 1/ C
0:1N.0; 9/ distributions

and t.2/, as variances of those distributions do not exist or are very large. On
average, efficiency of sample median is low compared to other methods except
for the double exponential distribution. Compared with HL estimator, M-estimator,
and Bondell (2008)’s estimator, the maximum smoothed likelihood estimator has
smaller mean square errors under uniform distribution and t.1/, and mixtures of
normal distributions on mean parameters. For other distributions, the new estimator
has comparable performance as HL estimator, M-estimator, and Bondell (2008)’s
estimator.
As we can see from Tables 11.1, 11.2 and 11.3, the products of sample sizes
and mean square errors of O Smo almost remain as a constant.
Hence we conjecture
that the mean square error of O Smo is of the order O n1 . In Fig. 11.1, we plot the
202 P. Li and Z. Tian

histograms of maximum smoothed likelihood estimates under t.3/ and 0:9N.0; 1/ C

0:1N.0; 9/. It is seen that their histograms behave similarly to normal distribution.
Hence we also conjecture that the asymptotic distribution of O Smo is normal. We
leave these two conjectures for future study.

11.4 Real Application

In clinical chemistry, the clinical assessment of biochemical measurements is

typically carried out by reference to a “normal range”, which is the 95 % prediction
interval of the mean measurement for a “healthy” population (Naylor and Smith
1983). One way of obtaining such a normal range is to first collect a large sample
of biochemical measurements from a healthy population. However, in practice, it
may be difficult to collect measurements only from healthy individuals. Instead,
measurements from a contaminated sample, containing both healthy and unhealthy
individuals, are obtained. Because of the potential heterogeneity in the contaminated
sample, mixtures of normal distributions are widely used in such analyses.
Naylor and Smith (1983) and Niu et al. (2015) used a mixture of two normal
distributions on the scale parameter to model a contaminated sample of 542 blood-
chloride measurements collected during routine analysis by the Department of
Chemical Pathology at the Derbyshire Royal Infirmary. Note that the mixture
of two normal distributions on the scale parameter is a symmetric distribution.
Here we illustrate our method by estimating the centre of this data set. The
maximum smoothed likelihood estimate and other five existing estimates are shown
in Table 11.4. As we can see, all six estimates are close to each other. According
to our simulation experience on the mixtures of normal distributions on the scale
parameter, we expect that the variances of the proposed method, Bondell’s method,
HL estimator, and M-estimator are similar and may be smaller than those of sample
mean and sample median.

Table 11.4 Six estimates of the centre of 542 blood-chloride measurements

Mean Median HL M-estimator Bondell
O Smo
99.985 100.017 100.033 100.068 100.087 100.100
11 Maximum Smoothed Likelihood Estimation of the Centre of a Symmetric. . . 203

11.5 Conclusion

In this paper, we proposed the maximum smoothed likelihood estimator for the
centre of a symmetric distribution. The proposed method performs better than those
widely used estimators such as the HL estimator, M-estimator, and Bondell (2008)’s
estimator under the uniform, t.1/, and mixtures of normal distributions on mean
parameters. It has comparable performance to the HL estimator, M-estimator, and
Bondell (2008)’s estimator under other symmetric distributions.
We admit that so far the proposed method lacks theoretical justification. Further
work on the consistency and asymptotic distribution of O Smo will provide solid
theoretical support for its application. There is also room for improvement for its
computational efficiency.

Acknowledgements Dr. Li’s work is partially supported by the Natural Sciences and Engineering
Research Council of Canada grant No RGPIN-2015-06592.

Appendix: R code for calculating

O Smo

library("KernSmooth")
library("ICSNP")

norm.kern=function(x,data,h)
{
out=mean( dnorm( (x-data)/h ) )/h
out
}

dkern=function(x,data)
{
h=dpik(data,kernel="normal")
out=lapply(x,norm.kern,data=data,h=h)
as.numeric(out)
}

dfint=function(x,data,mu)
{
p1=dkern(x,data)
p2=log(0.5*dkern(x,data)+0.5*dkern(2*mu-x,data)+1e-100)
p1*p2
}

pln=function(mu,data)
{
h=dpik(data,kernel="normal")
204 P. Li and Z. Tian

out=integrate(dfint, lower=min(data)-10*h,
upper=max(data)+10*h, data=data, mu=mu)
-out$value
}

hatmu.smooth=function(data)
{
##Input: data set
##Output: maximum smoothed likelihood estimate

hl.est=hl.loc(data)
est=optim(hl.est,pln,data=data,method="BFGS")$par
est
}

##Here is an example
set.seed(1221)
data=rnorm(100,0,1)
hatmu.smooth(data)
##Result: 0.1798129

References

Bondell HD (2008) On robust and efficient estimation of the center of symmetry. Commun Stat
Theory Methods 37:318–327
Eggermont PPB, LaRiccia VN (1995) Maximum smoothed likelihood density estimation for
inverse problems. Ann Stat 23:199–220
Eggermont PPB, Lariccia VN (2000) Maximum likelihood estimation of smooth monotone and
unimodal densities. Ann Stat 28:922–947
Eggermont PPB, Lariccia VN (2001) Maximum penalized likelihood estimation: volume I: density
estimation. Springer, New York
Hodges JL, Lehmann EL (1963) Estimates of location based on rank tests. Ann Math Stat 34:598–
611
Huber PJ (1964) Robust estimation of a location parameter. Ann Math Stat 35:73–101
Levine M, Hunter D, Chauveau D (2011) Maximum smoothed likelihood for multivariate mixtures.
Biometrika 98:403–416
Naylor JC, Smith AFM (1983) A contamination model in clinical chemistry: an illustration of a
method for the efficient computation of posterior distributions. J R Stat Soc Ser D 32:82–87
Niu X, Li P, Zhang P (2016) Testing homogeneity in a scale mixture of normal distributions. Stat
Pap 57:499–516
Owen AB (2001) Empirical likelihood. Chapman and Hall/CRC, New York
R Development Core Team (2011) R: a language and environment for statistical computing. R
Foundation for Statistical Computing, Vienna
Wand MP, Jones MC (1995) Kernel smoothing. Chapman and Hall, London
Yu T , Li P, Qin J (2014) Maximum smoothed likelihood component density estimation in mixture
models with known mixing proportions. arXiv preprint, arXiv:1407.3152
Chapter 12
Modelling the Common Risk Among Equities:
A Multivariate Time Series Model with an
Additive GARCH Structure

Jingjia Chu, Reg Kulperger, and Hao Yu

Abstract The DCC GARCH models (Engle, J Bus Econ Stat 20:339–350, 2002)
have been well studied to describe the conditional covariance and correlation
matrices while the common risk among series cannot be captured intuitively by the
existing multivariate GARCH models. A new class of multivariate time series model
with an additive GARCH type structure is proposed. The dynamic conditional
covariance between series are aggregated by a common risk term which has been
the key to characterize the conditional correlation.

12.1 Introduction

The volatility of the log return is necessary to be estimated for the financial
modelling. The Black-Scholes-Merton (BSM) model (Black and Scholes 1973;
Merton 1973) assumes that the log returns follow independently and identically
(i.i.d) normal distribution while the real market data shows volatility clustering and
heavy tail which violates this assumption. The conditional heteroscedasticity models
have been widely used in finance today to estimate the volatility to take the feature
of the financial return series into consideration. Let Ft be the information set (-
field) of all information up to time point t and assume fxt W t 2 Tg is the observed
process, then the general form of the model is in a multiplicative structure given by
8
ˆ
ˆ xt D
t t
<
E.xt jFt1 / D 0 (12.1)
ˆ
:̂
E.x2t jFt1 / D t2 ;

where the innovations f

t W t 2 Tg are i.i.d. white noises with mean 0 and variance
1. The innovations are independent of Ft1 , t ’s are Ft1 adapted.

J. Chu () • R. Kulperger • H. Yu

Department of Statistical and Actuarial Sciences, Western University, London, ON, Canada
e-mail: [email protected]; [email protected]; [email protected]

D.-G. (Din) Chen et al. (eds.), Advanced Statistical Methods in Data Science,
ICSA Book Series in Statistics, DOI 10.1007/978-981-10-2594-5_12
206 J. Chu et al.

Engle (1982) introduced the autoregressive conditional heteroscedasticity

(ARCH) model with the unique ability of capturing volatility clustering in financial
time series. The ARCH(q) model defines the conditional variance of xt to be

X
q
t2 D ! C ˛i x2ti :
iD1

The lag q tends to be large when the model was applied to real data. Subsequently,
GARCH model introduced by Bollerslev (1986) extends the formula for t by
adding autoregressive terms of t2 . The conditional variance of the univariate
GARCH(p,q) model was defined as

X
q
X
p
t2 D ! C ˛i x2ti C 2
ˇj tj :
iD1 jD1

When there are more than one time series, it is necessary to understand the co-
movements of the returns since that the volatility of stock returns are correlated
with each other. In contrast to the univariate cases, the multivariate volatility
models based on a GARCH dependence are much more flexible. The multivariate
GARCH models are specified based on the first two conditional moments as well
as the univariate cases. The first multivariate volatility model is the half-Vec (vech)
multivariate GARCH defined by Bollerslev et al. (1988), which is also one of the
most general forms of multivariate GARCH models.
8 1=2
ˆ
ˆ x t D Ht t ;
<
Xq X
p
(12.2)
ˆ ht D c C A C Bj htj ;
:̂ i ti
iD1 jD1

where t ’s are white noises with mean 0 and variance Im ,

ht D vech.Ht /;
|
t D vech.t t /:

In this class of model, the conditional covariance matrix is modelled directly.

The number of parameters in the general m-dimensional case increases at a rate
proportional to m4 , which makes it difficult to get the estimations.
There are simpler forms of multivariate GARCH by specifying the Ht in different
ways. The constant correlation coefficient (CCC)-GARCH model is presented
by Bollerslev (1990), who assumes that the conditional correlation matrix R D
.i;j /i;jD1; ;m would be time-invariant, and reduces the number of parameters to
12 Modelling the Common Risk Among Equities 207

O.m2 /. The model is defined as

8 1=2
ˆ
ˆ x t D Ht t
ˆ
ˆ
ˆ
< Ht D St RSt
(12.3)
ˆ
ˆ X
q
X
p
ˆ
ˆ t D c C 2
:̂ A i x ti C Bj tj
iD1 jD1

where t ispa vector contains the diagonal elements of Ht and St is the diagonal
matrix of t . A less restrictive time-varying conditional correlation version,
called the dynamic correlation coefficient (DCC) GARCH, is studied by Engle
(2002) and Tse and Tsui (2002). Both CCC-GARCH and DCC-GARCH models
are built by modelling the conditional variance of each series and the conditional
correlation between series. However, all of these multivariate GARCH models and
their extensions do not have a simple way to capture the common risk among
different stocks.
Information flows around the world almost instantaneously, thus most markets
(Asian, European, and American) will react to the same events (good news or
bad news). Most stock prices will go up or down together following the big
events (random shocks) currently. The strong positive association between the
equity variance and several variables is confirmed by Christie (1982). Recently,
Carr and Wu (2009) found that a common stochastic variance risk factor exists
among the stocks by using the market option premiums. There are several different
approaches used in finance and economy to describe the same driven process in the
literature. The first approach is the asset pricing model (Treynor, Market value, time,
and risk. Unpublished manuscript, pp 95–209, 1961; Toward a theory of market
value of risky assets. Unpublished manuscript, p 40, 1962; Sharpe 1964; Treynor
2008) and its generalizations (Burmeister and Wall 1986; Fama and French 1993,
2015) which quantifies the process by the market indices or some macro economic
factors. Another approach is the volatility jump models (Duffie et al. 2000; Tankov
and Tauchen 2011) which assumes the effect of the news is a discrete process
would happen by chance over the time. Different multivariate GARCH models
are introduced to describe the same underlying driven process in the financial time
series (Engle et al. 1990; Girardi and Tolga Ergun 2013; Santos and Moura 2014).
These models either involve other market variables, modelling the underlying risk
as a discrete process or characterize the common risk implicitly.
We develop a simple common risk model which keeps the GARCH structure and
involves the return only. In the light of Carr and Wu (2009), the innovations are
divided into two parts. A new additive GARCH type model was proposed by using
a common risk term to characterize the internal relationship among series explicitly.
The common risk term could be used as an indicator of the shock among series.
The conditional correlations aggregated by this common risk term are changing
dynamically. This model is also able to capture the conditional correlation clustering
208 J. Chu et al.

phenomenon described in So and Yip (2012). The common risk term would show
a latency after it reaching a peak since it follows a GARCH type structure, which
means that a big shock would take some time to calm down.
The notation and the new common underlying risk model is introduced in
Sect. 12.2. In Sect. 12.3, we discuss the model is identifiable and the estimates based
on Gaussian quasi likelihood are unique under certain assumptions. In Sect. 12.4, the
results of a Monte Carlo simulation study are shown and the estimated conditional
volatility is compare with some other GARCH models based on a bivariate dataset.

12.2 Model Specification

Consider an Rm -valued stochastic process fxt ; t 2 Zg on a probability space

.˝; A ; P/ and a multidimensional parameter in the parameter space
Rs .
We say that xt is a common risk model with an additive GARCH structure if, for all
t 2 Z, we have
8
ˆ
ˆ x1;t D
1;t 1;t C
0;t 0;t
ˆ
ˆ
ˆ
< x2;t D
2;t 2;t C
0;t 0;t
(12.4)
ˆ
ˆ
ˆ
ˆ
:̂
xm;t D
m;t m;t C
0;t 0;t

2 2
where 1;t ; ; m;t are following a GARCH type structure,
8 2
ˆ
ˆ 1;t D ˛1 g.x1;t1 /2 C ˇ1 1;t1
2
ˆ
ˆ
ˆ
ˆ
ˆ
ˆ
2
2;t D ˛2 g.x2;t1 /2 C ˇ2 2;t1
2
ˆ
<
(12.5)
ˆ
ˆ
ˆ
ˆ 2 D ˛ g.x 2 2
ˆ
ˆ m;t1 / C ˇm m;t1
ˆ
ˆ
m;t m
:̂ 2 2 2
0;t D !0 C ˇ01 1;t C C ˇ0m m;t :

2
The size of the effect on 0;t is linearly increasing with each observed element of
xt . The conditional volatilities based on this model will expose to infinity and the
mean reverse property will hardly hold when one of the i;t terms larger than 1. The
function g could be a continuous bounded function to avoid this kind of situation.
12 Modelling the Common Risk Among Equities 209

The m C 1 dimensional innovation terms ft 1 < t < 1g are independent

and identically distributed with mean 0 and covariance ˙ where ˙ has the same
parameterization as a correlation matrix,

R0
˙D :
01
| |
The innovation can be divided into two parts t D .t;ind ;
0;t /. The first part is a
m dimensional correlated individual shocks t;ind and the second part is a univariate
common shock term
0;t .
Define the following notations, Dt D diagf1;t ; 2;t ; m;t g; 1 D .1; 1; ; 1/|
0 1
0 1
1;t

1;t B
2;t C
B
2;t C B C
B C B C
t;ind D B : C ; t D t;ind D B ::: C :
:
@ : A
0;t B C
@
m;t A

m;t m1

0;t .mC1/1
Then Eq. 12.4 could be written in a matrix form:

xt D Dt t;ind C 0;t
0;t 1: (12.6)

So the model could be specified either by Eqs. 12.6 and 12.5 together or Eqs. 12.4
and 12.5 together. The conditional covariance matrix of xt can be computed by
definition Ht D cov.xt jFt1 /.
0 2 2 2 2 1
0;t C 1;t 0;t C 1;2 1;t 2;t : : : 0;t C 1;m 1;t m;t
B 2 C 1;2 1;t 2;t 2 2 2
C 2;m 2;t m;t C
B 0;t 0;t C 2;t : : : 0;t C
Ht D B :: :: :: :: C
@ : : : : A
2 2 2 2
0;t C 1;m 1;t m;t 0;t C 2;m 2;t m;t : : : 0;t C m;t :

2
Ht can be written as a sum of two parts: Ht D 0;t J C Dt RDt where J is a m m
matrix with 1 as all the elements (or J D 11| ).
The number of parameters is increasing at the rate O.m2 / which is in the
same manner as in the CCC-GARCH model. We could separate the vector of
unknown parameters into two parts: the parameters in the innovations correlation
matrix ˙ and the coefficients in the Eq. 12.5. The number of total parameters is
s D s1 C3mC1 where s1 D m.m1/ 2 is the number of parameters in R.
The conditional correlation between series i and series j can be represented by
the elements in Ht matrix. The dynamic correlation between series i and j can be
210 J. Chu et al.

calculated as

cov.xi;t ; xj;t /
ij;t D p
var.xi;t / var.xj;t /
2
0;t C i;j i;t j;t
Dq
2
.0;t C i;t2 /.0;t2
C j;t2 /

i;t j;t
1 C i;j
0;t 0;t
Ds s :
2 2
i;t j;t
1C 1C
0;t 0;t

From the equations above, the conditional correlation matrix Rt D .ij;t /i;jD1; ;m
tends to be J defined above when the common risk term 0;t is much larger than both
i;t and j;t . In this case, the common risk term is dominant and all the log return
series are nearly perfect correlated. On the contrary, the conditional correlation
matrix will be approaching the constant correlation matrix R when the common risk
term is much smaller and close to 0. Then, the conditional correlation will become
time invariant which is the same as a CCC-GARCH model. Mathematically,

Rt ! J when 0;t ! 1;
Rt ! R when 0;t ! 0:

12.3 Gaussian QMLE

A distribution must be specified for the innovations t process in order to form the
likelihood function. The maximum likelihood (ML) method is particularly useful
in statistical inferences for models because it provides us with an estimator which
has both consistency and asymptotic normality. The quasi-maximum likelihood
(QML) method could draw statistical inference based on a misspecified distribution
of the innovations while the ML method assumes that the true distribution of the
innovation is the specified distribution. ML method essentially is a special case of
the QML method with no specification error.
We can construct the Gaussian quasi likelihood function based on the density of
conditional distribution xt jFt1 . The vector of the parameters

D .1;2 ; ; m1;m ; ˛1 ; ; ˛m ; ˇ1 ; ; ˇm ; !0 ; ˇ01 ; ; ˇ0m /| (12.7)

12 Modelling the Common Risk Among Equities 211

belongs to a parameter space of the form

m.m1/
C3mC1

Œ0; 1/ 2 : (12.8)

The true value of the parameter is unknown, and denoted by

.0/ .0/ .0/ .0/ .0/ .0/ .0/

0 D .1;2 ; ; ; m1;m ; ˛1 ; ; ˛m.0/ ; ˇ1 ; ; ˇm.0/ ; !0 ; ˇ01 ; ; ˇ0m /| :

12.3.1 The Distribution of the Observations

The observations xt ’s are assumed following a realization of a m-dimensional

common risk process and t ’s are i.i.d. normally distributed with mean 0 and
covariance ˙. Equation 12.4 shows that based on the past, the observations can be
written as linear combinations of normally distributed variables, then the conditional
distribution of the observations xt ’s are multivariate normal as well, e.g. xt jFt1
N.0; Ht /. The model in Sect. 12.2 can be revised to a different form as
8 1=2
ˆ
ˆ x t D Ht t
ˆ
ˆ
ˆ
ˆ 0 1
ˆ
< 2
0;t 2
C 1;t 2
0;t C 1;2 1;t 2;t 2
: : : 0;t C 1;m 1;t m;t
B C 1;2 1;t 2;t
2
2 2 2
C 2;m 2;t m;t C
0;t C 2;t : : : 0;t
B 0;t C (12.9)
ˆ
ˆ Ht D B C
ˆ
ˆ :: :: :: ::
ˆ
ˆ @ : : : : A
:̂ 2 2 2 2
0;t C 1;m 1;t m;t 0;t C 2;m 2;t m;t ::: 0;t C m;t ;

where the innovation t are a sequence of i.i.d m-dimensional standard normal

variables. Then the quasi log likelihood function is given by

1 X 1 X
n n
|
Ln ./ D flogjHt ./j C xt Ht ./1 xt g D lt ./: (12.10)
2n tD1 2n tD1

The driving noises t ’s are i.i.d N.0; ˙/, so the conditional distribution of xt is
N.0; Ht .//. The QML estimator is defined as

O n D arg max Ln ./

1X
n
|
D arg min flogjHt ./j C xt Ht ./1 xt g
2 n tD1 (12.11)

1X
n
D arg min lt ./:
2 n tD1
212 J. Chu et al.

12.3.2 Identifiability

We start this section with the concept of parameter identifiability.

Definition 1 Let Ht ./ be the conditional second moment of xt , be the parameter
space, then Ht ./ is identifiable if 8 1 ; 2 2 , Ht . 1 / D Ht . 2 / a:s: ) 1 D
2.
It is necessary to study the condition of parameter identification since the parameter
estimates are based on the maximum of the likelihood function. The solution needs
to be unique when the likelihood function reaches its maximum.
Theorem 1 Assume that:
Assumption 1 8 2 , ˛i > 0 and ˇi 2 Œ0; 1/ for i D 1; ; m.
Assumption 2 The model in Eq. 12.4 is stationary and ergodic.
Then there exists a unique solution of 2 which maximizes the quasi likelihood
function for n sufficiently large.
If the Assumption 1 is satisfied, then the conditional second moment of xt , Ht , is
identifiable in the quasi-likelihood function. Suppose that 0 is the true value of the
parameters and Ht is identifiable, then E.Ln . 0 // > E.Ln .// for all ¤ 0 . If the
time series xt is ergodic and stationary, there will be a unique solution of 0 in the
parameter space which maximize the likelihood function when the sample size n
is sufficient large.

12.4 Numeric Examples

12.4.1 Model Modification

To reduce the number of parameters and simplify the model, the contributions from
2
each individual stock to the common risk indicator 0;t can be assumed equal, ˇ01 D
2
ˇ02 D D ˇ0m D ˇ0 . In this case, the number of parameters in 0;t is changed to
2 from m C 1 and the last line in Eq. 12.5 become
2 2 2
0;t D !0 C ˇ0 .1;t1 C C m;t1 /:

The g function presented in this section is chosen as a piecewise function which

defined as
(
x jxj < 0:01
g.x/ D
0:01 jxj >D 0:01 :
12 Modelling the Common Risk Among Equities 213

The effect of observed data will be bounded within 10 % once the observed data
reaches extreme large values (larger than 10 %). If the daily log return of a stock
exceeds 10 % in real world, we would consider to do more research on the stock
since it is unusual.

12.4.2 Real Data Analysis

A bivariate example is shown in this subsection which is based on the centered

log returns of two equity series (two stocks in New York Stock Exchange: the
International Bushiness Machines Corporation (IBM) and the Bank of American
(BAC) from 1995 to 2007 (Fig. 12.1)). The conditions for stationarity and ergodicity
are not solved yet. The ergodicity of the process could be partly verified by numeric
results while the stationarity is commonly assumed in financial log returns. The
default searching parameter space is chosen to be D Œ1; 1 Œ0; 17 and
the numeric checks are set to verify the positive definite constraints on Ht and R
matrices.
A numeric study called parametric bootstrap method (or Monte Carlo simulation)
is used to test the asymptotic normality of the MLE estimator.
The histograms of the estimates in Figs. 12.2 and 12.3 were well shaped as
normal distributions which verifies the asymptotic normality of the MLE in this
model by using the empirical study.
The horizontal lines in Fig. 12.4 show some big events in the global stock markets
over that time period. The 1997 mini crash in the global market was caused by
the economic crisis in Asia on Oct 27, 1997. The time period between two solid
lines was October, 1999 to October, 2001 where the Internet bubble burst. The last
line was the peak before financial crisis on Oct 9, 2007. The conditional variances
were significantly different in some time points. During the 1997 mini crash, the
estimated conditional variances in DCC-GARCH model are different from the ones
in the common risk model. The conditional variance of IBM was high while the
conditional variance of BAC was relative low from DCC-GARCH model. However,
0.05
−0.05
−0.15

1996 1998 2000 2002 2004 2006 2008

Fig. 12.1 Centered log returns of IBM and BAC from Jan. 1, 1995 to Dec. 31, 2007. The solid
black line represents the centered log returns of IBM and the cyan dash line represents the centered
log returns of BAC
214 J. Chu et al.

α1 α2

10 20 30 40 50 60 70
80
60
40
20
0

0
0.06 0.08 0.10 0.12 0.01 0.02 0.03 0.04 0.05

β1 β2
10 20 30 40 50 60 70

50
40
30
20
10
0

0.82 0.84 0.86 0.88 0.90 0.92 0.91 0.93 0.95 0.97

Fig. 12.2 The histogram of 1000 parameter estimates from the Monte Carlo Simulations (˛1 , ˛2 ,
ˇ1 , ˇ2 )

the conditional covariance of both log returns were quite high from the common
risk model. It is a difficult task to tell which model fits the data better since the main
usages of these models are all based on the conditional volatilities or the conditional
correlations (Fig. 12.5).
Denote the conditional variance estimating from model 1 by Vmod1 in the
following equation. Define a variable to measure the relative difference of the
estimated conditional variance between two models.
1 Pn
jVmod1 Vmod2 j
Relative difference D n 11 Pn
n 1 Vmod2

Table 12.1 is not a symmetric table since the elements in the symmetric
locations have different denominators according to the definition formula above.
The estimated conditional variance for IBM and BAC log return series from the
traditional models are really close to each other while the relative differences
12 Modelling the Common Risk Among Equities 215

ρ ω0 β0

140

50
50

120

40
40

100

30
80
30

20
20

10
10

20
0

0
−1.0 −0.6 −0.2 0.0 0.00000 0.00010 0.2 0.4 0.6

Fig. 12.3 The histogram of 1000 parameter estimates from the Monte Carlo Simulations (1;2 , !,
˛0 , ˇ0 )

Common Risk Model

0.0030

σ20,t
var of BAC
var of IBM
0.0015
0.0000

1996 1998 2000 2002 2004 2006 2008

DCC GARCH
0.0030

Internet bubble bursting

1997 mini crash

Peak before
0.0015

financial crisis
0.0000

1996 1998 2000 2002 2004 2006 2008

2
Fig. 12.4 The estimated conditional variances of IBM and BAC with 0;t
216 J. Chu et al.

0.6
CommonRisk
DCCGARCH
0.5 Peak before
0.4
0.3
0.2
0.1 CCCGARCH Internet bubble bursting financial crisis

1997 mini crash

1996 1998 2000 2002 2004 2006 2008

Fig. 12.5 The estimated conditional correlations between IBM and BAC

Table 12.1 The relative difference of BAC series between models

Model2
Model1 CommonRisk CCCGARCH DCCGARCH GARCH(1,1)
CommonRisk – 15:06 % 15:06 % 15:00 %
CCCGARCH 15:11 % – 0:01 % 0:12 %
DCCGARCH 15:11 % 0:01 % – 0:13 %
GARCH(1,1) 15:06 % 0:12 % 0:13 % –

Table 12.2 The 95 % confidence interval of the estimates by using parametric bootstrap
O1;2 100˛O 1 100˛O 2 10ˇO1 10ˇO2 105 !O 0 10ˇO0
‘True’ 0:45 6:69 1:90 8:93 9:58 1:51 4:34
LB 0:78 5:18 1:22 8:68 9:30 0:24 2:34
UB 0:12 9:41 3:74 9:13 9:71 7:47 6:04

between our new model and other models are large. It is worth to build up such
a complicate model since it will change the investment strategy dramatically.

12.4.3 Numeric Ergodicity Study

This example demonstrates the ergodicity and the long term behavior of the model.
The data were simulated from the ‘True’ parameter values in Table 12.2. The
plots illustrates the behavior of log returns from two common risk models (denote
by M1 and M2 ) starting from different initial
0 ’s. Denote the log return of the
first simulated bivariate common risk model M1 by .x1 ; x2 / and the initial value
.1;0 ; 2;0 ; 0;0 ; x1;0 ; x2;0 / in this model is .0:020; 0:018; 0:013; 0:0079; 0:0076/. The
log returns simulated from M2 were denoted by .y1 ; y2 / and the initial value of M2
is .0:01; 0:01; 0:01; 0:009; 0:009/.
In Figs. 12.6 and 12.7, we can see that the effect of the starting volatilities
vanishes after a long enough bursting time period.
12 Modelling the Common Risk Among Equities 217

0.05
0.03
0.01

0 200 400 600 800 1000

0.040
0.025
0.010

0 200 400 600 800 1000

0.01 0.02 0.03 0.04

0 200 400 600 800 1000

Fig. 12.6 The simulated ’s from two groups of initial values M1 and M2: the upper plot is 1;t ,
the middle plot is 2;t , the bottom plot is 0;t . The solid black lines represents the simulated values
from M1 while the red dash line shows the simulated values from M2
0.2
0.1
x1
−0.1 0.0

0 200 400 600 800 1000

0.15
0.05
x2
−0.15 −0.05

0 200 400 600 800 1000

Fig. 12.7 The simulated bivariate log returns from two different initial values M1 and M2: the
simulated path of x1;t is shown in the upper plot and the simulated path of x2;t is in the lower plot.
The solid black lines represents the simulated values from M1 while the red dash line shows the
simulated values from M2
218 J. Chu et al.

References

Black F, Scholes M (1973) The pricing of options and corporate liabilities. J Polit Econ 81:637–654
Bollerslev T (1986) Generalized autoregressive conditional heteroskedasticity. J Econ 31:307–327
Bollerslev T (1990) Modelling the coherence in short-run nominal exchange rates: a multivariate
generalized arch model. Rev Econ Stat 72:498–505
Bollerslev T, Engle RF, Wooldridge JM (1988) Capital asset pricing model with time-varying
covariances. J Polit Econ 96:116–131
Burmeister E, Wall KD (1986) The arbitrage pricing theory and macroeconomic factor measures.
Financ Rev 21:1–20
Carr P, Wu L (2009) Variance risk premiums. Rev Financ Stud 22:1311–1341
Christie AA (1982) The stochastic behavior of common stock variances. Value, leverage and
interest rate effects. J Financ Econ 10:407–432
Duffie D, Pan J, Singleton K (2000) Transform analysis and asset pricing for affine jump-
diffusions. Econometrica 68:1343–1376
Engle RF (1982) Autoregressive conditional heteroscedasticity with estimates of the variance of
United Kingdom inflation. Econometrica 50:987–1007
Engle RF (2002) Dynamic conditional correlation: a simple class of multivariate generalized
autoregressive conditional heteroskedasticity models. J Bus Econ Stat 20:339–350
Engle RF, Ng VK, Rothschild M (1990) Asset pricing with a factor-arch covariance structure.
Empirical estimates for treasury bills. J Econ 45:213–237
Fama EF, French KR (1993) Common risk factors in the returns on stocks and bonds. J Financ
Econ 33:3–56
Fama EF, French KR (2015) A five-factor asset pricing model. J Financ Econ 116:1–22
Girardi G, Tolga Ergun A (2013) Systemic risk measurement: multivariate GARCH estimation of
CoVaR. J Bank Financ 37:3169–3180
Merton RC (1973) An intertemporal capital asset pricing model. Econometrica 41:867–887
Santos AAP, Moura GV (2014) Dynamic factor multivariate GARCH model. Comput Stat Data
Anal 76:606–617
Sharpe WF (1964) Capital asset prices: a theory of market equilibrium under conditions of risk. J
Financ 19:425–442
So MKP, Yip IWH (2012) Multivariate GARCH models with correlation clustering. J Forecast
31:443–468
Tankov P, Tauchen G (2011) Volatility jumps. J Bus Econ Stat 29:356–371
Treynor JL (2008) Treynor on institutional investing. John Wiley & Sons, Hoboken
Tse YK, Tsui AKC (2002) A multivariate generalized autoregressive conditional heteroscedasticity
model with time-varying correlations. J Bus Econ Stat 20:351–362
Index

A Bayesian methodology, 159

Accelerated failure time (AFT) model, ix, Bayes risk, 18
77–98 Bias, x, 117, 136–138, 140–142, 154, 183–185,
Acute kidney injury (AKI) dataset, 120 191, 192
Adaptive design, 71 BIC. See Bayesian information criterion (BIC)
Admissible set, 62 Bivariate normal distribution, 46, 48–51
AIPW. See Augmented inverse probability Bootstrapping, 190, 192, 193
weighting (AIPW) Bounded longitudinal measurements, 160, 165
Akaike’s information criterion (AIC), 14, 16, Breast cancer questionnaire (BCQ) score, 155,
25, 45, 46, 84, 90 156, 163, 164
Algorithm-based design, 56, 57 B-spline method, 105
ARCH. See Autoregressive conditional
heteroscedasticity (ARCH)
Asymptotic distribution, 85, 88, 89, 190, 202, C
203 Calculus, ix, 172, 174
Augmented inverse probability weighting Case I interval censored data, 101–122
(AIPW), 186, 190, 191 Censoring, ix, 79, 80, 87, 89, 91–93, 97, 102,
Autoregressive conditional heteroscedasticity 105, 107, 108, 117, 124, 125, 129,
(ARCH), 14, 206 135–137, 141, 146, 149, 154, 158,
Autoregressive models, viii, 13–33 166
Center, x, 50, 116
Clinical trial, viii, ix, 3, 4, 6, 7, 10, 153–166
B Clinical trial data, 155–157, 162–165
Bahadur representation, 133, 134 Closed testing, 3–10
Bandwidth, 112, 113, 117, 134, 183, 198, 199 Combination therapy, viii, 58
Barrier strategy, 115 Common risk, x, 205–217
Basis function, 43, 78, 105, 106, 117, 124, 135, Complete log-likelihood, 158
136, 139, 145, 146, 148, 149 Conditional correlation, x, 206, 207, 209, 210,
Bayesian adaptive design, 58 216
Bayesian information criterion (BIC), 14, 16, Conditional likelihood, 16, 18, 103
21, 23, 24, 33, 84, 90, 93, 95, 117, Conditional variance, 13, 28, 32, 206, 207,
120 213–215
Bayesian Markov Chain Monte Carlo Conditional volatility, 208, 214
(MCMC), 45

D.-G. (Din) Chen et al. (eds.), Advanced Statistical Methods in Data Science,
ICSA Book Series in Statistics, DOI 10.1007/978-981-10-2594-5
220 Index

Confidence, 14, 114, 117, 118, 121, 131, 134, Ergodicity, 213, 216–217
138, 141, 143, 164, 165, 171, 173, Expectation-maximization (EM) algorithm,
179, 180, 192, 193, 216 14, 18–20, 22, 103, 156, 158, 159,
Confidence weighting, ix, 171–180 165
Consistency, 33, 60, 85, 88, 89, 114, 203, 210
Constant correlation coefficient (CCC)-
GARCH, 206, 207, 209, 210 F
Conventional multiple-choice model, 172, 174 Fixed effect, viii, 37, 42, 46, 157
Convergence, 20, 45, 82, 103, 105, 108, 116, Free-choice test, 172
159, 162
Convex minimization, 190
Correlated endpoints, 6 G
Counting process, 104, 109, 121 Gatekeeping procedure, viii, 3–10
Course performance, 175, 176, 180 Gaussian, viii, 13–33, 208, 210–212
Cox proportional hazards (CoxPH) model, 93, Generalized autoregressive conditional
96, 144–146, 148, 157, 160 heteroskedasticity (GARCH), 14,
Cumulative distribution function (CDF) 205–217
estimation, 135, 136 Generalized linear mixed effect model, 106,
Cure fraction, 155, 157–159, 165 161, 162
Current status data, ix, 102–105, 107, 116, 121, Goodness of fit, 45, 46, 125
122 Group bridge penalty, ix, 81, 82, 88, 95,
97, 98
Group LASSO penalty, 95, 98
D Group selection, ix, 77–98
Degree-of-certainty test, 172
Density ratio model (DRM), 123–151
Dimension reduction, 104 H
Discrete-time Markov decision processes, 13 Heterogeneity, 14, 46, 156, 161, 202
Dose finding, 55–73 Hierarchical objectives, 3, 4
Dose limiting toxicity (DLT), 55, 56, 58, High dimensional method, 55–73
61–63, 65, 67–69, 71 Hodges-Lehmann (HL) estimator, 196, 199,
Double robustness, 186 201–203
DRM. See Density ratio model (DRM) Hurdle model, 37, 38, 40–44, 46, 49, 50
Drug combination trial, 56, 57, 66, 71 Hypothesis testing, 59, 123–151
Dual partial empirical likelihood (DPEL), 125,
128, 130, 132
Dynamic correlation, 209 I
Dynamic correlation coefficient (DCC)- Identifiability constraints, 104–108, 212
GARCH, 207, 213 Imputation, x, 183–193
Incomplete measurement, 18
Information criteria, 14
E Information criterion, 20, 21
Efficiency, ix, 36, 124, 125, 134, 137, 139, 140, Informative drop-out, 154
143, 149, 191, 196, 199, 201, 203 Interval boundary, 60, 71
Efficient, viii, ix, 14, 33, 57, 103, 109, 111, Interval design, 55–73
112, 114, 121, 124, 125, 135, 138, Inverse probability weighting (IPW), 191,
139, 150, 154, 156, 159, 165, 196 193
Empirical likelihood (EL), vii, ix, 123–151, Isotonic regression, 63, 73
187, 196 Iterative algorithm, 19, 107, 115
Empirical likelihood (EL) ratio, 125, 130, 131,
150
Empirical likelihood (EL) ratio test, 125, J
130–132, 135, 143–146, 150 Joint modeling, ix, 153–166
Index 221

K MELE. See Maximum empirical likelihood

Kaplan-Meier estimator, 80 estimator (MELE)
Kernel density estimation, 197–199 M-estimator, 189, 196, 199–203
Kernel estimation, 112, 134 Missing at random (MAR), 154, 184
Kernel smoothing, 198 Missing data, vii, x, 183, 184
Mixture models, 16
ML. See Maximum likelihood (ML)
L Model misspecification, viii, 135, 139–140,
Label switching, 44 146, 150, 183, 193
Lagrange multiplier, 127, 128, 187 Modulus of rupture (MOR), 146, 148–150
Laplace approximation, 165 Modulus of tension (MOT), 146, 148, 149
LASSO. See Least absolute shrinkage and Monotonicity constraints, 107
selection operator (LASSO) Monte Carlo simulation, 208, 213–215
Latency, 208 MOR. See Modulus of rupture (MOR)
Latent variable, viii, 43, 159 MOT. See Modulus of tension (MOT)
Learning benefits, 179 MTD. See Maximum tolerated dose (MTD)
Least absolute shrinkage and selection operator Multi-modal, 27, 30
(LASSO), 14, 18, 24, 25, 78, 79, Multiple-answer test, 172
83–85, 89–95, 97, 98 Multiple-choice tests, ix, 171–180
Length of hospital stay, viii, 35–52 Multiple robustness, 190
Likelihood, vii, 20, 22, 47, 107, 108, 116, 155, Multivariate GARCH, 205–217
158, 165, 186, 197, 210, 212 Multivariate normal distribution, 4, 6, 43, 155
Limiting performance, 69
Linear mixed effect model, 156, 159, 165
Linear mixed tt model, 157, 165 N
Local alternative model, 125, 131, 150 Negative binomial model, 46, 51
Local power, 132 Newton-Raphson algorithm, 116, 159
Local quadratic approximation, 18 Nonconvex, 82
Logarithmic utility function, 106 Nonparametric link function, 121
Longitudinal trajectory, 166 Normal mixture, 140
Long-term monitoring, 124, 151
Long-term survivor, 124, 151
Lumber quality, 124, 125, 135, 146–151 O
Optimal interval, viii, 57, 60, 61, 65
Outlier detection, 140, 141
M Outliers, 14, 135, 140–143, 149, 150
Martingale central limit theorem, 113 Overdispersion, 35–52
Maximum empirical likelihood estimator
(MELE), ix, 125–130, 134
Maximum likelihood (ML), 14, 103, 105, P
156, 158, 159, 162, 187, 196, Parameter identifiability, 212
210 Parametric bootstrap, 213, 216
Maximum (conditional) likelihood estimation, Parsimony, 16, 18, 45
16 Partial information matrix, 128
Maximum smoothed likelihood estimator, Partially linear regression model, 79
197–199, 201, 203 Partially linear single-index proportional odds
Maximum tolerated dose (MTD), viii, ix, model, ix, 104
55–59, 61, 63, 64, 66–72 Penalized joint partial log-likelihood (PJPL),
Mean, x, 14, 15, 23–25, 30, 37–39, 41–48, 50, 161
51, 80, 87, 113, 114, 134, 139, 141, Penalized likelihood, 19, 115
154, 157, 158, 160–162, 164, 165, Penalty function, 16, 17, 78, 81, 84, 95, 115
184, 191–193, 195, 196, 199–203, Phase I trial, 56
205, 206, 208, 209, 211 PH model. See Proportional hazards (PH)
Median, 39, 40, 155, 192, 195, 196, 199–202 model
222 Index

Piecewise polynomial linear predictor, 162 Smoothly clipped absolute deviation (SCAD),
Plug-in method, 113 14, 18, 23–25, 78, 98
Poisson model, 41, 46–48, 50, 51 Sparse, 20, 82, 87, 115, 135
Polynomial splines, 105, 106, 121 Spatial random effect, 37–38, 43
Pool adjacent violators algorithm, 57, 72 Spline, 43, 45, 103, 105–108, 121, 166
Posterior probability, 57, 59–64 Student grades, 174–176
Primary biliary cirrhosis (PBC) data, 79, 93–97 Student learning, 180
Promotion time cure model, 155, 157–159 Student perception, 176
Proportional hazards (PH) model, 93, 96, 101, Student’s t distribution, 155
103, 108, 160 Student stress, 177, 180
Proportional odds model, ix, 101–122 Student-weighted model, 171–180
Stute’s weighted least squares estimator, 79,
81, 87, 98
Q Subset autoregressive models, 15
Quality of life measurements, 153–166 Survey, 52, 172, 173, 176–180
Quantile estimation, 123–151 Survival analysis, 77, 79, 101
Quasi-maximum likelihood estimate (QMLE), Symmetric distribution, 195–204
210–212

T
R Thresholding rule, 71
Random effect, 37, 38, 42, 43, 45–51, 156, Time series, 14–16, 22, 23, 27–33, 205–217
157, 160, 161 Tone perception data, 177, 179
Random walk, 61, 62, 68 Toxicity, 55–72
Reflection of knowledge, 179 Toxicity tolerance interval, 56
Regularity conditions, 108 Tuning parameter, 20–22, 79, 81, 83–85, 90,
Regularization, 13–33, 86, 187 93
Relative difference, 214, 216 Two-answer test, 172
Relative weights, 173–175, 179 Type I censoring, 124, 125, 141
Right censored data, 79, 80, 89, 97, 104
Risk theory, 109
Robust, 55–73, 113, 141, 149, 150, 156, 162, U
185, 196 Unbounded, 197
Robust design, 58, 65 Underlying driven process, 207
Robustness, 59, 72, 139–143, 146, 183–193, Underlying risk, 207, 208
196 University of Calgary, 172, 174

S V
Same day surgery, 36–38 Variable selection, 14, 77–79, 81, 82, 90
Seasonal effect, 43 Variance estimates, 37, 136, 137, 139, 140, 142
Selection consistency, 33, 88 Volatility, 13, 14, 28–30, 32, 205–208
Semiparametric estimation, 80 Volatility clustering, 205, 206
Semiparametric information bound, 109 Vuong statistic, 45, 46
Semiparametric maximum likelihood
estimation, 103
Semiparametric model, 80, 109, 117 W
Sieve maximum likelihood estimation (SMLE), Wald type tests, 114, 144
103 Weighted multiple testing correction, 3–10
Simplex distribution, 155, 160–162, 165
Single-index regression model, 104
Smoothed likelihood, 195–204 Z
Smoothing operator, 198 Zero inflation, 35–52

Linear Models and The Relevant Distributions and Matrix Algebra
No ratings yet
Linear Models and The Relevant Distributions and Matrix Algebra
539 pages
Time Series Analysis
100% (10)
Time Series Analysis
635 pages
Applied Categorical and Count Data Analysis (PDFDrive)
50% (2)
Applied Categorical and Count Data Analysis (PDFDrive)
380 pages
Cohen Mcqs
69% (13)
Cohen Mcqs
69 pages
Anthropology, Space, and Geographic Information Systems (Spatial Information Systems) PDF
No ratings yet
Anthropology, Space, and Geographic Information Systems (Spatial Information Systems) PDF
305 pages
Medical Devices Industry and Market Propects 2013-2023 PDF
33% (3)
Medical Devices Industry and Market Propects 2013-2023 PDF
21 pages
Complete Repertory To The Homeopathic Materia Medica Diseases of The Eyes
50% (2)
Complete Repertory To The Homeopathic Materia Medica Diseases of The Eyes
15 pages
2013 Book BayesianAndFrequentistRegressi PDF
No ratings yet
2013 Book BayesianAndFrequentistRegressi PDF
700 pages
Dokumen - Pub Clinical Epidemiology Amp Evidence Based Medicine Fundamental Principles of Clinical Reasoning Amp Research Hardcovernbsped 0761919384 9780761919384
No ratings yet
Dokumen - Pub Clinical Epidemiology Amp Evidence Based Medicine Fundamental Principles of Clinical Reasoning Amp Research Hardcovernbsped 0761919384 9780761919384
321 pages
Bayesian Analysis of Time Series - Broemeling L. D. (CRC 2019) (1st Ed.)
100% (5)
Bayesian Analysis of Time Series - Broemeling L. D. (CRC 2019) (1st Ed.)
293 pages
Benjamin S. Baumer, Daniel T. Kaplan, Nicholas J. Horton - Modern Data Science With R (Chapman & Hall - CRC Texts in Statistical Science) - Chapman and Hall - CRC (2021)
100% (1)
Benjamin S. Baumer, Daniel T. Kaplan, Nicholas J. Horton - Modern Data Science With R (Chapman & Hall - CRC Texts in Statistical Science) - Chapman and Hall - CRC (2021)
650 pages
(Monographs On Statistics and Applied Probability 113) Lang Wu-Mixed Effects Models For Complex Data-CRC Press (2010)
No ratings yet
(Monographs On Statistics and Applied Probability 113) Lang Wu-Mixed Effects Models For Complex Data-CRC Press (2010)
440 pages
(Chapman & Hall - CRC Texts in Statistical Science) Babette A. Brumback - Fundamentals of Causal Inference With R-Chapman and Hall - CRC (2021)
No ratings yet
(Chapman & Hall - CRC Texts in Statistical Science) Babette A. Brumback - Fundamentals of Causal Inference With R-Chapman and Hall - CRC (2021)
249 pages
Bayesian Hierarchical Models - With Applications Using R - Congdon P.D. (CRC 2020) (2nd Ed.)
100% (3)
Bayesian Hierarchical Models - With Applications Using R - Congdon P.D. (CRC 2020) (2nd Ed.)
593 pages
121 Stochastic Processes An Introduction Peter W. Jones Peter Smith Edisi 3 2018
100% (1)
121 Stochastic Processes An Introduction Peter W. Jones Peter Smith Edisi 3 2018
271 pages
Essentials of Probability Theory For Statisticians
67% (3)
Essentials of Probability Theory For Statisticians
419 pages
Missing and Modified Data in Nonparametric Estimation
100% (2)
Missing and Modified Data in Nonparametric Estimation
465 pages
Handbook of Regression Methods
100% (5)
Handbook of Regression Methods
654 pages
2018 Book DataScienceAndPredictiveAnalyt PDF
100% (3)
2018 Book DataScienceAndPredictiveAnalyt PDF
851 pages
Beginning R, 2nd Edition
100% (5)
Beginning R, 2nd Edition
337 pages
2017 Book MathematicalStatistics
83% (12)
2017 Book MathematicalStatistics
321 pages
Mathematical Statistics With Applications PDF
100% (16)
Mathematical Statistics With Applications PDF
644 pages
Applied Stochastic Modelling, Second Edition PDF
100% (6)
Applied Stochastic Modelling, Second Edition PDF
363 pages
Linear Regression and Correlation - A Beginner's Guide
100% (5)
Linear Regression and Correlation - A Beginner's Guide
220 pages
(Cambridge Series in Statistical and Probabilistic Mathematics) Gerhard Tutz, Ludwig-Maximilians-Universität Munchen - Regression For Categorical Data-Cambridge University Press (2012)
100% (3)
(Cambridge Series in Statistical and Probabilistic Mathematics) Gerhard Tutz, Ludwig-Maximilians-Universität Munchen - Regression For Categorical Data-Cambridge University Press (2012)
574 pages
Bayesian Inference Data Evaluation and Decisions Second Edition
100% (2)
Bayesian Inference Data Evaluation and Decisions Second Edition
245 pages
Mathematical Modeling For Business Analytics
94% (16)
Mathematical Modeling For Business Analytics
451 pages
Intake Interview
No ratings yet
Intake Interview
4 pages
Reinterpreting Ba Duan Jing From The Theories of The Eight Extra Meridians
100% (1)
Reinterpreting Ba Duan Jing From The Theories of The Eight Extra Meridians
20 pages
Regression Modeling Strategies - With Applications To Linear Models by Frank E. Harrell
100% (4)
Regression Modeling Strategies - With Applications To Linear Models by Frank E. Harrell
598 pages
Computational Bayesian Statistics. An Introduction - Amaral, Paulino, Muller PDF
100% (4)
Computational Bayesian Statistics. An Introduction - Amaral, Paulino, Muller PDF
257 pages
Statistical Regression and Classification - From Linear Models To Machine Learning
100% (10)
Statistical Regression and Classification - From Linear Models To Machine Learning
532 pages
Linear Regression Models, Analysis, and Applications
100% (3)
Linear Regression Models, Analysis, and Applications
193 pages
Multilevel Modeling Using R - Finch Bolin Kelley
100% (2)
Multilevel Modeling Using R - Finch Bolin Kelley
82 pages
Test Bank For Pharmacological Aspects of Nursing Care 7th Edition by Broyles
100% (2)
Test Bank For Pharmacological Aspects of Nursing Care 7th Edition by Broyles
10 pages
Hypothesis Testing - A Visual Introduction To Statistical Significance
100% (4)
Hypothesis Testing - A Visual Introduction To Statistical Significance
137 pages
Regression Modeling PDF
100% (1)
Regression Modeling PDF
598 pages
Singing Bowls Therapy
100% (1)
Singing Bowls Therapy
6 pages
Elements of Nonlinear Series Analysis and Forecasting PDF
100% (8)
Elements of Nonlinear Series Analysis and Forecasting PDF
626 pages
Linear Algebra Optimization Machine Learning PDF
100% (12)
Linear Algebra Optimization Machine Learning PDF
507 pages
An Introduction To Order Statistics: Mohammad Ahsanullah, Valery B. Nevzorov, Mohammad Shakil
No ratings yet
An Introduction To Order Statistics: Mohammad Ahsanullah, Valery B. Nevzorov, Mohammad Shakil
241 pages
Geoinformatics in Health Facility Analysis
No ratings yet
Geoinformatics in Health Facility Analysis
243 pages
Bayesian Statistical Methods
100% (10)
Bayesian Statistical Methods
288 pages
Data Wrangling With R
91% (11)
Data Wrangling With R
237 pages
Generalized Linear Models
100% (9)
Generalized Linear Models
243 pages
Cause and Correlation in Biology - A User's Guide To Path Analysis, Structural Equations and Causal Inference
100% (3)
Cause and Correlation in Biology - A User's Guide To Path Analysis, Structural Equations and Causal Inference
330 pages
A Concise Introduction To Statistical Inference
100% (5)
A Concise Introduction To Statistical Inference
231 pages
Ogden The Fear of Breakdown
100% (4)
Ogden The Fear of Breakdown
19 pages
Introduction To Data ScienceA Python Approach To Concepts, Techniques and Applications PDF
100% (10)
Introduction To Data ScienceA Python Approach To Concepts, Techniques and Applications PDF
227 pages
Statistics With Common Sense - David Kault (2003)
No ratings yet
Statistics With Common Sense - David Kault (2003)
272 pages
Dokumen - Pub Clinical Research Case Studies of Successes and Failures 1nbsped 1493925156 9781493925155
No ratings yet
Dokumen - Pub Clinical Research Case Studies of Successes and Failures 1nbsped 1493925156 9781493925155
183 pages
Statistics in Action
96% (25)
Statistics in Action
903 pages
Algorithms For Data Science 1st Brian Steele (WWW - Ebook DL - Com)
100% (15)
Algorithms For Data Science 1st Brian Steele (WWW - Ebook DL - Com)
438 pages
Practical Machine Learning R
90% (10)
Practical Machine Learning R
149 pages
Koch I. Analysis of Multivariate and High-Dimensional Data 2013
100% (17)
Koch I. Analysis of Multivariate and High-Dimensional Data 2013
532 pages
(Probability and Statistics For Programmers) Allen Downey - Think Stats. Probability and Statistics For programmers-O'Reilly Media (2012) PDF
100% (10)
(Probability and Statistics For Programmers) Allen Downey - Think Stats. Probability and Statistics For programmers-O'Reilly Media (2012) PDF
142 pages
Marma+Vital+Point+Self Healing+Guide
No ratings yet
Marma+Vital+Point+Self Healing+Guide
4 pages
Basic Counselling Skills
No ratings yet
Basic Counselling Skills
3 pages
Learning Statistics
100% (27)
Learning Statistics
408 pages
Statistical Data Analysis Explained
93% (27)
Statistical Data Analysis Explained
359 pages
PT As Administrator
No ratings yet
PT As Administrator
19 pages
National Dental Hygiene Certification Examination Sample Questions
100% (3)
National Dental Hygiene Certification Examination Sample Questions
9 pages
Key Concepts With EPAs 2021
No ratings yet
Key Concepts With EPAs 2021
10 pages
Psychology: General Description
No ratings yet
Psychology: General Description
3 pages
Clinical PD
No ratings yet
Clinical PD
3 pages
After
No ratings yet
After
3 pages
Soft Tissue Injuri
No ratings yet
Soft Tissue Injuri
11 pages
Surfactan Therapy For Neonates
No ratings yet
Surfactan Therapy For Neonates
20 pages
NUR 113 Rationale Sessions 5 8 Rationale PDF
No ratings yet
NUR 113 Rationale Sessions 5 8 Rationale PDF
8 pages
ST Georges Annual Report 2008 - 2009
No ratings yet
ST Georges Annual Report 2008 - 2009
52 pages
Gentle Threads Interference Screws BIOMET
No ratings yet
Gentle Threads Interference Screws BIOMET
6 pages
Crigler Najjar Syndrome
No ratings yet
Crigler Najjar Syndrome
3 pages
The Heart: Will Hallissey, Christian Ricchezza, and Kayla Inemer
No ratings yet
The Heart: Will Hallissey, Christian Ricchezza, and Kayla Inemer
21 pages
Cervical-Questions and Answers About Cervical Cancer For The Public PDF
No ratings yet
Cervical-Questions and Answers About Cervical Cancer For The Public PDF
10 pages
Pancreatitis
No ratings yet
Pancreatitis
2 pages
Impact of Therapist Emotional Intelligence On Psychotherapy
No ratings yet
Impact of Therapist Emotional Intelligence On Psychotherapy
11 pages
Ab 2
No ratings yet
Ab 2
4 pages
Elliot Greenebaum: How Do You Get This Way?
No ratings yet
Elliot Greenebaum: How Do You Get This Way?
4 pages
Nursing Care Plan Mullet
No ratings yet
Nursing Care Plan Mullet
4 pages
PHOBIA
No ratings yet
PHOBIA
3 pages
Statistics Super Review, 2nd Ed.
From Everand
Statistics Super Review, 2nd Ed.
The Editors of REA
5/5 (3)
Mathematical Statistics
From Everand
Mathematical Statistics
S. Wilks
5/5 (2)
Hands-On Time Series Analysis with R: Perform time series analysis and forecasting using R
From Everand
Hands-On Time Series Analysis with R: Perform time series analysis and forecasting using R
Rami Krispin
No ratings yet
R Graphs Cookbook Second Edition
From Everand
R Graphs Cookbook Second Edition
Jaynal Abedin
3/5 (1)
Learning Bayesian Models with R
From Everand
Learning Bayesian Models with R
M.Koduvely Dr. Hari
5/5 (1)
Introduction to Statistics: An Intuitive Guide for Analyzing Data and Unlocking Discoveries
From Everand
Introduction to Statistics: An Intuitive Guide for Analyzing Data and Unlocking Discoveries
Jim Frost
5/5 (1)
The Statistical Analysis of Experimental Data
From Everand
The Statistical Analysis of Experimental Data
John Mandel
3/5 (2)
Regression Analysis: An Intuitive Guide for Using and Interpreting Linear Models
From Everand
Regression Analysis: An Intuitive Guide for Using and Interpreting Linear Models
Jim Frost
5/5 (4)
Calculus and Statistics
From Everand
Calculus and Statistics
Michael C. Gemignani
4/5 (1)
Effective Amazon Machine Learning
From Everand
Effective Amazon Machine Learning
Alexis Perrier
No ratings yet
An Introduction to Statistical Computing: A Simulation-based Approach
From Everand
An Introduction to Statistical Computing: A Simulation-based Approach
Jochen Voss
No ratings yet
Data Preparation and Exploration: Applied to Healthcare Data
From Everand
Data Preparation and Exploration: Applied to Healthcare Data
Robert Hoyt
No ratings yet
Hypothesis Testing: An Intuitive Guide for Making Data Driven Decisions
From Everand
Hypothesis Testing: An Intuitive Guide for Making Data Driven Decisions
Jim Frost
No ratings yet
Errors of Regression Models: Bite-Size Machine Learning, #1
From Everand
Errors of Regression Models: Bite-Size Machine Learning, #1
Lee Baker
No ratings yet

Advance Statistical Methods in Data Science Chen

Uploaded by

Advance Statistical Methods in Data Science Chen

Uploaded by

ICSA Book Series in Statistics

Series Editors: Jiahua Chen · Ding-Geng (Din) Chen

Ding-Geng (Din) Chen

Department of Biostatistics Grace Y. Yi

ISSN 2199-0980 ISSN 2199-0999 (electronic)

Library of Congress Control Number: 2016959593

© Springer Science+Business Media Singapore 2016

Printed on acid-free paper

This Springer imprint is published by Springer Nature

To my wife, my daughter Amy, and my son

To my wife Xiaobo, my daughter Sophia, and

To my family, Wenqing He, Morgan He, and

dimensional drug combinations, the authors develop a random-walk ROI design

Chapter 10 discusses parametric imputation in missing data analysis. Author

We welcome readers’ comments, including notes on typos or other errors, and

University of British Columbia Jiahua Chen, MSc, PhD

University of Calgary Xuewen Lu, MSc, PhD

University of Waterloo Grace Y. Yi, MSc, MA, PhD

Western University Hao Yu, MSc, PhD

July 28, 2016

Part I Data Analysis Based on Latent or Dependent Variable

Part II Life Time Data Analysis

8 Recent Development in the Joint Modeling of Longitudinal

Part III Applied Data Analysis

Pengfei Li Department of Statistics and Actuarial Science, University of Waterloo,

Enas Ghulam, Kesheng Wang, and Changchun Xie

Abstract Hierarchically ordered objectives often occur in clinical trials. Many

In order to obtain better overall knowledge of a treatment effect, the investigators

E. Ghulam • C. Xie ()

© Springer Science+Business Media Singapore 2016 3

1.1.1 WMTCc Method

D 1 P.all pj > pi wj =wi /;

where Xj , j D 1; : : : ; m are standardized multivariate normal with correlation matrix

for the two-sided case and

for the one-sided case.

1.1.2 Single-Step WMTCc Method

1.2 Mixture Gatekeeping Procedures

Following Dmitrienko and Tamhane (2011), we consider a mixture procedure P

any intersection hypothesis H.I2 / at level ˛ if p2 .I2 / D minm iD1 Padj_i ˛.

where 0 c.I1 ; I2 j˛/ 1 and must satisfy the following equation:

PfI . p1 .I1 /; p2 .I2 // ˛jH.I/g D Pfp1 .I1 / ˛

1.3 Simulation Study

Following Dmitrienko and Tamhane (2011)’s example of the schizophrenia trial.

1.5 Concluding Remarks and Discussions

In this paper, we proposed the WMTCc-based mixture gatekeeping procedure.

Bauer P, Röhmel J, Maurer W, Hothorn L (1998) Testing strategies in multi-dose experiments

Dmitrienko A, Tamhane AC (2011) Mixtures of multiple testing procedures for gatekeeping

Abbas Khalili, Jiahua Chen, and David A. Stephens

Abstract Regime-switching Gaussian autoregressive models form an effective

A standard Gaussian autoregressive (AR) model of order q postulates that

Yt D 0 C 1 Yt1 C    C q Ytq C "t (2.1)

A. Khalili () • D.A. Stephens

© Springer Science+Business Media Singapore 2016 13

or volatility, of the series is var.Yt jYt1 ; : : : ; Ytq / D  2 , which is a constant with

2.2 Terminology and Model

Consider an observable time series fYt W t D 1; 2; : : :g with corresponding realized

t;k D k0 C k1 yt1 C : : : C kq ytq I k D 1; 2; : : : ; K: (2.2)

Here q is the maximum order that is thought to be reasonable across all K AR

ln .˚ K / D logff2 .yqC1 ; : : : ; yn jy1 ; y2 ; : : : ; yq /g

In principle, once K is selected, we could carry out maximum (conditional)

2.3 Regularization in MAR Models

2.3.1 Simultaneous AR-Order and Parameter Estimation when

In the following sections, we investigate regularization of the conditional likeli-

variance k2 goes to 0. This can be avoided by introducing a penalty function as in

pln .˚ K / D Qln .˚ K / rn .˚ K / (2.8)

with some regularization function

2.3.2 Numerical Computations

The maximization of the penalized conditional log-likelihood for a K regime MAR

holds in a neighbourhood of a current value 0 , and may be used. Coordinate-based

and thus the penalized complete conditional log-likelihood is plcn .˚ K / D Qlcn .˚ K /

.m/ .m/ 2.m/

.mC1/ .m/ .m/ .m/

.m/ .m/ .m/

2.3.2.2 Tuning of in rn .; /

One remaining issue in the implementation of the regularization method is the

k . i / and  2k ; compute N t;k;i D x>

Yt D 0 C 1 Yt1 C C q Ytq C "t (2.1)

or volatility, of the series is var.Yt jYt1 ; : : : ; Ytq / D 2 , which is a constant with

t;k D k0 C k1 yt1 C : : : C kq ytq I k D 1; 2; : : : ; K: (2.2)

variance k2 goes to 0. This can be avoided by introducing a penalty function as in

k . i / and 2k ; compute N t;k;i D x>

IC. i I k/ D nk logŒRSS k . i / C DF. i /.log nk / (2.11)

Model .K ; q/ .1 ; 2 / .1 ; 2 / t;1 t;2

Model .K ; q/ .1 ; 2 ; 3 / .1 ; 2 ; 3 / t;1 t;2 t;3

where ij D P.Yij D 0/ is the probability of a subject belonging to the zero

where .1 C ij =r/ is a measure of overdispersion. As r ! 1, the negative binomial

logit.ij / D w0ij ˛ C f1 .monthij / C b1j

where Bk .monthij /; k D 1; ; K denote the cubic B-spline basis function with

pj jH1 Unif.0; / and pj jH2 Unif.; 1/;