Missing and Modified Data in Nonparametric Estimation
Missing and Modified Data in Nonparametric Estimation
Modified Data in
Nonparametric
Estimation
With R Examples
MONOGRAPHS ON STATISTICS AND APPLIED PROBABILITY
General Editors
Missing and
Modified Data in
Nonparametric
Estimation
With R Examples
Sam Efromovich
The University of Texas at Dallas
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been
made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity
of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright
holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this
form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may
rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized
in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying,
microfilming, and recording, or in any information storage or retrieval system, without written permission from the
publishers.
For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://
www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923,
978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For
organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for
identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
and the CRC Press Web site at
http://www.crcpress.com
Contents
Preface xiii
1 Introduction 1
1.1 Density Estimation for Missing and Modified Data 2
1.2 Nonparametric Regression with Missing Data 7
1.3 Notions and Notations 14
1.4 Software 21
1.5 Inside the Book 23
1.6 Exercises 26
1.7 Notes 28
4 Nondestructive Missing 99
4.1 Density Estimation with MCAR Data 101
4.2 Nonparametric Regression with MAR Responses 107
4.3 Nonparametric Regression with MAR Predictors 112
4.4 Conditional Density Estimation 117
4.5 Poisson Regression with MAR Data 124
4.6 Estimation of the Scale Function with MAR Responses 127
4.7 Bivariate Regression with MAR Responses 129
4.8 Additive Regression with MAR Responses 133
4.9 Exercises 137
ix
x CONTENTS
4.10 Notes 142
References 423
Missing and modified data are familiar complications in statistical analysis, while statistical
literature and college classes are primarily devoted to the case of direct observations when
the data are elements of a matrix. Missing makes some elements of the matrix not available,
while a modification replaces elements by modified values. Classical examples of missing are
missing completely at random when an element of the direct data is missed by a pure chance
regardless of values of the underlying direct data, or missing not at random when the likeli-
hood of missing depends on values of missed data. Classical examples of data modifications
are biased data, truncated and censored data in survival analysis, amplitude-modulation of
time series and measurement errors. Of course, missing may be considered as a particular
case of data modification, while a number of classical data modifications, including biased
and truncated data, are created by hidden missing mechanisms. This explains why it is pru-
dent to consider missing and modification together. Further, these statistical complications
share one important feature which is the must know. A missing or modification of data
may be destructive and imply impossibility of consistent (feasible) estimation. If the latter
is the case, this should be recognized and dealt with appropriately. On the other hand, it is
a tradition in statistical literature to separate missing from modification. Further, because
missing may affect already modified data and vice versa, it is convenient to know how to
deal with both complications in a unified way.
This book is devoted solely to nonparametric curve estimation with main examples being
estimation of the probability density, regression, scale (volatility) function, conditional and
joint densities, hazard rate function, survival function, and spectral density of a time series.
Nonparametric curve estimation means that no assumption about shape of a curve (like
in linear regression or in estimation of a normal density) is made. The unique feature of
the nonparametric estimation component of the book is that for all considered statistical
models the same nonparametric series estimator is used whose statistical methodology is
based on estimation of Fourier coefficients by a corresponding sample mean estimator. While
the approach is simple and straightforward, the asymptotic theory shows that no other
estimator can produce a better nonparametric estimation under standard statistical criteria.
Further, using the same nonparametric estimator will allow the reader to concentrate on
different models of missing and data modification rather than on the theory and methods
of nonparametric statistics. The used approach, however, has a couple of downsides. There
is a number of popular methods of dealing with missing data like maximum likelihood,
expectation-maximization, partial deletion, imputation and multiple imputation, or the
product-limit (Kaplan–Meier) methodology of dealing with censored data. Similarly, there
exist equally interesting and potent methods of nonparametric curve estimation like kernel,
spline, wavelet and local polynomials. These topics are not covered in the book and only
corresponding references for further reading are provided.
The complementary R package is an important part of the book and it makes the learning
instructional and brings the discussed topics in the realm of reproductive research. Virtually
every claim and development mentioned in the book are illustrated with graphs that are
available for the reader to reproduce and modify. This makes the material fully transparent
and allows one to study it interactively. It is important to stress that no knowledge of R is
needed for using the R package and the Introduction explains how to install and use it.
xiii
xiv PREFACE
The book is self-contained, has a brief review of necessary facts from probability, para-
metric and nonparametric statistics, and is appropriate for a one-semester course for diverse
classes with students from statistics and other sciences including engineering, business, so-
cial, medical, and biological among others (in this case a traditional intermediate calculus
course plus an introductory course in probability, on the level of the book by Ross (2015),
are the prerequisites). It also may be used for teaching graduate students in statistics (in
this case an intermediate course in statistical inference, on the level of the book by Casella
and Berger (2002), is the prerequisite). A large number of exercises, with different levels of
difficulty, will guide the reader and help an instructor to test understanding of the material.
Some exercises are based on using the R package.
There are 10 chapters in the book. Chapter 1 presents examples of basic models, contains
a brief review of necessary facts from the probability and statistics, and explains how to
install and use the R software. Chapter 2 presents the recommended nonparametric estima-
tion methodology for the case of direct observations. Its Section 2.2 is a must read because
it explains construction of the nonparametric estimator which is used throughout the book.
Further, from a statistical point of view, the only fact that the reader should believe in is
that a population mean may be reliably estimated by a sample mean. Chapter 3 is primarily
devoted to the case of biased data that serve as a bridge between direct data and missed,
truncated and censored data. Chapters 4 and 5 consider nonparametric curve estimation
for missing data. While Chapter 4 is devoted to the cases when a consistent estimation
is possible based solely on missing data, Chapter 5 explores more complicated models of
missing when an extra information is necessary for consistent estimation. Chapters 6 and
7 are devoted to classical truncated and censored data modifications traditionally studied
in survival analysis, and in Chapter 7 the modified data may be also missed. Modified and
missing time series are discussed in Chapters 8 and 9. Models of missing and modified data,
considered in Chapters 3-9, do not slow down the rate of traditional statistical criteria. But
some modifications, like measurement errors or current status censoring, do affect the rate.
The corresponding problems, called ill-posed, are discussed in Chapter 10.
Let us also note that the main emphasis of the book is placed on the study of small
samples and data-driven estimation. References on asymptotic results and further reading
may be found in the “Notes” sections.
For the reader who would like to use this book for self-study and who is venturing for
the first time into this area, the advice is as follows. Begin with Section 1.3 and review
basic probability facts. Sections 2.1 and 2.2 will explain the used nonparametric estimator.
Please keep in mind that the same estimator is used for all problems. Then read Section 3.1
that explains the notion of biased data, how to recover the distribution from biased data,
and that a consistent estimation, based solely on biased data, is impossible. Also, following
the explanation of Section 1.4, install the book’s R package and test it using figures in
the above-mentioned sections. Now you are ready to read the material of your interest in
any chapter. During your first reading of the material of interest, you may skip probability
formulas and go directly to simulations and data analysis based on figures and R package;
this will make the learning process easier. Please keep in mind that a majority of figures
are based on simulations, and hence you know the underlying model and can appreciate the
available data and nonparametric estimates.
All updates to this book will be posted on the web site http://www.utdallas.edu/∼efrom
and the author may be contacted by electronic mail at [email protected].
Acknowledgments
I thank everyone who, in various ways, has had an influence on this book. My biggest
thanks go to my family for the constant support and understanding. My students, col-
leagues and three reviewers graciously read and gave comments on a draft of the book.
xv
John Kimmel provided invaluable assistance through the publishing process. The support
by NSF Grants DMS-0906790 and DMS-1513461, NSA Grant H982301310212, and actuarial
research Grants from the CAS, TAF and CKER are greatly appreciated.
Sam Efromovich
Dallas, Texas, USA, 2017
Chapter 1
Introduction
Nonparametric curve estimation allows one to analyze data without assuming the shape
of an estimated curve. Methods of nonparametric estimation are well developed for the
case of directly observed data, much less is known for the case of missing and modified
data. The chapter presents several examples that shed light upon the nature of missing and
modified data and raises important questions that will be answered in the following chapters.
Section 1.1 explores a number of examples with missing, truncated and censored data when
the problem is to estimate an underlying probability density. The examples help us to
understand that missing and/or modified data not only complicate the estimation problem
but also may prevent us from the possibility to consistently estimate an underlying density,
and this makes a correct treatment and understanding specifics of the data paramount. It
is also explained why statistical simulations should be a part of the learning process. The
latter is the primary role of the book’s R package which allows the reader to repeat and
modify simulations presented in the book. Section 1.2 explains the problem of nonparametric
regression with missing data via real and simulated examples. It is stressed that different
estimation procedures are needed for cases where either responses or predictors are missed.
Further, the problem of estimation of nuisance functions is also highlighted. Main notions
and notations, used in the book, are collected in Section 1.3. It also contains a brief primer
in elementary probability and statistics. Here the interested reader may also find references
on good introductory texts, but overall material of the book is self-explanatory. Installation
of the book’s R package and a short tutorial on its use can be found in Section 1.4. Section
1.5 briefly reviews what is inside other chapters. Exercises in Section 1.6 will allow the
reader to review basics of probability and statistics.
Before proceeding to the above-outlined sections, let us make two general remarks. The
first remark is about the used nonparametric approach for missing and modified data versus
a parametric approach. For the case of direct data, it is often possible to argue that an un-
derlying function of interest is known up to a parameter (which may be a vector) and then
estimate the parameter. Missing and/or modified data require estimation of a number of
nuisance functions characterizing missing and/or modification, and then a parametric ap-
proach would require assuming parametric models for those nuisance functions. The latter
is a challenging problem on its own. Nonparametric estimation is based solely on data and
does not require parametric models neither for the function of interest nor for nuisance func-
tions. This is what makes this book’s discussion of the nonparametric approach so valuable
for missing and modified data. Further, even if a parametric approach is recommended for
estimation of a function of interest, it is prudent to use nonparametric estimation for more
elusive nuisance functions describing data modification. The second remark is about the
used terminology of missing and modified data. Missing data are, of course, modified data.
Further, one may argue that truncated or censored data, our main examples of modified
data, are special cases of missing data. Nonetheless, it is worthwhile to separate the classical
notion of missing data, when some observations are not available (missed), from data mod-
ified by truncation, censoring, biasing, measurement errors, etc. Further, it is important to
1
2 INTRODUCTION
stress that historically different branches of statistical science study missing and modified
data. In the book a unified nonparametric estimation methodology is used for statistical
analysis of missing and modified data, but at the same time it is stressed that each type of
data has its own specifics that must be taken into account.
2.0
0.0
X*
2.0
0.0
0 2 4
0 2 4
Figure 1.1 Directly observed and biased data exhibited by two top and two bottom histograms, cor-
respondingly. Sample size of direct observations is n while N is the number of biased observations.
Direct observations of X ∗ are generated by the Normal density defined in Section 2.1, the bias-
ing function is max(0, min(a + bx, 1)) with a = 0 and b = 0.7. {Here and in all other captions
information in curly brackets is devoted to explanation of arguments of the figure that can be
changed. For Figure 1.1 the simulation can be repeated and results visualized by using the book’s
R package (see how to install and use it in Section 1.4). When the R package is installed, start R
and simply enter (after the R prompt >) ch1(fig=1). Note that using ch1(fi=1) or ch1(f=1)
yields the same outcome. Default values of arguments, used by the package for Figure 1.1 are
shown in the square brackets. All indicated default values may be changed. For instance, after
entering ch1(fig=1,n=100,a=0.1,b=0.5) a sample of size n = 100 with the biasing function
max(0, min(0.1 + 0.5x, 1)) will be generated and exhibited.} [n = 400, a = 0, b = 0.7].
is used throughout the book and the reader will be able to repeat simulations and change
parameters of estimators using the R package discussed in Section 1.4. Further, each figure,
in its caption, contains information about an underlying simulation and its parameters that
may be changed.
Now let us return to the discussion of the study of the ratio of alcohol in the blood of
liquor-intoxicated drivers. The above-described scenario of collecting data is questionable
because the police must stop all drivers during the time period of collecting data. A more
realistic scenario is that the researcher gets data available from routine police reports on
arrested drivers charged with driving under the influence of alcohol (a routine report means
4 INTRODUCTION
2.0
0.0
X*
2.0
0.0
2.0
0.0
2.0
0.0
Figure 1.2 Repeated simulation of Figure 1.1. {This figure is created by the call > ch1(f=1).}
that there are no special police operations to reveal all intoxicated drivers). Because a
drunker driver has a larger chance of attracting the attention of the police, it is clear that
the data will be biased toward higher ratios of alcohol in the blood. To understand this
phenomenon, let us simulate a corresponding sample using the direct data shown in the
top diagram in Figure 1.1. Assume that if X ∗ = x is the ratio of alcohol, then with the
probability max(0, min(a + bx, 1)) the driver will be stopped by the police. The above-
presented formula looks complicated, but it simply reflects the fact that the probability
of any event is nonnegative and less or equal to one. The third from the top diagram in
Figure 1.1 shows us results of the simulation which created the biased observations, here
the particular values a = 0 and b = 0.7 are used. As it could be expected, the histogram
shows that available observations are skewed to the right with respect to the direct data
because a “drunker” driver has a larger probability to be stopped. Further, the sample size
of the biased sample X1 , . . . , XN is just N = 146, in other words, among n = 400 drivers
using the highway during the period of time T of the study, only N = 146 were stopped.
Note that the value of N , as well as values of parameters a and b, are shown in the title. An
important conclusion from the histogram is that it is impossible to restore the underlying
density of X ∗ based solely on the biased data. This message becomes even more clear from
the bottom diagram where the same histogram of biased observations is overlaid by the
underlying density. Further, we again see how the underlying density sheds light on the
DENSITY ESTIMATION FOR MISSING AND MODIFIED DATA 5
data and the estimation problem, and note that this learning tool is available only in a
simulation.
What we have observed in Figure 1.1 is the example of modified by biasing data and
also a specific example of missing data because now only N = 146 observations, from the
hidden direct sample of size n = 400, are available. It is clear that a reliable (in statistics
we may say consistent) estimator must take this modification of direct data into account.
How to do this will be explained in Chapters 3-5.
The book’s R package (see Section 1.4 on how to install and use it) allows us to repeat
the simulation of Figure 1.1. To illustrate this possibility, diagrams in Figure 1.2 present
another simulation produced by R function ch1(fig=1) of the package. Arguments n, a
and b, used to create Figure 1.2, have the same default values of n = 400, a = 0 and
b = 0.7 as in Figure 1.1. Note that values of these parameters are shown in Figure 1.2. As a
result, diagrams in the two figures are different only due to stochastic nature of underlying
simulations. Let us look at diagrams in Figure 1.2. The direct sample creates a relatively
flat middle part of the histogram, and it clearly does not look symmetric about 0.5. Further,
the biased data is even more pronouncedly skewed to the right. Just imagine how difficult
it would be for an estimator to restore the symmetric and bell-shaped underlying density
shown by the solid line. The reader is advised to repeat this (and other figures) multiple
times to learn possible patterns of random samples.
Now let us consider another example of data modification. Suppose that we are interested
in the distribution of an insured loss X ∗ due to hail. We can get information about payments
for the losses from insurance companies, but the issue is that insurance companies do not
know all insured losses due to the presence of a deductible D in each hail insurance policy.
A deductible is the amount a policyholder agrees to pay out-of-pocket after making a claim
covered by the insurance policy. For example, if the deductible is $5,000 and the accident
damage costs $25,000, the policyholder will pay $5,000 and the insurance company will pay
the remaining $20,000. The presence of deductible in an insurance policy implies that if
the loss X ∗ is not larger than the deductible D then no payments occurs, while otherwise
the compensation (payment) for the loss is equal to X ∗ − D. As a result losses that do not
exceed policy deductibles are typically missed (there is no sense to make a claim for a loss
less than the deductible because not only the policyholder gets no compensation but also
the next insurance premium more likely will go up). Further, for a policy with a payment we
can calculate the loss as the payment plus the policy deductible. We conclude that, based
on policies with payments, we observe the loss X = X ∗ only given X ∗ > D∗ . The latter is
an important actuarial example of biased and missing data.
In statistical literature, the above-described effect of a deductible on availability of a
hidden loss is called a left truncation or simply a truncation. (The only pure notational
difference is that traditionally, under the left truncation, the loss is observed if it is larger
or equal to the deductible. In what follows we are going to use this approach.) In short,
truncation is a special missing mechanism which is defined by the variable of interest and
another truncating variable (here the deductible). Further, deductibles in insurance policies
may imply a destructive modification of data when no consistent estimation of the density
of losses is possible. Indeed, if the deductible in all policies is larger than a positive constant
∗
d0 , then the density f X (x) of the loss cannot be recovered for x < d0 because we always
miss the small losses. The latter is an important property of data truncation that should
be taken care of.
Deductible is not the only complication that one may deal with while working with
insurance data. An insurance policy may also include a limit L on the payment which
implies that a compensation for insurable loss cannot exceed the limit L. If a policy with
limit L made a payment on loss X ∗ , then the available information is the pair (V, ∆) where
V := min(X ∗ , L) and ∆ := I(X ∗ ≤ L) (here and in what follows, I(B) is the indicator
of event B which is equal to one if the event B occurs and it is equal to zero otherwise).
6 INTRODUCTION
X*
2
1
0
Figure 1.3 Hidden, truncated and censored insured losses due to hail. The top diagram exhibits
the histogram of hidden losses X1∗ , . . . , Xn∗ distributed according to the density shown by the solid
line. The middle diagram shows the histogram of these losses truncated by deductibles D1 , . . . , Dn
that are generated from the density shown by the dashed line. In other words, this is the histogram
of X1 , . . . , XN which is a subsamplePof Xl satisfying Xl∗ ≥ Dl , l = 1, . . . , n and the number of
n
available truncated losses is N := l=1 I(Xl ≥ Dl ) = 198 as shown in the title. The bottom
diagram shows right censored losses. It is based on the underlying losses shown in the top diagram
and a new sample L1 , . . . , Ln of policy limits generated from the density shown by the dashed line
in the middle diagram. The observations are n pairs (V1 , ∆1 ), . . . , (Vn , ∆n ) shown by circles in the
bottom diagram. Here Vl P := min(Xl , Ll ), ∆l := I(Xl ≤ Ll ). The title also shows the number of
n
uncensored losses M := l=1 ∆l = 206. {In all figures that may be repeated using the book’s R
package, at the end of their caption the reader can find in square brackets arguments that may be
changed. For instance, this Figure 1.3 allows one to repeat simulations with different sample sizes
n. All other parameters of the above-outlined simulation remain the same.} [n = 400].
In statistics this type of data modification is called the right censoring, and note that
the censoring does not decrease the total number of observations but it does decrease the
number of observed values of X ∗ . Further, if P(L > l0 ) = 0, then the distribution of X ∗
beyond l0 cannot be restored.
The above-presented actuarial example, as well as notions of truncation and censoring,
may be confusing a bit. Let us complement the explanation by a simulated example shown
in Figure 1.3. The top diagram in Figure 1.3 exhibits the histogram of simulated losses
NONPARAMETRIC REGRESSION WITH MISSING DATA 7
X1∗ , . . . , Xn∗ generated according to the density shown by the solid line. The middle diagram
in Figure 1.3 shows us the histogram of left truncated losses that are generated as follows.
Deductibles D1 , . . . , Dn are simulated according to the density shown by the dashed line,
and then for each l = 1, . . . , n if Xl∗ < Dl then Xl∗ is skipped and otherwise Xl∗ is observed
and added to the sampleP of truncated losses. As a result, we get the truncated sample
n
X1 , . . . , XN of size N := l=1 I(Xl ≥ Dl ). As it could be expected, the middle diagram
indicates that the left truncated losses are skewed to the right and the histogram does not
resemble the underlying density. We will discuss in Chapters 6 and 7 how to restore the
underlying density of hidden losses based on truncated data.
The bottom diagram in Figure 1.3 exhibits the example of right censored losses with the
limit L distributed according to the density shown by the dashed line in the middle diagram
and the hidden losses shown in the top diagram. The observed pairs (Vl , ∆l ), l = 1, . . . , n
are shown by circles, and recall that Vl := min(Xl∗ , Ll ) and ∆l := I(Xl ≤ Ll ). The top row
of circles shows us underlying (uncensored) losses corresponding to cases Xl ≤ Ll , while
the bottom row of circles shows us limits on payments corresponding to cases Xl > Ll .
Note that the available losses are clearly biased,
Pn they are skewed to the left with respect to
the hidden losses, and their number M := l=1 ∆l = 206 is dramatically smaller than the
number n = 400 of the hidden losses.
Clearly, special statistical methods should be used to recover the distribution of a random
variable whose observations are modified by truncation and/or censoring, and corresponding
statistical methods are collectively known as survival analysis. While survival data and
missing data have a lot in common, in the book we are following the tradition to consider
them separately; at the same time let us stress that the proposed method of nonparametric
analysis is the same for both types of data. Furthermore, we will consider missing survival
data when these two mechanisms of data modification perform together.
Let us make a final remark. The above-presented examples are simulated. The author is
a strong believer in learning statistics via simulated examples because in those examples you
know the underlying model and then can appreciate complexity of the problem and quality
of its solution. This is why the approach of simulated examples is used in the book. Further,
the book’s R software will allow the reader to test theoretical assertions, change parameters
of underlying statistical models and choose parameters of recommended estimators. This
will help the reader to gain the necessary experience in dealing with missing and modified
data.
If no assumption about the shape of m(x) is made, then (1.2.1) is called the nonparametric
regression of response Y on predictor X. Another popular interpretation of (1.2.1) is the
equation Y = m(X) + ε where ε is the regression error satisfying E{ε|X} = 0. The familiar
alternative to the nonparametric regression is a parametric linear regression when it is
8 INTRODUCTION
50
0 20 40 60 80 100
Month
20000
10000
50 60 70 80 90
Age
Wages
15
14
log(wage)
13
12
20 30 40 50 60
Age
Figure 1.4 Parametric and nonparametric regressions for three real datasets. Observations are
shown by circles, linear regression is shown by the solid line, nonparametric regression is shown by
the dashed line, and in the bottom diagram quadratic regression is shown by the dotted line. Sample
sizes in the top, middle and bottom diagrams are 108, 124, and 205, respectively.
assumed that m(x) := β0 + β1 x and then parameters β0 and β1 are estimated. While
parametric regressions are well known, the nonparametric regression is less familiar and not
studied in undergraduate and often even in graduate statistics classes. As a result, let us
begin with considering several examples that shed light on nonparametric regression.
Figure 1.4 presents three classical datasets. For now let us ignore curves and concentrate
on observations (pairs (Xl , Yl ), l = 1, 2, . . . , n) shown by circles. A plot of pairs (Xl , Yl ) in
the xy-plane, called scattergram or scatter plot, is a useful tool to get a first impression
about studied data. Let us begin with analysis of the top diagram which presents monthly
housing starts in the USA from January 1966 to December 1974. Note that the predictor
is deterministic and its realizations are equidistant meaning that Xl+1 = Xl + 1. It is also
reasonable to assume that monthly housing starts are depended variables, and then we
observe a classical time series of housing starts. The simplest classical decomposition model
NONPARAMETRIC REGRESSION WITH MISSING DATA 9
of a time series is
Yl = m(Xl ) + S(Xl ) + εl , (1.2.2)
where m(x) is a slowly changing function known as a trend component, S(x) is a periodic
function with period T (that is, S(x + T ) = S(x)) known as a seasonal (cyclical) component
(it is also customarily assumed that the sum of its values over the period is zero), and εl
are random and possibly dependent components with zero mean; more about time series
and its statistical analysis can be found in Chapters 8 and 9. While a traditional time series
problem is to analyze the random components, here we are interested in estimation of the
trend. Note that E(Yl |Xl = x) = m(x) + S(x), by its definition the trend is a “slowly”
changing (with respect to the seasonal component) function, and therefore the problem of
interest is the regression problem with the so-called fixed design when we are interested
in a low-frequency component of the regression. Now, please use your imagination and try
to draw the trend m(x) via the scattergram. Note that the period of seasonal component
is 12 months and this may simplify the task. Now look at the solid line which exhibits
the classical least-squares linear regression. Linear regression is a traditional tool used in
time series analysis, and here it may look like a reasonable curve to predict the housing
market. Unfortunately, back in the seventies, too many believed that this is the curve which
describes the housing market. The dashed line shows the estimated nonparametric trend
whose construction will be explained in the following chapters. The nonparametric trend
exhibits the famous boom and the tragic collapse of the housing market in the seventies,
and it nicely fits the scattergram by showing two modes in the trend. Further, note how
well the nonparametric trend allows us to visualize the seasonal component with the period
of 12 months.
The middle diagram presents a regression dataset of independent pairs of observations.
Its specifics is that the predictors are distributed not uniformly and the volatility of the
response depends on the predictor. Such a regression is called heteroscedastic. The scatter-
gram exhibits automobile insurance claims data. The dependent variable Y is the amount
paid on a closed claim, in (US) dollars, and the predictor X is the age of the driver. Only
claims larger than $10,000 are analyzed. Because the predictor is the random variable, the
regression (1.2.1) may be referred to as a random design regression. An appealing nature
of the regression problem is that one can easily appreciate its difficulty. To do this, try
to draw a curve m(x) through the middle of the cloud of circles in the scattergram that,
according to your own understanding of the data, gives a good fit (describes a relationship
between X and Y ) according to the model (1.2.1). Clearly such a curve depends on your
imagination as well as your understanding of the data at hand. Now we are ready to com-
pare our imagination with performance of estimators. The solid line shows us the linear
regression. It indicates that there is no statistically significant relationship between the age
and the amount paid on a closed claim (the estimated slope of 14.22 is insignificant with
p-value equal to 0.7). Using linear regression for this data looks like a reasonable approach,
but let us stress that this is up to the data analyst to justify that relationship between the
amount paid on a claim and the age of the operator is linear and not of any other shape.
Now let us look at the dashed line which exhibits a nonparametric estimate whose shape is
defined by the data. How this estimate is constructed will be explained shortly in Chapter
2. The nonparametric regression exhibits a pronounced shape which implies an interesting
conclusion: the amount paid on closed claims is largest for drivers around 68 years old and
then it steadily decreases for both younger and older drivers. (Of course, it is possible that
drivers of this age buy higher limits of insurance, or there are other lurking variables that
we do not know. If these variables are available, then a multivariate regression should be
used.) Now, when we have an opinion of the nonparametric estimator, please look one more
time at the data and you may notice that the estimator’s conclusion has merit.
10 INTRODUCTION
The bottom diagram in Figure 1.4 presents another classical regression dataset of 1971
wages as a function of age. The linear regression (the solid line) indicates an increase in
wages with age, which corresponds to a reasonable opinion that wages reflect the worker’s
experience. In economics, the traditional age wage equation is modeled as a quadratic in
age, and the corresponding parametric quadratic regression m2 (x) := β0 + β1 x + β2 x2
is shown by the dotted line. The quadratic regression indicates that the wages do initially
increase with age (experience) but then they eventually go down to the level of the youngest
workers. Note that the two classical parametric models present different pictures of the age
wage patterns. The dashed line presents the nonparametric regression. Note that overall
it follows the pattern of the quadratic regression but adds some important nuances about
modes in the age wage relationship.
Figure 1.4 allows us to understand performance of statistical estimators. At the same
time, it is important to stress that analysis of real datasets does not allow us to appreciate
accuracy of statistical estimates because an underlying regression function is unknown, and
hence we can only speculate about the quality of this or that estimate.
Similarly to our previous conclusion in Section 1.1 for density estimation, analysis of
real data does not allow us to fully appreciate how well a particular regression estimator
performs. To overcome this drawback, we are going to use numerical simulations with a
known underlying regression function. Let us consider several simulated examples.
We begin with the study of the likelihood (probability) of an insurable event, which
may be a claim, payment, accident, early prepayment on mortgage, default on a payment,
reinsurance event, early retirement, theft, loss of income, etc. Let Y be the indicator of an
insurable event (claim), that is Y = 1 if the event occurs and Y = 0 otherwise, and X
be a covariate which may affect the probability of claim; for instance, X may be general
economic inflation, or deductible, or age of roof, or credit score, etc. We are interested in
estimation of the conditional probability P(Y = 1|X = x) =: m(x). On first glance, this
problem has nothing to do with regression, but as soon as we realize that Y is a Bernoulli
random variable, then we get the regression E{Y |X = x} = P(Y = 1|X = x) = m(x). The
top diagram in Figure 1.5 illustrates this problem. Here X is uniformly distributed on [0, 1]
and then n = 200 pairs of independent observations are generated according to Bernoulli
distribution with P(Y = 1|X) = m(X) and m(x) is shown by the dotted line. Because the
regression function is known, we may learn how to recognize it in the scattergram. Linear
regression (the solid line), as it could be expected, gives us no hint about the underlying
regression while the nonparametric estimate (the dashed line), which will be defined in
Section 2.4, nicely exhibits the unimodal shape of the regression.
Now let us consider our second example where Y is the number of claims (events) of
interest, or it may be the number of: noncovered losses, payments on an insurance contract,
payments by the reinsurer, defaults on mortgage, early retirees, etc. For a given X = x,
the number of claims is modeled by Poisson distribution with parameter λ(x); that is,
P(Y = k|X = x) = e−λ(x) [λ(x)]k /k!, k = 0, 1, 2, . . . The aim is to estimate λ(x), and
because E{Y |X = x} = λ(x) this problem again can be considered as a regression problem.
A corresponding simulation with n = 200 observations is shown in the bottom diagram of
Figure 1.5. The underlying regression function (the dotted line) is bimodal. Try to use your
imagination and draw a regression curve through the scattergram, or even simpler, using
the dotted line try to understand the scattergram. The nonparametric estimate (the dashed
line) exhibits two modes. The estimate is not perfect but it does fit the scattergram. The
linear regression (the solid line) sheds no light on the data.
In summary, nonparametric regression can be used for solving a large and diverse number
of statistical problems. We will continue our discussion of the nonparametric regression in
Chapter 2, and now let us turn our attention to regression estimation with missing data
when values of responses or predictors may be not available (missed).
NONPARAMETRIC REGRESSION WITH MISSING DATA 11
Likelihood of Claim
0.8
Y
0.4
0.0
Number of Claims
12
0 2 4 6 8
Y
Figure 1.5 Linear and nonparametric regressions for two simulated regression datasets, n = 200.
In each diagram, observations are shown by circles overlaid by the underlying regression function
(the dotted line), linear regression (the solid line) and nonparametric regression (the dashed line).
Let us recall our example with intoxicated drivers discussed in the previous section when
the police stops with some likelihood drivers and then the ratio of alcohol is measured. A
Bernoulli random variable A describes the stopping mechanism and if A = 1, then a car is
stopped and if A = 0, then the car passes by. In other words, if A = 1 then the ratio of alcohol
is available and otherwise it is missed. Further, in general the availability A and the ratio
of alcohol X are dependent random variables. Then, as it is explained in Section 1, based
solely on available observations the density of the ratio of alcohol cannot be consistently
estimated. One possibility to overcome this obstacle is to obtain extra information. Let us
conjecture that the probability of a car to be stopped is a function of the car speed S,
and during the experiment the police measures speeds of all passing cars. Further, let us
additionally assume that the speed is a function of the ratio of alcohol. To be specific in our
simulated example, set S := c + dX + v(U − 0.5) where U is a standard uniform random
variable supported on [0, 1] and X is the ratio of alcohol level. Then we can write for the
availability likelihood,
100
S
80
60
AX
a = 0 , b = 0.005 , c = 60 , d = 50 , v = 20
60 70 80 90 100 110
0.4
0.0
60 70 80 90 100 110
Figure 1.6 Missing data in intoxicated drivers example when the availability A is a function of car
speed S. The ratio of alcohol X is generated as in Figure 1.1, the car speed S = c + dX + v(U − 0.5)
where U is the Uniform random variable on [0, 1] and independent of X, and given S = s the avail-
ability random variable A is equal to one with the probability max(0, min(a+bs, 1)). The top diagram
shows n = 400 pairs (A1 X1 , S1 ), . . . , (An Xn , Sn ) where only N = 165 pairs are complete (the ratio
of alcohol is available). The middle scattergram shows pairs (S1 , A1 X1 ), . . . , (Sn , An Xn ). The bot-
tom diagram is a traditional scattergram of n complete pairs (S1 , A1 ), . . . , (Sn , An ). {Parameters
n, a, b, c, d and v can be changed. Used (default) values of the parameters are shown in the subtitle
of the top diagram.} [n = 400, a = 0, b = 0.005, c = 60, d = 50, v = 20].
The expectation is also called the first moment, and in general the kth moment of g(X)
is defined as E{[g(X)]k }. Assuming that the second moment of g(X) exists (finite), the
variance of g(X) is defined as
The equality on the right side of (1.3.4) is a famous relation between the variance of g(X)
and its two first moments.
If B is a set of numbers, then we may use notation I(X ∈ B) for the indicator function
(or simply indicator) of the event X ∈ B (which reads as “X belongs to set B”), that is by
definition I(X ∈ B) = 1 if the event X ∈ B occurs and I(X ∈ B) = 0 if the event X ∈ B fails
to occur. Then the following relation between the expectation and the probability holds,
Several classical discrete random variables are used in the book. The Bernoulli random
variable takes on only two values that are traditionally classified as either a “success” or
a “failure.” If A is a Bernoulli random variable, then a success is traditionally written as
A = 1 and a failure is traditionally written as A = 0. We may say that the set {0, 1} of the
NOTIONS AND NOTATIONS 15
two numbers is the support of a Bernoulli probability mass function because it is positive
only on these two numbers. A Bernoulli random variable is completely described by the
probability of success w := P(A = 1); note that the probability of failure is 1 − w. Further,
we have E{A} = P(A = 1) = w and V(A) = w(1 − w).
If we have k mutually independent and identically distributed (iid) Bernoulli random
Pk
variables A1 , . . . , Ak , then their sum B := l=1 Al is called a Binomial random variable,
whose distribution is completely defined by parameters k and w, its mean is kw and the
variance is kw(1 − w). The support of the Binomial random variable is the set {0, 1, . . . , k}
and on this set the probability mass function is pB (x) = [k!/(x!(k − x)!]wx (1 − w)k−x .
Another discrete random variable of interest, which takes on nonnegative integer num-
bers, is called Poisson. It is characterized by a single parameter λ which is both the mean
and the variance of this random variable. If X is Poisson with parameter λ, we can sim-
ply write Poisson(λ). The corresponding probability mass function is pX (x) = e−λ λx /x!,
x ∈ {0, 1, 2, . . .}.
A randomR ∞ variable X is called continuous if there exists a nonnegative function f X (x)
such that −∞ f (x)dx = 1 and the cumulative distribution function F X (x) may be written
X
as Z x
X
F (x) = f X (y)dy. (1.3.6)
−∞
X
The function f (x) is called the probability density function or simply density. The support
of a continuous random variable X is the set of points x where the density f X (x) is positive.
In other words, X may take values only from its support. In what follows the lower and
upper bounds of the support of X are denoted as αX and βX , respectively. From the
definition of a continuous random variable X we get that P(X = x) = 0 for any number x,
and this represents the main distinction between continuous and discrete random variables.
Both the cumulative distribution function and the density give a complete description of
the corresponding continuous random variable. The mean of a function g(X) is defined as
Z ∞ Z ∞
X
E{g(X)} := g(x)f (x)dx = g(x)dF X (x), (1.3.7)
−∞ −∞
and its variance is defined by (1.3.4). Let us also recall that a square root of the vari-
ance is called a standard deviation, and a standard deviation has the same units as the
corresponding random variable.
Let us define several specific continuous random variables. A random variable X is said
to be uniformly distributed over the interval [a, a+b] if f X (x) = b−1 I(x ∈ [a, a+b]); we may
write that X is Uniform(a, a + b). Note that [a, a + b] is the support of X, E{X} = a + b/2
and V(X) = b2 /12. A random variable X is a normal random variable with mean µ and
variance σ 2 (in short, it is Normal(µ, σ 2 )) if its density is
2
/(2σ 2 )
f X (x) = (2πσ 2 )1/2 e−(x−µ) =: dµ,σ (x), −∞ < x < ∞. (1.3.8)
The density is unimodal, the support is a real line, its mean, median and mode coincide
and equal to µ, the variance is equal to σ 2 , the graph of the density is symmetric about
µ and bell-shaped, and the density practically vanishes (becomes very small) whenever
|x − µ| > 3σ, which is the so-called rule of three sigma. A useful property to know is that
the sum of two independent normal random variables with parameters (µ1 , σ12 ) and (µ2 , σ22 )
is again a normal random variable with parameters (µ1 + µ2 , σ12 + σ22 ). Further, a normal
random variable is called standard normal if its mean is zero and variance is equal to 1.
A random variable X is Laplace (double exponential) with parameter b if its density is
f X (x) = (1/2b)e−|x|/b , −∞ < x < ∞. Its mean is zero and variance is 2b2 .
16 INTRODUCTION
If we would like to consider two random variables X and Y , then their joint distribution
is completely defined by the joint cumulative distribution function
Note that the right side of (1.3.9) simply introduces a convenient and traditional nota-
tion for the probability of the intersection of two events. The joint cumulative distribution
function allows us to introduce marginal cumulative distribution functions for X and Y ,
Random variables X and Y are independent if F X,Y (x, y) = F X (x)F Y (y) for all x and y,
and otherwise they are dependent. All these notions are straightforwardly extended to the
case of k variables X1 , . . . , Xk , for instance
=: P(X1 ≤ x1 , . . . , Xk ≤ xk ). (1.3.11)
Random variables X and Y are jointly continuous if there exists a bivariate nonnegative
function f X,Y (x, y) on a plane (−∞, ∞)×(−∞, ∞) (two-dimensional or bivariate probability
density) such that Z Z x y
F X,Y (x, y) =: f X,Y (u, v)dvdu. (1.3.12)
−∞ −∞
Similarly to the marginal cumulative distribution functions (1.3.10), we may define marginal
densities of X and Y ,
Z ∞ Z ∞
f X (x) := f X,Y (x, u)du and f Y (y) := f X,Y (u, y)du. (1.3.13)
−∞ −∞
Random variables X and Y are independent if f X,Y (x, y) = f X (x)f Y (y) for all x and y
and otherwise they are dependent. The conditional probability density f X|Y (x|y) of X given
Y = y is defined from the relation
(If f Y (y0 ) = 0, then f X|Y (x|y0 ) may be formally defined as a function in x which is
nonnegative and integrated to 1.) Given Y = y, the conditional density f X|Y (x|y) becomes
a regular density and has all its properties. In particular, the conditional expectation of
g(X, Y ) given Y = y is calculated by the formula
Z ∞
E{g(X, Y )|Y = y} := g(x, y)f X|Y (x|y)dx. (1.3.15)
−∞
Using (1.3.15) we can also introduce a new random variable E{g(X, Y )|Y } which is a func-
tion of Y .
Conditional expectation often helps us to calculate probabilities and expectations for
bivariate events and functions, namely for a set B ∈ (−∞, ∞) × (−∞, ∞) we can write
Z
P((X, Y ) ∈ B) = E{I((X, Y ) ∈ B)} = f Y (y)f X|Y (x|y)dxdy
B
Z ∞ hZ ∞ i
= I((x, y) ∈ B)f X|Y (x|y)dx f Y (y)dy = E{E{I((X, Y ) ∈ B)|Y }}, (1.3.16)
−∞ −∞
NOTIONS AND NOTATIONS 17
and similarly for a bivariate function g(x, y),
In some statistical problems we are dealing with a vector of random variables where
some components are continuous and others are discrete. For instance, consider a pair
(X, Z) where X is continuous and Z is discrete and takes on nonnegative integer numbers.
Then we can introduce a joint mixed density f X,Z (x, z) defined on (−∞, ∞) × {0, 1, . . .}
and such that
z Z x
X
F X,Z (x, z) =: f X,Z (u, k)du. (1.3.18)
k=0 −∞
Furthermore, the marginal density and the marginal probability mass function are defined
as
∞
X Z ∞
f X (x) := f X,Z (x, k) and pZ (z) := f X,Z (x, z)dx, (1.3.20)
k=0 −∞
and the conditional density and the conditional probability mass function are
f X,Z (x, z) =: pZ (z)f X|Z (x|z) and f X,Z (x, z) =: f X (x)pZ|X (z|x). (1.3.21)
To finish the probability part, let us stress that much of the previous material carries
over to the case of a k-dimensional random vector (X1 , . . . , Xn ).
Now let us recall basic concepts of parametric statistics that deal with a sample of n
independent and identically distributed random variables X1 , . . . , Xn . If X has the same
distribution as these variables, then we may say that Xl is the lth realization of X or that it
is generated according to the distribution of X. If the realizations (sample values) are placed
in ascending order from the smallest to the largest, then they are called ordered statistics
and denoted by X(1) , . . . , X(n) , where X(1) ≤ X(2) ≤ . . . ≤ X(n) . The main assumption
of parametric statistics is that the cumulative distribution function FθX (x) of X is known
up to the parameter θ ∈ Θ where the set Θ is known. Mean of a normal distribution with
known variance is a classical example of unknown parameter.
The main aim of parametric statistics is to estimate an unknown parameter θ of the
distribution of X based on a sample from X. The parameter may be referred to as the
estimand because this is the quantity that we would like to estimate. The estimator of
a parameter θ is a function of a sample and, given values of observations in the sample,
the value of the estimator is called the estimate (in the literature these two notions are
often used interchangeably). Different diacritics above θ such as θ̄, θ̂, θ̃, etc. are used to
denote estimators or statistics based on a sample. Similarly, if any functional or function
is estimated, then the estimators are denoted by a diacritic above the estimand, say if the
density f X (x) is the estimand then fˆX (x) denotes its estimator.
The mean squared error M SE(θ̂, θ) := Eθ {(θ̂ − θ)2 } is traditionally used to measure
the goodness of estimating θ by an estimator θ̂, and it is one of possible risks. Here Eθ {·}
denotes the expectation according to the cumulative distribution function FθX , and we use
the subscript to stress that the underlying distribution is defined by the parameter θ. At
the same time, when skipping the subscript does not cause a confusion, it may be skipped.
18 INTRODUCTION
One of the attractive methods of constructing a parametric estimator is the sample mean
method. It works as follows. Suppose that there exists a function g(x) such that an estimand
may be written as the expectation of g(X), namely
θ = Eθ {g(X)}, θ ∈ Θ. (1.3.22)
Then the sample mean estimator, based on a sample of size n from X, is defined as
n
X
θ̂ := n−1 g(Xl ). (1.3.23)
l=1
If the function g(x) is unknown but may be estimated by some statistic g̃(x), then we may
use a plug-in sample mean estimator
n
X
−1
θ̃ := n g̃(Xl ). (1.3.24)
l=1
If g is concave, then
E{g(X)} ≤ g(E{X}). (1.3.35)
In nonparametric statistics, specifically in nonparametric curve estimation, the estimand
(recall that the estimand is what we want to estimate) is not a parameter but a function
(curve). For instance, the probability density f X (x) may be the estimand. Similarly, let
Y := m(X) + ε where X and ε are independent with known distributions. Then m(x) may
be the estimand. Our main nonparametric estimation approach is based on an orthonormal
series approximation. As an example, suppose that X is supported on [0, 1]. Introduce a
classical cosine orthonormal basis ϕ0 (x) := 1, ϕj (x) := 21/2 cos(πjx), j = 1, 2, . . . on [0, 1].
R1
Note that 0 ϕj (x)ϕi (x)dx = I(j = i). Then any square integrable on [0, 1] function m(x),
R1
that is a function satisfying 0 [m(x)]2 dx < ∞, can be written as a Fourier series
∞
X
m(x) = θj ϕj (x). (1.3.36)
j=0
In (1.3.36) parameters
Z 1
θj := m(x)ϕj (x)dx (1.3.37)
0
are called Fourier coefficients of function m(x). Note that knowing m(x) is equivalent to
knowing the infinite number of Fourier coefficients, and this fact explains the notion of
nonparametric curve estimation. Traditional risk, used for evaluation quality of estimating
m(x), x ∈ [0, 1] by an estimator m̃(x), is the MISE (mean integrated squared error) which
is defined as Z 1
MISE(m̃, m) := Em { [m̃(x) − m(x)]2 dx}. (1.3.38)
0
There are two important corollaries from (1.3.39). The former is that the integrated
squared function can be expressed via the sum of its squared Fourier coefficients, namely
the Parseval identity implies that
Z 1 ∞
X
[m(x)]2 dx = θj2 , (1.3.40)
0 j=0
and note that (1.3.40) is also traditionally referred to as the Parseval identity. The latter is
that the Parseval identity implies the following relation for the MISE,
Z 1 ∞
X
E{ [m̃(x) − m(x)]2 dx} = E{(θ̃j − θj )2 }, (1.3.41)
0 j=0
R1
where θ̃j := 0 m̃(x)ϕj (x)dx are Fourier coefficients of the estimator m̃(x). Relation (1.3.41)
is the foundation of the theory of series estimation when a function is estimated via its
Fourier coefficients.
Finally, let us explain notions and terminology used in dealing with randomly miss-
ing data. Consider an example of regression when a hidden underlying sample (H-sample)
(X1 , Y1 ), . . . , (Xn , Yn ) from (X, Y ) cannot be observed, and instead a sample with missing
data (M-sample) (X1 , A1 Y1 , A1 ), . . . , (Xn , An Yn , An ) from (X, AY, A) is observed. Here X
is the predictor, Y is the response, A is the availability which is a Bernoulli random variable
such that
P(A = 1|X = x, Y = y) =: w(x, y), (1.3.42)
and the function w(x, y) is called the availability likelihood. If Al = 1, then we observe
(Xl , Yl ) and the case (realization, observation) is called complete, and otherwise if Al = 0,
then we observe (Xl , 0) (or in R language this will be written as (Xl ,NA) where the R
string 00 NA00 stands for Not Available) and hence the case (realization, observation) is called
incomplete.
Depending on the availability likelihood, three basic types of random missing mecha-
nisms are defined. (i) Missing completely at random (MCAR) when missing a variable occurs
by chance that does not depend on underlying variables. In our particular case (1.3.42),
this means that w(x, y) = w, that is the availability likelihood is constant. (ii) Missing at
random (MAR) when missing a variable occurs by chance depending only on other always
observed variables. In our example this means that w(x, y) = w(x). Note that MCAR is
a particular case of MAR. (iii) Missing not at random (MNAR) when missing a variable
occurs by chance which depends on its value and may also depend on other variables. In
our example this means that the availability likelihood w(x, y) depends on y and may also
depend on x. Let us make several comments about MNAR because the notion may be
confusing. First, MNAR simply serves as a complement to MAR, that is if a missing is
not MAR, then it is called MNAR. Second, the term MNAR means that the probability
of missing may be defined by factors that are simply unknown. Further, in no way MNAR
SOFTWARE 21
means that an underlying missing mechanism is not stochastic, and in the book we are
considering only random missing mechanisms.
We will also use several inequalities for sums. The first one is the analog of (1.3.33) and
it is also called the Cauchy–Schwarz inequality,
n
X n
X n
X
| ul vl |2 ≤ u2l vl2 . (1.3.43)
l=1 l=1 l=1
is another useful inequality, and here γ −1 := 1/γ. The Minkowski inequality for numbers is
n
hX i1/r n
hX i1/r n
hX i1/r
r r
|uk + vk | ≤ |uk | + |vk |r , r ≥ 1. (1.3.45)
k=1 k=1 k=1
There is also a useful Minkowski inequality for the sum of two random variables,
h i1/r h i1/r h i1/r
E{|X + Y |r } ≤ E{|X|r } + E{|Y |r } , r ≥ 1. (1.3.46)
implies the following inequality for not necessarily independent random variables
X1 , . . . , Xn ,
n
X n
X
r r−1
E{| Xk | } ≤ n E{|Xk |r }, r ≥ 1. (1.3.48)
k=1 k=1
If additionally the random variables X1 , . . . , Xn are independent and zero mean, that is
E{Xk } = 0 for k = 1, . . . , n, then
n
X n
hX n
X r/2 i
r r
E{| Xk | } ≤ C(r) E{|Xk | } + E{Xk2 } , r ≥ 2, (1.3.49)
k=1 k=1 k=1
and
n n
r
X X
E{| r
Xk | } ≤ C(r)n 2 −1 E{|Xk |r }, r ≥ 2, (1.3.50)
k=1 k=1
1.4 Software
R is a commonly used free statistics software. R allows you to carry out statistical analyses
in an interactive mode, as well as allowing simple programming. You can find detailed
installation instructions in the R Installation and Administration manual on CRAN (www.r-
project.org, http://cran.r-project.org). After installing R (or if it was already installed), you
need to choose your working directory for the book’s software package. For instance, for
22 INTRODUCTION
MAC you can use 00 /Users/Me/Book200 . Download to this directory file book2.r from the
author’s web site www.utdallas.edu/∼efrom. The web site also contains relevant information
about the book. By downloading the package, the user agrees to consider it as a “black-box”
and employ it for educational purposes only.
Now you need to start the R on your computer. This should bring up a new window,
which is the R console. In the R console you will see:
>
This is the R prompt. Now, only once while you are working with the book, you need to
install several standard R packages. Type the following command in the R console to install
required packages
> install.packages(00 MASS00 )
> install.packages(00 mvtnorm00 )
> install.packages(00 survival00 )
> install.packages(00 scatterplot3d00 )
These packages are installed just once, you do not need to repeat this step when you start
your next R session.
Next you need to source (install) the book’s package, and you do this with R operating
in the chosen working directory. To do this, first type
> getwd()
This R command will allow you to see a working directory in which R is currently operating.
If it is not your chosen working directory 00 /Users/Me/Book200 , type
> setwd(00 /Users/Me/Book200 )
and then R will operate in the chosen directory for the book. Then type
> source(00 book2.r00 )
and the book’s R software will be installed in the chosen directory and you are ready to use
it. This sourcing should be done every time when you start a new R session.
For the novice, it is important to stress that no knowledge of R is needed to use the
package and repeat/modify figures. Nonetheless, a bit of information may be useful to
understand the semantics of calling a figure. What you are typing is the name of an R
function and then in its parentheses you assign values to its arguments. For instance,
> ch1(fig=1, n=300)
is the call to R function ch1 that will be run with arguments fig=1 and n=300, and all
other arguments of the function will be equal to their default values (here a=0 and b=0.7)
indicated in the caption of the corresponding Figure 1.1. Note that ch1 indicates that the
figure is from Chapter 1 while its argument fig=1 indicates that it is the first figure in
Chapter 1. In other words, ch1(fig=1,n=300) runs Figure 1.1 with the sample size n = 300.
In R, arguments of a function may be either scalars (for instance, a=5 implies that a
function will use value 5 for argument a), or vectors (for instance, vec = c(2,4) implies that
a function will use vector vec with the first element equal to 2 and the second element equal
to 4, and note that c() is a special R function called “combine” that creates vectors), or a
string (say den= 00 normal00 implies that a function will use name normal for its argument
den). R functions are smart in terms of using shorter names for its arguments, for instance
both fig=5, fi=5 and f=5 will be correctly recognized and imply the same value 5 for the
argument fig unless there is a confusion with other arguments.
In general, to repeat Figure k.j, which is jth figure in Chapter k, and to use the default
values of its arguments outlined in square brackets of the caption of Figure k.j, type
> chk(f=j)
If you want to use new values for two arguments, say n=130 and a=5, type
> chk(f=j, n=130, a=5)
Note that values of other arguments will be equal to default ones. Further, recall that typing
f=j and fig=j implies the same outcome.
INSIDE THE BOOK 23
To finish the R session, type
> q()
Finally a short comment about useful captions and figures. Figure 2.1 exhibits four corner
(test) functions: the Uniform, the Normal, the Bimodal and the Strata. Figure 2.2 shows
first eight elements of the cosine basis. The caption to Figure 2.3 explains how to create a
custom-made corner function. Figure 2.4 shows the sequence of curves and their colors on
the monitor. The caption to Figure 2.5 explains arguments used by the E-estimator.
Finally, if the availability likelihood is unknown but may be estimated, then its estimator is
plugged in (1.5.4). The same approach is used for all other problems considered in the book,
and then the main statistical issue to explore is how to construct a sample mean estimator
for Fourier coefficients.
Now let us briefly review the context of chapters and begin with a general remark. Each
chapter contains exercises (more difficult ones highlighted by the asterisk) and it ends with
section Notes that contain short historical reviews, useful references for further reading, and
discussion of relevant results.
Chapter 2 presents a brief review of basic nonparametric estimation problems with direct
data. The reader familiar with the topic, and specifically with the book Efromovich (1999a),
may skip this chapter with the exception of Sections 2.2 and 2.6. Section 2.2 defines the
E-estimator and its parameters that are used in all figures. Let us stress that proposed
E-estimator may be improved a bit for specific problems, and Efromovich (1999a) explains
how to do that. On the other hand, the E-estimator is simple, reliable and it performs well
for all studied settings. Further, Section 2.6 describes confidence bands used in the book. A
novice to nonparametric curve estimation is advised to become familiar with Sections 2.1,
2.3 and 2.4, get used to the book’s software and do the exercises because this is a necessary
preparation for next chapters.
Chapter 3 serves as a bridge between settings with direct data and missing and modified
data. Sections 3.1, 3.2 and 3.3 are a must read because they discuss pivotal problems of
density and regression estimation based on biased data. Another section to study is Section
3.7 which explores the possibility of estimating Bernoulli regression with unavailable failures.
As we will see shortly, this problem is the key for understanding a proposed solution for a
number of applied regression problems with missing data. All other sections explore special
topics that often occur in applications.
Chapters 4 and 5 are devoted solely to nonparametric estimation with missing data.
Chapter 4 explores cases of nondestructive missing when consistent estimation of an esti-
mand is possible solely on missing data. Let us stress that while all proposed procedures are
explained for the case of small sample sizes and illustrated via simulated examples, they are
supported by asymptotic theory which indicates optimality of proposed solutions among
all possible statistical estimators. Section 4.1 is devoted to the classical topic of density
estimation. Regression problems with missing responses and predictors are considered in
separate Sections 4.2 and 4.3 because proposed solutions are different. For the case of miss-
ing responses, the proposed estimator is based on complete cases and it ignores incomplete
cases. This is an interesting proposition because there is a rich literature on how to use
incomplete cases for “better” estimation. For the case of missing predictors, a special esti-
mation procedure is proposed that takes into account incomplete cases. At the same time,
it is also explained what may be done if incomplete cases are not available. All remaining
sections are devoted to special topics that expand a class of possible practical applications.
Chapter 5 is devoted to cases of destructive missing. As it follows from the definition,
here we explore problems when knowing missing data alone is not enough for consistent
estimation. Hence, the main emphasis is on a minimal additional information sufficient for
consistent estimation. As a result, the first three sections are devoted to density estimation
with different types of extra information. Here all issues, related to destructive missing, are
INSIDE THE BOOK 25
explained in great details. Other sections are devoted to regression problems with destructive
missing.
Chapter 6 is devoted to nonparametric curve estimation for truncated and censored
data. The reader familiar with survival analysis knows that product-limit estimators, with
Kaplan–Meier estimator being the most popular one, are the main tool for solution prac-
tically all nonparametric problems related to truncation and/or censoring. While being a
powerful and comprehensive methodology, it is complicated for understanding and making
an inference about. To overcome this complexity and to replace the product-limit method-
ology by E-estimation, we need to begin with a new (but very familiar in survival analysis)
problem of estimating the hazard rate. The hazard rate, similarly to the probability density
and cumulative distribution function, presents a complete description of a random variable.
Further, it is of a great practical interest on its own even for the case of direct observations.
The reason why we begin with the hazard rate is that E-estimation is a perfect fit for this
problem and because it is plain to propose a reliable sample mean estimator of its Fourier
coefficients for the case of direct observations. Further, the E-estimator naturally extends
to the case of truncated and censored data. As a result of this approach, estimation of
the hazard rate is considered in the first 4 sections. Then Sections 6.5-6.9 are devoted to
estimation of distributions and regressions.
What will be if some of truncated and/or censored observations are missed? This is the
topic of Chapter 7 where a large number of possible models are considered. The overall
conclusion is that the E-estimator again may be used to solve these complicated problems.
So far all considered problems were devoted to classical samples with independent obser-
vations. What will be if observations are dependent? This is the topic of the science called
the theory of time series. Chapter 8 explores a number of stationary time series problems
with missing data, both rich in history and new ones. Section 8.1 contains a brief review of
the theory of classical time series. One of the classical tools for the analysis of stationary
time series is spectral analysis. It is discussed in Section 8.2. Several models of missing data
in time series are discussed in Sections 8.3 and 8.4. Time series with censored observations
are discussed in Section 8.5. More special topics are discussed in other sections.
Chapter 9 continues the discussion of time series, only here the primary interest is in
analysis of not stationary (changing in time) time series and missing mechanisms. Also,
Section 9.2 introduces the reader into the world of continuous in time stochastic processes.
Special topics include the Simpson paradox and sequential design.
The last Chapter 10 considers more complicated cases of modified data that slow down
the rate of the MISE convergence. This phenomenon is called ill-posedness. As we will
see, measurement errors in data typically imply ill-posedness, and this explains practical
importance of the topic. Further, some types of censoring, for instance the current status
censoring, may make the problem (data) ill-posed. In some cases data do not correctly fit
an estimand, and this also causes ill-posedness. For instance, in a regression setting we
have a data that fits the problem of estimating the regression function. At the same time,
the regression data do not fit the problem of estimating the derivative of the regression
function, and the latter is another example of ill-posed problem (data). Another example
is a sample from a continuous random variable X. This sample perfectly fits the problem
of estimating the cumulative distribution function F X (x), which can be estimated with the
parametric rate n−1 of the mean squared error (MSE) convergence. At the same time, this
rate slows down if we want to estimate the probability density f X (x). The rate slows even
more if we want to estimate derivative of the density. As a result, we get a ladder of ill-
posed problems, and this and other related issues are considered in the chapter. The main
conclusion of Chapter 10 is that while ill-posedness is a serious issue, for small samples the
proposed E-estimators may be reasonable for relatively simple shapes of underlying curves.
On the other hand, there is no chance to visualize curves with more nuanced shapes like
ones with closely located and similar in magnitude modes. This is where the R package will
26 INTRODUCTION
help the reader to gain practical experience in dealing with these complicated statistical
problems and understanding what can and cannot be done for ill-posed problems.
1.6 Exercises
Here and in all other chapters, the asterisk denotes a more difficult exercise.
1.1.1 What is the difference between direct and biased observations in Figure 1.1?
1.1.2∗ How are direct and biased samples in Figure 1.1 generated? Suggest a probability
model for observations generated by a continuous random variable.
1.1.3 In Figure 1.1, is the biased sample skewed to the right or to the left with respect to
the direct sample? Why?
1.1.4 Repeat Figure 1.1 with arguments that will imply a biased sample skewed to the left.
1.1.5 Repeat Figure 1.1 with sample sizes n ∈ {25, 50, 100, 200}. For what sample size does
the histogram better exhibit shape of the underlying Normal density?
1.1.6 What is a histogram? Is it a nonparametric or parametric estimator?
1.1.7 Do you think that histograms in Figure 1.1 are over-smoothed or under-smoothed?
If you could change the number of bins, would you increase or decrease the number?
1.1.8 Can the generated by Figure 1.1 biased sample be considered as a hidden sample of
direct observations with missing values?
1.1.9 Explain how the biased sample in Figure 1.1 can be generated using the availability
random variable A. Is the availability a Bernoulli random variable?
1.1.10 What is the definition of the availability likelihood function?
1.1.11∗ Is it possible to estimate the density of a hidden variable based on its biased
observations? If the answer is negative, propose a feasible solution based on additional
information.
1.1.12 Consider the data exhibited in Figure 1.2. Can the left tail of the underlying density
∗
f X (x) be estimated?
1.1.13 Define a deductible in an insurance policy. Does it increase or decrease payments
for insurable losses?
1.1.14 Can we say that a deductible causes left truncation of the payment for insurable
losses?
1.1.15∗ If X is the truncated by deductible D loss X ∗ , then what is the formula for F X (x)?
What is the expected payment? Hint: Assume that the distributions of D and X ∗ are known.
1.1.16 Suppose that you got data about payments for occurred losses from an insurance
company. Can you realize from the data that they are truncated?
1.1.17 What is the definition of a limit in insurance policy? Does the limit increase or
decrease a payment for insurable losses?
1.1.18 Give an example of right censored data.
1.1.19∗ Suppose that (V1 , ∆1 ), . . . , (Vn , ∆n ) is a sample of censored variables. How are the
joint distribution of V and ∆ related to the joint distribution of an underlying (hidden)
variable of interest X and a censoring variable L? Find the expectation of V . Hint: Assume
that the variables are continuous and the joint density is known.
1.1.20 What is the probability that a variable X will be censored by a variable L?
1.1.21 Explain how data in Figure 1.3 are simulated.
1.1.22 In Figure 1.3 the truncated losses are skewed to the right. Why? Is it always the
case for left truncated data?
1.1.23 What is shown in the bottom diagram in Figure 1.3?
1.1.24 Look at the bottom diagram in Figure 1.3. Why are there no losses with values
greater than 0.85? Is it a typical or atypical outcome? Explain, and then repeat Figure 1.3
and check your conclusion.
1.1.25∗ The numbers N and M of available hidden losses after truncation or censoring,
EXERCISES 27
shown in Figure 1.3, are only about 50% of the size n = 400 of hidden losses. Why are the
numbers so small? How can one change the simulation to increase these numbers?
1.2.1∗ Verify the assertion that m(x), defined in (1.2.1), minimizes the conditional mean
squared error E{(Y − µ(x))2 |X = x} among all possible predictors µ(x). Hint: Begin with
the proof that E{(X − E{X})2 } ≤ E{(X − c)2 } for any constant c.
1.2.2 What is the difference, if any, between parametric and nonparametric regressions?
1.2.3 What is a linear regression? How many parameters do you need to specify for a linear
regression?
1.2.4 Are nonparametric regressions, shown in Figure 1.4, justified by data?
1.2.5∗ Explain components of the time series decomposition (1.2.2). Then propose a method
for their estimation. Hint: See Chapters 8 and 9.
1.2.6 Do you think that responses in the top diagram in Figure 1.4 are dependent or
independent?
1.2.7 Let Y be the indicator of an insurable event. Suppose that Y is a Bernoulli random
variable with P(Y = 1) = w. Verify that E{Y } = P(Y = 1).
1.2.8 In Figure 1.5 the number of insurance claims is generated by a Poussin random
variable. Do you think that this is a reasonable distribution for a number of claims?
1.2.9 Explain the underlying idea behind assumption (1.2.3).
1.2.10 Compare the assumption (1.2.3) with the assumption when the left side of (1.2.3)
is equal to w(x). What is the difference between these missing mechanisms? Which one
creates more complications?
1.2.11 Explain simulations that created Figure 1.6.
1.2.12∗ What changes may be expected in Figure 1.6 if one chooses negative argument d ?
1.2.13 What can the bottom diagram in Figure 1.6 be used for?
1.2.14 Repeat Figure 1.6 with smaller sample sizes. Does this affect visualization of under-
lying regression functions?
1.3.1 Give definitions of the cumulative distribution function and the survival function.
Which one is decreasing and which one increasing?
1.3.2 Verify the equality in (1.3.4). Do you need any assumption for its validity?
1.3.3 Prove that V(X) ≤ E{X 2 }. When do we have the equality?
1.3.4 Is E{X 2 } smaller than [E{X}]2 , or vice versa, or impossible to say?
1.3.5 Verify (1.3.5).
1.3.6 Consider two independent Bernoulli random variables with probabilities of success
being w1 and w2 . What are the mean and variance of their difference?
1.3.7∗ Give the definition of a Poisson(λ) random variable. What are its mean, second
moment, variance, and third moment?
1.3.8 What is the definition of a continuous random variable? Is it defined by the probability
density?
1.3.9 Suppose that function f (x) is a bona fide probability density. What are its properties?
1.3.10 Verify the equality in (1.3.7).
1.3.11 What is the relationship between variance and standard deviation?
1.3.12 Consider a pair (X, Y ) of continuous random variables with the joint cumulative
distribution function F X,Y (x, y). What is the marginal cumulative distribution function of
Y ? What is the probability that X = Y ? What is the probability that X < Y ?
1.3.13 It is given that F X,Y (1, 2) = 0.06, F X (1) = 0.3, F Y (1) = 0.1 and F Y (2) = 0.2. Are
the random variables X and Y independent, dependent or not enough information to make
a conclusion?
1.3.14 Suppose that f Y (y) is positive. Define the conditional probability density f X|Y (x|y)
via the joint density of (X, Y ) and the marginal density of Y .
1.3.15 Verify (1.3.16).
1.3.16 Verify (1.3.17).
28 INTRODUCTION
1.3.17 Explain the notion of mixed probability density.
1.3.18 Assume that Z is Bernoulli(w), and given Z = z the random variable X has normal
distribution with mean z and unit variance. What is the formula for the joint cumulative
distribution function of (X, Z)?
1.3.19 What is the definition of a sample of size n from a random variable X?
1.3.20 What is the difference (if any) between an estimand and an estimator?
1.3.21 What does the abbreviation MSE stand for?
1.3.22 What is the definition of an unbiased estimator?
Pn
1.3.23 Consider a sample mean estimator θ̂ := n−1 l=1 g(Xl ) proposed for a sample from
X. It is known that this estimator is unbiased. Suppose that X is distributed according to
the density fθX (x). What is the estimand for θ̂?
1.3.24∗ Consider a sample mean estimator (1.3.23). What is its mean and variance? Then
solve the same problem for the plug-in estimator (1.3.24). Hint: Make convenient assump-
tions.
1.3.25 Present a condition under which a sample mean estimator is consistent.
1.3.26 Prove the generalized Chebyshev inequality. Hint: Begin with the case of a continuous
random variable, write
Z Z ∞
P(|X| ≥ t) = f X (x)dx = I(g(|x|) ≥ g(t))f X (x)dx,
|x|≥t −∞
1.7 Notes
This book is a natural continuation of the book Efromovich (1999a) where the case of direct
data is considered and a number of possible series estimators, as well as other nonparametric
estimators like spline, kernel, nearest neighbor, etc., are discussed. A relatively simple,
brief and introductory level discussion of nonparametric curve estimation can be found in
Wasserman (2006). Mathematically more rigorous statistical theory of series estimation can
be found in Tsybakov (2009) and Johnstone (2017).
There is a large choice of good books devoted to missing data where the interested reader
may find the theory and practical recommendations. Let us mention Allison (2002), Little
and Rubin (2002), Longford (2005), Tsiatis (2006), Molenberghs and Kenward (2007), Tan,
Tian and Ng (2009), Enders (2010), Graham (2012), van Buuren (2012), Bouza-Herrera
(2013), Berglund and Heeringa (2014), Molenberghs et al. (2014), O’Kelly and Ratitch
(2014), Zhou, Zhou and Ding (2014), and Raghunathan (2016). These books discuss a
large number of statistical methods and methodologies of dealing with missing data with
emphasis on parametric and semiparametric models. Imputation and multiple imputation
are the hottest topics in the literature, and some procedures are proved to be optimal. On
the other hand, for nonparametric models it is theoretically established that no estimator
can outperform the E-estimation methodology. In other words, any other estimator may at
best be on par with an E-estimator.
There are many good sources to read about survival analysis. The literature is primarily
NOTES 29
devoted to the limit-product methodology and practical aspects of survival analysis in dif-
ferent sciences. Let us mention books by Miller (1981), Cox and Oakes (1984), Kalbfleisch
and Prentice (2002), Klein and Moeschberger (2003), Gill (2006), Martinussen and Scheike
(2006), Aalen, Borgan and Gjessing (2008), Hosmer, Lemeshow and May (2008), Allison
(2010, 2014), Guo (2010), Mills (2011), Royston and Lambert (2011), van Houwelingen and
Putter (2011), Chen, Sun and Peace (2012), Crowder (2012), Kleinbaum and Klein (2012),
Liu (2012), Klein et al. (2014), Lee and Wang (2013), Li and Ma (2013), Collett (2014),
Zhou (2015), and Moore (2016).
A nice collection of results for the sum of random variables can be found in Petrov
(1975).
Finally, let us mention the classical mathematical book Tikhonov (1998) and more recent
Kabanikhin (2011) on ill-posed problems. Nonparametric statistical analysis of ill-posed
problems may be found in Meister (2009) and Groeneboom and Jongbloed (2014).
1.2 In Figure 1.4, the “Monthly Housing Starts” dataset is discussed in Efromovich
(1999a), the “Auto Insurance Claims” dataset is discussed in Efromovich (2016c), and the
“Wages” dataset is discussed in the R np package.
Chapter 2
This chapter presents pivotal results for the classical case of directly observed data. It
is explicitly assumed that observations are neither modified nor missing. All presented
results will serve as a foundation for more complicated cases considered in other chapters.
The chapter is intended to: (i) overview basics of orthonormal series approximation; (ii)
present a universal method of orthonormal series estimation of nonparametric curves which
is used throughout the book; and (iii) explain adaptive estimation of the probability density
and regression function for the case of complete data. Section 2.1 considers a cosine series
approximation which is used throughout the book. It also reminds the reader how to use the
book’s R package and how to repeat and modify graphics. Section 2.2 explains the problem
of nonparametric density estimation. Here the nonparametric E-estimator, which will be
used for all considered in the book problems, is introduced and explained. Section 2.3 is
devoted to the classical problem of nonparametric regression estimation. It is explained how
the E-estimator, proposed for the density model, can be used for the regression model. As
a result, even if the reader is familiar with regression problems, it is worthwhile to read this
section and understand the underlying idea of E-estimation methodology.
In many applied settings with modified and/or missing data, a special type of a non-
parametric regression, called a Bernoulli (binary) regression, plays a key role. This is why
Section 2.4 is devoted to this important topic. E-estimator, used for estimation of multivari-
ate functions, is defined and discussed in Section 2.5. Nonparametric estimation of functions,
similarly to the classical parametric inference, may be complemented by confidence bands.
This topic is discussed in Section 2.6.
31
32 ESTIMATION FOR DIRECTLY OBSERVED DATA
1. Uniform 2. Normal 3. Bimodal 4. Strata
1.4
2.5
2.5
3.0
2.0
2.5
2.0
1.2
2.0
1.5
1.5
1.0
1.5
1.0
1.0
1.0
0.8
0.5
0.5
0.5
0.6
0.0
0.0
0.0 0.4 0.8 0.0 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8
Figure 2.1 The corner functions. {This set may be seen on the monitor by calling (after the R
prompt) > ch2(f=1). A corner function may be substituted by a custom-made one, see explanation
in the caption of Figure 2.3.}
2. Normal. This is a normal density with mean 0.5 and standard deviation 0.15, that is,
f2 (x) := d0.5,0.15 (x) defined in (1.3.8). The normal (bell-shaped) curve is the most widely
recognized curve. Recall the rule of three standard deviations, which states that a normal
density dµ,σ (x) practically vanishes whenever |x − µ| > 3σ. This rule helps us to understand
the curve. It also explains why we do not divide f2 by its integral over the unit interval,
because this integral is very close to 1.
3. Bimodal. This is a mixture of two normal densities, f3 (x) := 0.5d0.4,0.12 (x)
+0.5d0.7,0.08 (x). The curve has two closely located modes. As we will see throughout the
book, this is one of the more challenging corner functions.
4. Strata. This is a function supported over two separated subintervals. In the case of a
density, this corresponds to two distinct strata in the population. This is what differentiates
the Strata from the Bimodal. The curve is obtained by a mixture of two normal densi-
ties, namely, f4 (x) := 0.5d0.2,0.06 (x) + 0.5d0.7,0.08 (x). (Note how the rule of three standard
deviations was used to choose the parameters of the normal densities in the mixture.)
Now, let us recall that a function f (x) defined on an interval (the domain) is a rule that
assigns to each point x from the domain exactly one element from the range of the function.
Three traditional methods to define a function are a table, a formula, and a graph. For
instance, we used both formulae and graphs to define the corner functions.
The fourth (unconventional) method of describing a function f (x) is via a series expan-
sion. Here and in what follows we always assume that the domain is [0, 1] and a function is
R1
square integrable on the unit interval, that is 0 [f (x)]2 dx < ∞. The latter is a mild restric-
tion because in statistical applications we are primarily dealing with bounded functions.
Then
X∞ Z 1
f (x) = θj ϕj (x), x ∈ [0, 1] where θj := f (x)ϕj (x)dx . (2.1.1)
j=0 0
Here the functions ϕj (x) are known, fixed, and referred to as the orthonormal functions or
elements of the orthonormal basis (or simply basis) {ϕ0 , ϕ1 , . . .}, and the θj are called the
Fourier coefficients of f (x), x ∈ [0, 1]. A system of functions is called orthonormal if the
R1 R1
integral 0 ϕs (x)ϕj (x)dx = 0 for s 6= j and 0 (ϕj (x))2 dx = 1 for all j.
Note that to describe a function via an infinite orthogonal series expansion (2.1.1) one
needs to know the infinite number of Fourier coefficients. No one can store or deal with an
infinite number of coefficients. Instead, a truncated (finite) orthonormal series (or so-called
SERIES APPROXIMATION 33
j = 0 j = 1 j = 2 j = 3
1.5
1.5
1.5
1.4
1.0
1.0
1.0
1.2
0.5
0.5
0.5
1.0
0.0
0.0
0.0
-0.5
-0.5
-0.5
0.8
-1.0
-1.0
-1.0
0.6
-1.5
-1.5
-1.5
0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8
j = 4 j = 5 j = 6 j = 7
1.5
1.5
1.5
1.5
1.0
1.0
1.0
1.0
0.5
0.5
0.5
0.5
0.0
0.0
0.0
0.0
-0.5
-0.5
-0.5
-0.5
-1.0
-1.0
-1.0
-1.0
-1.5
-1.5
-1.5
-1.5
0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8
partial sum)
J
X
fJ (x) := θj ϕj (x) (2.1.2)
j=0
R1
is used to approximate f , namely the integrated squared error (ISE) 0 [fJ (x)−f (x)]2 dx →
0 as J → ∞. The integer parameter J is called the cutoff. Also, here and in what follows the
mathematical symbol := means “equal by definition”; in other words (2.1.2) is the definition
of the partial sum fJ (x).
The advantage of this approach is the possibility to compress the data and describe a
function using just several Fourier coefficients. In statistical applications this also leads to
the estimation of a relatively small number of Fourier coefficients. Roughly speaking, the
main statistical issue will be how to choose a cutoff J and estimate Fourier coefficients θj .
Correspondingly, the rest of this section is devoted to the issue of how a choice of J affects
visualization of series approximations and what are the known mathematical results which
shed light on the choice. This will give us a necessary understanding and experience in
choosing reasonable cutoffs.
In what follows, the cosine orthonormal basis on [0, 1]
√
ϕ0 (x) := 1 and ϕj (x) := 2 cos(πjx) for j = 1, 2, . . . . (2.1.3)
is used. The first eight elements are shown in Figure 2.2. It is not easy to believe that such
elements may be good building blocks for approximating different functions, and this is
where our corner functions become so handy.
Series approximations (2.1.2) for J = 0, 1, 2, 3, 4, 5, 6, 7, 8 are shown in Figure 2.3. In
this and all other figures an underlying corner function is always shown by the solid line.
As a result, some other curves may be “hidden” behind a solid line whenever they coincide.
Further, it is always better to visualize figures using the R software because of colored
curves.
34 ESTIMATION FOR DIRECTLY OBSERVED DATA
We do not show approximation of the Uniform function because it is the element ϕ0 (x) of
the basis, so it is perfectly approximated by any partial sum. Indeed, the Uniform is clearly
described by the single Fourier coefficient θ0 = 1, all other θj being equal to zero because
R1
0
ϕj (x)dx = 0 whenever j > 0 (recall that the antiderivative of cos(πjx) is (1/πj) sin(πjx);
R1√ √
thus 0 2 cos(πjx)dx = 2(πj)−1 [sin(πj1) − sin(πj0)] = 0 for any positive integer j).
Thus, there is no surprise that the Uniform is perfectly fitted by the cosine system; this is
why we can skip the study of its approximations. At the same time, surprisingly enough, as
we shall see shortly in the next chapters, this function is a difficult one for reliable statistical
estimation. The reason for this is that if any estimated Fourier coefficient θj , j > 0 is not
zero, it is easy to realize that the corresponding estimate is wrong because it is not a flat
curve.
In Figure 2.3 the Uniform is replaced by the custom-made function, and the caption ex-
plains how any custom-made function can be created and then analyzed using the software.
The function has a pronounced shape (look at the solid line), it is aperiodic, not differen-
tiable, and its right tail is flat. Beginning with J = 3 we get a clear understanding of the
underlying shape, and J = 4 gives us a very satisfactory visualization. This corner function
also allows us to discuss the approximation of a function near the boundary points. As we
see in the left bottom diagram, the partial sums are flattened out near the edges. This is
because derivatives of any partial sum (2.1.2) are zeros at the boundary points (derivatives
of cos(πjx) are equal to −πj sin(πjx) and therefore they are zeros for x = 0 and x = 1).
In other words, the visualization of a cosine partial sum always reveals small flat plateaus
near the edges (you may notice them in all approximations). Increasing the cutoff helps
to decrease the length of the plateaus and improve the visualization. This is the boundary
effect which exists, in this or that form, for all bases. A number of methods have been
proposed on how to improve visualization of approximations near boundaries, see Chapters
2 and 3 in Efromovich (1999a); at the same time if we are aware about boundary effects and
know how to recognize them, it is better to simply ignore them. Overall, here and in what
follows we are going to use often the famous Voltaire’s aphorism “...better is the enemy of
the good...” as a guide in finding reasonable solutions.
Returning to Figure 2.3, approximation of the Normal is a great success story for the
cosine system. Even the approximation based on the cutoff J = 3, where only 4 Fourier
coefficients are used, gives us a fair visualization of the underlying function, and the cutoff
J = 5 gives us an almost perfect fit. Just think about a possible compression of the data in
a familiar table for a normal density into only several Fourier coefficients.
Now let us consider the approximations of the Bimodal and the Strata. Note that here
partial sums with small cutoffs “hide” the modes. This is especially true for the Bimodal,
whose modes are less pronounced and separated. In other words, approximations with small
cutoffs oversmooth an underlying curve. Overall, about ten Fourier coefficients are necessary
to get a fair approximation. On the other hand, even the cutoff J = 5 gives us a correct
impression about a possibility of two modes for the Bimodal and clearly indicates two
strata for the Strata. Note that you can clearly see the dynamic in approximations as J
increases. Further, if we know that an underlying function is nonnegative, then by truncating
negative values of approximations (just imagine that you replace negative values by zero) the
approximations are dramatically improved. We will always use this and other opportunities
to improve estimates.
Note that while cosine approximations are not perfect for some corner functions, under-
standing how these partial sums perform may help us to “read” messages of these approxi-
mations and guess about underlying functions. Overall, for the given set of corner functions,
the cosine system does an impressive job in both representing the functions and the data
compression.
SERIES APPROXIMATION 35
2.5
2.0
2.5
3
2.0
2.0
1.5
2
1.5
1.5
1.0
1.0
1.0
0.5
1
0.5
0.5
0.0
0
0.0
0.0
-0.5
0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8
2.5
2.0
2.5
3
2.0
2.0
1.5
1.5
2
1.5
1.0
1.0
1
1.0
0.5
0.5
0.5
0.0
0
0.0
0.0
-0.5
0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8
Figure 2.3 Approximations of the custom-made and three corner functions by cosine series with
different cutoffs J. The solid lines are the underlying functions. Short-dashed, dotted, dot-dashed
and long-dashed lines correspond to cutoffs J = 1 through J = 4 in the top diagrams and J = 5
through J = 8 in the bottom diagrams, respectively. The first function is custom-made, and the
explanation of how to construct it follows in the curly brackets. {The optional argument CFUN
allows one to substitute a corner function by a custom-made corner function. For instance, the
choice CF U N = list(3, 00 2 ∗ x − 3 ∗ cos(x) 00 ) would imply that the third corner function (the
Bimodal) is substituted by the positive part of 2x −R3 cos(x) divided by its integral over [0,1], i.e.,
1
the third corner function will be (2x − 3 cos(x))+ / 0 (2u − 3 cos(u))+ du. Any valid R formula in
x (use only the lowercase x) may be used to define a custom-made corner function. This option is
available for all figures where corner functions are used. Figure 2.4 allows one to check the sequence
of curves and their color when the R software is used, and the curves are well recognizable on a
color monitor. Try > ch2(f=4) to test colors and repeat Figure 2.4. Here and in what follows,
arguments of a corresponding function can be found at the end of the caption in square brackets.}
[CF U N = list(1, 00 2 − 2 ∗ x − sin(8 ∗ x) 00 )]
Now let us add some theoretical results to shed light on Fourier coefficients. The beauty
of a theoretical approach is that it allows us to analyze simultaneously large classes of
functions; in particular
R1 recall that we are interested in functions f that are square integrable
on [0, 1], i.e., 0 f 2 (x)dx < ∞. For square-integrable functions the famous Parseval identity
36 ESTIMATION FOR DIRECTLY OBSERVED DATA
5
4
3
2
1
Figure 2.4 Sequence of curves (lines) used in the book. Five horizontal lines with y-intercepts equal
to k where k is the kth curve in the sequence used in all other figures. In a majority of diagrams
we are using only the first four curves, and then, to make references simple and short, we refer
to them as solid, dashed, dotted and dot-dashed curves (note that the dotted curve may look like
short-dashed curve and the dot-dashed curve may look like short-dashed-intermediate-long-dashed
curve). If all five curves are used, then we refer to them as solid, short-dashed, dotted, dot-dashed
and long-dashed. Depending on your computer and software, the curves may look different and using
this figure helps to realize this. {To repeat this figure, call after the R prompt > ch2(f=4)}
states that
Z 1 ∞
X Z 1 X
f 2 (x)dx = θj2 and ISE(f, fJ ) := (f (x) − fJ (x))2 dx = θj2 , (2.1.4)
0 j=0 0 j>J
where fJ is the partial sum (2.1.2) and the ISE stands for the integrated squared error (of
the approximation fJ (x)).
Thus, the faster Fourier coefficients decrease, the smaller cutoff J is needed to get a
good approximation of f by a partial sum fJ (x) in terms of the ISE.
Let us explain the main characteristics of a function f that influence the rate at which
its Fourier coefficients decrease. Namely, we would like to understand what determines the
rate at which Fourier coefficients of an integrable function f decrease as j → ∞.
To analyze θj , let us recall the technique of integration by parts. If u(x) and v(x) are
both continuously differentiable functions, then the following equality, called integration by
parts, holds: Z 1 Z 1
u(x)dv(x) = [u(1)v(1) − u(0)v(0)] − v(x)du(x). (2.1.5)
0 0
Here du(x) := u(1) (x)dx is the differential of u(x), and u(k) (x) denotes the kth derivative
of u(x).
Assume that f (x) is differentiable. Using integration by parts and the relations
We established the first rule (regardless of a particular f ) about the rate at which the
R1
Fourier coefficients decrease. Namely, if f is differentiable and 0 |f (1) (x)|dx < ∞, then |θj |
decrease with rate at least j −1 .
Let us continue the calculation. Assume that f is twice differentiable. Then using the
method of integration by parts on the right-hand side of (2.1.7), we get
√ Z 1 √ Z 1
2 (1) 2
θj = − sin(πjx)f (x)dx = f (1) (x)d cos(πjx)
πj 0 (πj)2 0
√ √ Z 1
2 (1) (1) 2
= [f (1) cos(πj) − f (0) cos(0)] − cos(πjx)f (2) (x)dx. (2.1.9)
(πj)2 (πj)2 0
We conclude that if f (x) is twice differentiable, then
h Z 1 i
|θj | ≤ j −2 |f (1) (1)(−1)j − f (1) (0)| + |f (2) (x)|dx , j≥1 . (2.1.10)
0
Thus, the Fourier coefficients θj of smooth (twice differentiable) functions decrease with
a rate not slower than j −2 .
So far, boundary conditions (i.e., values of f (x) near boundaries of the unit interval [0, 1])
have not affected the rate. The situation changes if f is smoother, for instance, it has three
derivatives. In this case, integration by parts can be used again. However, now the decrease
of θj may be defined by boundary conditions, namely by the term [f (1) (1)(−1)j − f (1) (0)]
on the right-hand side of (2.1.9). These terms are equal to zero if f (1) (1) = f (1) (0) = 0.
This is the boundary condition that allows θj to decrease faster than j −2 . Otherwise, if the
boundary condition does not hold, then θj cannot decrease faster than j −2 regardless of how
smooth the underlying function f is.
Now we know two main factors that define the decay of Fourier coefficients of the cosine
system and therefore the performance of an orthonormal approximation: smoothness and
boundary conditions. These are the fundamentals that we need to know about the series
approximation.
The topic of how the decay of Fourier coefficients depends on various properties of an
underlying function is a well-developed branch of mathematics, see the Notes. Here we
introduce two classical functional classes that we will refer to later. The Sobolev function
class (ellipsoid) Sα,Q , 0 ≤ α, 0 < Q < ∞ is
n ∞
X o
Sα,Q := f : (1 + (πj)2α )θj2 ≤ Q . (2.1.11)
j=0
For integer α the class is motivated by α-fold differentiable functions. Another important
example is the class of analytic functions,
Analytic functions have the derivatives of any order; this class is the subclass of Sobolev’s
functions and it is used to describe functions that are smoother than Sobolev’s functions. In
particular, the class nicely fits mixtures of normal densities and describes spectral densities
of ARMA processes discussed in Chapter 8.
38 ESTIMATION FOR DIRECTLY OBSERVED DATA
Let us present an example which explains why it is convenient to work with PJthe above-
introduced classes. The relation (2.1.4) explains that the partial sum fJ (x) := j=0 θj ϕj (x)
always approximates a square-integrable function f (x), x ∈ [0, 1] in terms of the ISE.
R1
Indeed, (2.1.4) shows that the ISE(f, fJ ) := 0 (f (x) − fJ (x))2 dx = j>J θj2 vanishes as
P
J increases. At the same time, we cannot quantify decrease of the ISE without additional
information about the function f . The theoretical importance of the two function classes is
that they allow us to quantify the decrease of the ISE uniformly over all functions from the
classes. Namely, for any J ≥ 1 and f from the Sobolev class Sα,Q we have
that is the ISE decreases as a power in J, while for any function from the analytic class
Ar,Q the ISE decreases exponentially in J,
Further, the function classes allow us to establish that the partial sum fJ (x) approx-
imates f (x) uniformly at all points x ∈ [0, 1]. Let us explain this assertion. Using the
Cauchy-Schwatz inequality (1.3.43) we can write for functions from the Sobolev class Sα,Q
with α > 1/2, X X
|f (x) − fJ (x)| = | θj ϕj (x)| ≤ 21/2 |θj |
j>J j>J
X X
≤ 21/2 [ j −2α j 2α θj2 ]1/2 < [Q/(2α − 1)]1/2 J −α+1/2 . (2.1.15)
j>J j>J
Also, thePvariance V(g(X)) Pmay be estimated by the familiar sample variance estimator
n n
(n−1)−1 l=1 [g(Xl )−n−1 r=1 g(Xr )]2 . Then, according to (2.2.2), multiply
Pn this estimator
by n−1 and get the sample variance estimator of the variance of n−1 l=1 g(Xl ).
Now let us return to the nonparametric estimation. For all statistical problems consid-
ered in the book (density estimation, regression, spectral density, etc.) the same three-step
approach is used for construction of a nonparametric data-driven series estimator. The
corresponding estimator will be referred to as the E-estimator.
E-estimator of a function f (x), x ∈ [0,
P1]:
∞
1. Consider a series expansion f (x) = j=0 θj ϕj (x) explained in Section 2.1. Suggest a
R1
sample mean estimator θ̂j of Fourier coefficients θj := 0 f (x)ϕj (x)dx. Then calculate a
corresponding sample variance estimator v̂jn of the variance vjn := V(θ̂j ) of the sample
mean estimator.
2. The E-estimator is defined as
Jˆ
X
fˆ(x) := θ̂j I(θ̂j2 > cT H v̂jn )ϕj (x). (2.2.3)
j=0
√
Here {ϕj (x), j = 0, 1, . . .} is the cosine basis {ϕ0 (x) := 1, ϕj (x) := 2 cos(πjx), j =
1, 2, . . .}. Recall that J is called the cutoff and θj is called the jth Fourier coefficient of
f X (x) corresponding to the jth element ϕj (x) of the used basis.
Step 1 of the E-estimation methodology is to estimate θj using a sample mean estimator.
To find such an estimator, we express the Fourier coefficient as an expectation. Write
Z 1
θj := ϕj (x)f X (x)dx = E{ϕj (X)}. (2.2.6)
0
We can conclude that the parameter of interest θj is the expectation (population mean) of
the random variable ϕj (X). This yields the sample mean estimator
n
X
−1
θ̂j := n ϕj (Xl ). (2.2.7)
l=1
Note that θ0 = 1 and there is no need to estimate it, but regardless θ̂0 = 1. This is the
recommended, according to Step 1, method of moments estimator of Fourier coefficients.
(Do you see that θ0 = 1 for any underlying density?) Let us also stress that all known
series density estimators, proposed in the literature, use the recommended Fourier estimator
(2.2.7). As soon as the Fourier estimator is chosen, its variance can be estimated. According
to (2.2.7), we can use the classical sample variance estimator and set
h n
X i
v̂jn := n−1 (n − 1)−1 [ϕj (Xl ) − θ̂j ]2 . (2.2.8)
l=1
and calculate its MISE. The Parseval identity and unbiasedness of θ̂j allow us to write,
J
X X J
X X
MISE(f˜JX , f X ) = E{(θ̂j − θj )2 } + θj2 = vjn + θj2 . (2.2.10)
j=0 j>J j=0 j>J
DENSITY ESTIMATION FOR COMPLETE DATA 41
Consider the two sums on the right-hand side of (2.2.10). The first sum is the integrated
variance of f˜JX (x) which is equal to the sum of J + 1 variances V(θ̂j ). The second sum in
(2.2.10) is the integrated squared bias
X
ISBJ (f X ) := θj2 . (2.2.11)
j>J
It is impossible to estimate the integrated squared bias directly because it contains infinitely
many terms. Instead, let us note that the Parseval identity allows us to rewrite this infinite
sum via a finite sum,
Z 1 J
X
ISBJ (f ) = [f X (x)]2 dx − θj2 . (2.2.12)
0 j=0
R1
The term 0 [f X (x)]2 dx is a constant. Thus, the problem of finding a cutoff J that minimizes
PJ
(2.2.10) is equivalent to finding a cutoff that minimizes j=0 (vjn − θj2 ). Hence, we can
rewrite (2.2.10) as
J
X Z 1
MISE(f˜JX , f X ) = 2
[vjn − θj ] + [f X (x)]2 dx, (2.2.13)
j=0 0
and we need to choose J which minimizes the sum in the right-hand side of (2.2.13). Esti-
mation of θj2 is based on the relation which is valid for any θ̌j (recall (1.3.4)),
The Fourier estimator θ̂j is unbiased, and this together with (2.2.14) yield
This implies that we can propose the unbiased estimator θ̂j2 − v̂jn of θj2 .
Using this result in (2.2.13) we conclude that the search for the cutoff which minimizes
PJ
the MISE is converted into finding a cutoff J which minimizes j=0 [2v̂jn − θ̂j2 ]. This is
exactly the sum on the right side of (2.2.4). Further, it was explained in Section 2.1 that
we do not need to use very large cutoffs J to get a good approximation of a function by its
partial sum, and this leads us to the search of the cutoffs over the set defined in (2.2.4).
Now let us explain why the indicator function is used in (2.2.3). For practically all
functions some Fourier coefficients are very small or even zero. For instance, the Uniform
corner function has all Fourier coefficients, apart of θ0 = 1, equal to zero. Due to symmetry,
the Normal has all even Fourier coefficients equal to zero. As a result, the indicator performs
the so-called thresholding which allows us to remove small θ̂j from the estimator. The default
value of the coefficient of thresholding cT H = 4, and the R software allows us to change it,
as well as parameters cJ0 and cJ1 , in each figure.
Finally, step 3 is the projection on a class of bona fide functions. For the density this is
the class of nonnegative and integrated to 1 functions. The projection is
Z 1
¯X ˆX
f (x) := max(0, f (x) − u), where the constant u implies f¯X (x)dx = 1. (2.2.16)
0
This procedure cuts off values of fˆX (x) that are smaller than u, and this may create
unpleasant bumps in f¯X (x). There is a procedure, described in Section 3.1 of Efromovich
(1999a), which removes the bumps, and it is used by the R software.
42 ESTIMATION FOR DIRECTLY OBSERVED DATA
Let us make one more remark about the following property of the variance vjn ,
Here and in what follows os (1) is the little-o notation for generic sequences which tend to
zero as s → ∞. To prove (2.2.17), we use the trigonometric equality
As we know from the previous section, Fourier coefficients θj decay as j increases and this,
together with (2.2.2), proves (2.2.17). The conclusion is that while we do not know values
of individual vjn , we have n limj→∞ vjn = 1; actually for many densities only several first
nvjn may be far from 1.
This finishes our explanation of the E-estimation methodology and how to construct
a density E-estimator. Several questions immediately arise. First, how does E-estimator
perform for small sample sizes? Second, is it possible to suggest a better estimator? We are
considering these questions in turn.
To evaluate performance of the E-estimator for small samples, we use Monte Carlo
simulations where samples are generated according to the corner densities shown in Figure
2.1. E-estimates for sample sizes 100, 200, and 300 are shown in Figure 2.5 whose caption
explains the diagrams. This figure exhibits results of 4 times 3 (that is, 12) independent
Monte Carlo simulations. Note that the estimates are based on simulations, hence another
simulation will yield different estimates. A particular outcome, shown in Figure 2.5 (as well
as in all other figures), is chosen primarily with the objective of the discussion of a variety
of possible E-estimates.
We begin with the estimates for the Uniform density shown in Diagram 1. As we see,
while for n = 100 and n = 200 the estimates are perfect, the estimate for n = 300 is
bad. On one hand, this outcome is counterintuitive; on the other, it is a great teachable
moment with two issues to discuss. The former is that for a particular simulation a larger
sample size may lead to a worse estimate. To understand this phenomenon, let us consider
a simple example. Suppose an urn contains 5 chips and we know that 3 of them have one
color (the “main” color) and that the other two chips have another color. We know that
the colors are red and blue but do not know which one is the main color. We draw a chip
from the urn and then want to make a decision about the main color. The natural bet
is that the color of the drawn chip is the main color (after all, the chances are 35 that the
answer is correct). Now let us draw two more chips. Clearly, a decision based on three drawn
chips should only be better. However, there is a possible practical caveat. Assume that the
main color is blue and the first chip drawn is also blue. Then the conclusion is correct. On
the other hand, if the two next chips are red (and this happens with probability 16 ), the
conclusion will be wrong despite the increased “sample size.” Of course, if we repeat this
experiment many times, then on average our bet will prevail, but in a particular experiment
the proposed “reasonable” solution may imply a wrong answer. The latter issue is that we
can realize what is the underlying series estimate for n = 300. Please return to Figure 2.2,
where the basis functions are exhibited, and think about a formula for the series estimate
shown by the dot-dashed line. This estimate is fˆ(x) = 1 + θ̂2 ϕ2 (x) with θ̂2 ≈ 0.14. We
know that θ2 = 0 for the Uniform density, so why was θ̂2 included? The answer is because
θ̂22 ≈ 0.02 > cT H v̂2n ≈ 4/300 ≈ 0.013. Hence, we would need to use cT H ≈ 6 to threshold
this large Fourier estimate, and while this may be good for the Uniform (actually, any
larger threshold coefficient cT H will benefit estimation of the Uniform), that can damage
DENSITY ESTIMATION FOR COMPLETE DATA 43
1. Uniform 2. Normal 3. Bimodal 4. Strata
0, 0, 0.02 0.014, 0.0051, 0.031 0.17, 0.024, 0.0079 0.26, 0.14, 0.075
1.2
3.0
2.5
3.0
2.5
2.0
2.5
1.1
2.0
2.0
1.5
1.0
1.5
1.5
1.0
1.0
1.0
0.9
0.5
0.5
0.5
0.0
0.0
0.0
0.8
0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8
x x x x
Figure 2.5 Probability density E-estimates for sample sizes 100, 200 and 300 and 4 underlying
corner densities. The corresponding ISEs (integrated squared errors) of the estimates are shown
in the title. In a diagram the solid line is an underlying density while dashed, dotted and dot-
dashed lines correspond to E-estimates based on samples of sizes n = 100, n = 200, and n = 300,
respectively. (Note that the sequence of lines is the same as in Figure 2.4 which may be used to
check curves on your computer.) Note that a solid line may “hide” other lines, and this implies
a perfect estimation. For instance, in the first diagram the Uniform density (the solid line) hides
the dashed and dotted lines and makes them invisible. {Recall that this figure may be repeated (with
other simulated datasets) by calling after the R-prompt > the R-function > ch2(f=5). Also, see
the caption of Figure 2.3 about a custom-made density. All the arguments, shown below in square
brackets, may be changed. Let us review these arguments. The argument set.n allows one to choose
three (or fewer) sample sizes. The arguments cJ0, cJ1, and cTH control the parameters cJ0 , cJ1 ,
and cT H used by the E-estimator. Note that R does not recognize subscripts, so we use cJ0 instead
of cJ0 , etc. Also recall that below in the square brackets the default values for these arguments are
given. Thus, after the call > ch2(f=5) the estimates will be calculated with these values of the
coefficients. If one would like to change them, for instance to use a different threshold level, say
cT H = 3, make the call > ch2(f=5, cTH=3).} [set.n = c(100,200,300), cJ0 = 3, cJ1 = 0.8, cTH
= 4]
estimation of other densities. Another comment is as follows. The specific of the Uniform
is that any deviation of an estimate from the Uniform looks like a “tragic” mistake despite
the fact the deviation may be very small in terms of the ISE. It is worthwhile to compare
the ISE = 0.02 for the case n = 300 with ISEs for other corner functions to appreciate this
comment.
In the Diagram 2 for the underlying Normal density, the estimates nicely exhibit the
symmetric and unimodal shape of the Normal. Curiously, here again the worst estimate is for
the largest sample size n = 300. Now, let us compare the visual appeal of the estimates with
the corresponding ISEs and ask ourselves the following question. Does the ISE reflect the
quality of nonparametric estimation? This is an important question because the expected
ISE, which we call the MISE, is our main criterion in finding a good estimator. Overall, if you
44 ESTIMATION FOR DIRECTLY OBSERVED DATA
1. Uniform 2. Normal 3. Bimodal 4. Strata
3.5
3.0
2.0
3.0
3.0
2.5
2.5
2.5
1.5
2.0
2.0
2.0
Density
Density
Density
Density
1.5
1.0
1.5
1.5
1.0
1.0
1.0
0.5
0.5
0.5
0.5
0.0
0.0
0.0
0.0
0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8
X X X X
Figure 2.6 Performance of density E-estimator for simulated samples of size n = 100. A sample is
shown by the histogram overlaid by the underlying density (solid line) and its E-estimate (dashed
line). [n = 100, cJ0 = 3, cJ1 = 0.8, cTH = 4]
repeat Figure 2.5 several times, it becomes clear that the MISE is a reasonable criterion.
Returning to Diagram 2, note that the dashed curve has funny tails. It is impossible to
explain them without knowing the underlying sample, and this is what our next Figure 2.6
will allow us to do.
Estimates for the Bimodal, exhibited in Diagram 3, are of a special interest. For the
case n = 100 the E-estimate (the dashed line) oversmooths the Bimodal and shows a single
mode. This is a typical outcome and it indicates that this sample size is too small to
indicate the closely located modes of the Bimodal. The larger sample sizes have allowed
the E-estimator to exhibit the bimodal shape of the density. Also note how ISEs reflect
the quality of estimation. The E-estimates for the Strata exhibit two pronounced strata.
This clearly shows the flexibility of the proposed E-estimator because in the Strata we have
two high spikes, a pronounced flat valley, and rapidly vanishing tails. The E-estimator is
clearly capable to show us these characteristics of the density. Please keep in mind that the
E-estimator does not know an underlying density, it has only data, and the densities change
from the flat Uniform to the rough Strata. Also, note how well the projection on the class
of bona fide densities performs here (compare with the ideal Fourier approximations of the
Strata in Figure 2.2).
Figure 2.6 is another tool to understand how the E-estimator performs. Here in each
diagram we can see the E-estimate which overlays the frequency histogram of an underlying
sample from the density (the solid line) indicated in the title. A frequency histogram (or
simply histogram) is a special nonparametric density estimator that uses vertical columns
above bins to show frequencies of observations falling within bins. The histogram is a popular
statistical method to visualize data; more about the histogram can be found in Section 8.1
of Efromovich (1999a).
DENSITY ESTIMATION FOR COMPLETE DATA 45
For the Uniform we observe the monotonic E-estimate (the dashed line). This is a bad
estimate of the flat Uniform density, and also look at the huge deviation of the estimate from
the Uniform. But is this the fault of the proposed estimation procedure? The histogram helps
us to answer the question because it shows us the data. The answer is “no.” Recall that
the density estimate describes the underlying sample, and the sample is heavily skewed
to the left. Hence the E-estimate is consistent with the data. Two comments are due.
First, the studied nonparametric estimation may be viewed upon as a correct smoothing
of the histogram; this explains why statisticians working on topics of nonparametric curve
estimation are often called the “smoothers.” The second comment is that similarly to our
analysis of Diagram 1 in Figure 2.5, we can figure out the functional form of the shown
series estimate, and here it is 1 + θ̂1 ϕ1 (x). Repeated simulations show that such a skewed
sample is a rare appearance but it is a real possibility to be ready for.
An interesting simulation is shown in the Normal diagram. The symmetric bell-shaped
density created a sample with curious and asymmetric tails. This is what we also see in the
otherwise very good E-estimate.
For the Bimodal density the E-estimate shows two pronounced modes and this fact
should be appreciated. On the other hand, the magnitudes of modes are shown incorrectly.
Is this the fault of the E-estimator? To answer the question, try to smooth the histogram
using your imagination (and forget about the underlying density shown by the solid line);
most likely you will end up with a curve resembling the E-estimate. Indeed, there is nothing
in the data that may tell us about larger left mode or smaller right mode. Further, let
us look at the left tail of the estimate. Here we see how the projection on the bona fide
class performs. Indeed, recall that the E-estimator fˆX (x), defined in (2.2.3), is an extremely
smooth function in x. Hence, the left tail shows us that the underlying E-estimate fˆX (x)
takes on negative values to reflect the smallest values in the sample that are separated from
the main part of the sample, and then the negative values are cut off.
Finally, consider the Strata in Figure 2.6. The E-estimate is good but not as sharp as
one would like it to be. Here again it is worthwhile to put yourself in the shoes of the
E-estimator. Look at this particular dataset, compare it with the underlying density (the
solid line), and then try to answer the following question: How can the E-estimator perform
better for this sample? One possibility here is to consider a larger cutoff (by increasing
parameters cJ0 and cJ1 ) and decreasing the thresholding coefficient cT H . This and other
figures allow one to make these changes and then test performance of the E-estimator via
repeated simulations. Of course, a change that benefits one density may hurt another one,
and this is when simultaneous analysis of different underlying densities becomes beneficial.
Two practical conclusions may be drawn from the analysis of these particular simu-
lations. First, the visualization of a particular estimate is useful and sheds light on the
estimator. Second, a conclusion may not be robust if it is based only on the analysis of
just several simulations. The reason is that for any estimator one can find a dataset where
an estimate is perfect or, conversely, very bad; this is why it is important to complement
your reading of the book by repeating figures using the R software and analyzing the re-
sults. Every experiment (every figure) should be repeated many times; the rule of thumb,
based on the author’s experience, is that the results of at least 20 simulations should be
analyzed before making a conclusion. Also, try to understand the parameters of each figure
and change them to understand their role.
Remark 2.2.2. In the case where the density is estimated over a given support [a, b] (or data
are given only for this interval), to convert the problem to the case of the [0, 1] support one
first should rescale the data and compute Yl := (Xl −a)/(b−a). The rescaled observations are
distributed according to a density f Y (y), which is then estimated over the unit interval [0, 1].
Let f˜Y (y), y ∈ [0, 1] be the obtained estimate of the density f Y (y). Then the corresponding
46 ESTIMATION FOR DIRECTLY OBSERVED DATA
estimate of f X (x) over the interval [a, b] is defined by f˜X (x) := (b−a)−1 f˜Y ((x−a)/(b−a)),
x ∈ [a, b].
Remark 2.2.3. Consider the setting where one would like to estimate an underlying density
over its finite support [a, b], which is unknown. In other words, both the density and its
support are of interest. The simplest and a reliable solution is to set ã := X(1) , b̃ := X(n)
and then use the method proposed in the previous remark. Here X(1) ≤ X(2) ≤ · · · ≤ X(n)
are the ordered observations. Let us also explain a more sophisticated solution. It is known
that P(a < X < X(1) ) = P(X(n) < X < b) = 1/(n + 1), Thus, a =: X(1) − d1 and
b =: X(n) + d2 , where both d1 > 0 and d2 > 0 should be estimated. Let us use the
following approach. If an underlying density is flat near the boundaries of its support, then
for a sufficiently small positive integer s we have (X(1+s) − X(1) )/s ≈ X(1) − a = d1 , and
similarly (X(n) − X(n−s) )/s ≈ b − X(n) = d2 . The default value of s is 1. Thus, we may set
dˆ1 := (X(1+s) − X(1) )/s and dˆ2 := (X(n) − X(n−s) )/s. More precise estimation of d1 and d2
requires estimation of both the density and its derivatives near X(1) and X(n) , and this is
a complicated problem for the case of small sample sizes.
Remark 2.2.4. We shall see that for different settings, including regression, missing and
modified data, the parameter d := limj,n→∞ nE{(θ̂j − θj )2 }, where θ̂j is an appropriate
sample mean estimator of θj , defines the factor in changing a sample size that makes es-
timation of an underlying curve comparable, in terms of the same precision of estimation,
with the density estimation model over the known support [0, 1] when d = 1. In other words,
the problem of density estimation may be considered as a benchmark for analyzing other
models. Thus we shall refer to the coefficient d as the coefficient of difficulty. We shall see
that it is a valuable tool, which allows us to appreciate the complexity of a problem based
on our experience with the density estimation.
We are finishing this section with P the asymptotic analysis of the MISE of the recom-
J
mended projection estimator f˜JX (x) = j=0 θ̂j ϕj (x), and then comment on its consistency.
The MISE is calculated in (2.2.10) and we would like to evaluate it for the case of Sobolev
and analytic function classes defined in (2.1.11) and (2.1.12), respectively. We begin with
the Sobolev’s densities. Using (2.1.11) and (2.2.17) we can write that
MISE(f˜X , f X ) ≤ c∗ [Jn−1 + J −2α ], f X ∈ Sα,Q .
J (2.2.20)
Here c is a finite positive constant which depends on (α, Q). Then the cutoff J ∗ , which
∗
minimizes the right-hand side of (2.2.20), is proportional to n1/(2α+1) . Further, the optimal
cutoff yields the classical result for Sobolev densities: the MISE vanishes with the rate
n−2α/(2α+1) . This rate is also optimal meaning that no other estimator can improve it
for the Sobolev densities. This result is intuitively clear because a single parameter can
be estimated with the variance not smaller in the order than n−1 , and this together with
(2.1.11) gives us a lower bound for the MISE which is, up to a constant, the same as the
right-hand side of (2.2.20). We conclude that the proposed E-estimation methodology is
rate optimal for Sobolev’s functions. Further, it is explained in Efromovich (1999a) that the
used thresholding decreases the MISE of E-estimator.
For analytic densities the projection estimator is not only rate but also sharp optimal
meaning that its MISE converges with the optimal rate and constant (see more in Efro-
movich 1999a). Let us shed light on this assertion. Using (2.1.12) and (2.2.17) we can write
for any Jn → ∞ as n → ∞,
Jn
X X
MISE(f˜JXn , f X ) = vjn + θj2
j=0 j>Jn
X
= n−1 Jn (1 + on (1)) + θj2
j>Jn
NONPARAMETRIC REGRESSION 47
≤ [n−1 Jn + (Q2 /2r)e−2rJn ] + on (1)Jn n−1 , f X ∈ Ar,Q . (2.2.21)
The cutoff which minimizes the right-hand side of (2.2.21) is Jn∗
= (1/2r) ln(n)(1 + on (1)).
Note that only a logarithmic in n number of Fourier coefficients is necessary to estimate
the density, and the latter is the theoretical justification for the set of cutoffs used by the
procedure (2.2.4). Using this optimal cutoff we can continue (2.2.21) and get
Further, this upper bound is sharp, that is there is no other estimator for which the right-
hand side of (2.2.22) may be smaller for the analytic functions.
Now let us comment on consistency of the projection estimator (2.2.9) for functions
from a Sobolev class Sα,Q , α > 1/2. (Recall that for the density θ̂0 = θ0 = 1, but, whenever
possible, we are presenting general formulas that may be used for the analysis of any problem
like regression or spectral density estimation.) Write,
J
X X
|f˜JX (x) − f X (x)| = | (θ̂j − θj )ϕj (x) + θj ϕj (x)|
j=0 j>J
J
X X
≤ 21/2 |θ̂j − θj | + | θj ϕj (x)|. (2.2.23)
j=0 j>J
Using (2.1.15), (2.2.19) and E{|θ̂j − θj |} ≤ [E{(θ̂j − θj )2 }]1/2 we conclude that for some
constant cα,Q < ∞
Inequality (2.2.24), together with the Markov inequality (1.3.26), yields that for any t > 0
This inequality allows us to establish consistency of the projection estimator for a wide class
of increasing to infinity cutoffs J = Jn → ∞ as n → ∞. A similar calculation can be made
for analytic densities.
The presented asymptotical results give theoretical justification of the E-estimation
methodology. Another useful conclusion of the asymptotic theory is that the MISE con-
verges slower than the traditional rate n−1 known for parametric problems like estimation
of the mean or the variance of a random variable. This is what makes nonparametric prob-
lems so special and more challenging than classical parametric problems.
The last assertion follows from the fact that for any random variable Z with finite second
moment and any constant c we have
It is easy to see why (2.3.5) holds for the model (2.3.4), and here we are proving this formula
for the general model (2.3.2). Write,
n Y ϕ (X) o
j
E = E{ϕj (X)[f X (X)]−1 E{Y |X}}
f X (X)
Z 1
= ϕj (x)[f X (x)]−1 m(x)f X (x)dx = θj . (2.3.6)
0
Here in the first equality we used the standard formula E{g(X, Y )} = E{E{g(X, Y )|X}} of
writing the expectation of a bivariate function g(x, y) via the expectation of the conditional
expectation.
NONPARAMETRIC REGRESSION 49
If the design density f X (x) is known, then the sample mean Fourier estimator of θj is
n
X Yl ϕj (Xl )
θ̃j := n−1 . (2.3.7)
f X (Xl )
l=1
If the design density is unknown then its E-estimator fˆ(x) of Section 2.2, based on the
sample X1 , . . . , Xn , may be used. This yields a plug-in sample mean Fourier estimator
n
X Yl ϕj (Xl )
θ̂j := n−1 . (2.3.8)
l=1
max(fˆX (Xl ), c/ ln(n))
In (2.3.8) fˆX (x) is truncated from below because it is used in the denominator, c is the
additional parameter of the regression E-estimator with the default value c = 1, and all
corresponding figures allow the user to choose any c > 0.
The variance of θ̂j can be estimated by the sample variance based on statistics
{Yl ϕj (Xl )/ max(fˆX (Xl ), c/ ln(n)), l = 1, . . . , n}. This yields the regression E-estimator
m̂(x). Further, if some bona fide restrictions are known (for instance, it is known that the
corner functions are nonnegative), then a projection on the bona fide functions is performed.
Remark 2.3.1. Let θk be the kth Fourier coefficient of the regression function m(x), and
k 6= j. Then it is possible to show that
This observation points upon a more accurate procedure of estimation which, theoretically,
yields asymptotically efficient estimation. The simplest and most valuable step in imple-
mentation of the idea is to calculate θ̂0 and then subtract it from Yl , l = 1, 2, . . . , n and
use the differences for estimating Fourier coefficients θj , j ≥ 1. We are utilizing this step
in the regression E-estimator. There is
R 1also another idea of estimation of the Fourier coef-
ficients based on using formula θj = 0 m(x)ϕj (x)dx. Namely, it is possible to replace an
unknown m(Xl ) by its observation Yl and then use responses in a numerical integration.
This approach is convenient for the case of fixed-design predictors. More about these and
other methods can be found in Chapter 4 of Efromovich (1999a).
Figure 2.7 allows us to look at regression data and check how the regression E-estimator
performs for small sample sizes. A plot of the pairs (Xl , Yl ) in the xy-plane (so-called
scattergram or scatter plot) is a useful tool to get a first impression about a dataset at
hand. Consider Monte Carlo simulations of observations according to (2.3.4) with n = 100;
underlying experiments are explained in the caption. Four scattergrams are shown by circles.
For now let us ignore the lines and concentrate on the data. An appealing nature of the
regression problem is that one can easily appreciate its difficulty. To do this, try to use your
imagination and draw a line m(x) through the middle of the cloud of circle in a scattergram
that, according to your understanding of the regression problem, gives a good fit according to
model (2.3.4). Or even simpler, because in the diagrams the underlying regression functions
are shown by the solid line, try to recognize them in the cloud of circles. If you are not
successful in this imagination and are confused, do not be upset because these particular
scattergrams are difficult to read due to a large scale function (just look at the range of
responses).
Let us examine the four diagrams in turn where the dashed line shows E-estimates. For
the Uniform case (here the regression function is the Uniform) the estimate is good. Can you
see that there are more observations near the right tail than the left one? This is because the
design density is increasing. The scattergram for larger predictors may also suggest that an
underlying regression should have a decreasing right tail, but this is just an illusion created
50 ESTIMATION FOR DIRECTLY OBSERVED DATA
Uniform, ISE = 0.025 Normal, ISE = 0.021 Bimodal, ISE = 0.14 Strata, ISE = 0.24
5
4
2
4
3
3
2
2
1
2
Y
Y
1
1
0
0
0
0
-1
-1
-1
-1
-2
-2
0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8
X X X X
Figure 2.7 Heteroscedastic nonparametric regression. Observations (the scattergram) are shown by
circles overlaid by the underlying regression (the solid line) and its regression E-estimate (the dashed
line). The underlying model is Y = m(X) + σs(x)ε where m(x) is a corner function indicated in
the title, ε is standard normal
R1 and independent of X and σs(x) is the scale function. ISE is the
integrated squared error 0 (m(x) − m̂(x))2 dx. The E-estimator knows that an underlying regression
function is nonnegative. {The arguments are: n controls the sample size, sigma controls σ, the string
scalefun defines the shape of a custom function s(x) which is truncated from below by the value dscale
and then rescaled into the bona fide density supported on [0, 1], the string desden controls the shape
of the design density which is then truncated from below by the value dden and then rescaled into a
bona fide density. Argument c controls parameter c in (2.3.8). Arguments cJ0, cJ1 and cT H are
explained in Figure 2.6.} [n = 100, desden = 00 1 + 0.5 ∗ x00 , scalefun = 00 3 − (x − 0.5)ˆ2 00 , sigma
= 1, dscale = 0, dden = 0.2, c = 1, cJ0 = 3, cJ1 = 0.8, cTH = 4]
by the regression errors. Actually, here one can imagine a number of interesting shapes of
the regression. For the Normal regression the estimate is not perfect but it correctly shows
the unimodal shape. Because the Normal regression has a pronounced shape and its range is
comparable with the regression noise, we can see the correct shape in the scatter plot. The
reason why the magnitude of the estimate is smaller than it should be is due to the large
noise which does not allow the E-estimator use larger frequency corresponding to smaller
Fourier coefficients (recall Figure 2.3). Also note how the E-estimator uses the fact that
the underlying regression function is nonnegative by truncating negative values of m̂(x).
The E-estimate for the Bimodal regression is respectful keeping in mind the relatively small
sample size. Note that it would be tempting to oversmooth the scattergram and indicate
just one mode. The Strata case is interesting. The E-estimator does a superb job here. To
realize this, just try to draw a regression line through the scattergram; it would be a difficult
task to ignore the outliers.
It is a good exercise to repeat this figure with different parameters and get used to the
nonparametric regression and the E-estimator.
BERNOULLI REGRESSION 51
2.4 Bernoulli Regression
Suppose that we are interested in a relationship between a continuous random variable X
(the predictor) and a Bernoulli random variable A. Bernoulli regression (often referred to
as binary, probit, and also recall a parametric logistic regression) is an important statistical
model that is used in various fields, including actuarial science, machine learning, engineer-
ing, most medical fields, and social sciences. For example, the likelihood of an insurable
event, as a function of a covariate, is the key topic for insurance industry. The likelihood of
admission to a university, as a function of the SAT score, is another example. Clinical trials,
whose aim is to understand effectiveness of a new drug, as a function of known covariates,
is another classical example. Bernoulli regression is widely used in engineering, especially
for predicting the probability of failure of a given process, system or product. Prediction of
a customer’s propensity to purchase a product or halt a subscription is important for mar-
keting. Furthermore, as we will see shortly, Bernoulli regression often occurs in statistical
problems with modified and missing data.
We begin with reviewing the notion of a classical Bernoulli random variable. Suppose
that a trial, or an experiment, whose outcome can be classified as either a “success” or
as a “failure,” is performed. Introduce a random variable A which is equal to 1 when the
outcome is the success and A = 0 otherwise. Set P(A = 1) = w and correspondingly
P(A = 0) = 1 − w for some constant w ∈ [0, 1]; note that w is the probability of the success.
A direct calculation shows that E{A} = w and V(A) = w(1 − w). If the probability w of
success is unknown P and a sample A1 , . . . , An from A is available, then the sample mean
n
estimator ŵ := n−1 l=1 Al may be used for estimation of the parameter w. This estimator
is unbiased and enjoys an array of optimal properties. These are the basic facts that we
need to know about a Bernoulli random variable.
Now let us translate the classical parametric Bernoulli setting into a nonparametric one.
Assume that the probability of the success w is the function of a predictor X which is a
continuous random variable supported on the unit interval [0, 1] with the design density
f X (x) ≥ c∗ > 0, x ∈ [0, 1]. In other words, we assume that for x ∈ [0, 1] and a ∈ {0, 1}
Note that here w(x) may be considered as either the probability of success given the pre-
dictor equal to x or as the conditional expectation of A given the predictor equal to x, that
is
w(x) := P(A = 1|X = x) = E{A|X = x}. (2.4.2)
This explains why the problem of estimation of the conditional probability of success w(x)
may be solved using nonparametric regression.
The Bernoulli regression problem is to estimate the function w(x) using a sample of size
n from the pair (X, A).
To solve the problem using the E-estimation methodology, we begin with writing down
Fourier coefficients θj of w(x) as an expectation. Write,
Z 1 n Aϕ (X) o
j
θj := w(x)ϕj (x)dx = E X (x)
. (2.4.3)
0 f
If the design density is known, (2.4.3) yields the sample mean estimator
n
X Al ϕj (Xl )
θ̃j := n−1 . (2.4.4)
f X (Xl )
l=1
Design density may be known in controlled regressions and some other situations, but in
general it is unknown. In the latter case we may estimate the design density using the
52 ESTIMATION FOR DIRECTLY OBSERVED DATA
0.0 0.6
A
Figure 2.8 Bernoulli regression. In each diagram a scattergram is overlaid by the underlying re-
gression w(x) (the solid line) and its E-estimate (the dashed line). {n controls the sample size, the
string desden defines the shape of the design density which is then truncated from below by dden
and then rescaled into a bona fide density.} [n = 100, desden = 00 1 + 0.5 ∗ x00 , dden = 0.2, c = 1,
cJ0 = 3, cJ1 = 0.8, cTH = 4]
E-estimator fˆX (x) of Section 2.2 and plug it in (2.4.4). Because the density E-estimate is
used in the denominator, it is prudent to truncate it below from zero. This yields
n
X Al ϕj (Xl )
θ̂j := n−1 . (2.4.5)
ˆ
max(f X (Xl ), c/ ln(n))
l=1
{φj (x), j = 0, 1, . . .} and {ηj (x), j = 0, 1, . . .} on the unit interval [0, 1]. Then products of
elements from these two bases,
{ϕj1 j2 (x1 , x2 ) := φj1 (x1 )ηj2 (x2 ), j1 , j2 = 0, 1, . . .}, (2.5.1)
2
create a basis on [0, 1] which is called a tensor-product basis. This useful mathematical fact
implies a great flexibility in creating convenient bivariate bases. In the previous section,
the cosine basis on [0, 1] was used; for bivariate functions we will use the cosine tensor-
product basis with elements ϕj1 j2 (x1 , x2 ) := ϕj1 (x1 )ϕj2 (x2 ). Recall that ϕ0 (x) := 1 and
ϕj (x) := 21/2 cos(πjx), j = 1, 2, . . .
If a function f (x1 , x2 ) is square integrable on [0, 1], then its partial sum approximation
with cutoffs J1 and J2 is
J1 X
X J2
fJ1 J2 (x1 , x2 ) = θj1 j2 ϕj1 j2 (x1 , x2 ), (2.5.2)
j1 =0 j2 =0
54 ESTIMATION FOR DIRECTLY OBSERVED DATA
and the Fourier coefficients θj1 j2 are defined by the formula
Z
θj1 j2 := f (x1 , x2 )ϕj1 j2 (x1 , x2 )dx1 dx2 . (2.5.3)
[0,1]2
This formula immediately yields the sample mean estimator of Fourier coefficients
for the case of a bivariate density f X1 X2 (x1 , x2 ) supported on [0, 1]2 . Denote by
(X11 , X21 ), . . . , (X1n , X2n ) a sample of size n from (X1 , X2 ) and define the sample mean
Fourier estimator
n
X
θ̂j1 j2 := n−1 ϕj1 j2 (X1l , X2l ). (2.5.4)
l=1
Note that the equality in the bottom line of (2.5.5) is due to the Parseval identity. The de-
crease in the ISB is similar to what we have in the case of univariate differentiable functions,
and this is the good news. The bad news is that now (J1 + 1)(J2 + 1) Fourier coefficients
MULTIVARIATE SERIES ESTIMATION 55
Data Data
X1
0.4
0.0
0.2 0.4 0.6 0.8 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
X2 X2
Density Density
2 6
4
1 2
0 0
X1
X1
1 1
10 X2 10 X2
Estimate Estimate
10
2
0
0 0
0
X1
X1
1 1
10 X2 10 X2
Figure 2.9 E-estimation of the bivariate density. Two samples of size n = 100 are generated and
the corresponding estimation is exhibited in the two columns of diagrams. The first row of diagrams
shows the data (scattergrams). The second row shows the underlying bivariate densities. The third
row shows corresponding bivariate density E-estimates. {n controls the sample size. An underlying
bivariate density is the product of two corner densities defined by parameters c11 and c12 for the
left column of diagrams and c21 and c22 for the right column.} [n = 100, c11 = 1, c12 = 2, c21
= 2, c22 = 3, cJ0 = 3, cJ1 = 0.8, cTH = 4]
must be estimated, and this yields the variance of the projection bivariate estimator to be
of order n−1 J1 J2 . If we optimize the MISE with respect to J1 and J2 , then we get that
optimal cutoffs and the corresponding MISE are
Here On (1) are generic sequences in n such that 0 < c∗ < On (1) < c∗ < ∞.
Recall that in a univariate case the MISE decreases faster with the rate n−2β/(2β+1) . In
a case of a d-variate density, the MISE slows down to n−2β/(2β+d) .
What is the conclusion for the case of multivariate functions? We have seen that using
tensor-product bases makes the problem of series approximation of multivariate functions
similar to the univariate case. Nonetheless, several technical difficulties arise. First, apart
from bivariate functions (surfaces), there is no simple tool to visualize a multidimensional
curve. Second, we have seen in Section 2.1 that to approximate fairly well a smooth univari-
56 ESTIMATION FOR DIRECTLY OBSERVED DATA
ate function, about 5 to 10 Fourier coefficients are needed. For the case of a d-dimensional
curve this translates into 5d to 10d Fourier coefficients. Since all these coefficients must be
estimated, the rate of the MISE convergence slows down as d increases. Third, suppose that
n = 100 points are uniformly distributed over the five-dimensional unit cube [0, 1]5 . What
is the probability of having some points in a neighborhood of reasonable size, say a cube
with side 0.2? Since the volume of such a cube is 0.25 = 0.00032, the expected number of
points in this neighborhood is n times 0.25 , i.e., 0.032. As a result, no averaging over that
neighborhood can be performed. For this example, to get on average of 5 points in a cube,
its side should be 0.55, that is, more than a half of the range along each coordinate. This
shows how sparse multivariate observations are. Fourth, the notion of a small sample for
multivariate problems mutates. Suppose that for a univariate regression a grid of 50 points
is considered sufficient. Then this translates into 50 points along each axis, i.e., into 50d data
points. These complications present a challenging problem, which is customarily referred to
as the curse of dimensionality. However, in no way the curse implies that the situation is
always hopeless as we have seen for the case of bivariate functions.
Here α ∈ (0, 1) and its typical values are 0.01, 0.02, 0.05 and 0.1, zα/2 is a function in α
R∞ 2 √
such that zα/2 (2π)−1/2 e−x /2 dx = α/2 and σ̂ := σ̂ 2 . Note that the confidence interval is
random (it is a statistic of observations) and its main property is the following relation
P [µ̂ − zα/2 σ̂n−1/2 ≤ µ ≤ µ̂ + zα/2 σ̂n−1/2 ]µ = 1 − α + on (1). (2.6.2)
Recall that on (1) denotes a vanishing sequence in n, and P(D|µ) or P(D|f ) denote the prob-
abilities of event D given an estimated parameter µ or an estimated function f , respectively.
Relation (2.6.2) sheds light on the notion of the 1−α confidence interval (2.6.1), namely,
this confidence interval covers an underlying parameter µ with the probability 1 − α. Note
how the half-length of the confidence interval, called the margin of error, sheds light on the
used sample mean estimator µ̂, namely, it tells us how far the estimator can be from the
underlying parameter of interest. Finally, let us stress that while a confidence interval is
based on the estimated variance, its length is proportional to the standard deviation σ̂.
Now let us return to our problem of nonparametric estimation of the density. Suppose
that for a particular sample of size n we get an E-estimate fˆX (x) = 1 + θ̂j ϕj (x); recall that
θ0 = 1 for any density supported on [0, 1]. This estimate allows us to conjecture that an
CONFIDENCE BANDS 57
underlying density is fjX (x) := 1 + θj ϕj (x). Assume that the conjecture is correct. Then
(2.6.1) and (2.6.2) yield that (we use notation ϕ2j (x) := [ϕj (x)]2 )
P |fjX (x) − fˆX (x)| ≤ zα/2 [n−1 V(ϕj (X))ϕ2j (x)]1/2 fjX = 1 − α + on (1), x ∈ [0, 1], (2.6.3)
and
P max |fjX (x) − fˆX (x)|[n−1 V(ϕj (X))ϕ2j (x)]−1/2 ≤ zα/2 fjX = 1 − α + on (1). (2.6.4)
x∈[0,1]
Then, using the Parseval identity, the MISE (the mean integrated squared error) of the
estimator fˆX (x) = 1 + θ̂j ϕj (x) can be written as the sum of the variance and the integrated
squared bias (ISB) of the estimator, namely
Z 1 X
E{ (fˆX (x) − f X (x))2 dx} = V(θ̂j ) + θk2 . (2.6.6)
0 k∈{1,2,...}\j
In (2.6.3)-(2.6.4) the term ISB is ignored, and this explains the variance-band terminology.
There are several traditional justifications for using the variance-band approach. The first
one is that in nonparametric inference it is difficult (even theoretically) to take into account
the bias term because it involves an infinite number of unknown Fourier coefficients. Un-
fortunately, any approach that takes into account the bias may lessen effect of the bias but
cannot eliminate it. The second justification is that for many densities and cases of small
to moderate sample sizes, the bias of the E-estimate is relatively small, and simulations
show that the variance-band approach implies reliable confidence bands. Finally, as we will
see in the following chapters, in a number of applications the variance-band is of interest
on its own. This is why the variance-band approach is recommended for nonparametric
estimation. Furthermore, knowing flaws of the variance-band approach allows us to make
reasonable inferences about E-estimates.
Now let us expand the considered example of a single Fourier coefficient in an E-
estimate of the density to a general case. Suppose that an E-estimate is fˆX (x) =
P
1 + j∈N θ̂j ϕj (x). The estimate allows us to conjecture that an underlying density is
X
(x) = 1 + j∈N θj ϕj (x). Then we can write that fˆX (x) − fN X
P P
fN (x) = j∈N (θ̂j − θj )ϕj (x).
The expectation of this difference is zero and the variance is (below a general formula for
calculation of the variance of a sum of random variables is used)
X X
2
σN (x) := V (θ̂j − θj )ϕj (x) = ϕi (x)ϕj (x)Cov(θ̂i , θ̂j ), (2.6.7)
j∈N i,j∈N
where the covariance Cov(θ̂i , θ̂j ) := E{(θ̂i − E{θ̂i })(θ̂j − E{θ̂j })} := σij (n) =: σij . This
58 ESTIMATION FOR DIRECTLY OBSERVED DATA
1. Uniform , ISE = 0
0.0
-1
-1
n = 200
Figure 2.10 Pointwise and simultaneous confidence bands for the density E-estimator. Four di-
agrams correspond to the four underlying corner densities. The solid and dashed lines show the
underlying density and its E-estimate, respectively. Two dotted lines show a pointwise band and
two dot-dashed lines Rshow a simultaneous band. Each title shows the integrated squared error of the
1
E-estimate which is 0 (fˆX (x) − f X (x))2 dx. {Argument alpha controls α.} [n = 200, alpha = 0.05,
cJ0 = 3, cJ1 = 0.8, cTH = 4]
formula allows us to calculate sample mean estimates σ̂ij of σij and then plug them in
(2.6.7) and get an estimate σ̂N (x) for σN (x). This, together with the central limit theorem,
implies the following pointwise variance-band (or in short pointwise band)
X
(x) − fˆX (x)| ≤ zα/2 σ̂N (x)fN
X
P |fN = 1 − α + on (1), x ∈ [0, 1]. (2.6.8)
2.7 Exercises
2.1.1 Describe the four corner functions. What are their special characteristics and differ-
ences?
2.1.2 How can a sample from the Bimodal density be generated?
2.1.3 Prove that if function f (x) is square integrable on [0, 1] and {ϕj (x), j = 0, 1, . . .} is
R1 P∞
the basis on this interval, then 0 f 2 (x)dx = j=0 θj2 .
2.1.4 Use Figures 2.1 and 2.2 and suggest a reasonable Fourier approximation of the Normal
corner function using a minimal number of elements of the cosine basis.
2.1.5 Use Figure 2.3 to choose minimal cutoffs that imply a reasonable approximation of
corner functions. Explain why the cutoffs are different.
2.1.6 Check tails of approximations in Figure 2.3 and explain why they are poor for some
functions and good for others.
2.1.7 Explain how the integration by parts formula (2.1.5) can be verified.
2.1.8 Verify (2.1.7).
2.1.9∗ Assume that a function f (x), x ∈ [0, 1] has k bounded derivatives on [0, 1]. Explain
how its Fourier coefficients θj decrease. Do they decrease with the rate j −k or faster?
2.1.10∗ How do boundary conditions affect the decrease of Fourier coefficients?
2.1.11 Define Sobolev and Analytic function classes.
2.1.12 Do Fourier coefficients from a Sobolev class decrease faster than from an Analytic
class?
2.1.13 Verify (2.1.13). Can a sharper inequality be proposed?
2.1.14 Verify (2.1.14). Explain how it may be used for the analysis of the MISE.
2.1.15 Prove validity of (2.1.5) or correct it.
60 ESTIMATION FOR DIRECTLY OBSERVED DATA
2.1.16 Consider an analytic class Ar,Q and evaluate |f (x) − fJ (x)|. Hint: Follow (2.1.15).
2.1.17∗ Let r := r(α) denote the largest integer strictly less than a positive number α,
and L > 0 be another positive number. Introduce a Hölder class Hα,L of functions f (x),
x ∈ [0, 1] whose rth derivative f (r) (x) satisfies
|f (r) (x) − f (r) (y)| ≤ L|x − y|α−r , (x, y) ∈ [0, 1]2 . (2.7.1)
Further, for a positive integer k introduce a (Sobolev) class Dk,L of functions f (x), x ∈ [0, 1]
such that f (k−1) (x) is absolutely continuous and
Z 1
[f (k) (x)]2 dx ≤ L2 . (2.7.2)
0
Prove that the function class Dk,L contains the Hölder function class Hk,L .
2.1.18∗ Using notation of Exercise 2.1.17∗ , introduce a periodic function class
∗
Dk,L := {f : f ∈ Dk,L , f (r) (0) = f (r) (1), r = 0, 1, . . . , k − 1}. (2.7.3)
Further, introduce a classical trigonometric basis {η0 (x) = 1, η2j−1 (x) = 21/2 sin(2πjx),
∗
η2j (x) = 21/2 cos(2πjx), j = 1, 2, . . .} on [0, 1]. Show that if f ∈ Dk,L then
∞
X
f (x) = νj ηj (x), (2.7.4)
j=0
R1
where νj := 0
f (x)ηj (x) are Fourier coefficients and
∞
X
(2πj)2k [ν2j−1
2 2
+ ν2j ] ≤ L2 . (2.7.5)
j=1
2.1.19∗ Using notation of Exercise 2.1.18∗ , prove that if (2.7.4)-(2.7.5) holds, then f ∈ Dk,L ∗
.
2.1.20 Combine assertions of the previous three exercises and make a conclusion about the
introduced function classes. R1
2.1.21∗ Let a function f (x), x ∈ [0, 1] be r-times differentiable on [0, 1], 0 [f (r) (x)]2 dx < ∞,
and for all positive and odd integers k < r
Further, let ϕj (x), j = 0, 1, . . . be elements of the cosine basis (2.1.3) on [0, 1]. Prove that
R1
Fourier coefficients θj := 0 f (x)ϕj (x)dx of the function f (x) satisfy the Parseval-type
identity
X∞ Z 1
2r 2
(πj) θj = [f (r) (x)]2 dx. (2.7.7)
j=1 0
2.1.23∗ Explain how the result of the previous exercise may help to accelerate convergence
of the cosine Fourier series, or in other words may help to avoid the boundary problem
discussed in Section 2.1.
2.1.24∗ Consider the cosine basis {ϕj (x), j = 0, 1, . . .} on [0, 1], and introduce an analytic
(exponential) ellipsoid
∞
X Z 1
Eα,Q := {f (x) : e2αj θj2 ≤ Q, θj := f (u)ϕj (u)du, x ∈ [0, 1]}. (2.7.12)
j=0 0
PJ
Consider a series approximation fJ (x) := l=0 θj ϕj (x) for functions f from the ellipsoid,
and then explore its ISB and |fJ (x) − f (x)|.
2.2.1 Give definitions of a discrete, continuous and mixed random variables.
2.2.2 For a continuous random variable, what is the relationship between the cumulative
distribution function, the survivor function and the probability density?
2.2.3 Suppose that the density of a random variable X is known. Define the expectation
and variance of a function g(X).
2.2.4 What is the relationship between the variance, first and second moments of a random
variable?
2.2.5 Is the variance of a random variable less, equal or larger than the second moment?
2.2.6PnConsider a sample X1 , . . . , Xn from a random variable X. Is the sample mean
n−1 l=1 g(Xl ) unbiased estimator of E{g(X)}? Do you need any assumption to support
your conclusion?
2.2.7 ConsiderPna sample X1 , . . . , Xn from a random variable X. What is the variance of
statistic n−1 l=1 g(Xl )? Do you need any assumption to support your conclusion?
R1
2.2.8 Propose a sample mean estimator of a Fourier coefficient θj := 0 f X (x)ϕj (x)dx
based on a sample of size n from the random variable X supported on [0, 1].
2.2.9∗ Consider the problem of Exercise 2.2.8 where now it is no longer assumed that [0, 1]
is the support of X. Suggest the sample mean estimate for θj . Hint: Write down θj as the
expectation and recall the use of the indicator function I(·), for instance, we can write
R1 R∞
0
g(x)dx = −∞ I(x ∈ [0, 1])g(x)dx.
2.2.10 Explain the first step of the E-estimator.
2.2.11∗ Explain the second step of the E-estimator.
2.2.12 Explain the third step of the E-estimator.
2.2.13 What is the reason for having the indicator function in (2.2.3)?
2.2.14∗ Why do we have a factor 2 in the right side of (2.2.4)?
2.2.15 Is the Fourier estimator (2.2.7) unbiased?
2.2.16∗ Why is the sum in (2.2.8) divided by (n − 1) despite the fact that there are n terms
in the sum?
2.2.17∗ Verify formula (2.2.10).
2.2.18 What is the definition of the integrated squared bias of a density estimate?
2.2.19 What is the definition of the MISE? Comment on this criterion.
62 ESTIMATION FOR DIRECTLY OBSERVED DATA
2.2.20 Verify relation (2.2.13).
2.2.21∗ Prove that (2.2.16) is the projection on the class of densities under the ISE criterion.
2.2.22 Formula (2.2.16) defines the projection of a function on a class of densities. Draw
a graphic of a continuous density estimate that takes on both positive and negative values
and is integrated to one. Then explain how projection (2.2.16) will look like. Hint: Think
about a horizontal line such that the curve above it is the projection.
2.2.23 Verify formula (2.2.19). What can one conclude from the formula?
2.2.24 Explain all parameters of the E-estimator used in Figure 2.5.
2.2.25∗ Use Figure 2.5 and choose an optimal argument cJ0 for each combination of sample
sizes 100, 200, 300 and four corner functions. Comment on your recommendations.
2.2.26∗ Use Figure 2.5 and choose an optimal argument cJ1 for each combination of sample
sizes 100, 200, 300 and all four corner functions. Comment on your recommendations.
2.2.27∗ Use Figure 2.5 and choose an optimal argument cTH for each combination of sample
sizes 100, 200, 300 and all four corner functions. Comment on your recommendations.
2.2.28∗ Use Figure 2.6 and repeat Exercises 2.2.25-27.
2.2.29 Suppose that a random variable of interest Y is supported on a given interval [a, b].
How can the software, developed for the unit interval [0, 1], be used for estimation of the
density f Y (y)?
2.2.30∗ Consider the question of Exercise (2.2.29) only now the support is unknown.
2.2.31∗ Explain and prove formulas (2.2.20), (2.2.21) and (2.2.22).
2.2.32 What is the definition of a consistent density estimator?
2.2.33∗ Prove (2.2.25) for the case of Sobolev densities. Is it possible to establish the
consistency of a projection estimator for densities from a Sobolev class of order α ≤ 1/2?
2.2.34∗ For a class of analytic densities, establish an inequality similar to (2.2.25).
2.2.35∗ Prove consistency of the E-estimator for a Sobolev class of order α > 1/2.
2.3.1 Explain the problem of a nonparametric regression. What is the difference between
the classical linear regression and nonparametric one?
2.3.2 Why is the regression function E{Y |X = x} a reasonable predictor of Y given X = x?
2.3.3 Verify (2.3.3). What does the inequality tell us about the variance? Does this inequal-
ity shed light on the MSE in (2.3.1)?
2.3.4∗ Is (2.3.4) a more general regression model than (2.3.2)? Justify your answer. Can
these two models be equivalent?
2.3.5 Define homoscedastic and heteroscedastic regressions. Which one, in your opinion, is
more challenging?
2.3.6 Explain every step in establishing (2.3.6).
2.3.7 Is (2.3.7) a sample mean estimator? Explain your answer.
2.3.8∗ What is the variance of the estimator (2.3.7)? What is the coefficient of difficulty?
2.3.9 Explain the motivation of the estimator (2.3.8) and why it uses truncation from the
zero of the design density estimator.
2.3.10∗ Evaluate the mean, variance and the coefficient of difficulty of Fourier estimator
(2.3.8). Hint: Begin with the case of known design density.
2.3.11∗ Explain the modification of the E-estimate proposed in Remark 2.3.1. Then suggest
a procedure which uses not only θ̂0 but also other Fourier estimates.
2.3.12∗ Propose a procedure for estimation of θj using the idea of numerical integration.
Then explore its mean and variance. Hint: Use (2.3.6) and then replace m(Xl ) by Yl .
2.3.13 Explain arguments of Figure 2.7.
2.3.14∗ For each corner function and each sample size, find optimal parameters of the
E-estimator used in Figure 2.7. Explain the results.
2.3.15 Repeat Figure 2.7 with different design densities that make estimation of the Bimodal
more and less complicated.
EXERCISES 63
2.3.16 Using Figure 2.7, explain how the scale function affects estimation of each corner
function.
2.4.1 What is the definition of a Bernoulli random variable?
2.4.2 What is the mean and variance of a Bernoulli random variable?
2.4.3 Consider the sum of n identically distributed and independent Bernoulli random
variables. What is the mean and variance of the sum?
2.4.4 Present several examples of the Bernoulli regression.
2.4.5 Why is the Bernoulli regression problem equivalent to estimation of the conditional
probability of the success given the predictor?
2.4.6 Describe the E-estimator for Bernoulli regression.
2.4.7∗ Explain the motivation behind the Fourier estimator (2.4.5). Then find its mean,
variance and the coefficient of difficulty.
2.4.8 Estimates in Figure 2.8 are not satisfactory. Can they be improved by some innova-
tions or there is nothing that can be done better for the data at hand?
2.4.9∗ Repeat Figure 2.8 for different sample sizes and for each corner function choose a
minimal sample size that implies a reliable regression estimation.
2.4.10 If our main concern is the shape of curves, which of the corner functions is more
difficult and simpler for estimation?
2.4.11 Find better values for parameters of the E-estimator used in Figure 2.8.
2.4.12 Compare presentation of diagrams in Figures 2.7 and Figure 2.8. Which one do you
prefer? Explain your choice.
2.5.1 How to construct a tensor-product basis?
2.5.2 Write down Fourier approximation of a three-dimensional density, supported on [0, 1]3 ,
using a tensor-product cosine basis.
2.5.3∗ Explain the sample mean Fourier estimator (2.5.4). Find the mean, variance, and
the coefficient of difficulty.
2.5.4 Consider the problem of estimation of a bivariate density supported on [0, 1]2 . Explain
constriction of the E-estimator.
2.5.5 Repeat Figure 2.9 several times, and for each simulation compare your understanding
of the underlying density, exhibited by data, with the E-estimate.
2.5.6 Using Figure 2.9, suggest optimal parameters for the E-estimator.
2.5.7 Repeat Figure 2.9 with underlying densities generated by different densities for the
variables X1 and X2 . Does this make estimation of the bivariate density more complicated
or simpler?
2.5.8 For the setting of Exercise 2.5.7, would you recommend to use the same or different
parameters of the E-estimator?
2.5.9 Verify relations (2.5.5).
2.5.10∗ Verify expressions for the optimal cutoffs and the MISE presented in (2.5.6).
2.5.11 What is the curse of multidimensionality?
2.5.12 How can the curse of multidimensionality be overcome?
2.5.13∗ Figure 2.9 allows us to estimate a density supported on [0, 1]2 . How can a density
with unknown support be estimated using the software? Will this change the coefficient of
difficulty of the E-estimator?
2.5.14∗ Consider a bivariate regression function m(x1 , x2 ) := E{Y |X1 = x1 , X2 = x2 }.
A sample of size n from a triplet (X1 , X2 , Y ) is available. Suggest a bivariate regression
E-estimate. Hint: Propose convenient assumptions.
2.6.1 Consider estimation of the mean of a bounded function g(X) based on a sample of
size n from X. Propose an estimator and its (1 − α) confidence interval.
2.6.2∗ Suppose that we have a sample of size n from a normal random variable X with
64 ESTIMATION FOR DIRECTLY OBSERVED DATA
unknown mean θ. Let g(x) be a bounded and differentiable function. Propose an estimator
of g(θ) and its 1 − α confidence interval.
2.6.3 How is the Central Limit Theorem used in construction of confidence intervals?
2.6.4 Explain the approach leading to (2.6.3).
2.6.5 Prove (2.6.4).
2.6.6 What is the definition of a nonparametric pointwise confidence band?
2.6.7 What is the definition of a nonparametric simultaneous confidence band?
2.6.8 Explain how the relation (2.6.6) is obtained.
2.6.9 Define the variance and the ISB (integrated squared bias) of a nonparametric esti-
mator.
2.6.10 Why an unknown ISB may prevent us from construction of a reliable confidence
band for small samples?
2.6.11 Using Figure 2.3 explain why a variance-band, which ignores the ISB, may be a
reasonable approach for the corner densities.
2.6.12∗ An analytic class of densities was introduced in (2.1.12). Consider an analytic
density, write down the MISE of a projection estimate, and then evaluate the ISB using
(2.1.12). Using the obtained expression and the Central Limit Theorem, suggest a reasonable
confidence band. Hint: Note that the inference for analytic functions justifies the variance-
band approach.
2.6.13∗ Verify (2.6.7). Hint: Recall the rule of calculating the variance of a sum of dependent
random variables
2.6.14∗ Write down a reasonable estimator for the covariance σij defined below line (2.6.7).
Then explore its statistical properties like the mean and variance.
2.6.15∗ Verify (2.6.8). Hint: Use the Central Limit Theorem. Further, note that the set N
is random, so first consider the conditional expectation given the set.
2.6.16 Explain the underlying idea of the proposed simultaneous confidence band defined
in (2.6.10).
2.6.17∗ Propose an algorithm for calculating the simultaneous confidence band introduced
in (2.6.10). In other words, think about a program that will do this. Hint: You need to find
coefficients cj,α/2,n satisfying (2.6.9). Note that Zj in (2.6.9) are components of a specific
normal random vector defined above line (2.6.9).
2.6.18 Repeat Figure 2.10. Find a simulation when the E-estimate for the Normal is not
perfect, and explain the shape of the bands.
2.6.19 If the parameter α is decreased, then will a corresponding band be wider or narrower?
Check your answer using Figure 2.10.
2.6.20∗ If you look at a band in Figure 2.10, then it is plain to notice that its width is not
constant. Why?
2.6.21 Titles of diagrams in Figure 2.10 show values of the ISE. What is the definition of
the ISE and why is it used to quantify quality of a nonparametric estimate?
2.6.22 How the ISE of an E-estimate is related to a band? Answer this question heuristically
and using mathematical analysis, and then check your answers using Figure 2.10.
2.6.23∗ Fourier estimators θ̂j , j = 1, 2, . . . of the density E-estimator are dependent random
variables. In particular, verify the following formula or find a correct expression,
may be useful.
2.6.24∗ Explore the problem of Exercise 2.6.23 for a regression model.
NOTES 65
2.8 Notes
In what follows the subsections correspond to sections in the chapter.
2.1 Fourier series approximation is one of the cornerstones of the mathematical science.
The presented material is based on Chapter 2 in Efromovich (1999a). The basic idea of
Fourier series is that a periodic function may be expressed as a sum of sines and cosines.
This idea was known to the Babylonians, who used it for the prediction of celestial events.
The history of the subject in more recent times begins with d’Alembert, who in the eigh-
teenth century studied the vibrations of a violin string. Fourier’s contributions began in
1807 with his studies of the problem of heat flow. He made a serious attempt to prove
that any function may be expanded into a trigonometric sum. A satisfactory proof was
found later by Dirichlet. These and other historical remarks may be found in the book by
Dym and McKean (1972). Also, Section 1.1 of that book gives an excellent explanation
of the Lebesgue integral, which should be used by readers with advanced mathematical
background. The mathematical books by Krylov (1955), Bary (1964) and Kolmogorov and
Fomin (1957) give a relatively simple discussion (with rigorous proofs) of Fourier series.
There are many good books on approximation theory. Butzer and Nessel (1971) and Nikol-
skii (1975) are the classical references, and DeVore and Lorentz (1993), Temlyakov (1993),
and Lorentz, Golitschek and Makovoz (1996) may be recommended as solid mathematical
references. An interesting topic of wavelet bases and series wavelet estimators are discussed
in Walter (1994), Donoho and Johnstone (1995), Efromovich (1997c; 1999a; 2000a,c; 2004e;
2007a,b; 2009b), Härdle et al. (1998), Mallat (1998), Vidakovic (1999), Efromovich et al.
(2004), Efromovich et al. (2008), Nason (2008), Efromovich and Valdez-Jasso (2010), and
Efromovich and Smirnova (2014a,b).
2.2 This section is based on Chapter 3 of Efromovich (1999a). The first result about op-
timality of Fourier series estimation of nonparametric densities is due to Chentsov (1962).
Professor Chentsov was not satisfied with the fact that this estimate could take on negative
values, and later proposed to estimate g(x) := log(f (x)) by a series estimator ĝ(x) and then
set fˆ(x) := eĝ(x) ; see Chentsov (1980) and also Efron and Tibshirani (1996). Clearly, the
last estimator is nonnegative. Recall that we dealt with this issue by using the projection
(2.2.17), see also a discussion in Efromovich (1999a) and Glad, Hjort and Ushakov (2003).
An interesting discussion of the influence of Kolmogorov’s results and ideas in approxi-
mation theory on density estimation may be found in Chentsov (1980) and Ibragimov and
Khasminskii (1981). Series density estimators are discussed in the books by Devroye and
Györfi (1985), Devroye (1987), Thompson and Tapia (1990), Tarter and Lock (1993), Hart
(1997), Wasserman (2006), Tsybakov (2009), Hollander, Wolfe and Chicken (2013). Asymp-
totic justification of using adaptive Fourier series density estimators is given in Efromovich
(1985; 1989; 1996b; 1998a; 1999d; 2000b; 2008b; 2009a; 2010a,c; 2011c) and Efromovich
and Pinsker (1982). A discussion of plug-in estimation may be found in Bickel and Dok-
sum (2007). Projection procedures are discussed in Birgé and Massart (1997), Efromovich
(1999a, 2010c), and Tsybakov (2009). There is also a rich literature on other approaches to
density estimation, see a discussion in the classical book by Silverman (1986) as well as in
Efromovich (1999a), Scott (2015), and more recent results in Sakhanenko (2015, 2017). See
also an interesting practical application in Efromovich and Salter-Kubatko (2008).
Let us specifically stress that the used default values of parameters of the E-estimator
are not necessarily “optimal,” and even using the word optimal is questionable because
there is no feasible risk for the analysis of E-estimator for small samples. It is up to the
reader to choose parameters that imply better fit for the reader’s favorite corner densities
and sample sizes. And this is where the R package becomes so handy because it allows one
to find custom-chosen parameters for every considered setting.
For the more recent development in the sharp minimax estimation theory, including the
superefficiency, see Efromovich (2014a, 2016a, 2017).
66 ESTIMATION FOR DIRECTLY OBSERVED DATA
2.3 The discussed nonparametric regression is based on Chapter 4 of Efromovich (1999a).
There are also many good books where different applied and theoretical aspects of non-
parametric regression are discussed. These books include Carroll and Ruppert (1988),
Eubank (1988), Müller (1988), Härdle (1990), Wahba (1990), Green and Silverman (1994),
Wand and Jones (1995), Simonoff (1996), Nemirovskii (1999), Györfi et al. (2002), Takezawa
(2005), Li and Racine (2009), Berk (2016), Faraway (2016), Matloff (2017) and Yang (2017)
among others. A chapter-length treatment of orthogonal series estimates may be found in
Eubank (1988, Chapter 3). Asymptotic justification of orthogonal series estimation for the
regression model is given in Efromovich (1986; 1992; 1994b; 1996a; 2001c; 2002; 2005a;
2007d,e), where it is established that for smooth functions a data-driven series estima-
tor outperforms all other possible data-driven estimators. Practical applications are also
discussed. The heteroscedastic regression was studied in Efromovich (1992, 2013a) and
Efromovich and Pinsker (1996), where it is established that asymptotically a data-driven
orthogonal series estimator outperforms any other possible data-driven estimators whenever
an underlying regression function is smooth. Also, Efromovich and Pinsker (1996) present
results of numerical comparison between series and local linear kernel estimators. In some
applications it is known that an underlying regression function satisfies some restrictions
on its shape, like monotonicity or unimodality. Nonparametric estimation under the shape
restrictions is discussed in Efromovich (1999a, 2001a), Horowitz and Lee (2017), and in
the book Groeneboom and Jongbloed (2014) solely devoted to the estimation under shape
restrictions.
2.4 Bernoulli regression is considered in Chapter 4 of Efromovich (1999a), see also a dis-
cussion and further references in Mukherjee and Sen (2018). It also may be referred to as
binary regression. The asymptotic justification of the E-estimator is given in Efromovich
(1996a) and Efromovich and Thomas (1996) where also an interesting application to the
analysis of the sensitivity can be found. As we will see shortly, this regression is a pivot in
solving a number of statistical problems with missing data.
2.5 This section is based on Chapter 6 of Efromovich (1999a) which contains discussion of
a number of multivariate problems arising in nonparametric estimation. Asymptotic theory
can be found in Efromovich (2000b; 2002; 20010c,d). Although generalization of most of the
univariate series estimators to multivariate series estimators appears to be feasible, we have
seen that serious problems arise due to the curse of multidimensionality, as it was termed
by Bellman (1961). The curse is discussed in the books by Silverman (1986) and Hastie
and Tibshirani (1990). Many approaches have been suggested aimed at a simplification and
overcoming the curse: additive and partially linear modeling, principal components analysis,
projection pursuit regression, classification and regression trees (CART), multivariate adap-
tive regression splines, etc. A discussion may be found in the book Izenman (2008). Many
of these methods are supported by a number of specialized R packages, see also Everitt
and Hothorn (2011). Approximation theory is discussed in the books by Nikolskii (1975),
Temlyakov (1993), and Lorentz, Golitschek, and Makovoz (1996). A book-length discussion
of multivariate density estimators (with a particular emphasis on kernel estimators) is given
in Scott (2015).
2.6 Inference about nonparametric estimators, based on the analysis of the MISE and
its decomposition into variance and integrated squared bias, can be found in the books
Efromovich (1999a) and Wasserman (2006). The latter contains an interesting overview of
different approaches to construction of confidence bands as well as a justification of the
variance-band approach. Theoretical analysis of nonparametric confidence bands, including
proofs of impossibility of constructing efficient data-driven bands, can be found in (mathe-
matically involved) papers Cai and Low (2004), Genovese and Wasserman (2008), Giné and
Nickl (2010), Hoffmann and Nickl (2011), Cai and Guo (2017), and Efromovich and Chu
(2018a,b). Nonparametric hypotheses testing is explored in the book by Ingster and Suslina
(2003). Bayesian approach is discussed in Yoo and Ghosal (2016).
Chapter 3
This chapter considers basic models where underlying observations are modified and the
process of modification is known. The studied topics serve as a bridge between the case of
direct data and the case of missing, truncated and censored data considered in the following
chapters. Section 3.1 is devoted to density estimation based on biased data. Biased data is
a classical statistical example of modified data. The interesting aspect of the presentation
is that biased sampling is explained via a missing mechanism. Namely, there are underlying
hidden realizations of a random variable of interest X ∗ , and then a hidden realization may
be observed or skipped with the likelihood depending on the value of the realization. This
sampling mechanism creates biased data because the distribution of the observed X is
different from the distribution of the hidden X ∗ . As we will see in the following chapters,
missing, truncation, censoring and other modifications typically imply biased data. Section
3.2 considers regression with biased responses. Section 3.3 explores regression with biased
predictors and responses. Other sections are devoted to special topics. Among them, results
of Section 3.7, where Bernoulli regression with unavailable failures is considered, will be
often used in the next chapters.
67
68 ESTIMATION FOR BASIC MODELS OF MODIFIED DATA
∗
To estimate the density f X (x), we need to understand how it is related to the density
f (x) of biased observations. In what follows, we assume that X ∗ is supported on [0, 1] and
X
This formula points upon a simple sample mean estimator of P(A = 1).
Note that the above-presented formulas are based on the sequential model of creating bi-
ased data. In general, instead of exploring the process of collecting biased data, the problem
∗
of estimation of f X (x) based on a biased sample from X is formulated via the following
relation: ∗
X f X (x)B(x)
f (x) = R 1 . (3.1.6)
0
f X ∗ (u)B(u)du
In this case B(x) is not necessarily the probability and may take on values larger than 1. On
the other hand, according to (3.1.6), the biasing function can be always rescaled to make it
not larger than 1, and this rescaling does not change the probability model.
In what follows we will use model (3.1.3)-(3.1.4) rather than (3.1.6) to stress the fact
that biased data may be explained via a sequential missing mechanism. As it was already
explained, this does not affect the generality of considered model. Also, we will use notation
B −1 (x) := 1/B(x).
Now, given known B(x), (3.1.1) and (3.1.3)-(3.1.4), we are in a position to explore the
∗
problem of estimation of the density f X based on a biased sample X1 , . . . , Xn from X.
First of all, let us stress that if the biasing function B(x) is unknown, then no consistent
estimation of the density of interest is possible. This immediately follows from (3.1.6) which
∗
shows that only the product f X (x)B(x) is estimable. Unless the nuisance function is known
∗ ∗
(or may be estimated), we cannot factor out f X (x) from the product f X (x)B(x). This
is the pivotal moment in our understanding of the biasing modification of data, and we
will observed it in many particular examples of missing and modified data. Only if the
biasing function is known (for instance from previous experiments) or may be estimated
based on auxiliary data, the formulated estimation problem becomes feasible and consistent
∗
estimation of f X (x) becomes possible.
The E-estimation methodology of Section 2.2 tells us that we need to propose a sample
∗
mean estimator of Fourier coefficients θj of the density of interest f X (x). To do this, we
should first write down θj as an expectation and then mimic the expectation by a sample
DENSITY ESTIMATION FOR BIASED DATA 69
mean estimator. To make the first step, using (3.1.3) let us write down Fourier coefficients
θj as follows:
Z 1
∗
θj := ϕj (x)f X (x)dx
0
Z 1
= P(A = 1) ϕj (x)f X (x)B −1 (x)dx = P(A = 1)E{ϕj (X)B −1 (X)}. (3.1.7)
0
Here ϕj (x) are elements of the cosine basis on [0, 1] (the definition and discussion can be
R1 ∗ R1 ∗
found in Section 2.1). Note that θ0 = 0 ϕ0 (x)f X (x)dx = 0 f X (x)dx = 1, and hence we
need to estimate only Fourier coefficients θj , j ≥ 1. Formula (3.1.7) implies the following
plug-in sample mean estimator of Fourier coefficients:
n
X
θ̂j := P̂ n−1 ϕj (Xl )B −1 (Xl ), (3.1.8)
l=1
3.5
2.0
3.0
3
4
2.5
1.5
2.0
3
Density
Density
Density
Density
2
1.0
1.5
2
1.0
1
0.5
0.5
0.0
0.0
0
0
0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8
X X X X
Figure 3.1 Density estimation for biased data with the biasing function B(x) = a + bx. Four
∗
diagrams correspond to different underlying densities f X (x) indicated in the title and shown by
the solid line. The biased observations are shown by the histogram and the density E-estimate
∗
fˆX (x) by the dashed line. {Parameters of the biasing function are controlled by set.B = c(a,b)}
[n = 100, set.B = c(0.2,0.8), cJ0 = 4, cJ1 = .5, cTH = 4]
The Normal diagram in Figure 3.1 is another interesting example of the length-biased
data. Again we see the skewed to the right histogram of the biased data, and the E-estimator
correctly restores the symmetric shape of the Normal density. The funny tails are due to
outliers. The Bimodal is another teachable example. This density is difficult for estimation
even for the case of direct observations, and here look at how the E-estimator correctly
increases (with respect to the histogram) the left mode and decreases the right mode. A
similar outcome is observed in the Strata diagram where the E-estimator also corrects the
histogram. Let us also shed light on underlying coefficients of difficulty. For instance, for
the case of B(x) = 0.1 + 0.9x, coefficients of difficulty for our 4 corner densities are 1.4, 1.1,
1.1 and 1.3, respectively.
Figure 3.2 allows us to zoom on biased data and quantify the quality of estimation. The
diagrams are similar to those in Figure 3.1, only here confidence bands are added (note that
they are cut below the bottom line of the histogram). In the top diagram the underlying
density is the Bimodal shown by the solid line. The histogram clearly exhibits the biased
data that are skewed to the right. The E-estimate (the dashed line) fairly well exhibits the
underlying density, and the 0.95-level confidence bands shed additional light on the quality
of estimation. The subtitle shows the integrated squared error (ISE) of the E-estimate,
the estimated coefficient of difficulty and the sample size. The bottom diagram exhibits
a simulation with the underlying density of interest being the Strata. The left stratum is
estimated worse than the right one, and we see that the underlying left strata is beyond
the pointwise band but still within the simultaneous one. While the E-estimate is far from
being perfect, it does indicate the two pronounced strata and even shows that the left mode
is larger than the right one despite the heavily right-skewed biased data.
DENSITY ESTIMATION FOR BIASED DATA 71
3
Density
2
1
0
2
1
0
Figure 3.2 Density estimation for biased data. Results of two simulations are exhibited for the
∗
Bimodal and the Strata underlying densities f X (x). Simulations and the structure of diagrams are
similar to Figure 3.1 only here 1−α confidence bands, explained in Section 2.6, are added. The 1−α
pointwise and simultaneous bands are shown by the dotted and dot-dashed lines, respectively. The
exhibited confidence bands are truncated from below by the bottom line of the histogram. {Parameters
of the biasing function are controlled by set.B =c(a,b), underlying densities are chosen by set.corn,
and α is controlled by the argument alpha.} [n = 200, set.B = c(0.2,0.8), set.corn = c(3,4), alpha
= 0.05, cJ0 = 3, cJ1 = 0.8, cTH = 4]
One theoretical remark is due about the proposed E-estimator. Its natural competitor
is the ratio estimator which is based on formula (3.1.3). The ratio estimator is defined as
the E-estimator fˆX (x), based a biased sample, divided by B(x)/P̂ . It is possible to show
that the proposed E-estimator is more efficient than the ratio estimator, and this is why it
is recommended. On the other hand, the appealing feature of the ratio estimator is in its
simplicity.
We are finishing this section with a remark about the relation between the biased and
missing data. Recall that a biased sample may be generated by a sequential sampling from
X ∗ when some of the realizations are missed. Further, the sample size n of biased observa-
tions corresponds to a larger random sample size N (stopping time) of the hidden sample
from X ∗ . One may think that N − n observations in the hidden sample from X ∗ are missed.
This thinking bridges the biased data and the missing data. Furthermore, there is an impor-
tant lesson that may be learned from the duality between the biasing and missing. We know
that unless the biasing function is known, consistent estimation of the density of X ∗ is im-
possible. Hence, missing data may preclude us from consistent estimation of an underlying
∗
density f X unless some additional information about the missing mechanism is available.
72 ESTIMATION FOR BASIC MODELS OF MODIFIED DATA
In other words, missing may completely destroy information about density contained in a
hidden sample. In the next chapters we will consider such missing mechanisms and refer to
them as destructive missing.
the function D(x) makes the expression in the square brackets a bona fide conditional
density and it is defined as
1
D(x) := = E{[1/B(X, Y )]|X = x}, (3.2.3)
E{B(X, Y ∗ )|X = x}
and f X (x) is the design density of the predictor X supported on [0, 1], and it is assumed
that f X (x) ≥ c∗ > 0.
Let us explain how a sample with biased responses, satisfying (3.2.1), may be generated.
First, a sample X1 , . . . , Xn from X is generated according to the design density f X (x).
Then for each Xl , a single biased observation Yl is generated according to the algorithm of
∗ ∗
Section 3.1 with (notation of that section is used) the density f X (y) = f Y |X (y|Xl ) and
the biasing function B(y) := P(A = 1|X ∗ = y, Xl ) = B(Xl , y), where A is the Bernoulli
variable. Note that the difference between this sampling and sampling in Section 3.1 is that
here biased responses are generated n times with different underlying densities and different
biasing functions.
It is a nice exercise to check that the above-described biasing mechanism implies (3.2.1).
Write, ∗
f Y |X (y|x) = f Y |A,X (y|1, x)
∗ ∗
|X
f Y ,A|X (y, 1|x) fY (y|x)P(A = 1|Y ∗ = y, X = x)
= = . (3.2.4)
P(A = 1|X = x) P(A = 1|X = x)
According to the above-described biasing algorithm, P (A = 1|Y ∗ = y, X = x) = B(x, y),
and we also get that
1 2 3 4 5 6 7
Y
4
2
Figure 3.3 Regression with biased responses. The biasing function is B(x, y) = b1 + b2 x + b3 y,
the design density is the Uniform, and the hidden regression is Y ∗ = [m(X) + 3σ] + σξ where
m(x) is a corner function, ξ is a standard normal regression error which is independent of the
predictor X, and σ is a parameter. The two diagrams correspond to the Uniform and the Normal
functions m(x), and the regression functions m(x) + 3σ are shown by the solid lines. Biased data
are shown by circles. Regression E-estimate (the dashed line) and the naı̈ve regression E-estimate
(the dotted line), based on biased data, have integrated squared errors indicated as ISE and ISEN ,
respectively. {Parameters of the biasing function are controlled by set.B = c(b1 , b2 , b3 ) and note that
the biasing function must be positive, underlying regressions are chosen by the argument set.corn, σ
is controlled by the argument sigma.} [n = 100, sigma = 1, set.B = c(0.3,0.5,2), set.corn = c(1,2),
c = 1, cJ0 = 3, cJ1 = 0.8, cTH = 4]
As usual, we use E-estimation methodology for estimating D(x). Using (3.2.3) we can
write for its Fourier coefficients
Z 1
κj := D(x)ϕj (x)dx
0
Z 1 n ϕj (X) o
= E{[1/B(x, y)]|X = x}ϕj (x)dx = E . (3.2.6)
0 f X (X)B(X, Y )
This implies the plug-in sample mean estimator
n
X ϕj (Xl )
κ̂j := n−1 , (3.2.7)
ˆX
max(f (Xl ), c/ ln(n))B(Xl , Yl )
l=1
where fˆX (x), x ∈ [0, 1] is the E-estimator of the density f X (x) based on X1 , . . . , Xn (recall
that the predictor is not biased).
74 ESTIMATION FOR BASIC MODELS OF MODIFIED DATA
The Fourier estimator (3.2.7) yields the E-estimator D̂(x), x ∈ [0, 1].
Now we are ready to explain how we can estimate Fourier coefficients of the regression
function m(x) := E{Y ∗ |X = x}. Using (3.2.1), Fourier coefficient of m(x), x ∈ [0, 1] can be
written as follows,
Z 1 Z 1hZ ∞ i
∗
θj := m(x)ϕj (x)dx = yf Y |X (y|x)dy ϕj (x)dx
0 0 −∞
Z 1hZ ∞ i
= yf Y |X (y|x)[B(x, y)D(x)]−1 dy ϕj (x)dx
0 −∞
n Y ϕj (X) o
=E . (3.2.8)
f X (X)B(X, Y )D(X)
This yields the following plug-in sample mean estimator of θj ,
Z 1
Yl ϕj (Xl )
θ̂j := . (3.2.9)
0 ˆX
max(f (Xl ), c/ ln(n))B(Xl , Yl )D̂(Xl )
There is one useful remark about the plug-in D̂(Xl ). It follows from (3.2.2) and (3.2.3) that
if B(x, y) ≤ cB < ∞, and recall that the biasing function is known, we get D(x) ≥ 1/cB .
Then D̂(Xl ), used in (3.2.9), may be truncated from below by 1/cB , that is we may plug
in max(D̂(Xl ), 1/cB ).
Fourier estimator (3.2.9) yields the regression E-estimator m̂(x), x ∈ [0, 1] defined in
Section 3.3.
As we see, the E-estimator for the regression with biased response is more complicated
than the one for the regular regression proposed in Section 2.3 because now we need to
estimate the nuisance function D(x).
Figure 3.3 allows us to test the proposed estimator, and its caption explains the simu-
lation and the diagrams. The top diagram shows the scattergram of regression with biased
responses when the underlying regression is the Uniform plus 3. Note the high volatility
of the biased data. The biasing clearly skews data up, and this is highlighted by the naı̈ve
regression E-estimate of Section 2.3 (the dotted line) based solely on the biased data. As we
know, without information about biasing, a regression estimator cannot be consistent. The
proposed regression estimator (the dashed line) is almost perfect, and this is highlighted
by the small ISE. The bottom diagram shows a similar simulation for the Normal plus 3
underlying regression function. Here we have an interesting divergence between the two es-
timates. The naı̈ve one is better near the mode and worse otherwise. The integrated squared
errors quantify the quality of estimation.
It is worthwhile to repeat Figure 3.3 with different parameters and learn to read scat-
tergrams with biased responses.
where
1
D= = E{1/B(X, Y )} (3.3.2)
E{B(X ∗ , Y ∗ )}
REGRESSION WITH BIASED PREDICTORS AND RESPONSES 75
is a constant that makes the joint density bona fide. Further, the biasing function B(x, y)
is known and is bounded below from zero,
As a result, unless E{B(x, Y ∗ )|X ∗ = x} is a constant (an example of the latter is B(x, y) =
B(y)), the observed predictor is also biased. This is what differentiates models (3.3.1) and
(3.2.1).
Let us present a particular example and then explain how biased data may be generated
via a sequential missing algorithm. Recall the example of Section 3.1 about the distribution
of the ratio of alcohol in the blood of liquor-intoxicated drivers based on routine police
reports on arrested drivers. It was explained that the data in reports was biased given that
a drunker driver was more likely to be stopped by the police. Suppose that now we are
interested in the relationship between the level of alcohol and the age (or income level) of
the driver. If it is reasonable to assume that both the level of alcohol and the age (income
level) are the factors defining the likelihood of the driver to be stopped (as the thinking
goes, your wheels give clues to your age, gender, income level and marital status) then both
the level of alcohol and age (income) in the reports are biased.
A possible method of simulation of the biased data is based on a sequential missing.
There is an underlying hidden sequential sampling from triplet (X ∗ , Y ∗ , A) where A is a
Bernoulli random variable such that P(A = 1|X ∗ = x, Y ∗ = y) = B(x, y) satisfying (3.3.3).
If (X1∗ , Y1∗ , A1 ) is the first hidden realization of the pair, then we observe (X1 , Y1 ) :=
(X1∗ , Y1∗ ) if A1 = 1 and skip the hidden realization otherwise. Then the hidden simulation
continues until n observations of (X, Y ) are available.
Let us check that the simulated sample satisfies (3.1.3). For the joint density of the
observed pair (X, Y ) we can write,
∗
,Y ∗ ,A
X,Y X ∗ ,Y ∗ |A fX (x, y, 1)
f (x, y) = f (x, y|1) =
P(A = 1)
∗
,Y ∗
fX (x, y)P(A = 1|X ∗ = x, Y ∗ = y)
= . (3.3.5)
P(A = 1)
If we compare (3.3.5) with (3.3.1), then we can conclude that the formulas are identical
because B(x, y) = P(A = 1|X ∗ = x, Y ∗ = y) and D = 1/P(A = 1).
Now let us explain how an underlying regression function m(x) := E{Y ∗ |X ∗ = x}
can be estimated by the regression E-estimator of Section 2.3. Following the E-estimation
methodology, we need to understand how to estimate Fourier coefficients θj of the regression
function. The approach is to write down Fourier coefficients as an expectation and then
mimic the expectation by a sample mean estimator. Write,
Z 1 Z 1hZ ∞ i
∗ ∗ ∗
θj := m(x)ϕj (x)dx = y[f X ,Y (x, y)/f X (x)]dy ϕj (x)dx. (3.3.6)
0 0 −∞
∗
Using (3.3.1) we get the following expression for the marginal density f X (x) (compare with
(3.3.4)) ∗
f X (x) = f X (x)E{[1/B(X, Y )]|X = x}/D. (3.3.7)
76 ESTIMATION FOR BASIC MODELS OF MODIFIED DATA
Using this formula in (3.3.6), together with (3.3.1), we continue,
Z 1hZ ∞
yf X,Y (x, y) i
θj = ∗
X (x)
dy ϕj (x)dx
0 −∞ DB(X, Y )f
n Y ϕj (X) o
=E . (3.3.8)
B(X, Y )f X (X)E{[1/B(X, Y )]|X}
This is a pivotal formula that sheds light on the possibility to estimate θj . First, we need
to estimate two nuisance functions f X (x), x ∈ [0, 1] and
In its turn, Fourier estimator (3.3.11) yields the E-estimator D̂(x), x ∈ [0, 1].
The two nuisance functions are estimated, and then (3.3.8) yields the following plug-in
sample mean estimator of Fourier coefficients θj of the regression function m(x),
n
X Yl ϕj (Xl )
θ̂j := n−1 . (3.3.12)
l=1
B(Xl , Yl ) max(fˆX (Xl ), c/ ln(n))D̂(Xl )
One remark about (3.3.12) is due. If the biased data is created by a missing mechanism,
then D(x) ≥ 1, and then its E-estimator may be truncated from below by 1.
Apart of estimation of the regression, in applied examples it may be of interest to
∗ ∗
estimate the marginal densities f X and f Y of the hidden predictor X ∗ and the hidden
response Y ∗ . We estimate these densities in turn.
∗
Estimation of the hidden design density f X (x) is based on the following useful proba-
bility formula. We divide both sides of (3.3.1) by DB(X, Y ), then integrate both sides with
respect to y, use (3.3.2), (3.3.9) and get
∗ D ∗ D
f X (x) = f X (x) = f X (x) . (3.3.13)
E{[1/B(X, Y )]|X = x} D(x)
Formula (3.3.13) tells us that the density f X (x) of the observable variable X is biased
∗
with respect to the density of interest f X (x) with the biasing function 1/D(x). Because
we already constructed the E-estimator D̂(x), it can be used in place of unknown D(x),
∗
and then f X (x), x ∈ [0, 1] be estimated by the plug-in density E-estimator of Section 3.1.
For the density of the hidden response Y ∗ , again using (3.3.1) we obtain the following
useful formula (compare with (3.3.13))
∗ D
f Y (y) = f Y (y) . (3.3.14)
E{[1/B(X, Y )]|Y = y}
REGRESSION WITH BIASED PREDICTORS AND RESPONSES 77
5
4
3.0
3
Y
Y
2.0
2
1
1.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
X X
n = 100 , ISE = 0.0086 , ISEN = 0.014 n = 100 , ISE = 0.0019 , ISEN = 0.013
Density
0.4
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
X X
Density
0.2
0.4
0.0
0.0
Y Y
Figure 3.4 Regression with biased predictors and responses. The underlying regression and the bi-
asing function B(x, y) are the same as in Figure 3.3 whose caption explains the parameters. Two
columns of diagrams correspond to simulations with different underlying regression functions. A top
diagram shows the scattergram of biased data by circles overlaid by the underlying regression (the
solid line), the proposed regression E-estimate (the dashed line) and the naı̈ve regression E-estimate
of Section 2.3 based on biased data and not taking into account the biasing. The corresponding in-
tegrated squared errors of the two estimates are shown as ISE and ISEN. The middle and bottom
diagrams show histograms of biased predictors and responses overlaid by the underlying marginal
densities (the solid line) and their E-estimates (the dashed line). These densities are shown over
the empirical range of biased observations. [n = 100, sigma = 0.5, set.B = c(0.2,0.5,1), set.corn
= c(1,2), c = 1, cJ0 = 3, cJ1 = 0.8, cTH = 4]
∗
Sure enough, density f Y (y) is biased with respect to the density of interest f Y (y) with the
biasing function 1/E{[1/B(X, y)]|Y = y}. The conditional expectation may be estimated
similarly to how D(x) was estimated, and then the density E-estimator of Section 3.1 may
be used.
Let us stress one more time that we need to know the biasing function B(x, y) to solve
the regression and density estimation problems.
78 ESTIMATION FOR BASIC MODELS OF MODIFIED DATA
Figure 3.4 allows us to test performance of the proposed regression estimator and the
marginal density estimators. Its diagrams and curves are explained in the caption, and note
that while the underlying regression and even the biasing function are the same as in Figure
3.3, the biased data are generated according to (3.3.1) and hence the data are different from
those generated according to formula (3.2.1) in Section 3.2.
Now let us look at the scattergrams. We begin with the left column where the underlying
regression is created by the Uniform and it is shown by the solid line in the left-top diagram.
First of all, note that we are dealing with a large volatility in the biased data. It looks like
there are several modes in an underlying regression, but we do know that this is not the
case. The dotted line shows the naı̈ve regression estimate of Section 2.3 which ignores
the known information that the data is biased. It does indicate modes, and recall that
the E-estimator follows data while we know the underlying simulation and the hidden
regression function. The proposed regression E-estimator, which takes into account the
biasing nature of predictors and responses, also shows the same modes but it is much closer
to the underlying regression. The latter is also supported by the indicated in the subtitle
ISE = 0.0086 and ISEN = 0.014. Overall, despite a not perfect shape, the performance of
the E-estimator is impressive keeping in mind the large volatility in the biased data. The
middle and bottom diagrams are devoted to estimation of the hidden marginal densities.
The density E-estimate of predictor is perfect despite the histogram of biased predictors
exhibiting several modes. The bottom diagram is even more interesting because it allows
us to look at the elusive marginal density of the response. Note that the histogram is
asymmetric, and the density E-estimate is also skewed to the left, but overall it is a good
estimate.
The right column shows us results of a simulation with the regression function created
by the Normal. Again, the top diagram highlights the large volatility of data. Further,
the scattergram and the underlying regression (the solid line) highlight the biased nature
of the data (just notice that the data are skewed up). The latter is highlighted by the
dotted line of the naı̈ve regression estimate which goes above the underlying regression
(the large volatility attenuates the difference). The proposed estimator (the dashed line)
practically coincides with the underlying regression. Further, look at the corresponding
integrated squared errors and note that taking into account the biased nature of data yields
almost a seven-fold decrease in the integrated squared error. The middle diagram is of a
special interest on its own. Here we can visualize the histogram of biased predictors, note
the influence of the regression function on the distribution of observed biased predictors.
It may be a good exercise to write down the density f X (x) for this simulation and then
analyze it. The proposed marginal density E-estimator does a perfect job for this heavily
biased data, and recall that this estimator is rather involved and based on estimation of the
nuisance function D(x). The bottom diagram is even more interesting because here we are
dealing with the elusive marginal density of responses. Note that the hidden marginal density
∗
f Y (y) has a peculiar asymmetric shape with two modes. Further, look at how “disturbed”
the histogram is, and how it magnifies the tiny right mode of the underlying density and, at
the same time, diminishes the main left mode. Keeping in mind complexity of the marginal
density estimation, which involves estimation of a nuisance conditional expectation, the E-
estimator does an impressive job in exhibiting the shape of the underlying marginal density
of the response Y ∗ .
The studied nonparametric regression problem with biased predictors and responses is
a complicated one, both in terms of the model and the solution. It is highly advisable to
repeat Figure 3.4, with both default and new arguments, and learn more about the biased
regression and its consequences.
A remark about a regression with biased predictors is due. It is worthwhile to explain the
problem via a particular example of the corresponding data modification. There is a hidden
simulation from pair (X ∗ , A) where X ∗ is a continuous variable (predictor) supported on
REGRESSION WITH BIASED PREDICTORS AND RESPONSES 79
∗
[0, 1] and f X (x) ≥ c∗ > 0, and A is a Bernoulli random variable generated according to
the conditional density P(A = 1|X ∗ = x) =: B 0 (x) ≥ c0 > 0. Denote the first realization
of the pair as (X1∗ , A1 ). If A1 = 1, then X1 := X1∗ is observed, next the response Y1
∗
is generated according to the conditional density f Y |X (y|X1 ), and the first realization
(X1 , Y1 ) of a regression sample with biased predictors is obtained. If A1 = 0, then the
realization (X1∗ , A1 ) is skipped. Then the next realization of the pair (X ∗ , A) occurs. The
sequential sampling stops whenever n realizations (X1 , Y1 ), . . . , (Xn , Yn ) are collected. The
problem is to estimate the regression of the responseR ∞ Y Yon|Xthe hidden predictor X ∗ , that
∗
∗
is, we want to estimate m(x) := E{Y |X = x} = −∞ yf (y|x)dy.
Let us explore the regression function for the considered regression model with biased
predictors. To do this, it suffices to find a convenient expression for the conditional density
f Y |X (y|x). We begin with the corresponding joint density,
∗
∗
|A f Y,X ,A (y, x, 1)
f Y,X (y, x) = f Y,X (y, x|1) =
P(A = 1)
∗
f Y,X (y, x)P(A = 1|Y = y, X ∗ = x)
= . (3.3.15)
P(A = 1)
According to the considered biased sampling, the equality P(A = 1|Y = y, X ∗ = x) = P(A =
1|X ∗ = x) holds. This equality, together with the relation f Y,X (y, x) = f Y |X (y|x)f X (x),
the inequality f X (x) > 0, x ∈ [0, 1] and (3.3.15), yield
∗ ∗
f Y |X (y|x)f X (x)P(A = 1|X ∗ = x)
f Y |X (y|x) = . (3.3.16)
f X (x)P(A = 1)
Equality (3.3.18) sheds a light on the case of biased predictors in the regression setting.
Here, despite the fact that X is biased, we have the equality (3.3.17) which implies that the
regressions of Y on X and Y on X ∗ are the same. If you think about this outcome, it may
seem either plain or confusing. If the latter is the feeling, then think about the fact that X
∗
is equal to X ∗ whenever X ∗ is observed, and then Y is generated according to f Y |X . Of
course, the same conclusion can be made from our general formula (3.3.1) when the biasing
function B(x, y) = B(x).
Finally, let us note that there is a special (and quite different) notion of unbiased predic-
tors in finance theory, namely that forward exchange rates are unbiased predictors of future
spot rates. In general, forward exchange rates are widely expected as a good predictor of
future spot rates. For instance, any international transaction involving foreign exchange is
risky due to unexpected change in currency exchange rates. Forward contract can be used
to lower such risk, and as a result, the relation between the forward exchange rate and the
corresponding future spot rate is of great concern for investors, portfolio managers, and
policy makers. Forward rates are often expected to be unbiased estimator of corresponding
future spot rates. It is possible to explore this problem, using our nonparametric technique,
via statistical analysis of the joint distribution of the forward and spot rates.
80 ESTIMATION FOR BASIC MODELS OF MODIFIED DATA
3.4 Ordered Grouped Responses
So far we have explored problems where an underlying data was modified by a biasing
mechanism caused, for instance, by observing a realization of a hidden sampling only if a
specific event occurs. In this section we are considering another type of modification when
it is only known that an underlying observation belongs to a specific group of possible
observations. A group, depending on a situation and tradition, can be referred to by many
names, for instance the stratum, category, cluster, etc.
Let us present several motivating examples. Strata, categories or clusters may define the
socioeconomic statuses of a population: (i) Lowest 25 percent of wage earners; (ii) Middle 50
percent of wage earners; and (iii) Highest 25 percent of wage earners. A car may be driven
with speed below 25, between 25 and 45, or above 45 miles per hour. A patient may have
no pain, mild pain, moderate pain, severe pain, or acute pain. A patient in a study drinks
no beer a day, 1 beer a day, more than 1 but fewer than 2 beers a day, and at least 2 beers
a day. The overall rating of a proposal can be poor, fair, good, very good, or excellent.
In the above-presented examples, there is a logical ordering of the groups and hence they
may be referred to as ordinal responses. To finish with the terminology, nominal responses
have no natural logical ordering; examples are the color of eyes or the place of birth of a
respondent to a survey.
Classical examples of nonparametric regression with grouped regression are the predic-
tion of how a dosage of this or that medicine affects pain, or how the length of a rehabilitation
program affects drug addiction, or how the quality of published papers affects the rating of
a proposal.
To shed light on grouped (categorical, strata, cluster) nonparametric regression, let us
consider the numerically simulated data shown in Figure 3.5. The left diagram shows an
example of simulated classical additive regression Y ∗ = m(X)+ση which is explained in the
caption. The small sample size n = 30 is chosen to improve visualization of each observation.
The scatter plot is overlaid by boundaries for 4 ordered groups: Y ∗ < −1, −1 ≤ Y ∗ < 1,
1 ≤ Y ∗ < 3, and 3 ≤ Y ∗ . Then the data are modified by combining the responses into
the above-highlighted groups (categories) shown in the right diagram. Thus, instead of the
hidden underlying pairs (Xl , Yl∗ ), where Yl∗ = m(Xl ) + σηl , we observe modified pairs
(Xl , Yl ) where Yl is the number of a group (cell, category, stratum, etc.) for an unobserved
Yl∗ . Figure 3.5 visually stresses the loss of information about the underlying regression
function, because grouped data give no information on how underlying unobserved responses
are spread out over cells. Please look at the right diagram and imagine that you need to
visualize an underlying regression function. Further, the fact that heights of cells may be
different, make the setting even more complicated.
The interesting (and probably unexpected) feature of the grouped regression is that the
regression noise may help to recover an underlying regression. Indeed, consider a case where
a regression function is m(x) = 0, σ = 0 and cells are as shown in Figure 3.5. Then the
available observations are (Xl , 2), l = 1, 2, . . . , n and there is no way to estimate the under-
lying regression function. Further, even if there are additive errors but their range is not
large enough, for instance σηl are uniform U (−0.99, 0.99), then the modified observations
are again (Xl , 2), l = 1, . . . , n.
It is a good exercise to repeat Figure 3.5 with different arguments and get used to this
special type of data modification.
Now we are ready to explain how an underlying regression function may be estimated
based on observed grouped responses.
In what follows it is assumed that the underlying regression model is
Y = m(X) + ε, (3.4.1)
ORDERED GROUPED RESPONSES 81
4 44
4
4
3
3 33 33 3 33 33 33 3
3
2
2 2 2
2 22
2 22 22
Y*
2
1
0
1
-1
0
-2
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
X X
where X is supported on [0, 1] and f X (x) ≥ c∗ > 0, and the regression error ε is a continuous
random variable with zero mean, finite variance and independent of the predictor X.
We begin with the parametric case m(x) = θ and the model of grouped data shown in
Figure 3.5. Let p̄ be the proportion of observations that have categories 3 or 4. Then the
probability P(θ + ε ≥ 1) =: p, which is the theoretical proportion of observations in the
third and fourth categories, is
p = P(ε ≥ 1 − θ) = 1 − F ε (1 − θ). (3.4.2)
By solving this equation we get a natural estimate of θ,
θ̄ = 1 − Qε (1 − p̄), (3.4.3)
ε ε
where Q (α) is the quantile function, that is, P(ε ≤ Q (α)) = α.
Note that we converted the problem of grouped regression into Bernoulli regression
discussed in Section 2.4. The latter is the underlying idea of the proposed solution.
There are three steps in the proposed regression estimator for regression with grouped
responses.
82 ESTIMATION FOR BASIC MODELS OF MODIFIED DATA
4 4444 44 4 4 44
44
4444
4
33 3 333
3
33 3
333333
3
33 33
33 3333
333
33 333 333 33
3 3
3333
33
333
3 3
3
333 3333333
3 3 3
3
3
2 222 22
22222
22
222 2
2 2 2
2 2
2
2 2
22222
222
2
222 2 2
2
22222
222
2
222
2
22 222
222 2 2 2
2 2
222
22 2222
Y
Y
2
2
111 1 11 11
1
1 1 11 1
1
1
0
0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
X X
Estimates Estimates
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x
Figure 3.6 E–estimates for nonparametric regression with grouped responses. The underlying sim-
ulation is the same as in Figure 3.5. Dotted and dashed lines show estimates p̂ and m̂ of the binary
probabilities and the regression function, respectively. The solid line is the underlying regression
function. [n = 100, set.corn = c(2,3), sigma = 1, bound.set = c(-50,-1,1,3,50),a = 0.005,b =
0.995, cJ0 = 3, cJ1 = 0.8, cTH = 4]
Step 1. Combine the ordered groups into two groups of “successes” and “failures.” Ide-
ally, the boundary in responses that separates these two groups should be such that both
successes and failures spread over the domain of predictors. For instance, for the example
shown in Figure 3.5, the only reasonable grouping is {(1, 2), (3, 4)}.
Step 2. Use the Bernoulli regression E-estimator p̂(x) of Section 2.4 to estimate the
probability of a success as a function in x. If no information about the regression error ε is
given, this is the last step. If the distribution of ε is given, then go to step 3.
Step 3. This step is based on the assumption that the distribution of ε is known. Assume
that an observed Yl belongs to the success group iff Yl ≥ c∗ where c∗ is a constant. Then
m̂(x) = c∗ − Qε 1 − [p̂(x)]ba .
(3.4.4)
Here [z]ba = max(a, min(z, b)) is the truncation (or we can say projection) of z onto interval
[a, b]. The truncation allows us to avoid infinite values for m̂. The “default” values of a and
b are 0.005 and 0.995.
Let us check how the proposed estimator performs. Figure 3.6 exhibits results of two
simulations in two columns of diagrams. Underlying regression functions are the Normal
and the Bimodal shown by the solid lines in the bottom diagrams. The regression errors
are standard normal. The estimates p̂(x) and m̂(x) are shown by dotted and dashed lines,
respectively. The datasets are simulated according to Figure 3.5, only here the sample size
MIXTURE 83
n = 100. The estimates p̂(x) (the dotted lines) look not too impressive but not too bad either
keeping in mind the complexity of the grouped data. After all, we could observe similar
shapes in estimates based on n = 100 direct observations. The estimate for the Bimodal
(see the right-bottom diagram) has a wrong and confusing left tail, but it corresponds to
the left tail of the grouped data exhibited in the right-top diagram.
Knowing the distribution of regression error ε dramatically improves the visual appeal
of estimates m̂(x) shown in the bottom diagrams of Figure 3.6 by the dashed lines. The
estimate for the Normal is truly impressive keeping in mind both complexity of the setting
and the small sample size. The estimate for the Bimodal is also a significant improvement
both in terms of the two pronounced modes and their magnitudes (just compare with the
dotted line which shows the estimate p̂(x)).
The reader is advised to repeat this figure with different arguments and get used to this
particular data modification and the proposed estimates.
The proposed estimator is not optimal because it is based on creating just two groups
from existing groups. Nevertheless, asymptotically the suggested estimator is rate optimal,
it is a good choice for the case of small sample sizes where typically only several groups
contain a majority of responses, and its simplicity is appealing.
3.5 Mixture
This section presents a new type of data modification that occurs in a number of practical
applications, and it will be explained via a regression example.
There is an underlying sample of size n from pair (X, Y ) where X is the predictor and
Y is the response. It is known that X is a continuous random variable supported on [0, 1],
Y is Bernoulli and P(Y = 1|X = x) = m(x). The problem is to estimate the conditional
probability m(x). As we know from Section 2.4, the problem may be treated as a Bernoulli
regression because
m(x) := E{Y |X = x}. (3.5.1)
If the sample from (X, Y ) is available, then the E-estimator of Section 2.4 can be used. In
the considered mixture model, the responses are hidden and instead we observe realizations
from (X, Z) where
Z = Y ζ + (1 − Y )ξ. (3.5.2)
Here ζ and ξ are random variables with known and different mean values µζ and µξ .
As we can see, the mixture (3.5.2) is a special modification of an underlying variable of
interest Y .
One of the classical practical examples of the mixture is a change-point problem in
observed time series where Xl = l/n is time and Y = 1 if an object functions normally and
Y = 0 if the object functions abnormally. Then Equation (3.5.2) tells us that while we do
not observe Y directly, observations of ζ correspond to the case where the object functions
normally and observations of ξ correspond to the case where it functions abnormally. Then
changing the regression m(X) from 0 to 1 implies that the object recovers from abnormal
functioning.
Now let us propose an E-estimator for the underlying regression function (3.5.1). In
what follows it is assumed that in model (3.5.2) µζ 6= µξ and that X is independent of ζ
and ξ. Introduce a scaled version of the observed Z defined as
Z 0 := (Z − µξ )/(µζ − µξ ). (3.5.3)
The underlying idea of the new random variable Z 0 is based on the following relation:
Z − µξ Y ζ + (1 − Y )ξ − µξ
E{Z 0 |X = x} = E{ |X = x} = E{ |X = x}
µζ − µξ µζ − µξ
84 ESTIMATION FOR BASIC MODELS OF MODIFIED DATA
2
1
Z
0
-1
X
2
1
Z
0
-1
-2
Figure 3.7 E-estimation for mixtures regression. Underlying regression (the solid line) and its E-
estimate (the dashed line) overlaid the scattergram shown by circles. In the top and bottom di-
agrams the underlying regressions are the Normal and the Bimodal divided by their maximum
value. If the Uniform is chosen, then it is equal to 3/4. Realizations of X are equidistant and
hence observations imitate a time series. {Random variable ξ is Normal(muxi,(sdxi)2 ) and ζ is
Normal(muzeta,(sdzeta)2 ).} [n = 100, set.corn = c(2,3), muxi = 1, muzeta = 0, sdxi = 0.6, sdzeta
= 0.9, cJ0 = 4, cJ1 = 0.8, cTH = 4]
Here ε is zero mean, unit variance and independent of X random variable, the nonnegative
function σ(x) is called the scale (spread or volatility) function, and the predictor X is
supported on [0, 1] and f X (x) ≥ c∗ > 0.
Traditional regression problem is to estimate the function m(x), and then the design
density f X (x) and the scale function σ(x) become nuisance ones. These two functions may
be of interest on their own. We know from Section 2.2 how to estimate f X (x), x ∈ [0, 1]
based on the observed predictors. In this section our task is to estimate the scale σ(x),
x ∈ [0, 1].
In the statistical literature the same problem of estimating the scale function may be re-
ferred to as either estimation of a nuisance function in a regression problem or as estimation
based on data modified by a nuisance regression function.
Let us explain the latter formulation of the problem of scale estimation. There are hidden
observations Zl = σ(Xl )εl of the scale function. If they would be available, we could convert
estimation of the scale into a regression problem. To do this, we write
4
2
Y
0
-2
X
0 2 4 6
Y
-4
Figure 3.8 Estimation of the scale function. Each diagram exhibits observations (scattergram) of
a heteroscedastic regression generated by the Uniform design density, a regression function (the
Normal and the Bimodal in the top and bottom diagrams), and the scale function σ(x) = s + σf (x)
where f (x) is the Normal, and s and σ are positive constants. In each diagram an underlying
regression function is shown by the dotted line and its E-estimate by the dot-dashed line, while an
underlying scale function and its E-estimate are shown by the solid and dashed lines, respectively.
{Underlying regression functions are controlled by the argument set.corn, parameter σ by sigma,
parameter s by s, and the choice of f is controlled by argument scalefun.} [n = 100, set.corn =
c(2,3), sigma = 1, s = 0.3, scalefun = 2, cJ0 = 3, cJ1 = 0.8, cTH = 4]
As a result, (3.6.2) is the regression model discussed in Section 2.3 and we can use the
regression E-estimator for estimation of the regression σ 2 (x). Further, if the E-estimator
takes on nonnegative values, then they are replaced by zero. Taking the square root of
the above-defined estimator yields the estimator σ̃(x, Z1n ) of the scale function, and here
Z1n := (Z1 , Z2 , . . . , Zn ). Of course, in the regression model (3.6.1) realizations of Z are
hidden, instead we observe pairs (Xl , Yl ) such that
Yl = m(Xl ) + Zl , l = 1, 2, . . . , n. (3.6.4)
Equation (3.6.4) explains how the hidden observations Zl are modified by the nuisance and
unknown function m(Xl ).
A natural possible solution of a problem with data modified by nuisance functions is
to first estimate them and then plug them in. In our case we can estimate the regression
function m(x) by the regression E-estimator m̂(x), and then replace unknown Zl by
to estimate the regression function w(x). The proposed E-estimator is based on the E-
estimation methodology of constructing a sample mean estimator of Fourier coefficients
of an Runderlying regression function. Following the methodology, a Fourier coefficient
1
θj := 0 w(x)ϕj (x)dx of the regression function w(x), x ∈ [0, 1] can be written as
Z 1 n A∗ ϕ (X ∗ ) o
j
θj = E{A∗ |X ∗ = x}ϕj (x)dx = E . (3.7.3)
0 f X ∗ (X ∗ )
∗ ∗
If the design density f X (x) is unknown, then it is replaced by its E-estimator fˆX (x) of
Section 2.2 truncated below from zero by c/ ln(n). In its turn, the plug-in Fourier estimator
(3.7.4) yields the regression E-estimator w̃(x) of Section 2.4.
Now we are ready to consider the Bernoulli regression problem with unavailable failures.
The aim is still to estimate the regression function w(x), x ∈ [0, 1] defined in (3.7.1), but
now the sample (X1∗ , A∗1 ), . . . , (Xn∗ , A∗n ) is hidden. Instead, a subsample X1 , . . . , XN of the
predictors X1∗ , . . . , Xn∗ , corresponding to successes, is available and also the sample size n of
the hidden sample is known. The subsampling is done as follows. If A∗1 = 1 then X1 := X1∗ ,
and otherwise X1∗ is skipped. Then this subsampling continues, and finally if A∗n = 1 then
XN := Xn∗ and otherwise Xn∗ is skipped. Note that the number N of available predictors in
the subsample is
Xn
N := A∗l . (3.7.5)
l=1
Further, as usual we do not consider settings with N = 0 because there are no data, and
in general we also exclude cases with relatively small N that are not feasible for nonpara-
metric estimation. Let us also note that the available data may be equivalently written as
A∗1 X1∗ , . . . , A∗n Xn∗ or as (A∗1 X1∗ , A∗1 ), . . . , (A∗n Xn∗ , A∗n ).
It is convenient to use a different notation X for the observed predictor in a success case
because the distribution of X is different from the distribution of the underlying predictor
X ∗ . Indeed,
∗ ∗
X X ∗ |A∗ f X ,A =1 (x) ∗ w(x)
f (x) := f (x|1) = = f X (x) . (3.7.6)
P(A∗ = 1) P(A∗ = 1)
This result implies that the observed predictor X has a biased distribution with respect
to the hidden predictor X ∗ , and the biasing function is equal to the regression function
w(x).
Recall that biased distributions and biased data were discussed in Section 3.1. As we
know from that section (and this also follows from (3.7.6)), based on the biased data we
∗
can consistently estimate only the product f X (x)w(x). The pivotal conclusion is that we
X∗
need to know the design density f (x) or its estimate for consistent estimation of w(x).
As a result, we are exploring the following path for solving the problem. Formulas (3.7.3)
and (3.7.4) tell us that to estimate Fourier coefficients of the regression function w(x) (and
hence to construct a regression E-estimator), it is sufficient to know only predictors Xl∗
corresponding to A∗l = 1. As a result, it is sufficient to know only the observed predictors
X1 , . . . , XN . This is good news. The bad news is that we need to know the underlying
BERNOULLI REGRESSION WITH UNAVAILABLE FAILURES 89
∗
density f X (x) which, as we already know, cannot be estimated based on the available
data.
∗
Suppose that we know values f X (Xl ), l = 1, . . . , N . Then the regression function may
be estimated solely on available predictors corresponding to successes in the hidden Bernoulli
sample. Indeed, we may rewrite (3.7.4) as
N
−1
X ϕj (Xl )
θ̃j = n . (3.7.7)
f X ∗ (Xl )
l=1
This Fourier estimator yields the regression E-estimator w̃(x), x ∈ [0, 1]. In some practi-
cal applications, when design of predictors is controlled, this conclusion allows us to use this
regression E-estimator. Further, theoretically this E-methodology implies asymptotically (in
n) optimal regression estimation.
∗
If the design density f X (x) is unknown, then in some situations it may be possible to
∗ ∗
get an extra sample XE1 , . . . , XEk of size k n from X ∗ ; here means “significantly
∗
smaller.” Then we may use the extra sample to calculate the density E-estimator fˆX (x)
and plug it in (3.7.7). Because the density estimator is used in the denominator, it is prudent
to truncate it from below by c/ ln(n) where c is the new parameter of the E-estimator. Then
the (plug-in) sample mean estimator of Fourier coefficients of w(x), x ∈ [0, 1] is
N
−1
X ϕj (Xl )
θ̂j := n . (3.7.8)
ˆ
max(f ∗ (Xl ), c/ ln(n))
X
l=1
This Fourier estimator yields the regression estimator ŵ(x), x ∈ [0, 1]. The asymptotic
theory shows that, under a mild assumption, this approach is consistent and implies optimal
MISE (mean integrated squared error) convergence.
One more remark is due. In all future applications of the Bernoulli regression with
unavailable failures, we need to know w(x) only for x ∈ {X1 , . . . , Xn }. This is important
information to know because it means that the range of observations in the E-sample should
be close to the range of available observations X1 , . . . , Xn .
Let us test the proposed E-estimator on several simulated examples. Figure 3.9 presents
the first set of four simulations, its caption explains the diagrams and the simulation. Here
a left diagram shows the histogram of an extra sample of size k from X ∗ ; the extra sample is
∗
referred to as an E-sample. An E-sample is used to estimate f X (Xl ), l = 1, . . . , N . Values
∗ ∗
of an underlying design density f X (Xl ) and its E-estimate fˆX (Xl ) are shown by circles
and crosses, respectively. A corresponding right diagram shows via circles observed Pn pairs
(Xl , 1). The size of a hidden sample n and the number of available predictors N = l=1 A∗l
are shown in the title. Further, the solid and dashed lines show the underlying regression
w(x) and its oracle-estimate based on all n hidden realizations of (X ∗ , A∗ ). Crosses show
values of ŵ(Xl ).
Now we can look at specific simulations and outcomes shown in Figure 3.9. The top row
shows the case of the constant regression w(x) = 3/4. The extra E-sample is tiny (k = 30)
for a nonparametric estimation of the density. The default histogram stresses complexity
of the density estimation which is a linear function shown by the circles. The E-estimate
is surprisingly good here (look at the crosses). It is fair to say that visualization of data
(the histogram) does not help us to recognize the density, and this is why the density
E-estimate is impressive. Then the estimated values of the underlying design density are
plugged in the regression E-estimator (3.7.8), and results are shown in the right-top diagram.
First of all, let us compare the solid line (the underlying regression) and crosses showing
values ŵ(Xl ). The regression estimate is perfect, and this is despite the fact that only
N = 75 observations are available. Interestingly, the oracle’s E-estimate, based on hidden
90 ESTIMATION FOR BASIC MODELS OF MODIFIED DATA
1.0
0.0 0.6
A*
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
X* X
0.0 0.6
A*
0.7
0.0 0.2 0.4 0.6 0.8 1.0 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
*
X X
0.0 0.6
A*
0.4
0.0 0.2 0.4 0.6 0.8 1.0 0.2 0.3 0.4 0.5 0.6 0.7 0.8
X* X
0.0 0.6
A*
0.4
0.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8
X* X
Figure 3.9 Bernoulli regression with unavailable failures and an extra sample (E-sample) of pre-
dictors. Four rows of diagrams exhibit results of simulations with different underlying regression
functions w(x) shown by the solid line and named in the title of a right diagram. The sizes of a hid-
den Bernoulli regression and E-sample are n and k, respectively. A left diagram shows the histogram
∗ ∗
of E-sample and values of the design density f X (Xl )and its E-estimate fˆX (Xl ), l = 1, . . . , N by
circles and crosses, respectively. A right diagram shows by circles available observations in the
Bernoulli regression, the underlying regression function w(x) (the solid line), oracle’s regression
E-estimate (the dashed line) based on n hidden observations, and by crosses values of the proposed
E-estimator ŵ(Xl ), l = 1, . . . , N . {The figure allows to choose different parameters of E-estimator
for the design density (they are controlled by traditional arguments) and the regression E-estimator.
Further, for the regression E-estimator, parameters cJ0 and cJ1 can be specified for each of the 4
experiments. The latter is done by arguments setw.cJ0 and setw.cJ1. The argument desden controls
the shape of the design density which is then truncated from below by the value dden and rescaled
into a bona fide density. The argument st.k controls sample sizes of E-samples for each row.} [n =
100, set.k = c(30,30,50,50), desden = 00 1 + 0.5 ∗ x00 , dden = 0.2, c=1, cJ0 = 3, cJ1 = 0.8, cTH =
4, setw.cJ0 = c(3,3,3,3), setw.cJ1 = c(0.3,0.3,0.3,0.3)]
pairs (Xl∗ , A∗l ), l = 1, 2, . . . , n, is much worse (look at the oscillating dashed line). This is
an interesting outcome but it is rare, in general the oracle-estimate is much better.
The second (from the top) row of diagrams in Figure 3.9 considers the same setting only
with the Normal regression function. Here again only k = 30 extra observations of X ∗ are
∗
available for estimating the design density f X . Note that the histogram clearly deviates
BERNOULLI REGRESSION WITH UNAVAILABLE FAILURES 91
from the underlying linear design density. However, fortunately for us, we need values of the
density E-estimator only for Xl ∈ [0.2, 0.9] interval, and within this interval the E-estimate
is satisfactory. As a result, the right diagram shows us a fair regression estimate despite the
fact that only N = 36 observations from hidden n = 100 are available. The oracle estimate
of the regression (the dashed line) is good. In the third row of diagrams the case of the
Bimodal regression is considered. Here the larger sample size k = 50 of E-sample is used
and the design density estimate is fair. Unfortunately, this cannot help the regression E-
estimator because the size N = 34 of available observations is too small and the regression
function is too complicated (recall our simulations in Section 2.3). The poor oracle estimate,
based on n = 100 hidden observations, sheds additional light on the difficult task. In the
bottom diagram we explore the case of the Strata regression function. The design density
estimate is fair, and the regression E-estimate is truly impressive given that only N = 36
observations are available. Further, this estimate is on par with the oracle estimate.
It is advisable to repeat Figure 3.9 with different parameters and get used to this chal-
lenging problem. Further, it is of interest to explore the relation between k and n that
implies a reliable estimation comparable with the oracle’s estimation. Further, Figure 3.10
allows us to use different parameters for the density estimator and regression E-estimators
used in each row. The latter is a nice feature if we want to take into account different sample
sizes and shapes of the underlying curves.
In many applications the support of the predictor may not be known. We have discussed
this situation in Chapter 2, and let us continue it here because this case may imply some
additional complications for our regression E-estimator. Namely, so far it has been explicitly
assumed that the design density is bounded below from zero (recall that in Figure 3.9 the
design density is not smaller than the argument dden). Let us relax these two assumptions
and explain how this setting may be converted into the above-considered one.
Suppose that the hidden predictor X ∗ is a continuous random variable supported on a
real line. Then our methodology of E-estimation is as follows. First, we combine N available
predictors Xl and k extra observations XEl and find among these N + k observations
the smallest and largest values XS and XL , respectively. Then, using the transformation
(X − XS )/(XL − XS ) we rescale onto [0, 1] the two available samples, and repeat all steps of
the above-proposed regression E-estimation. The only new element here is that the obtained
∗
fˆX (x) should be divided by (XL − XS ) to restore its values to the original interval.
Figure 3.10 illustrates this setting and the proposed solution. Its structure is similar to
Figure 3.9, only here the regression function is the same in all 4 experiments, it is a custom-
made function, and other differences are explained in the caption. Let us look at the top
row of diagrams. Here the density E-estimate is fair, keeping in mind the small sample size
k = 30, and its deviation from the underlying one is explained by the histogram. Please pay
attention to the fact that 30 observations from a normal density may not be representative
of an underlying density (as we see from the heavily skewed histogram). The deficiency in
the density estimate is inherited by the regression estimate. Namely, note that the regression
E-estimate (shown by crosses) is significantly smaller for positive values of X, and this is
due to larger values of the density E-estimate. In the second row of diagrams results of
an identical simulation are shown. Here the density estimate, at the required values Xl ,
is almost perfect. Of course, recall that the smallest values are truncated from below to
avoid almost zero values in the denominator. The corresponding regression E-estimate is
better. Overall, keeping in mind the small sample sizes N = 61 and N = 68 of available
observations, the two regression estimates are fairly good and correctly indicate the sigmoid
shape of the regression.
Simulations in the two bottom rows in Figure 3.10 use larger size k = 50 of E-samples.
The second from the bottom row of diagrams exhibits a teachable outcome which stresses
the fact that outcomes of small random samples may present surprises. Here we observe the
worst density and regression estimates despite the largest N = 82. Note how the shortcom-
92 ESTIMATION FOR BASIC MODELS OF MODIFIED DATA
0.25
0.0 0.6
A*
0.05
-4 -2 0 2 4 -2 -1 0 1 2 3 4
X* X
0.0 0.6
A*
0.00
-4 -2 0 2 4 -2 0 2 4
*
X X
0.0 0.6
A*
0.00
-4 -2 0 2 4 6 -2 0 2 4 6
X* X
0.0 0.6
A*
0.00
-6 -4 -2 0 2 4 -6 -4 -2 0 2 4
X* X
Figure 3.10 Bernoulli regression with unavailable failures. The structure of the diagrams is the
same as in Figure 3.9. The difference is that the design density is Normal(0, σ 2 ), in all rows the
same underlying regression function w(x) is used and it is controlled by the string w, and in a right
diagram triangles show available observations while circles and crosses show values of the underlying
regression function w(Xl ) and its E-estimate ŵ(Xl ), l = 1, . . . , N . {The argument sigma controls
the standard deviation σ of the normal design density. The string w defines the regression function
w(x), and note that w(x) ∈ [0, 1]. All other arguments are the same as in Figure 3.9.} [n = 100,
set.k = c(30,30,50,50), sigma = 2, w = 00 0.2 + 0.8 ∗ exp(1 + 2 ∗ x)/(1 + exp(1 + 2 ∗ x))00 , c = 1, cJ0
= 3, cJ1 = 0.8, cTH = 4, setw.cJ0 = c(3,3,3,3), setw.cJ1 = c(0.3,0.3,0.3,0.3)]
ings in the density estimate are inherited by the regression E-estimate. The bottom row of
diagrams exhibits another outcome of the same simulation, and here both the density and
the regression E-estimates are very good.
The simulations indicate that the issue that we should be aware of is that the range
of the E-sample should be close to the range of available predictors. This remark may be
useful if sequential E-sampling is possible.
Overall, we may conclude that if using a relatively small extra sample of hidden pre-
dictors is possible, then the proposed regression E-estimator is a feasible solution of the
otherwise unsolvable problem of Bernoulli regression with unavailable failures.
It is highly advisable to repeat Figure 3.10 with different parameters and learn more
about this important problem that will play a key role in statistical analysis of missing data.
EXERCISES 93
3.8 Exercises
3.1.1 What is the definition of biased data?
3.1.2 Present an example of biased data.
3.1.3 Suppose that an underlying random variable X ∗ is observed only if it is larger than
another independent random variable T . Are the observed realizations of X ∗ biased?
3.1.4 Verify (3.1.2) and (3.1.3).
3.1.5 Explain all components of formula (3.1.5).
3.1.6∗ For the setting of Exercise 3.1.3, write down a formula that relates the density of
X ∗ with the density of X.
3.1.7 Is (3.1.8) a reasonable estimator of the Fourier coefficient (3.1.7)?
3.1.8 Find the mean of θ̂j defined in (3.1.8). Hint: Begin with the case when P := P(A = 1)
is given, and then look at how using the plug-in estimate P̂ affects the mean.
3.1.9∗ Evaluate the variance of θ̂j defined in (3.1.8). Hint: Prove formula (3.1.10).
3.1.10 Verify inequality (3.1.11).
3.1.11 Repeat Figure 3.1 for different biasing functions and explain outcomes.
3.1.12 Repeat Figure 3.2 with different underlying corner densities and biasing functions.
What combinations imply worse and better estimates?
3.1.13 How parameters of the biasing function, used in Figure 3.2, affect the coefficient of
difficulty for the four corner densities?
3.1.14 A naive estimate for biased data first estimates the density of the observed random
variable X and then corrects its using a known biasing function B(x). Write down a formula
for this estimator. Hint: Density E-estimator can be used for estimating f X , then use formula
∗
f X (x) = f X (x)B(x)/B (3.8.1)
where B is a constant which makes the density f X (x) bona fide (integrated to 1).
3.2.1 Explain a regression setting with biased responses.
3.2.2 Is the predictor, the response, or both biased under the model (3.2.1)?
3.2.3 Explain all functions in (3.2.1).
3.2.4 How is formula (3.2.3) obtained?
3.2.5 Explain how formula (3.2.4) is obtained. What is its relation to (3.2.1)?
3.2.6 How can a simulation of regression with biased responses be designed?
3.2.7 Consider a model where an underlying response Y ∗ is observed only if Y ∗ > T where
T is an independent random variable. Is this a sampling with biased responses?
3.2.8 For the setting of the previous exercise, what is the formula for the joint density of
the observed pair of random variables (predictor and response)?
3.2.9 Suppose that in the setting of Exercise 3.2.7 the random variable T depends on
predictor X. Does this information make a difference in your conclusions about the biased
data? Is this a response-biased sampling?
3.2.10 Verify formula (3.2.8).
3.2.11 What is the underlying idea of the estimator (3.2.9)?
3.2.12 Evaluate the bias of estimator (3.2.9).
3.2.13 What is the underlying idea of the estimator D̂(x)?
3.2.14 What is the bias of the estimator D̂(x)?
3.2.15∗ The corresponding coefficient of difficulty of the proposed regression E-estimator is
1 ∞ ∗
f Y |X (y|x)
Z Z
X 2
d := E{[1/(f (X)B(X, Y )D(X))] } = dydx. (3.8.2)
0 −∞ f X (x)B(x, y)D(x)
Prove this assertion, or show that it is wrong and then suggest a correct formula. Hint:
Begin with the case when all nuisance functions (like f X (x) or D(x)) are known.
94 ESTIMATION FOR BASIC MODELS OF MODIFIED DATA
3.2.16 Explain all arguments used by Figure 3.3.
3.2.17 Repeat Figure 3.3 a number of times using different regression functions. Which
one is more difficult for estimation? Hint: Use both visual analysis and ISEs to make a
conclusion.
3.2.18∗ Use Figure 3.3 to answer the following question. For each underlying regression
function, what are the parameters of the biasing function that make estimation less and more
challenging? Confirm your observations using theoretical analysis based on the coefficient
of difficulty.
3.2.19∗ Consider the case B(x, y) = B ∗ (y) when the biasing is defined solely by the value of
the underlying response. Present all related probability formulas for this case, and propose
an E-estimator.
3.2.20∗ In the literature, statisticians often consider a model where
∗
f X|Y (x|y) = f X|Y (x|y). (3.8.3)
3.9 Notes
3.1 Biased data is a familiar topic in statistical literature, see a discussion in Efromovich
(1999a), Comte and Rebafka (2016) and Borrajo et al. (2017). A review of possible estima-
tors may be found in the book by Wand and Jones (1995). Efromovich (2004a) has proved
that E-estimation methodology is asymptotically efficient, and then a plug-in estimator of
the cumulative distribution function is even second-order efficient, see a discussion and the
proof in Efromovich (2004c). A combination of biased and other modifications is also pop-
ular in the literature, see Luo and Tsai (2009), Brunel et al. (2009), Ning et al. (2010) and
Chan (2013).
3.2-3.3 Biased responses commonly occur in technological, actuarial, biomedical, epidemi-
ological, financial and social studies. In a response-biased sampling, observations are taken
according to the values of the responses. For instance, in a study of possible dependence
of levels of hypertension (response) on intake of a new medicine (covariate), sampling from
patients in a hospital is response-biased with respect to a general population of people with
hypertension. Another familiar example of sampling with selection bias in economic and
social studies is that the wage is only observed for the employed people. An interesting
discussion may be found in Gill, Vardi and Wellner (1988), Bickel and Ritov (1991), Wang
(1995), Lawless et al. (1999), Luo and Tsai (2009), Tsai (2009), Ning, Qin and Shen (2010),
Chaubey et al. (2017), Kou and Liu (2017), Qin (2017) and Shen et al. (2017).
98 ESTIMATION FOR BASIC MODELS OF MODIFIED DATA
3.4 Nonparametric estimation for ordered categorical data is considered in the books Si-
monoff (1996) and Efromovich (1999a). Efromovich (1996a) presents asymptotic justifica-
tion of E-estimation.
3.5 A discussion of parametric mixture models can be found in the book by Lehmann and
Casella (1998). Nonparametric models are discussed in books by Prakasa Rao (1983) and
Efromovich (1999a). The asymptotic justification of the E-estimation is given in Efromovich
(1996a). For possible further developments see Chen et al. (2016).
3.6 Asymptotic justification of using E-estimation for nuisance functions may be found in
Efromovich (1996a; 2004b; 2007a,f).
Chapter 4
Nondestructive Missing
In this chapter nonparametric problems with missing data that allow a consistent estimation
are considered. By missing we mean that some cases in data (you may think about rows in a
matrix) are incomplete and instead of numbers some elements in cases are missed (empty).
In R language missed elements are denoted by a logical flag “NA” which stands for “Not
Available,” and this is why we are saying that some elements in a case are available (not
missed) and others not available (missed); we may similarly say that a case is complete (all
elements are available) or incomplete (when some elements are not available).
With missing data, on the top of all earlier discussed issues with nonparametric estima-
tion, we must address the new issue of dealing with incomplete cases. Of course, using the
E-estimation methodology converts the new problem into proposing a sample mean Fourier
estimator based on missing data. As we will see shortly, for some settings incomplete cases
can be ignored (and this may be also the best solution) and for others a special statistical
procedure, which takes into account the missing, is necessary for consistent estimation.
Let us recall that if for a missing data a consistent estimation is possible, then we refer
to the missing as nondestructive. The meaning of this definition is that while a nondestruc-
tive missing may affect accuracy (quality) of estimation via increasing the MISE and other
nonparametric risks, at least it allows us to propose a consistent estimator. Some types of
missing may destroy all useful information contained in an underlying (hidden) data and
hence make a consistent estimation impossible. In this case the missing is called destructive
and then some extra information is needed for a consistent estimation; destructive missing
is discussed in the next chapter. Traditional examples of nondestructive missing are settings
where observations are missed completely at random (MCAR) when the probability of an
observation to be available (not missed) does not depend on its value. Another example is
settings with missing at random (MAR) when the probability of an observation to be avail-
able (not missed) depends only on value of another always observed (never missed) random
variable. In some special cases missing not at random (MNAR), when the probability of a
variable to be available depends on its value, also implies a nondestructive missing.
In this chapter we often encounter Bernoulli and Binomial random variables. Let us
briefly review basic facts about these random variables (more can be found in Section 1.3).
A Bernoulli random variable A may be equal to zero (often coded as a “failure”) with the
probability 1 − w or 1 (often coded as a “success”) with the probability w. Parameter w,
the probability that A is equal to 1, describes this random variable. In short, we can say
that A is Bernoulli(w). The mean of AP is equal to w, that is E{A} = w, and the variance of
n
A is equal to w(1 − w). The sum N := l=1 Al of n independent and identically distributed
Bernoulli(w) random variables has a binomial distribution with P(N = k) = [n!/(k!(n −
k)!)]wk (1 − w)n−k , k = 0, 1, . . . , n. In short,Pwe can write
Pnthat N is Binomial(n, w). The
n
mean value of N is nw (indeed, E{N } = E{ l=1 Al } = l=1 E{Al } = nE{A} = nw), and
the variance of N is nw(1−w) (it is the sum of variances of Al ). Another useful result about
a Binomial distribution is that for large n it can be approximated by a Normal distribution
with the same mean and variance. The rule of thumb is that if min(nw, n(1 − w)) ≥ 5 or
99
100 NONDESTRUCTIVE MISSING
n ≥ 30, then the distribution of N is Normal with the mean nw and the variance nw(1−w).
Hoeffding’s inequality states that for any positive constant t,
N 2
N 2
P − w < −t ≤ e−2nt , P | − w| > t ≤ 2e−2nt . (4.0.1)
n n
In its turn, Hoeffding’s inequality yields that for any constant δ ∈ (0, 1],
p
P N/n < w − ln(1/δ)/(2n) ≤ δ. (4.0.2)
In this and the following chapters, a Bernoulli(w) random variable A describes an un-
derlying missing mechanism such that if A = 1 (the success) then a hidden observation is
available (this is why we use the letter A which stands for “Availability”) and if A = 0
(the failure), then the hidden observation is not available. Also recall our explanation that
in R and some other statistical softwares the logical flag NA is used to indicate a missed
(not available) value. The probability w of the success may be referred to as the availability
likelihood. If there
Pn are n realizations in a hidden sample of interest, the number of complete
cases is N := l=1 Al which has a Binomial(n, w) distribution. These facts explain why
Bernoulli and Binomial distributions are pivotal in statistical analysis of missing data.
In what follows we refer to an underlying and hidden sample as H-sample, and to a sam-
ple with missing observations as M-sample. Typically an M-sample is created from a corre-
sponding H-sample by an underlying missing mechanism implying that H- and M-samples
may be dependent. Let us also comment about notation used in this and the next sections.
Suppose that X is a continuous random variable of interest and X1 , . . . , Xn is a sample
from X. Suppose that A is the availability (Bernoulli random variable) and A1 , . . . , An is a
sample from A. Then the sample from X is the H-sample, and (A1 X1 , A1 ), . . . , (An Xn , An )
is the M-sample. Further, because P(A = I(AX 6= 0)) = 1, sample (A1 X1 ), . . . , (An Xn )
is also M-sample and the two M-samples are equivalent. In graphics, we may use AX and
X[A == 1] as axis labels. The label AX means that all observations in the M-sample are
considered, while X[A == 1] means that only not missed observations in the M-sample are
considered. The latter notation corresponds to R operator of extracting elements of a vector
X corresponding to unit elements of vector A.
Let us make one important remark. In the previous chapters we have learned that a
feasible sample size n of a sample, used to solve a nonparametric curve estimation problem,
cannot be small. In missing data the number N of complete cases mimics the sample size n,
and hence an M-sample with relatively small N should not be taken lightly even if the size n
of the hidden H-sample is relatively large. In other words, even if n is large, for missing data
it is prudent to look at the number of complete cases N and only then decide on feasibility
of using a nonparametric estimator. If the sample size n should be chosen a priori, then
(4.0.1), (4.0.2) and probability inequalities of Section 1.3 may help us to understand how
large the size n should be to avoid a prohibitively small N . Further, numerical simulations
(and in particular those in this and the following chapters) become an important tool in
gaining a necessary experience in the statistical analysis of missing data.
The context of this chapter is as follows. Section 4.1 considers the classical problem of
a univariate density estimation where elements of an H-sample may be missed purely at
random meaning that the missing mechanism does not depend on values of observations in
an underlying H-sample. This is the case of MCAR and it is not difficult to understand that
a complete-case approach implies a consistent density estimation. Nonetheless, because it
is our first problem with missing data, everything is thoroughly explained. In particular, it
is explained how to deal with the random number N of available observations. Section 4.2
considers a case of missing responses in nonparametric regression when the probability of
missing may depend on predictor. The main conclusion is that the simplest procedure of
DENSITY ESTIMATION WITH MCAR DATA 101
estimation based on complete cases (and ignoring incomplete ones) dominates all other pos-
sible approaches. The case of missed predictors, considered in Section 4.3, is more involved.
The latter is not surprising because in regression analysis a deviation from basic assump-
tions about predictors typically causes major statistical complications. It is explained that a
multi-step statistical methodology of regression estimation, involving estimation of nuisance
functions, is required. Further, no longer a complete-case approach implies a consistent es-
timation. Estimation of the conditional density, which is a bivariate estimation problem, is
discussed in Section 4.4. A special topic of regression with discrete responses is discussed
in Section 4.5 via the classical and practically important example of Poisson regression.
Scale (volatility ) estimation in a regression setting is explored in Section 4.6. Multivariate
regression is discussed in Sections 4.7 and 4.8.
Let us comment on (4.1.1). The data from (AX, A) is collected as follows. First an
observation of X is generated according to the density f X , then independently of X an
observation of A is generated according to the Bernoulli distribution with P(A = 1) =
w. If A = 1 then the observation of X is available (not missed) and the joint density
is f (AX, A)(x, 1) = f X (x)P(A = 1|X = x) = f X (x)w. Otherwise, if A = 0 then the
observation of X is not available (missed) and f AX,A (0, 0) = P(X ∈ [0, 1], A = 0) =
R1 X R1
0
f (x)P(A = 0|X = x)dx = 0 f X (x)P(A = 0)dx = 1 − w.
102 NONDESTRUCTIVE MISSING
It is important to stress that the MCAR yields a random number
n
X
N := Al (4.1.2)
l=1
of available (not missed) observations of X in the M-sample, or equivalently we may say that
we have only N complete cases in the M-sample. The distribution of N is Binomial(n, w),
E{N } = nw, V(N ) = nw(1 − w). In a particular simulation the number of available ob-
servations N can be small with respect to n; this is the main complication of the MCAR.
Inequalities (4.0.1) and (4.0.2) may be used to evaluate the probability of small N , and
also recall that, according to the Central Limit Theorem, if n is sufficiently large then the
distribution of N is close to the Normal(nw, nw(1 − w)).
Now we are ready to explain how to use our E-estimation methodology for construction
of E-estimator of the density f X (x), x ∈ [0, 1] for a MCAR sample.
First, we need to propose a sample mean estimator of Fourier coefficients θj :=
R1 X
0
f (x)ϕj (x)dx, j ≥ 1. Recall that ϕ0 (x) = 1, ϕj (x) = 21/2 cos(πjx), j = 1, 2, . . . are
R1
elements of the cosine basis on [0, 1], and Fourier coefficient θ0 = 0 f X (x)dx = 1 is always
known for a density supported on [0, 1].
The idea of construction of a feasible sample mean estimator of θj is based on the
assertion that the distribution of available (not missed) observations in M-sample is the
same as the distribution of the variable of interest X. This assertion may look plain due to
the independence between X and A, but because this is our first example of missing data,
let us prove it. Consider the cumulative distribution function of an available observation of
X in M-sample, that is an observation of (AX, A) given A = 1. Write,
P(A = 1, X ≤ x)
F AX,A|A (x, 1|1) := P(AX ≤ x|A = 1) =
P(A = 1)
P(A = 1)P(X ≤ x)
= = P(X ≤ x) = F X (x). (4.1.3)
P(A = 1)
Taking the derivative of both sides in (4.1.3) yields the important equality between the
involved densities,
f AX,A|A (x, 1|1) = f X (x). (4.1.4)
This is what was wished to prove. Of course, we may also get (4.1.4) from (4.1.1) using
definition of the conditional density.
Note that f AX,A|A (x, 1|1) = f AX|A (x|1), and hence we may use the latter density in
(4.1.4), and also recall that a sample from AX is equivalent to the sample from the pair
(AX, A).
We conclude that observations in a complete-case subsample have the distribution of the
underlying random variable of interest X. This allows us to write down a Fourier coefficient
of f X (x) as
Z 1 Z 1
θj := f X (x)ϕj (x)dx = f AX,A|A (x, 1|1)ϕj (x)dx = E{Aϕj (AX)|A = 1}. (4.1.5)
0 0
Assume that the availability likelihood w is known. Then (4.1.5) implies a classical
sample mean estimator of θj ,
Pn
n−1 l=1 Al ϕj (Al Xl )
θ̌j := . (4.1.6)
w
DENSITY ESTIMATION WITH MCAR DATA 103
To polish our technique of dealing with missing data, let us formally check that estimator
(4.1.6) is unbiased. Write,
n ϕ (AX) o P(A = 1)E{ϕ (X)}
j j
E{θ̌j } = E A = = θj . (4.1.7)
w P(A = 1)
Using independence between A and X, as well as θj = E{ϕj (X)} and that A takes on values
0 or 1, we continue,
n Pn A E{ϕ (A X )|A } o
l=1 lP j l l l
E{θ̂j |N > k} = E n |N > k
l=1 Al
n n A E{ϕ (X )}
P o n Pn A o
l=1Pl j l l
=E n |N > k = θj E Pl=1n |N > k = θj . (4.1.10)
l=1 Al l=1 Al
We conclude that if we restrict our attention to M-samples with N > k, the plug-in
sample mean estimator (4.1.8) is unbiased. This is a nice and encouraging theoretical result
for missing data.
The third remedy that may be used in dealing with zero ŵ is to plug in max(ŵ, c/ ln(n))
with some positive constant c. Recall that we used a similar remedy in dealing with small
estimates of the design density in regression problems. Then a theoretical justification of
this remedy is based on the assumption w ≥ c0 > 0 and that P(ŵ < c/ ln(n)) decreases
exponentially in n according to (4.0.1).
While these three remedies are different, in applications they yield similar outcomes
according to (4.0.1) and (4.0.2), and we will be able to check this via simulations shortly.
The Fourier estimator (4.1.8) yields the density E-estimator fˆ(x) of Section 2.2.
Now let us make one more comment about the Fourier estimator (4.1.8) and, correspond-
ingly, about the density E-estimator. These are complete-case estimators that are based on
N available observations of the random variable X in an M-sample. Note that even the
sample size n of M-sample is not used. As a result, the random number N of available
observations plays the role of fixed n in the proposed density E-estimator of Section 2.2. We
know from Section 2.2 that the sample size n should be relatively large for a chance to get a
feasible nonparametric density estimation. The same is true for the missing data, and only
M-samples with large N must be considered. In practical applications N is known, but in
a simulation, according to (4.0.1), this means that both n and w should be relatively large.
We may conclude that the MCAR does not change the procedure of E-estimation which
simply uses available observations and ignores missed ones. Interestingly, a complete-case
approach is the default approach in all major statistical softwares including R. The recom-
mended complete-case approach, as the asymptotic theory confirms, is optimal and cannot
be improved by any other method.
Now let us check how the recommended complete-case approach performs in simula-
tions. Note that no new software is needed because we are using the E-estimator of Section
2.2. Figure 4.1 allows us to look at two simulations with the underlying densities being
the Normal and the Bimodal. The caption explains the simulation and diagrams. The cor-
responding sample sizes are chosen to be small so we can observe the underlying hidden
H-sample from X and the M-sample from (AX, A). The two top diagrams exhibit a simula-
tion and E-estimates for the Normal density based on H-sample and M-sample. The sample
size n = 50 is small and the size N = 34 of the M-sample is almost “parametric,” but still
it is far from being in single digits. It is an important teachable lesson to repeat this figure
with different parameters and pay attention to N . H-sample and M-sample, shown in the
top diagram, are reasonable, but we may see some issues with the tails (there are more
large realizations than small ones). Nonetheless, the corresponding E-estimates, shown in
the second (from the top) diagram, are very reasonable. Visually the estimates are too close
to see any difference, and this is when their empirical ISEs help us to recognize the effect
of the MCAR on estimation. The two bottom diagrams exhibit results for the Bimodal
density. Here a larger sample size n = 75 is used because otherwise it is difficult to find a
DENSITY ESTIMATION WITH MCAR DATA 105
Normal, n = 50, N = 34
0.0 0.8
A
Bimodal, n = 75, N = 54
0.0 0.8
A
Figure 4.1 MCAR data and density E-estimation. The two top and two bottom diagrams correspond
to samples from the Normal and the Bimodal densities, and the missing is created by Bernoulli
availability A with the availability likelihood P(A = 1|X = x) = w = 0.7. The top diagram shows
both the H-sample and M-sample via the scattergram (X1 , A1 ), . . . , (Xn ,P
An ) where available cases
n
are shown by circles and missing cases are shown by crosses. N := l=1 Al is the number of
available observations in the M-sample. In the second from the top diagram the underlying density,
E-estimate based on the M-sample, and E-estimate based on the H-sample are shown by the solid,
dashed and dotted lines, respectively. The ISEs of E-estimates, based on the H-sample and M-sample,
are denoted as ISEH and ISEM, respectively. The two bottom diagrams have the same structure.
{Recall that the simulation may be repeated by calling (after the R prompt >) the R function >
ch4(f=1). All default arguments, shown below in the square brackets, may be changed. Let us review
these arguments. The argument set.c controls the choice of underlying corner densities; recall that
the caption of Figure 2.3 explains how to use a custom-made density. The argument set.n allows
one to choose 2 different sample sizes. The argument w controls the availability likelihood P (A = 1),
that is the likelihood of a hidden realization of X to be observed. The arguments cJ0, cJ1 and cTH
control the parameters cJ0 , cJ1 , and cT H used by the E-estimator defined in Section 2.2. Note that
R language does not recognize subscripts, so we use cJ0 instead of cJ0 , etc. To repeat this figure
with the defaults arguments, make the call > ch4(f=1). If one would like to change arguments,
for instance to use a different threshold level, say cT H = 3, make the call > ch4(f=1, cTH=3).
To change sample sizes to 100 and 150, make the call > ch4(f=1, set.n=c(100,150).} [set.c =
c(2,3), set.n = c(50,75), w = 0.7, cJ0 = 3, cJ1 = 0.8, cTH = 4]
sample which indicates two modes of the Bimodal. Note that the M-sample contains only
N = 54 observations, and this makes E-estimation of the Bimodal challenging. The two
E-estimates, based on H-sample and M-sample, indicate two pronounced modes, but the
former more accurately shows the main mode and overall is closer to the Bimodal, and the
latter is stressed by the ISEs. We can also note that the Bimodal density is more difficult
for estimation than the Normal.
Let us complement the discussion of outcomes of the two simulations in Figure 4.1 by a
theoretical analysis of the Fourier estimator (4.1.6). The estimator is unbiased and for its
106 NONDESTRUCTIVE MISSING
0.0 1.5
0 20 40
1 2 3 4
ISEH/ISEM
Figure 4.2 Testing the theoretical conclusion that using M-sample of size k, equal to rounded up
ratio n/w, allows us to estimate an underlying density with the same MISE as using H-sample of
size n. Simulations are the same as in Figure 4.1, only now extra k − n observations of X are
combined with H-sample and then are used to generate M-sample of size k. The first three diagrams
show results of particular simulations, and their structure is the same as in the second from the
bottom diagram in Figure 4.1. The bottom diagram shows the histogram of ratios ISEH/ISEM for
400 repeated simulations. Its title also shows the sample mean ratio and the sample median ratio.
{The argument nsim controls the number of simulations, and corn controls an underlying corner
density.} [n = 100,corn = 3, w = 0.7, nsim = 400, cJ0 = 3, cJ1 = 0.8, cTH = 4]
If w is unknown then the estimator θ̂j , defined in (4.1.8), is used. Recall that this is a
plug-in estimator with ŵ = N/n used in place of w. Relation (4.0.1), together with some
straightforward algebra, shows that
Further, recall the notion of the coefficient of difficulty d = limn,j→∞ nV(θ̂j ) introduced
in Chapter 2. For the case of an H-sample we have d = 1, and for the MCAR d = 1/w =
1/P(A = 1) ≥ 1 with the equality only if P(A = 1) = 1 (no missing). Hence, for an MCAR
sample of size k we need to have k ≥ n/w to compete, in terms of the MISE, with H-
samples of size n. This relation quantifies the effect of the MCAR on nonparametric density
estimation.
NONPARAMETRIC REGRESSION WITH MAR RESPONSES 107
We are finishing the section by exploring an interesting question motivated by the above-
presented theory. Is it possible, under any scenario, to prefer an MCAR sampling to the
traditional sampling without missing? The question may be confusing because the missing
data literature always considers missing as a nuisance which should be avoided if possible.
On the other hand, let us consider the following situation. Assume that the price of a single
observation in an M-sample (a sample that allows missing observations) is PM while in
the corresponding H-sample (a sample without missing) the price is PH . As we know from
the asymptotic theory, to get the same MISE the sample size k of the M-sample should
be equal to n/w where n is the size of the H-sample. This yields that the M-sampling
becomes more cost efficient if PM < wPH . In other words, at least theoretically, if the price
of the MCAR sampling is low with respect to the price of sampling without missing (which
may require more diligent bookkeeping or collecting meteorological data even during bad
weather conditions), then the MCAR sampling can be more cost efficient. Of course, we
have k > n, and if the total time of sampling is important, then this issue should be taken
into account, but if only the price of sampling and accuracy (MISE) are important, then at
least theoretically MCAR missing may have the edge.
Can the above-presented asymptotic theory be applied to small samples? Figure 4.2
allows us to explore this question via an intensive numerical study. The underlying exper-
iment is similar to the one in Figure 4.1. Namely, first a direct H-sample of size k, equal
to the rounded up ratio n/w, is generated from a corner function, here it is the Bimodal.
Then an M-sample is obtained from the H-sample with w = 0.7 for which E-estimate fˆM X
is calculated. Also, based on the first n observations of the H-sample, the E-estimate fH ˆX
Function w(x) is called the availability likelihood, and (4.2.3) implies that the missing is
MAR (missing at random) because the probability of missing the response is defined by the
always observed predictor.
To use our E-estimation methodology for the MAR sample, we need to propose a sample
mean estimator of Fourier coefficients of the regression function (4.2.1). To understand its
construction, we begin with the formula for the joint mixed density of the observed triplet
(X, AY, A). For x ∈ [0, 1], y ∈ (−∞, ∞) and a ∈ {0, 1} we can write down the joint mixed
density as
f X,AY,A (x, ay, a) = P(A = a|X = x)f X,AY (x, ay)
= [w(x)f X (x)f Y |X (y|x)]a [(1 − w(x))f X (x)]1−a . (4.2.4)
This formula allows us to write down Fourier coefficients of m(x), x ∈ [0, 1] as follows,
Z 1
θj = m(x)ϕj (x)dx = E{AY ϕj (X)/[f X (X)w(X)]}, j = 0, 1, . . . (4.2.5)
0
Assume for a moment that functions w(x) and f X (x) are known. Then the sample mean
estimator of Fourier coefficients is
n
X Al Yl ϕj (Xl )
θ̄j := n−1 . (4.2.6)
f X (Xl )w(Xl )
l=1
If f X (x) and w(x) are unknown, and this is a typical situation, then they can be es-
timated based on the M-sample. Indeed, the design density can be estimated using E-
estimator of Section 2.2 and the availability likelihood can be estimated by the Bernoulli
regression estimator of Section 2.4.
There is also another attractive possibility to deal with the case of unknown f X (x) and
w(x). First, we rewrite the estimate θ̄j as
n
X Al Yl ϕj (Xl )
θ̄j = [nP(A = 1)]−1 . (4.2.7)
f X (Xl )w(Xl )/P(A = 1)
l=1
Next, we note that the marginal density of the predictor in a complete case is
Further, since in practice we always deal with N comparable with n, we can replace c/ ln(n)
in (4.2.10) by c/ ln(N ). Then the Fourier estimator is based only on complete-case obser-
vations (when Al = 1), and even the sample size n is not used. In other words, this is a
complete-case Fourier estimator.
We can make the following conclusion. The E-estimator, proposed for the case of classical
regression with no missing data, can be used here for the subsample of complete cases
(Xl , Al Yl ) corresponding to Al = 1, l = 1, . . . , n. This approach, when only complete cases
are used in estimation and incomplete cases are ignored, is called a complete-case approach.
The asymptotic theory supports this approach and asserts that no other estimator may
outperform a complete-case regression estimation for the case of MAR responses.
Let us check how the proposed complete-case E-estimator performs for small sample
sizes. Two simulations are shown in the two columns of Figure 4.3. Top diagrams show
scattergrams of the underlying H-samples (unavailable samples from (X, Y )) overlaid by
the corresponding regression functions and their E-estimates. As we already know from
Section 2.3, the estimates may be good for the sample size n = 100, and this relatively
small sample size is chosen to better visualize data. In both cases the E-estimates are fair
despite the heteroscedasticity. The bottom diagrams show us MAR samples from (X, AY )
that are referred to as M-samples. Note that here complete pairs are shown by circles and
incomplete by crosses. In both simulations the same availability likelihood function w(x)
is used, and we can see from the M-samples that the function is decreasing in x. Let us
look more closely at the left column of diagrams with the Normal regression function. The
M-sample is a teachable example of what missing may do to the data. First, the number N
of complete pairs is only 77, that is almost a quarter of responses are lost. Second, while
for the H-sample the E-estimate overestimates the mode, in M-sample it underestimates it,
and we can realize why from the scattergrams. It is also of interest to compare the ISEs
which describe how the MAR affects the quality of estimation. For the Bimodal regression,
the MAR affects the quality of estimation rather dramatically. The Bimodal is a difficult
regression to deal with, and here the heteroscedasticity makes its estimation even more
complicated by hiding the two modes. Nonetheless, the E-estimator does a very good job
for the H-sample. Magnitudes of the two modes are shown correctly, both modes are slightly
shifted to the left, but overall we get a good picture of a bimodal regression function.
The MAR, highlighted in the right-bottom diagram, modifies the H-sample in such a way
that while the E-estimate shows two modes, its left mode is higher than the right one. It
is a teachable moment to analyze the scattergram and, keeping in mind the underlying
availability likelihood, to figure out why such a dramatic change has occurred.
Now let us make several theoretical comments about the recommended estimator. First,
let us check that (4.2.6) is unbiased estimator of Fourier coefficients θj , that is
E{θ̄j } = θj . (4.2.11)
To prove this assertion, we first note that the assumption (4.2.3) implies independence
of A and Y given X, and in particular that E{AY |X} = E{A|X}E{Y |X}. This equality,
together with the rule of calculation of the expectation via conditional expectation and
110 NONDESTRUCTIVE MISSING
4
3
3
2
2
1
Y
1
0
0
-2 -1
-1
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
X X
4
3
3
2
2
AY
AY
1
1
0
0
-2 -1
-1
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
X X
Figure 4.3 Performance of the complete-case E-estimator for regression with MAR responses. Re-
sults for two simulations with the Normal and the Bimodal regression functions are shown in the
two columns of diagrams. The underlying model is Y = m(X) + σS(x) where is standard normal
and independent of X. The top diagrams show underlying hidden samples (H-samples), the bottom
diagrams show the corresponding M-samples with missing responses. Observations in H-samples are
shown by circles. Available complete pairs are shown by circles while incomplete pairs (Al Xl , Xl )
with Al = 0 are shown by crosses. The titles show corresponding integrated squared errors (ISE).
An underlying regression function and its E-estimate are shown by the solid and dashed lines, re-
spectively. Because it is known that estimated regression functions are nonnegative, a projection on
nonnegative functions is used. {The arguments are: set.c controls two underlying regression func-
tions for the left and right columns; n controls the sample size; sigma controls the parameter σ; the
string scalefun defines a custom function S(x) which is truncated from below by the value dscale
and then it is rescaled to get a bona fide density supported on [0,1]; the string desden controls shape
of the design density f X (x) which is then truncated from below by the value dden and then it is
rescaled to get a bona fide density; the availability likelihood function w(x) is defined by the string
w and then truncated from below by dwL and from above by dwU .} [n = 100, set.c = c(2,3), sigma
= 1, scalefun = 00 3-(x-0.5)ˆ2 00 , desden = 00 1+0.5*x 00 , dscale = 0, dden = 0.2, w = 00 1-0.4*x 00 ,
dwL = 0.5, dwU = 0.9, c = 1, cJ0 = 3, cJ1 = 0.8, cTH = 4]
Using formula
V(Z) = E{Z 2 } − [E{Z}]2 , (4.2.15)
we continue with calculation of the second moment,
n AY ϕ (X) 2 o n Y ϕ (X) 2 o
j j
E X
=E X
E{A2 |X, Y }
f (X)w(X) f (X)w(X
n (m2 (X) + 2m(X)σ(X)ε + σ 2 (X)ε2 )ϕ2 (X)w(X) o
j
=E
[f X (X)w(X)]2
n n (m2 (X) + 2m(X)σ(X)ε + σ 2 (X)ε2 )ϕ2 (X)w(X) oo
j
=E E |X . (4.2.16)
[f X (X)w(X)]2
Using the assumed E{ε|X} = 0 and E{ε2 |X} = 1, we may continue (4.2.16),
n AY ϕ (X) 2 o n [m2 (X) + σ 2 (X)]ϕ2 (X) o
j j
E = E
f X (X)w(X) [f X (X)]2 w(X)
Z 1 2
[m (x) + σ 2 (x)]ϕ2j (x)
= dx. (4.2.17)
0 f X (x)w(x)
This, (4.2.12) and (4.2.15) prove (4.2.13).
Third, in Section 2.3 it was explained how to reduce the variance (3.2.13) by subtracting
an appropriate partial sum
X
m̃−j (Xl ) := θ̄s ϕs (Xl ) (4.2.18)
s∈{0,1,...,bn }\j
A direct calculation shows that, under a mild assumption on smoothness of functions m(x),
f X (x), σ(x) and w(x), we get
Note that while in general X and A may be dependent random variables, according to
(4.3.1) they are conditionally independent given the response Y .
The joint mixed density of the triplet (AX, Y, A) is
where ε is a random variable which may depend on X, E{ε|X} = 0 and E{ε2 |X} = 1
almost sure, and σ(x) is the scale function.
NONPARAMETRIC REGRESSION WITH MAR PREDICTORS 113
The good news about the regression model with MAR predictors is that the availability
likelihood (4.3.1) depends only on the always observed value of the response. The bad news
is that in complete cases of M-sample the conditional density f Y |X,A (y|x, 1) of the response
given the predictor is biased with respect to the conditional density of interest f Y |X (y|x)
(discussion of biased densities may be found in Section 3.1). Let us prove the last assertion.
Write,
w(y)f Y |X (y|x)f X (x)
f Y |X,A (y|x, 1) = R1
f X (x) 0 w(u)f Y |X (u|x)du
w(y)
= R∞ Y |X (u|x)du
f Y |X (y|x). (4.3.5)
−∞
w(u)f
This is what was wished to prove, and note that w(y) is the biasing function.
The biased conditional density implies that in general, when only M-sample is available,
a complete-case approach cannot be used for estimation of the conditional density f Y |X (y|x)
and hence for estimation of the regression function m(x) := E{Y |X = x}. This is what sets
apart regressions with MAR responses and MAR predictors, because for the former the
complete-case approach is both consistent and optimal.
To propose a regression estimator, we are going to consider in turn three scenarios when
a consistent estimation is possible (furthermore, the asymptotic theory asserts that the
proposed solutions yield optimal rates of the MISE convergence). The first one is when the
design density f X (x) and the availability likelihood w(y) are known. Under this scenario
a complete-case approach yields consistent estimation. The second one is when these two
nuisance functions are unknown but the marginal density f Y (y) of the response is known.
Under the second scenario a complete-case approach is also consistent. Finally, if only M-
sample is available, a consistent E-estimator uses both complete and incomplete cases.
Scenario 1. Functions f X (x) and w(y) are known. To employ the E-estimation method-
ology, we need to suggest a sample mean estimator of Fourier coefficients
Z 1
θj := m(x)ϕj (x)dx, j = 0, 1, . . . (4.3.6)
0
of the regression function of interest (4.3.3). To do this, we need to write down the coefficients
as an expectation. Using (4.3.2) and (4.3.3) we can write,
The Fourier estimator θ̄j yields the regression E-estimator of Section 2.3. Note that the
E-estimator is based only on complete cases, that is the regression estimate is a complete-
case estimator. Further, the asymptotic theory asserts that a complete-case approach is
optimal.
Scenario 2. Known marginal density of responses f Y (y). This is an interesting case
both on its own and because it will explain how to solve the problem of regression estimation
when only M-sample is available.
There are two steps in the proposed regression estimation. The first step is to use the
known density f Y (y) of responses for estimation of nuisance functions w(y) and f X (x). This
114 NONDESTRUCTIVE MISSING
step may be done using only complete cases. The second step is to utilize the E-estimator
proposed under the above-considered first scenario.
Let us show how the two nuisance functions can be estimated. We begin with estimation
of w(y). Recall that w(y) := P(A = 1|Y = y) = E{A|Y = y}. This implies that estimation
of w(y) is the Bernoulli regression for which E-estimator ŵ(y) was proposed in Section 2.4.
Also recall that only Yl in complete cases are needed whenever the density of Y is known.
With the estimator ŵ(y) at hand, we can estimate the density f X (x). The idea of
R1
estimation is as follows. Suppose that w(y) is known, and denote by κj := 0 f X (x)ϕj (x)dx
the jth Fourier coefficient of the density. Then, as we will show shortly, the sample mean
estimator of κj is
Xn
κ̃j := n−1 Al ϕj (Al Xl )/w(Yl ). (4.3.9)
l=1
Then we plug max(ŵ(Yl ), c/ ln(n)) in place of unknown w(Yl ). This Fourier estimator yields
the density E-estimator fˆX (x), x ∈ 0, 1] of Section 2.2.
The teachable moment here is that if f Y (y) is known, then the complete-case approach
is still consistent.
To finish our discussion of Scenario 2, we need to prove that (4.3.9) is the sample mean
estimator. Using the rule of calculation of the expectation via a conditional expectation, we
can write,
E{κ̃j } = E{Aϕj (AX)/w(Y )} = E{ϕj (X)E{A|X, Y }/w(Y )}
Z 1
= E{ϕj (X)w(Y )/w(Y )} = E{ϕj (X)} = f X (x)ϕj (x)dx = κj . (4.3.10)
0
This is what was wished to show.
Scenario 3. Only M-sample is available. The available M-sample is (A1 X1 , Y1 , A1 ), . . . ,
(An Xn , Yn , An ) from the triplet (AX, Y, A). No other information is available.
The proposed solution is to convert this scenario into the previous one. To do this,
we note that all n observations of Y are available and hence the density f Y (y) may be
estimated by the density E-estimator fˆY (y) of Section 2.2. Then the estimator proposed for
the second scenario can be utilized with the density f Y replaced by its E-estimate. Let us
also stress that values of all nuisance functions are needed only at (Al Xl , Yl ) from complete
cases.
Let us comment on the proposed data-driven regression estimator from the point of view
of a complete-case approach. All observations, including responses in incomplete cases, are
needed only for estimation of the density f Y (y), furthermore, this density is needed only
at points Al Yl corresponding to Al = 1. As a result, if an extra sample of responses is
available to estimate f Y (Yl ), then only complete cases can be used. This remark is important
whenever the reliability of data in incomplete cases is in doubt.
Figure 4.4 allows us to explain the setting and each step in construction of the regression
estimator. The caption explains all diagrams.
We begin with the left-top diagram exhibiting the underlying H-sample (the hidden
scattergram). Pairs of observations are shown by circles, the solid and dashed lines show
the underlying regression and the E-estimate based on the H-sample, respectively. The
regression is Y = m(X) + σS(X)ε, where m(x) and S(x) are two custom-made functions,
σ is the parameter and ε is an independent from X standard normal random variable.
For the particular simulation Y = 2X + (3 − (X − 0.5)2 )ε. The distribution of X is also
custom-made, here it is uniform on [0, 1]. The title indicates the sample size n and the ISE
of the regression E-estimate. Let us note that the linear regression, as a function, is not
an element of the cosine basis, and it is a challenging nonparametric function. Overall, the
estimate correctly shows the underlying regression.
NONPARAMETRIC REGRESSION WITH MAR PREDICTORS 115
0 1 2 3 4
0.8
Y
0.4
0.0
-2
X Y
1.3
0 1 2 3 4
1.1
Y
0.9
0.7
-2
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8
AX X[A==1]
E-estimate of the Density of Response Regression and its E-estimate, ISE = 0.0079
0.00 0.10 0.20 0.30
Y x
Figure 4.4 Nonparametric regression with MAR predictors. The underlying model is Y = m(X) +
σS(x)ε where ε is standard normal and independent of X. The left-top diagram shows the hidden
H-sample, regression m(x) (the solid line) and its E-estimate (the dashed line). The left-middle
diagram shows the corresponding M-sample, m(x) by the solid line and E-estimate, based on com-
plete cases, by the dashed line. The left-bottom diagram shows E-estimate of the marginal density
f Y (Yl ) at observed responses. The right-top diagram shows the scattergram of (Yl , Al ) by circles, the
underlying availability likelihood function w(Yl ) by triangles and its E-estimate ŵ(Yl ) by crosses.
The right-middle diagram shows by triangles f X (Al Xl ) for complete cases (when Al = 1), and by
crosses its E-estimate (here they coincide). The dashed line shows density E-estimate based only on
predictors in complete cases (available predictors). The axis label X[A == 1] indicates that f X (Xl )
and fˆX (Xl ) are shown only for Xl corresponding to Al = 1, that is only for complete cases. The
right-bottom diagram shows the underlying regression function m(x) (the solid line) and the regres-
sion E-estimate (the dashed line) proposed for the case of MAR predictors. {The arguments are:
mx defines a custom-made underlying regression function, n controls the size of H-sample, sigma
controls the parameter σ; scalefun defines shape of function S(x) which is truncated from below
by the value dscale and then rescaled into a bona fide density supported on [0, 1]; desden allows
to choose the shape of design density f X (x) which is then truncated from below by dden and then
rescaled into a bona fide density. The shape of the availability likelihood function w(y) is defined by
the argument-function w which is then truncated from below by dwL and from above by dwU .}
[mx = 00 2*x 00 , n = 300, sigma = 1, scalefun = 00 3-(x-0.5)ˆ2 00 , desden = 00 1+0*x 00 , dscale = 0,
dden = 0.2, w = 00 1-1.2*y 00 , dwL = 0.2, dwU = 0.9, c = 1, cJ0 = 3, cJ1 = 0.8, cTH = 4]
The left-middle diagram shows us the observed missing data (M-sample). The missing is
created by the Bernoulli random variable A with an availability likelihood function P(A =
1|X = x, Y = y) = w(y), in this particular simulation w(y) = min(0.9, max(0.2, 1 − 1.2y));
the caption explains how to change this function. P The diagram shows n realizations of
n
(Y, AX), and its title also shows the number N := l=1 Al of complete pairs. As we see,
from n = 300 of hidden complete pairs, the missing mechanism allows us to have only
116 NONDESTRUCTIVE MISSING
N = 141 complete pairs, that is more than a half of predictors are missed. The naı̈ve
complete-case regression E-estimator is shown by the dashed line; its visualization and the
ISE = 0.17 supports the above-made theoretical opinion about inconsistency of complete-
case approach for the case of MAR predictors. The complete-case E-estimate indicates
smaller y-intercept and changing slope. To understand this outcome, let us briefly look
at the right-top diagram. Here the circles show the scattergram of (Yl , Al ). As we see, all
smaller underlying responses belong to complete cases, while many larger responses belong
to incomplete cases. This is what explains the complete-case E-estimate in the left-middle
diagram. To finish the discussion of the diagram, let us comment on how the M-sample was
generated. First the random variable X (the predictor) is generated. Second, the response
Y is generated according to the custom-made regression formula. Then the Bernoulli(w(Y ))
random variable A is generated. This creates the M-sample. A particular sample is shown in
the diagram. Note that if in the H-sample the predictors are distributed uniformly, available
predictors in the M-sample are no longer uniform; it is clear that the distribution is skewed
to the left. Can you support this conclusion theoretically via a formal analysis of f X|A (x|1)?
We will return to this density shortly.
The first two diagrams described the data. All the following diagrams are devoted to
the above-outlined regression estimation based solely on the M-sample.
The left-bottom diagram illustrates the first step: estimation of the response density
f Y (y). Because all responses are available, we may use the density E-estimator of Section
2.2. The only issue to explain is how we are dealing with an unknown support of Y . The
density E-estimator of Section 2.2 assumes that the support is [0, 1]. This is why we use
the rescaling approach explained in Section 2.2. Namely, we do the following: (i) Rescale
all Yl onto the unit interval by introducing Yl0 := (Yl − Y(1) )/(Y(n) − Y(1) ); (ii) Calculate
0 0
density E-estimate fˆY (y 0 ) of f Y (y 0 ), y 0 ∈ [0, 1]; (iii) Rescale back the calculated estimate
0
and get the wished fˆY (Yl ) := [Y(n) −Y(1) ]−1 fˆY (Yl0 ). The diagram shows that the calculated
E-estimate is unimodal with the mode at Y = 1. Another informative aspect of the plot is
that majority of responses are located around the mode and just few create tails (as should
be expected from the curve).
Let us stress that the left-bottom diagram illustrates the only step where all n obser-
vations of Y are used. Hence, if an additional sample from Y is available, we can estimate
f Y (y) and then use it in the complete-case regression E-estimator described below. This is
a useful remark because it explains what can be done if only complete cases in an M-sample
are available.
The right-top diagram shows us the scattergram of A versus Y , it is shown by circles.
The regression of A on Y is the searching after the availability likelihood function w(y) =
P(A = 1|Y = y). This is a classical Bernoulli regression discussed in Section 2.4. Recall that
if f Y (y) is known then the regression E-estimator is based only on values of Yl in available
complete cases (when Al = 1). In the diagram triangles and crosses show the underlying
w(Yl ) and its E-estimate ŵ(Yl ) at all observed values of Y (note that Figure 4.4 allows us to
chose any custom-made availability likelihood function w(y)). The E-estimate is not perfect
due to the wrong right tail. At the same time, only a small proportion (from n = 300) of
responses belong to that tail. Furthermore, recall that w(y) is a nuisance function whose
values are needed only at values Yl corresponding to Al = 1. Hence, a bad estimation of the
right tail may not ruin the regression E-estimate.
The right-middle diagram illustrates estimation of the design densityf X (x) at values
x = Xl of available predictors in the M-sample. The density is estimated with the help
of the estimates ŵ(Yl ) for l such that Al = 1. The diagram shows us by the triangles and
crosses the underlying and estimated densities (here they coincide). Note that the E-estimate
is perfect despite the imperfect estimate of w(y). The dashed line shows us the density E-
estimate fˇX (x) based on N = 141 available realizations of X. This estimate correctly shows
CONDITIONAL DENSITY ESTIMATION 117
the biased (by the MAR) character of predictors in complete cases. The reader may look
theoretically at the density f X|A (x|1) and get a formula for the biasing function. This is an
interesting example of creating a biased data when the underlying density can be restored
without knowing the biasing function.
Finally, with the help of the estimated w(y) and f X (x), we can estimate the regression
function m(x). In the right-bottom diagram, the solid and dashed lines show the underly-
ing regression and the proposed E-estimate, respectively. The E-estimate nicely shows the
monotone character of the regression. Further, if we compare its ISE = 0.0079 with the
ISE = 0.0095 of the E-estimate based on the H-sample, then the outcome is surprisingly
good especially keeping in mind the number N = 141 of complete cases and the imperfect
estimate of the availability likelihood. Of course, this is an atypical outcome. It is advisable
to repeat Figure 4.4 and learn more about this interesting statistical problem.
The asymptotic theory asserts that under a mild assumption the proposed methodology
is optimal. Further, we may conclude that the case of MAR predictors is dramatically more
complicated than the case of MAR responses where a complete-case approach may be used.
The latter means that we always have a chance of observing (not missing) Y given X. The
problem is to estimate conditional density f Y |X (y|x) using M-sample.
It was shown in Section 4.2 that a complete-case approach is consistent for estimation
of the conditional expectation E{Y |X}. Can this approach also shine in estimation of the
conditional density? Let us explore this question theoretically. Consider the conditional
density of Y given X in complete cases of an M-sample. Using (4.4.3) we can write,
f X,Y (x, y)
= = f Y |X (y|x). (4.4.4)
f X (x)
We conclude that the conditional density in the complete cases of an M-sample is the
same as the conditional
Pn density of interest f Y |X in the underlying H-sample. Of course,
the number N = l=1 Al of complete cases is a Binomial(n, P(A = 1)) random variable,
and for some M-samples N may be dramatically smaller than n. On the other hand, we
know from the inequality (4.0.1) that the likelihood of relatively small N is negligible for
large n. In other words, the more serious issue is that here we are dealing with estimation
of a bivariate function, recall the discussion in Section 2.5. As a result, similarly to our
discussion in Section 4.1, only M-samples with relatively large N > k should be considered,
and for a bivariate problem a reasonable k is in the hundreds.
Now we are in a position to propose an E-estimator of the conditional density (4.4.1).
First, let us explain how we can solve the problem when an H-sample of size n from (X, Y )
is available. Following our methodology of constructing an E-estimator, we need to propose
a sample mean estimator of Fourier coefficients of the conditional density. Suppose that a
pair (X, Y ) is supported on [0, 1]2 and recall the corresponding tensor-product basis with
elements ϕj1 j2 (x, y) := ϕj1 (x)ϕj2 (y) (recall Section 2.5). Using this basis, we may write
down Fourier coefficients of the conditional density,
Z
θj1 j2 = f Y |X (y|x)ϕj1 j2 (x, y)dxdy
[0,1]2
This Fourier estimator yields the conditional density E-estimator fˆY |X (y|x), (x, y) ∈
[0, 1]2 of Section 2.5.
In a general case of an unknown support of the pair (X, Y ), we use our standard proce-
dure of rescaling available observations on the unit square.
Let us check how the E-estimator performs for a sample with MAR responses and also
add an explanation why estimation of the conditional density may shed an important light
on relationship between two random variables which cannot be gained from the analysis of
regression functions.
Figure 4.5 helps us to understand the MAR setting and how the E-estimator performs.
We postpone for now explanation of the underlying simulation, and this will allow us to use
our imagination and guess about an underlying distribution. The left-top diagram shows a
scattergram of a hidden H-sample; for now do not look at the other diagrams. For us, after
the analysis of so many scattergrams in the previous sections, this one is not a difficult one.
It is clear that the underlying regression is a unimodal and symmetric around 0.5 function
which resembles the Normal function. And this is a reasonable answer. Keeping in mind
that we have n = 500 observations (see the title), it was not a difficult task to visualize a
Normal-like regression. At the same time, if you are still confused with the regression and
the scattergram, your feeling is correct because the scattergram is not as simple as it looks.
We return to this diagram shortly.
Now let us look at the left-bottom diagram. It shows E-estimate of the conditional
density f Y |X (y|x) based on the H-sample. Let us explain how to analyze the estimate. Use
a vertical slice with a constant x = x0 , and then the cut along the shown surface exhibits
the estimate f Y |X (y|x0 ) as a function in y. The surface indicates two pronounced ridges
with a valley between, and this yields a conditional density which, as a function in y, has
two pronounced modes.
Now let us return to the scattergram for the H-sample. Look one more time at the
scattergram and please pay attention to the following detail. Do you see a pronounced gap
between two clusters of circles? Actually, it looks like we have two unimodal regressions
shown in the same diagram. This is what the E-estimator sees in the H-sample and shows
us via the conditional density. By the way, here we have 500 observations, does it look like
that you see so many circles? If the answer is “no,” then you are not alone.
To appreciate the conditional density estimate and get a better understanding of the
scattergram, let us explain how the H-sample is generated. The response is Y = u(X) +
ση + σN ε where X is the Uniform random variable, η is the Strata random variable, ε is the
Normal random variable and these three variables are mutually independent, and u(x) is
the Normal function. The variable η creates the special shape of the data (the two ridges).
Note that this type of data also may be created by a mixture of two underlying regressions.
After this discussion it becomes clear that for the H-sample at hand its regression func-
tion is not a very informative tool because it does not explain the data. The conditional
density sheds a better light on the data. The downside is that a larger sample size and
paying attention to arguments of the E-estimator is critical here. For instance, note that
the argument cTH is increased to 10 with the aim of removing relatively small Fourier
coefficients. A rule of thumb, often recommended, is to choose cTH close to 2 ln(n). For
120 NONDESTRUCTIVE MISSING
6
4
4
AY
Y
2
0
0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
X X
0.2 0.2
0.0 0.0
0 0
x
5 5
10 y 10 y
Figure 4.5 Estimation of conditional density f Y |X (y|x) for H-sample and M-sample with MAR
responses. The simulation is explained in the text. In the right-top diagram circles and crosses
show complete and incomplete cases, respectively. {The argument corn controls function u(x), cS
controls the distribution of η, sigmaN controls σN , the string w controls the availability likelihood
function while dwL and dwU control its lower and upper bounds.} [n = 500, corn = 2, sigma = 4,
sigmaN = 0.5, cS = 4, w = 00 1-0.4*x 00 , dwL = 0.5 ,dwU = 0.9, c = 1, cJ0 = 3, cJ1 = 0.8, cTH
= 10]
the reference, when n increases from 100 to 500, this argument changes from 9 to 12. Some
experience is also needed in choosing other arguments (they are the same as in the previous
figures). Recall that figures in the book allow the reader to change their arguments, and
then via simulations to learn how to choose better parameters of E-estimators for different
statistical models.
The right column of the diagrams in Figure 4.5 shows us an M-sample produced by
MAR responses with P (A = 1|Y = y, X = x) = w(x) = max(dwL , min(dwU , 1 − 0.4x)).
Note that only N = 393 complete pairs, from n = 500 hidden ones, are available (see the
title); here we lost a bit more than one-fifth of the responses. Further, note that responses
with larger predictors are more likely to be missed, and this creates the biased scattergram.
Nonetheless, in the scattergram of the M-sample we still can see the two pronounced ridges
and a valley between them. The reader is advised to compare the two scattergrams and
analyze the effect of the missing responses. The E-estimate for the M-sample, shown in
CONDITIONAL DENSITY ESTIMATION 121
the right-bottom diagram, is very impressive. The two ridges and the valley between them
are clearly exhibited and, despite the heavy missing for larger values of the predictor, the
estimate for larger x is as good as the one in the left-bottom diagram for the H-sample.
We may conclude that for the considered problem of MAR responses, the proposed
complete-case approach has worked out very nicely for the particular simulation. This may
not be the case in another simulation, some adjustments in arguments for other classes of
conditional densities may be beneficial, etc. Figure 4.5 allows us to explore all these issues
and gain necessary experience in dealing with the complicated problem of estimating a
conditional density with MAR responses.
MAR predictors. Let us describe the problem of estimation of a conditional density of
response Y given predictor X based on data with missing predictors. There is a hidden
sample (X1 , Y1 , A1 ), . . . , (Xn , Yn , An ) from the triplet (X, Y, A) of dependent random vari-
ables where A is a Bernoulli random variable which controls availability (not missing) of X.
If Al = 1 then the corresponding predictor Xl is available and the case (X, Y ) is complete,
otherwise Xl is missed and the case is incomplete because only the response Yl is available.
Formally, we may say that an available M-sample of size n from the triplet (AX, Y, A) is
(A1 X1 , Y1 , A1 ), . . . , (An Xn , Yn , An ). The considered missing mechanism is MAR and
Two remarks are due about the model. First, while in general X and A are dependent
random variables, according to (4.4.8) they are conditionally independent given the response
Y . Second, the mechanism of generating the H-sample and M-sample is the same as in
Section 4.3 only here the problem is to estimate the conditional density f Y |X (y|x) of the
response given the predictor.
We begin our discussion of an appropriate E-estimator of the conditional density for
the case of continuous (X, Y ) supported on [0, 1]2 , and it is also assumed that the design
density f X (x) ≥ c∗ > 0.
The following formula is valid for the joint mixed density of the triplet (AX, Y, A),
Here {ϕj1 j2 (x, y)} is the cosine tensor-product basis on [0, 1]2 defined in Section 2.5. If
the marginal density f X (x) and the availability likelihood function w(y) are known, then
according to (4.4.9) the following sample mean estimator of Fourier coefficients may be
recommended,
n
X Al ϕj1 j2 (Al Xl , Yl )
θ̄j1 j2 := n−1 . (4.4.11)
f X (Al Xl )w(Yl )
l=1
Let us check that the Fourier estimator is unbiased. Using (4.4.9) we can write,
n Aϕ (AX, Y ) o
j1 j2
E{θ̄j1 j2 } = E
f X (AX)w(Y )
f Y |X (y|x)
Z
lim [nV(θ̄j1 j2 )] = X
dxdy. (4.4.13)
n,j1 ,j2 →∞ [0,1]2 f (x)w(y)
This formula shows how the availability likelihood (the missing mechanism) affects estima-
tion of the conditional density.
In general functions f X (x) and w(x) are unknown and should be estimated. The avail-
ability likelihood w(y) is estimated by the Bernoulli regression E-estimator ŵ(y) of Section
2.4 based on n realizations of (Y, A). To estimate the design density f X (x), we note that
its Fourier coefficients are
Z 1
κj := f X (x)ϕj (x)dx = E{Aϕj (AX)/w(Y )}, (4.4.14)
0
where the last equality holds due to (4.4.9). This yields the plug-in sample mean estimator
of κj ,
n
X Al ϕj (Al Xl )
κ̂j := n−1 . (4.4.15)
max(ŵ(Yl ), c/ ln(n))
l=1
This Fourier estimator yields the density E-estimator fˆX (x), x ∈ [0, 1] of Section 2.2.
Now we can plug the obtained estimators in (4.4.11) and get the plug-in sample mean
estimator of Fourier coefficients of the conditional density,
n
X Al ϕj1 j2 (Al Xl , Yl )
θ̂j1 j2 := n−1 . (4.4.16)
l=1
[max(fˆX (Al Xl ), c/ ln(n))][max(ŵ(Yl ), c/ ln(n)]
This Fourier estimator yields the conditional density E-estimator for the considered
model of MAR predictors.
If the support of (X, Y ) is unknown, then we use our traditional rescaling onto the unit
square.
Figure 4.6 helps us to understand the model and how the E-estimator performs. The
simulation of an H-sample is the same as in Figure 4.5, and a particular H-sample is shown
in the left-top diagram. Further, the right-top diagram shows the conditional density E-
estimate based on the H-sample. After our analysis of Figure 4.5, we can plainly realize the
two ridges and the valley between them.
The left-middle diagram shows the M-sample generated by MAR predictors with the
availability likelihood function shown in the right-middle diagram by triangles. Note that
in the left-middle diagram incomplete cases are shown by crosses. The title shows that from
n hidden cases only N = 408 are complete in the M-sample.
These two diagrams show us the data. Now let us explain how the proposed E-estimator
performs. The right-middle diagram shows us by circles the Bernoulli scattergram of n
realizations of (Y, A), and then triangles and crosses show the underlying values of w(Yl )
and their E-estimates ŵ(Yl ), l = 1, 2, . . . , n. As we see, the E-estimate is almost perfect. The
left-bottom diagram shows us by triangles and crosses the design density f X (Xl ) and its
E-estimate fˆX (Xl ), respectively. The estimate is perfect. Finally, the right-bottom diagram
shows us the proposed E-estimate based on the M-sample. Overall, the E-estimate is on par
with (but not as good as) the reference E-estimate based on the H-sample and exhibited in
the right-top diagram.
We may conclude that: (i) A data-driven estimation of a conditional density for a model
with MAR predictors is a feasible task; (ii) E-estimate requires estimation of several nuisance
CONDITIONAL DENSITY ESTIMATION 123
6 0.4
0.2
4
0.0
0
Y
x
5
0
10 y
0.0 0.2 0.4 0.6 0.8 1.0
0.8
4
Y
0.4
2
0.0
0
AX Y
0.2
0.0
0
x
5
10 y
0.0 0.2 0.4 0.6 0.8 1.0
X[A==1]
Figure 4.6 Estimation of conditional density f Y |X (y|x) based on data with MAR predictors. Under-
lying simulation and the structure of diagrams exhibiting scattergrams and estimates of conditional
densities are the same as in Figure 4.5, only here the missing is defined by the response. The left-
bottom and right-middle diagrams show by triangles and crosses the underlying functions and their
estimates, respectively. {The string w controls w(y) where y is the value of response rescaled onto
[0,1].} [n = 500, corn = 2, sigma = 4, sigmaN = 0.5, cS = 4, w = 00 1-0.4*y 00 , dwL = 0.2, dwU
= 0.9, c = 1, cJ0 = 3, cJ1 = 0.8, cTH = 10]
functions (the design density and the availability likelihood) that may be of interest on their
own; (iii) Due to the curse of multidimensionality, a relatively large number N of complete
cases is required for a reliable estimation. In its turn, this implies a large sample size n; (iv)
Arguments of E-estimator require an adjustment to take into account the bivariate nature
of a conditional density.
It is highly recommended to repeat Figures 4.5 and 4.6 with different parameters and
gain necessary experience in dealing with the complicated problem of estimation of the
conditional density.
124 NONDESTRUCTIVE MISSING
4.5 Poisson Regression with MAR Data
So far in this chapter we have considered regression models with continuous responses.
These models are very popular in applied statistics. At the same time, there are many
applications where responses are discrete random variables. For instance, we considered a
Bernoulli regression in Section 2.4 when response takes only two values.
In this section we are considering a so-called Poisson regression when the conditional
distribution of the response given the predictor is Poisson. First, let us recall the notion of
a Poisson random variable. A random variable Y taking on one of the values 0, 1, 2, . . .
is said to be a Poisson random variable with parameter m > 0 if P(Y = k) = e−m mk /k!,
k = 0, 1, . . . For the Poisson random variable we have E{Y } = V(Y ) = m. The reader may
recall from a standard probability class several customary examples of random variables
that obey the Poisson probability law: The number of misprints on a page of a book; the
number of wrong telephone numbers that are dialed in a day; the number of customers
entering a shopping mall on a given day; the number of α-particles discharged in a fixed
period of time from some radioactive material; and the number of earthquakes occurring
during some fixed time span.
The classical probability considers m as a constant. At the same time, in all the above-
mentioned examples the parameter m may depend on another given variable (predictor) X.
In this case the function m(x) may be considered as a regression function because
In other words, given X = x the response Y has Poisson distribution with parameter m(x).
This definition immediately implies that
Before explaining the case of missing data, let us propose an E-estimator for the Poisson
regression based on a hidden H-sample (X1 , Y1 ), . . . , (Xn , Yn ) from the pair (X, Y ) where
X is the predictor which is a continuous random variable supported on [0, 1] and f X (x) ≥
c∗ > 0, x ∈ [0, 1], and given X = x the response Y is a Poisson random variable with the
mean m(x). The aim is to estimate the regression function m(x).
Using our methodology of construction of a regression E-estimator, we need to find a
(possibly plug-in) sample mean estimator of Fourier coefficients
Z 1
θj := m(x)ϕj (x)dx. (4.5.3)
0
n Y ϕ (X) o
j
= E{E{[Y ϕj (X)/f X (X)]|X}} = E . (4.5.4)
f X (X)
If the design density f X is known, then we immediately get the sample mean estimator of
θj ,
n
X Yl ϕj (Xl )
θ̄j := n−1 . (4.5.5)
f X (Xl )
l=1
X
If the design density f is unknown, then it may be estimated by the density E-estimator
POISSON REGRESSION WITH MAR DATA 125
fˆX of Section 2.2 based on observations X1 , . . . , Xn . This yields the plug-in sample mean
estimator of Fourier coefficient θj ,
n
X Yl ϕj (Xl )
θ̃j := n−1 . (4.5.6)
l=1
max(fˆX (Xl ), c/ ln(n))
The Fourier estimator yields the regression E-estimator m̃(x), x ∈ [0, 1] for the case of
a known H-sample.
Now we are considering a setting with missing realizations of a Poisson random variable.
The model is as follows. There are n hidden realizations of (X1 , Y1 , A1 ), . . . , (Xn , Yn , An )
from the triplet (X, Y, A) where A is a Bernoulli random variable, called the availability,
and the availability likelihood is
Then the observed M-sample is a sample (X1 , A1 Y1 , A1 ), . . . , (Xn , An Yn , An ) from the triplet
(X, AY, A).
Note that the Poisson Yl is not available if Al = 0 and we observe a complete case (Xl , Yl )
if Al = 1. Furthermore, (4.5.7) implies that the missing is MAR (missing at random) because
the probability of missing Y depends on the value of always observed predictor X.
To propose a sample mean estimator of Fourier coefficients (4.5.3) of the regression
function of interest m(x), x ∈ [0, 1], we begin with a formula for the joint density of pair
(X, Y ) in a complete case of an M-sample. This joint density is the conditional density of
the pair given A = 1,
f X,Y,A (x, y, 1)
f X,Y |A (x, y|1) =
P(A = 1)
f X,Y (x, y)P(A = 1|X = x, Y = y) w(x)
= = f X,Y (x, y) . (4.5.8)
P(A = 1) P(A = 1)
Note that the density is biased with the biasing function being w(x).
Integrating (4.5.8) with respect to y we get a formula for the marginal density of X in
a complete case of M-sample,
w(x)
f X|A=1 (x|1) = f X (x) . (4.5.9)
P(A = 1)
This marginal density is also biased with the same biasing function w(x). However,
combining (4.5.8) and (4.5.9) we get the following pivotal result,
We conclude that the conditional density of the response given the predictor in a com-
plete case of M-sample is equal to the underlying conditional density of the predictor given
the response in the underlying H-sample.
This conclusion immediately implies that the above-defined regression estimator m̃(x),
developed for an H-sample, also may be used for consistent regression estimation for the
considered case of missing data whenever the estimator is based on complete cases in an
M-sample.
Figure 4.7 illustrates the setting of Poisson regression and how the E-estimator performs.
The left column of diagrams illustrates the case of the Normal being the regression function
m(x) = E{Y |X = x}. First of all, let us look at the scattergram for the H-sample. Note
how special the scattergram for the discrete Poisson response is. This is because Poisson
126 NONDESTRUCTIVE MISSING
5
4
4
3
3
Y
Y
2
2
1
1
0
0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
X X
5
4
4
3
3
AY
AY
2
2
1
1
0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
X X
Figure 4.7 Poisson regression with MAR responses. Two simulations with different regression func-
tions are shown in the two columns of diagrams. Scattergrams are shown by circles, while for
M-samples incomplete cases are shown by crosses. Underlying regressions and their E-estimates
are shown
Pn by solid and dashed lines, respectively. Titles show the sample size n, the number
N = l=1 Al of complete cases, and integrated squared errors (ISE). Because it is known that
estimated regression functions are nonnegative (Poisson random variables are nonnegative), a pro-
jection on nonnegative functions is used. {The arguments are: set.c controls two underlying regres-
sion functions for the left and right columns; n controls the sample size; desden controls the shape
of the design density f X (x) which is then truncated from below by the value dden and rescaled to
get a bona fide density; the availability likelihood function w(x) is chosen by the string w and then
truncated from below by dwL and from above by dwU .} [n = 100, set.c = c(2,3), dden = 0.2,
desden = 00 0.7+0.4*x 00 , w = 00 1-0.4*x 00 , dwL = 0.5, dwU=0.9, c = 1, cJ0 = 3, cJ1 = 0.8, cTH
= 4]
random variable takes on only nonnegative integer values. The E-estimate is relatively good
here. It correctly shows location of the mode and the symmetric shape of the regression.
Now, please look at the circles one more time and answer the following question. Should
the estimate be shifted to the right for better fitting the data? The answer is likely “yes,”
but this is the illusion created by the increasing design density. Indeed, please pay attention
that there are more observations in the right half of the diagram than in the left one. The
left-bottom diagram shows the M-sample and the corresponding E-estimate based only on
circles (complete cases). Note that the number of complete cases is N = 84. The estimate
is definitely worse, and the reader is advised to look at the 16 missed responses and explain
why they so dramatically affected the estimate.
The right column of the diagrams shows two similar diagrams for the case of the un-
derlying regression being the Bimodal. Here again it is of interest to compare the two
ESTIMATION OF THE SCALE FUNCTION WITH MAR RESPONSES 127
scattergrams and understand why the E-estimate for the M-sample is worse. The obvious
possible reason is that the sample size decreased from 100 to 83, but this alone cannot cause
almost doubled ISE. The reason is in missing several “strategic” responses, and the reader
is asked to find them.
It is recommended to repeat Figure 4.7 with different arguments to get better under-
standing of the Poisson regression with MAR responses.
Here ε is a zero mean and unit variance random variable (regression error) which is inde-
pendent of the predictor X. The predictor X is supported on [0, 1] and f X (x) ≥ c∗ > 0. The
nonnegative function σ(x) is called the scale (spread or volatility) function. The problem of
Section 3.6 was to estimate the scale function σ(x) based on a sample from (X, Y ). Using
our terminology for missing data, we can say that in Section 3.6 estimation for the case of
H-sample was considered.
Let us describe an M-sample with MAR responses. We observe an M-sample
(X1 , A1 Y1 , A1 ), . . . , (Xn , An Yn , An ) of size n from the triplet (X, AY, A). Here A is a
Bernoulli random variable which is the indicator that the response is available (not missed),
and the availability likelihood is
To propose an E-estimator of the scale function, we begin with several probability for-
mulas. The joint mixed density of the triplet (X, AY, A) is
f X,AY,A (x, ay, a) = [w(x)f X (x)f Y |X (y|x)]a [(1 − w(x))f X (x)]1−a , (4.6.3)
This is the pivotal result that will allow us to propose an estimator based exclusively
on complete cases in an M-sample. Indeed, we know from Section 4.2 that the regression
function m(x) and the density f X|A (x|1) may be estimated by corresponding E-estimators
based on complete cases.PFurther, the product nP(A = 1) can be estimated by the sample
n
mean estimator N := l=1 Al , which is the number of complete cases, and recall our
discussion in Section 4.1 of the remedy for the event N = 0. Then we plug these estimators
in (4.6.9) and get the plug-in sample-mean Fourier estimator based on the subsample of
complete cases,
X Al (Al Yl − m̂(Xl ))2 ϕj (Xl )
θ̂j = N −1 . (4.6.10)
{l: Al =1, l=1,...,n}
max(fˆX|A (Xl |1), c/ ln(n))
The Fourier estimator yields the E-estimator of the squared scale, and then taking the
square root gives us the wished scale E-estimator σ̂(x).
Figure 4.8 explains the setting, sheds light on performance of the complete-case scale
E-estimator, and its caption explains the diagrams. We begin with the top diagram. For
now, ignore titles and curves, and pay attention to the observations. Circles show N = 77
complete cases and crosses show n − N = 23 incomplete cases. Probably the first impression
from the scattergram that there is a large volatility in the middle of the support of X and
furthermore many missed responses have predictors from this set. The large volatility makes
the problem of visualization of an underlying regression function complicated. Now let us
check how the E-estimator solved the problem. The diagram shows the underlying regression
and scale functions by the dotted and solid lines, respectively. The regression E-estimate
is shown by the dot-dashed line. The estimate is significantly lower in the middle of the
support (and this could be predicted from the data) but it does exhibit the unimodal and
symmetric character of the Normal. The scale estimate (the dashed line), despite the poor
regression estimate, is surprisingly fair estimate with respect to its regression counterpart.
The bottom diagram exhibits a similar simulation only here the more complicated Bi-
modal defines the scale function. Here we have a large volatility within the interval [0.6, 0.8].
The regression E-estimate is good, and the scale estimate is also relatively good apart of
the left tail which reflects the observed volatility of responses with smallest predictors.
It is recommended to repeat Figure 4.8 with different parameters and learn more about
this interesting and practically important problem. Overall, repeated simulations conducted
by Figure 4.8 indicate that the proposed scale E-estimator for regression with MAR re-
sponses is feasible. The asymptotic theory also supports the proposed complete-case ap-
proach.
BIVARIATE REGRESSION WITH MAR RESPONSES 129
0 2 4 6 8
AY
-4
0
-5
Figure 4.8 Estimation of the scale function for regression with MAR responses. Each diagram
exhibits an M-sample generated by the Uniform design density, the Normal regression function,
and the scale function equal to 1 plus a corner function, indicated in the title, and then the total is
multiplied by parameter σ. In a scattergram, incomplete cases are shown by crosses and complete
cases by circles, respectively. Each diagram also shows the underlying regression function (the dotted
line), the regression E-estimate (the dot-dashed line), the underlying scale function (the solid line),
and the scale E-estimate (the dashed line). {Underlying regression functions are controlled by the
argument set.corn, scale functions are controlled by arguments set.scalefun and sigma, availability
likelihood is max(min(w(x), dwU ), dwL).} [n = 100, set.corn = c(2,2), sigma = 1, set.scalefun =
c(2,3), w = 00 1-0.4*x 00 , dwL = 0.5, dwU = 0.9, cJ0 = 4, cJ1 = .5, cTH = 4]
Here Y is the response, σ(x1 , x2 ) is the bivariate scale function, and ε is a zero mean and
unit variance regression error that is independent of (X1 , X2 ). The pair (X1 , X2 ) is the
vector-predictor (the two random variables may be dependent, and in what follows we may
refer to them as covariates) with the design density f X1 X2 (x1 , x2 ) supported on the unit
square [0, 1]2 and f X1 X2 (x1 , x2 ) ≥ c∗ > 0, (x1 , x2 ) ∈ [0, 1]2 .
130 NONDESTRUCTIVE MISSING
Then a classical regression problem is to estimate the bivariate regression function
(surface) m(x1 , x2 ) based on a sample (Yl , X1l , X2l ), . . . , (Yn , X1n , X2n ) from the triplet
(Y, X1 , X2 ).
Using the approach of Section 2.5 we need to propose a sample mean estimator of Fourier
coefficients Z
θj1 j2 := m(x1 , x2 )ϕj1 j2 (x1 , x2 )dx1 dx2 , (4.7.2)
[0,1]2
If the design density f X1 ,X2 (x1 , x2 ) is known, then the sample mean estimator of θj1 j2
is
n
X Yl ϕj 1 j2 (X1l , X2l )
θ̌j1 j2 := n−1 . (4.7.4)
f X1 X2 (X1l , X2l )
l=1
If the design density is unknown, then its E-estimator fˆX1 X2 (x1 , x2 ), defined in Section
2.5, may be plugged in (4.7.4). This Fourier estimator yields the bivariate regression E-
estimator for the case of H-sample. Further, the case of k covariates is considered absolutely
similarly only a corresponding tensor-product basis should be used in place of the bivariate
basis.
Now we are in a position to describe a sampling with MAR responses. We observe
an M-sample (X11 , X21 , A1 Y1 , A1 ), . . . , (X1n , X2n , An Yn , An ) of size n from the quartet
(X1 , X2 , AY, A). The triplet (X1 , X2 , Y ) is the same as in the above-described regression
(4.7.1), A is a Bernoulli random variable which is the indicator that the response is available
(not missed), and the availability likelihood is
P(A = 1|X1 = x1 , X2 = x2 , Y = y)
f X1 ,X2 ,A (x1 , x2 , 1)
f X1 ,X2 |A (x1 , x2 |1) =
P(A = 1)
Now consider the case of unknown design density and availability likelihood. Using
(4.7.7) we note that the denominator in (4.7.11) is
w(X1l , X2l )f X1 X2 (X1l , X2l ) = P(A = 1)f X1 X2 |A (X1l , X2l |1). (4.7.12)
Pn
Set N := l=1 Al and note that this is the total number of complete cases in the M-
sample. Recall our discussion in Section 4.2 about remedy for the event N = 0, and also
note that we must restrict our attention to samples with N > k and k being a relatively
large integer for a feasible estimation of a bivariate regression function. The sample mean
estimator of P(A = 1) is N/n and this, together with (4.7.11) and (4.7.12), yields the
following plug-in sample mean estimator of θj1 j2 ,
n
X Al Yl ϕj1 j2 (X1l , X2l )
θ̂j1 j2 := N −1 . (4.7.13)
l=1
max(fˆX1 X2 |A (X 1l , X2l |1), c/ ln(n))
Here fˆX1 X2 |A (X1l , X2l |1) is the density E-estimator of Section 2.5 based on complete cases
in the M-sample.
Now, if we look at the estimator (4.7.13) one more time, then it is plain to realize that
it is based solely on complete cases, and even the sample size n does not need to be known
if we replace c/ ln(n) by c/ ln(N ).
The Fourier estimator (4.7.13) yields the regression E-estimator m̂(x1 , x2 ) for the case
of MAR responses.
Let us check how the proposed complete-case regression E-estimator performs for a
particular simulation.
The top diagram in Figure 4.9 exhibits both the H-sample and M-sample. Dots and stars
show 100 realizations of the hidden triplet (X1 , X2 , Y ) with predictors shown as X1 and
X2. The sample size is small and this allows us to visualize all predictors and responses.
Due to the missing mechanism, some responses in the H-sample are missed, and those
observations are shown by stars. In other words, for realizations shown by stars only their
(X1, X2) coordinates are available in the M-sample. The title indicates that the M-sample
has N = 85 complete cases and 15 cases are incomplete. Now let us look more closely at the
scattergram. First, here we have n = 100 observations, but do you have a feeling that there
are 100 vertical lines? It is difficult to believe that we see so many vertical lines. Also, there
132 NONDESTRUCTIVE MISSING
Scattergram, n = 100 , N = 85
8
6
4
Y
1.0
X2
2
0.8
0.6
0.4
0
0.2
0.0
-2
X1
6 5 5
4
2
0 00 00
X1
X1
X1
1 1 1
10 X2 10 X2 10 X2
Figure 4.9 Bivariate regression with MAR responses. The top diagram is a 3-dimensional scatter-
plot that shows us H- and M-samples. The dots show complete cases and the stars show hidden
cases in which the responses are missed. The underlying bivariate regression function m(X1, X2)
and its E-estimates based on H-sample and M-sample are exhibited in the second row of dia-
grams. The H-sample is generated as Y = m(X1, X2) + σε where ε is standard normal, X1
and X2 are the Uniform. {Underlying regression function m(x1, x2) is the product of the two
corner functions whose choice is controlled by the argument set.corn, the availability likelihood is
max(min(w(x1, x2), dwU ), dwL) and w(x1, x2) is defined by the string w.} [n = 100, set.corn =
c(2,2), sigma = 1, w = 00 1-0.6*x1*x2 00 , dwL = 0.5, dwU = 0.9, cJ0 = 4, cJ1 = 0.5, cTH = 4]
are relatively large empty spaces on the square with no observations at all, and therefore no
information about an underlying regression surface for those spots is available. Further, for
instance in the subsquare [.8, 1] × [0, 0.2] there is just one predictor (look at the vertical line
with coordinates around (.95, 0.2)). This is what makes the bivariate regression problem
so complicated. Note that the scattergram explains complications of a multivariate setting
better than words and theorems.
The underlying bivariate regression function is shown in the left-bottom diagram, it is
ADDITIVE REGRESSION WITH MAR RESPONSES 133
the Normal by the Normal. Its E-estimates, based on the H-sample and the M-sample, are
shown in the middle-bottom and right-bottom diagrams. We may notice that the estimates
are wider than the underlying regression surface, but overall they do show the symmetric
and bell-type shape of the underlying bivariate regression function.
It takes time and practice to get used to the multivariate regression problem, and Figure
4.9 is a good tool to get this experience. The advice is not to use large sample sizes because
then it will be difficult to analyze 3-dimensional scattergrams.
Due to (4.8.2) we have θk0 = 0, and this explains why we need to estimate only Fourier
coefficients with j ≥ 1.
As usual, to propose a sample mean estimator for θjk we need to write down the Fourier
coefficient as the expectation of a function of observed random variables. Let us look at the
following expectation,
n d
X o
E{Y ϕj (Xk )/f Z (Z)} = E mk (Xk ) + σε ϕj (Xk )/f Z (Z)
β+
k=1
Z d
X
= [β + mk (xk )]ϕj (xk )dx1 · · · dxd . (4.8.4)
[0,1]d k=1
134 NONDESTRUCTIVE MISSING
Recall that {ϕj (x)} is the basis on [0, 1], and hence for any j ≥ 1 and s 6= k we have
Z 1 Z 1
ms (xs )ϕj (xk )dxk = ms (xs ) ϕj (xk )dxk = 0. (4.8.5)
0 0
R1
This is the place where the assumed j ≥ 1 is critical because 0 ϕ0 (x)dx = 1.
Using (4.8.5) we continue (4.8.4),
Z 1
Z
E{Y ϕj (Xk )/f (Z)} = mk (xk )ϕj (xk )dxk = θkj . (4.8.6)
0
We obtained the desired expression for θkj , and it immediately implies the sample mean
estimator
n
X Yl ϕj (Xkl )
θ̄kj = n−1 . (4.8.7)
f Z (Zl )
l=1
Now we need to explain how to estimate the constant β in (4.8.1). Again we are thinking
about a sample mean estimator. Write,
Z Xn
β= [β + mk (xk )]dx1 · · · dxd = E{Y /f Z (Z)}. (4.8.8)
[0,1]d k=1
f Z,AY,A (z, ay, a) = [w(z)f Z (z)f Y |Z (y|z)]a [(1 − w(z))f Z (z)]1−a . (4.8.11)
The Fourier estimators yield E-estimators of the additive components in the regression.
Note that the proposed regression E-estimator is based on the complete-case approach.
Figure 4.10 sheds light on the setting and shows how the additive regression E-estimator
recovers additive components. Here d = 2 so we can visualize data via the 3-dimensional
scattergram. The main title shows that the sample size is n = 100 Pn(a larger size would
overcrowd the scattergram), the number of complete cases is N = l=1 Al = 76, and the
underlying β = 1. The interesting feature of the design is the dependence between X1 and
X2 explained in the caption, and note that in the diagrams we use X1 and X2 in place
of X1 and X2 , respectively. The underlying additive components are the Normal and the
Strata (note that 1 is always subtracted from these functions to satisfy (4.8.2)).
Let us look at the 3-dimensional scattergram. First, can you realize the underlying
additive components from this scattergram? It is apparently not a simple question even if
you know the components. Second, can you see that the predictors are dependent? To do
this, we need to compare the distributions of X2 for cases X1 < 0.5 and cases X1 ≥ 0.5. In
the first case the distribution of X2 is the Uniform, while in the second it is the Normal.
This is not a simple task for just n = 100 observations, but overall we can see that there
are less realizations of X2 near boundaries for the second case than for the first one.
Now let us look at how the proposed E-estimator performs. The estimate of β for the
H-sample (denoted as H-beta.est) is shown in the subtitle of the left-bottom diagram and it
is 0.99. This is an excellent outcome because the underlying β = 1 (see the main title). For
the M-sample the estimate is 0.91 and it is, of course, worse but not bad for the sample of
size 76 and dependent covariates. Estimates of the first component are not perfect and worse
than we could get for the same sample sizes in univariate regressions. On the other hand,
they correctly describe the bell-shaped and symmetric around 0.5 component. Further, can
you visualize this component in the scattergram? The answer is probably “no.” The second
component is estimated relatively well and we do see two strata.
It is highly recommended to repeat Figure 4.10 with different parameters and learn to
read scattergrams created by additive regressions. It is also a good exercise to compare
scattergrams produced by Figures 4.9 and 4.10.
136 NONDESTRUCTIVE MISSING
-3 -2 -1 0 1 2 3 4 5
Y
0.8 1.0
X2
0.2 0.4 0.6
0.0
0.0 0.2 0.4 0.6 0.8 1.0
X1
2.0
1.0
1.0
m2
m1
0.0
-1.0 0.0
-1.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Figure 4.10 Additive regression with MAR responses and dependent covariates. The top diagram is
a 3-dimensional scattergram which, as explained in the caption of Figure 4.9, allows us to visualize
the H- and M-samples. The underlying additive regression function is m(X1, X2) = β + m1 (X1) +
m2 (X2) and Y = m(X1, X2)+σε where ε is standard normal. Function m1 (x) is the Normal minus
1, and function m2 (x) is the Strata minus 1. The marginal density of X1 is the Uniform while the
conditional density of X2 is the mixture f X2|X1 (x2 |x1 ) = I(x1 < t)f U (x2 ) + I(x1 ≥ t)f N (x2 )
where t ∈ [0, 1], U is the Uniform and N is the Normal. In the bottom diagrams, the solid, dashed
and dotted lines show the underlying component and its E-estimates based on H-sample and M-
sample, respectively, and the corresponding estimates of β are shown in the subtitles as H-beta.est
and M-beta.est. {Two additive components of the regression function are controlled by the argument
set.corn, parameter β by argument beta, argument t controls the conditional density of X2 given
X1, all other arguments are the same as in Figure 4.9.} [n = 100, set.corn = c(2,4), t = 0.5, sigma
= 1, w = 00 1-0.6*x1*x2 00 , dwL = 0.5, dwU = 0.9, cJ0 = 4, cJ1 = 0.5, cTH = 4]
We may conclude that if an underlying regression model is additive, then its components
can be fairly well estimated even for the case of relatively small sample sizes. Also, even if
an underlying regression function is not additive, the additive model may shed light on the
data at hand. This explains why additive regression is often used by statisticians as a first
look at data.
EXERCISES 137
4.9 Exercises
4.1.1 Explain heuristically (using words) the statistical model of a MCAR sample.
4.1.2 Describe the probability model of a MCAR sample via a hidden sample (H-sample)
and a corresponding sample with missing observations (M-sample).
4.1.3 Prove formula (4.1.1) and then comment on the underlying sampling procedure.
4.1.4 Suppose that the joint mixed density of pair (AX, A) is defined by (4.1.1) and the
parameter w depends on the value of x. Is this Pn still the case of MCAR?
4.1.5∗ Consider the random variable N := l=1 Al and describe its statistical properties.
Develop inequalities that evaluate the probability of event N ≤ γn for γ ∈ [0, 1].
4.1.6 Suppose that your colleague sent you R-data and you discovered that it contains both
numbers and logical flags NA. What does the NA stand for?
4.1.7 Explain the meaning of the probability P(AX ≤ x|A = 1). Then prove that this
conditional probability is equal to P(X ≤ x).
4.1.8 Prove that estimator θ̌j , defined in (4.1.6), is unbiased estimator of Fourier coefficient
θj of an underlying density of interest f X (x).
4.1.9 Is it correct to say that the estimator (4.1.8) is based solely on complete cases? Explain
your answer.
4.1.10 Do we need to know the sample size n to calculate the estimate (4.1.8)?
4.1.11 Why is the availability likelihood w used in the denominator of (4.1.6)?
4.1.12∗ If w is unknown, what can be done?
4.1.13 How small can the number N of complete cases be? Further, what to do if N = 0?
Is it reasonable to consider M-samples with N < 10?
4.1.14 Explain the diagrams in Figure 4.1. Then repeat this figure 20 times and find the
sample mean and sample median of the ratios ISEH/ISEM. Does it match the theoretical
ratio?
4.1.15 Can one expect good estimation of the tails of the Normal for samples with size
n = 50?
4.1.16∗ Verify and explain all relations in (4.1.9).
4.1.17∗ Verify and explain all relations in (4.1.10).
4.1.18 Explain possible remedies for density estimation when N is small.
4.1.19 Discuss the possibility of a MCAR sampling being more cost efficient than a corre-
sponding direct sampling without missing.
4.1.20 Explain all diagrams in Figure 4.2.
4.1.21 Repeat Figure 4.2 with different parameters and write a report about obtained
results.
4.1.22∗ Let us consider the following two propositions, and the aim of the exercise is to check
their validity. First, show that formula (4.1.5) yields the following sample mean estimator
of θj , P
{l: Al =1, l∈{1,...,n}} Al ϕj (Al Xl )
θ̂j := Pn
l=1 Al
Pn n
Al ϕj (Al Xl ) X
= l=1Pn = N −1 Al ϕj (Al Xl ). (4.9.1)
l=1 Al l=1
Check correctness of (4.9.1) and (4.9.2), and then write down your comments. Hint: Pay
attention to the case N = 0.
138 NONDESTRUCTIVE MISSING
4.1.23∗ Can collecting missing data be attractive? Assume that there is an extra cost for
avoiding missing data, and then propose a procedure of a controlled sampling that minimizes
the cost of sampling given a guaranteed value of the MISE.
4.2.1 Describe the statistical model of nonparametric regression with MAR responses. Give
definition of the MAR.
4.2.2 What is the difference between MCAR and MAR responses?
4.2.3 Assume that (4.2.3) holds. Are A and Y dependent? Given X, are A and Y dependent?
4.2.4 Which regression model is more general: model (4.2.1) or (4.2.2)? Explain your answer
and present examples.
4.2.5 Verify formula (4.2.4) for the joint (mixed) density.
4.2.6 Find the probability that the response is missed, that is find the probability P(A = 1).
4.2.7 Verify expression (4.2.5) for Fourier coefficients of the regression function.
4.2.8 Based on (4.2.5) and assuming that the design density is known, propose a sample
mean estimator of Fourier coefficients of the regression function. Then compare your answer
with (4.2.6).
4.2.9 Explain motivation behind the estimator (4.2.7).
4.2.10 Explain how the estimator (4.2.9) is obtained.
4.2.11 Is the estimator (4.2.9) based solely on complete cases? If the answer is “yes,” then
can incomplete cases be ignored? Further, do we need to know the sample size n?
4.2.12∗ Explain estimator (4.2.10). If N = 0, then what is the meaning of this estimator?
Further, will you use it for N = 10? Hint: Recall remedies discussed in Section 4.1.
4.2.13 Using Figure 4.3, for each corner function propose less and more favorable availability
likelihood functions w(x) that imply the same probability P(A = 1). Then verify your
recommendation via comparison of ISEs.
4.2.14 For each corner function, propose optimal parameters cJ0 and cJ1 and verify your
conclusion using Figure 4.3.
4.2.15 For each corner function, propose optimal parameter cT H and verify your conclusion
using Figure 4.3.
4.2.16 Prove equality (4.2.11).
4.2.17 Explain how (4.2.13) is established.
4.2.18 Verify expression (4.2.14) for the variance of the Fourier estimator θ̄j .
4.2.19∗ Explain estimator (4.2.19) and prove (4.2.20).
4.2.20 Using expression (4.2.21) for the coefficient of difficulty, explain how the missing
mechanism and characteristics of the regression define the difficulty of regression problem
with MAR responses.
4.2.21 Suppose that we can choose the design density. Which one should we recommend?
4.2.22 The optimal design density (4.2.22) depends on the scale and availability likelihood.
These functions are typically unknown but may be estimated based on the MAR sample.
Propose a sequential estimation plan that allows to adapt the design density to the data at
hand.
4.2.23∗ Can working with missing data be attractive? Assume that there is an extra cost
for avoiding missing data, and then propose a procedure of a controlled sampling that
minimizes the cost of sampling given a guaranteed value of the MISE.
4.3.1 Explain the model of nonparametric regression with MAR predictors. Present several
examples.
4.3.2 Explain the model (4.3.1). Is it MAR? When does this model become MCAR?
4.3.3 Prove (4.3.2).
4.3.4 What is the difference, if any, between models (4.3.3) and (4.3.4)?
4.3.5 Write down the conditional density of the response given the predictor in a complete
case. Hint: Check with (4.3.5).
4.3.6 Can the regression problem with MAR predictors be solved using only complete cases?
EXERCISES 139
4.3.7 Suppose that the design density and the availability likelihood functions are given.
Propose a regression E-estimator.
4.3.8 Suppose that the marginal density of responses is known. Propose a regression E-
estimator.
4.3.9 Propose a data-driven regression E-estimator.
4.3.10 Can a complete-case regression estimator be consistent?
4.3.11 Explain diagrams in Figure 4.4.
4.3.12 Find optimal parameters of the E-estimator using simulations produced by Figure
4.4.
4.3.13 Explore how the shape of the availability likelihood affects regression estimation
given P(A = 1) = 0.7.
4.3.14 How does the shape of the scale function affect the estimation? Verify your conclu-
sion: (i) Using Figure 4.4; (ii) Theoretically via analysis of the coefficient of difficulty.
4.3.15 Prove (4.3.5) and explain the assumptions.
4.3.16 Explain regression estimator under scenario 1.
4.3.17 Verify each equality in (4.3.7).
4.3.18∗ Find the mean and the variance of Fourier estimator (4.3.8). What is the corre-
sponding coefficient of difficulty?
4.3.19 Explain how to construct a regression estimator under scenario 2.
4.3.20 What is the motivation of the Fourier estimator (4.3.9)?
4.3.21∗ Find the mean and the variance of estimator (4.3.9).
4.3.22 Propose a regression estimator for scenario 3.
4.3.23∗ Evaluate the MISE of regression estimator based on an M-sample.
4.3.24 Consider the case N = 0. How can E-estimator treat this Pn case?
4.3.25∗ Suppose that we consider an M-sample only if N := l=1 Al ≥ k > 0. Calculate
the MISE of E-estimator.
4.3.26∗ Why is the regression with MAR responses less complicated than the regression
with MAR predictors?
4.3.27∗ Can collecting missing data be attractive? Assume that there is an extra cost for
avoiding missing data, and then propose a procedure of a controlled sampling that minimizes
the cost of sampling given a guaranteed value of the MISE.
4.4.1 What is the definition of a conditional density?
4.4.2 Verify (4.4.2).
4.4.3 Consider f Y |X (y|x). Given x, is this a regular density in y or not?
4.4.4 Consider a regression model with predictor X and response Y . What is the more
general description of the relationship between X and Y , the conditional density or the
regression function?
4.4.5 If the conditional density f Y |X (y|x) is a more general characteristic of relationship
between the predictor X and the response Y than the regression function m(x) = E{Y |X =
x}, then why do we study the regression function?
4.4.6 Describe a model of conditional density estimation with MAR responses.
4.4.7 Describe a model of conditional density estimation with MAR predictors.
4.4.8 Explain each equality in (4.4.4).
4.4.9 What conclusion can be made from (4.4.4)?
4.4.10 What are the difference and similarity between H-sample and M-sample?
4.4.11 Describe the distribution of the number N of complete cases in M-sample.
4.4.12∗ Explain the methodology of construction of the estimator (4.4.7). What should be
done if N = 0?
4.4.13 Repeat Figure 4.5 and explain all diagrams.
4.4.14 Will a regression model be useful for description of data in Figure 4.5?
4.4.15 Explain a model with MAR predictors.
140 NONDESTRUCTIVE MISSING
4.4.16 Verify formula (4.4.9) for the joint density.
4.4.17 Can a complete-case estimator imply a consistent estimation for the case of MAR
predictors?
4.4.18 Is the estimator (4.4.11) of Fourier coefficients unbiased?
4.4.19∗ Verify formula (4.4.13).
4.4.20∗ Write down a formula for the coefficient of difficulty for the case of MAR predictors.
Then find the optimal design density.
4.4.21∗ Find the mean and the variance of Fourier estimator (4.4.16).
4.4.22 How can the design density f X be estimated? Hint: Note that the density of pre-
dictors in complete cases is biased.
4.4.23 How can the availability likelihood function w(y) be estimated? Hint: Pay attention
to the fact that the support of Y is unknown.
4.4.24 Repeat Figure 4.6 and explain all diagrams.
4.4.25 Use Figure 4.6 to understand which nuisance function has the least and largest effect
on the regression estimation.
4.4.26 Consider a setting when information in incomplete cases cannot be considered as re-
liable. In other words, consider the satiation when cases (Al X, Yl , Al ) with Al = 0 may have
corrupted values of Yl . What can be done in this case? Hint: Consider using an additional
sample of responses (Y10 , Y20 , . . . , Yn0 0 ) from Y (without observing predictors).
4.4.27 Find the coefficient of difficulty of the proposed regression E-estimator.
4.4.28∗ Can collecting missing data be attractive? Assume that there is an extra cost for
avoiding missing data, and then propose a procedure of a controlled sampling that minimizes
the cost of sampling given a guaranteed value of the MISE.
4.5.1 What is the Poisson distribution?
4.5.2 What are the mean and the variance of a Poisson random variable?
4.5.3 Define a Poisson regression.
4.5.4 Explain how to construct an E-estimator for Poisson regression with no missing data.
4.5.5 Describe a Poisson regression model with MAR responses.
4.5.6 Verify (4.5.8).
4.5.7 What conclusion can be made from (4.5.10)?
4.5.8 Suggest an E-estimator for a Poisson regression with MAR responses.
4.5.9 Repeat Figure 4.7 and explain all diagrams.
4.5.10 What type of shape of the availability likelihood function makes the estimation
worse and better for the corner functions? Check your conclusion using Figure 4.7.
4.5.11 Using Figure 4.7, try to find optimal parameters of the E-estimator for each corner
function.
4.5.12 Find the coefficient of difficulty of the proposed regression E-estimator.
4.5.13∗ Suppose that the availability likelihood depends on Y . Propose a consistent regres-
sion estimator. Hint: It is possible to ask about additional information.
4.5.14∗ Can collecting missing data be attractive? Assume that there is an extra cost for
avoiding missing data, and then propose a procedure of a controlled sampling that minimizes
the cost of sampling given a guaranteed value of the MISE.
4.6.1 Explain the problem of scale estimation for the case of direct observations (H-sample).
4.6.2 Verify formula (4.6.3) for the joint density.
4.6.3 Propose the methodology of estimation of the scale function. Hint: Convert it into a
regression problem with several nuisance functions.
4.6.4 Is the estimator (4.6.7) unbiased? Hint: You may follow (4.6.8).
4.6.5 Explain the underlying idea of the estimator (4.6.10).
4.6.6 Repeat Figure 4.8 and explain diagrams.
4.6.7 Repeat Figure 4.8 and comment on the estimates.
4.6.8 How does the shape of regression function affect estimation of the scale?
EXERCISES 141
4.6.9 How does the shape of availability likelihood function affect estimation of the scale?
4.6.10∗ Find the coefficient of difficulty of Fourier estimator (4.6.7).
4.6.11 Verify each equality in (4.6.8).
4.6.12∗ Evaluate the mean and variance of Fourier estimator (4.6.8).
4.6.13∗ Explain how the nuisance functions m̂(x) and fˆX|A (x|1), used in (4.6.10), are
constructed. Then evaluate their MISEs.
4.6.14∗ Explain how the parameter c of estimator (4.6.10) affects its coefficient of difficulty.
4.6.15∗ Can collecting missing data be attractive? Assume that there is an extra cost for
avoiding missing data, and then propose a procedure of a controlled sampling that minimizes
the cost of sampling given a guaranteed value of the MISE.
4.7.1 Describe the problem of bivariate regression.
4.7.2 Propose an E-estimator for bivariate regression based on H-sample.
4.7.3 Suppose that we are considering a bivariate regression with MAR responses. What is
the availability likelihood function in this case?
4.7.4 Explain how to construct a basis for bivariate functions with domain [0, 1]2 .
4.7.5 Write down and then prove the Parseval identity for a bivariate function.
4.7.6 Prove equality (4.7.3). What is the used assumption?
4.7.7∗ Find the mean and the variance of the estimator (4.7.4).
4.7.8 Explain the assumption (4.7.5).
4.7.9 Suppose that (4.7.5) holds. Are A and Y dependent given X1 ?
4.7.10 Verify (4.7.6).
4.7.11 Prove every equality in (4.7.7).
4.7.12 Is (4.7.8) correct? Prove or disprove it.
4.7.13 Verify (4.7.9) Formulate all necessary assumptions.
4.7.14 Prove (4.7.10).
4.7.15 Is the estimator (4.7.11) unbiased? Prove your assertion.
4.7.16 Verify (4.7.12). Formulate necessary assumptions.
4.7.17 Explain the estimator (4.7.13). What can be done when N = 0?
4.7.18∗ Find the mean and the variance of Fourier estimator (4.7.13).
4.7.19∗ Explain how plug-in estimators of nuisance functions, used in (4.7.13), are con-
structed.
4.7.20 Explain the simulation used in Figure 4.9.
4.7.21 Using Figure 4.9, suggest a minimal sample size which yields a reliable estimation
of bivariate regressions.
4.7.22 Explain diagrams in Figure 4.9.
4.7.23 Propose optimal parameters of the E-estimator for two different underlying regres-
sions. Hint: Use Figure 4.9.
4.7.24 Using Figure 4.9, explain how the availability likelihood affects the regression esti-
mation.
4.7.25∗ Consider a problem of bivariate regression with MAR predictors. Propose an E-
estimator.
4.7.26∗ Can collecting missing data be attractive? Assume that there is an extra cost for
avoiding missing data, and then propose a procedure of a controlled sampling that minimizes
the cost of sampling given a guaranteed value of the MISE.
4.8.1 Describe a regression model with additive components.
4.8.2 Consider the case of two predictors. What is the difference between bivariate and
additive regressions? Which one would you suggest to use?
4.8.3 Why do we need the restriction (4.8.2)?
4.8.4 Propose a sample mean estimator of Fourier coefficients (4.8.3).
4.8.5 Prove (4.8.4). What is the used assumption?
142 NONDESTRUCTIVE MISSING
4.8.6 Why is (4.8.5) correct?
4.8.7 Prove (4.8.6).
4.8.8 Is (4.8.6) correct for j = 0?
4.8.9 Is the estimator (4.8.7) unbiased?
4.8.10 Explain how parameter β can be estimated.
4.8.11∗ Evaluate the variance of Fourier estimator (4.8.7).
4.8.12 Describe an additive regression model with MAR responses.
4.8.13 Explain the assumption (4.8.10). What will be if the availability likelihood depends
on Y ?
4.8.14 Verify (4.8.11).
4.8.15 Prove (4.8.12), and formulate the used assumption.
4.8.16 Verify (4.8.13).
4.8.17 Explain how formula (4.8.14) is established.
4.8.18∗ Is the estimator (4.8.15) unbiased? Find its variance.
4.8.19∗ Explain how estimators (4.8.16) and (4.8.17) were suggested. Then find their vari-
ances.
4.8.20 Explain all diagrams in Figure 4.10.
4.8.21 Use Figure 4.10 and explain the so-called “curse of multidimensionality.”
4.8.22 Repeat Figure 4.10 and try to explain E-estimates via analysis of the scattergram.
4.8.23 Suggest better parameters of the E-estimator used in Figure 4.10.
4.8.24∗ Can collecting missing data be attractive? Assume that there is an extra cost for
avoiding missing data, and then propose a procedure of a controlled sampling that minimizes
the cost of sampling given a guaranteed value of the MISE.
4.8.25∗ Consider additive regression with MAR predictors. Propose an E-estimator.
4.10 Notes
In many practical applications, and almost inevitably in those dealing with activities of
human beings, some entries in the data matrix may be missed. A number of interesting
practical examples and a thorough discussion of missing data can be found in books by
Rubin (1987), Allison (2002), Little and Rubin (2002), Groves et al. (2002), Tsiatis (2006),
Molenberghs and Kenward (2007), Daniels and Hogan (2008), Davey and Salva (2009),
Enders (2010), Graham (2012), Carpenter and Kenward (2013), Molenberghs et al. (2014),
and Raghunathan (2016). There is a strong opinion in the statistical community that re-
searchers often play down the presence of missing data in their studies. To change this
attitude, there has been an explosion of reviews/primaries on missing data for different
sciences, see, for instance, Bodner (2006), Enders (2006, 2010), Baraldi and Enders (2010),
Honaker and King (2010), Young, Weckman and Holland (2011), Cheema (2014), Zhou et
al. (2014), Nakagawa (2015), Newgard and Lewis (2015), Lang and Little (2016), Little et
al. (2016), Efromovich (2017), Sullivan et al. (2017). The available literature is primarily
devoted to parametric models.
The MCAR, MAR and MNAR terminology is due to Rubin (1976).
A number of methods for dealing with missing data is proposed in the literature. Among
the more popular are the following:
1. The most common approach is referred to as a complete-case approach (case-wise
or list-wise deletion is another name often used in the literature). It involves eliminating
all records with missing values on any variable. Popular statistical softwares, like SAS, S-
PLUS, SPSS and R, by default ignore cases of observations with missing values, and this
tells us how popular and well accepted the complete-case methodology is. Why? Because
this method is simple, intuitively appealing and optimal for some settings. A disadvantage
of the approach is two-fold. First, in some applications it is too wasteful. Second, in many
NOTES 143
important settings a complete-case approach yields inconsistent estimation, and this fact
should be known and taken into account.
2. Imputation is a common alternative to the deletion. It is used to “fill in” a value for
a missing value using the other information in the database. A simple procedure for impu-
tation is to replace the missing value with the mean or median of that variable. Another
common procedure is to use simulation to replace the missing value with a value randomly
drawn from the records having values for the variable. It is important to stress that impu-
tation is only the first step in any estimation and inference procedures. The second step is
to propose an estimator and then complement it by statistical inference about properties
of the estimator, and these are not trivial steps because imputation creates dependence be-
tween observations. Warning: the fact that a complete data is created by imputation should
be always clearly cited because otherwise a wrong decision can be made by a data analyst
who is not aware about this fact. As an example, suppose that we have a sample from a
Normal(µ, σ 2 ) distribution of which some observations are missed. Suppose that an analyst
is interested in estimation of the mean µ, and an oracle helps by imputing the underlying µ
in place of the lost observations. This is an ideal imputation, and then the analyst correctly
uses the sample mean to estimate µ. However, if then the analyst (or someone else) will
use this imputed (and hence complete) sample to estimate the variance σ 2 , the classical
sample variance estimator will yield a biased estimate (to see this just imagine that all
observations were missed and then calculate the sample variance estimator which should
be equal to zero). An interesting discussion, examples and references can be found in Little
and Rubin (2002), Davidian, Tsiatis and Leon (2005), Enders (2010), Graham (2012) and
Molenberghs et al. (2014).
3. Multiple imputation is another popular method. It is based on repeated imputation–
estimation steps and then aggregation (for instance via averaging) of the estimates. Multiple
imputation is a flexible but complicated statistical procedure which requires a rigorous
statistical inference. Books by van Buuren (2012) and Berglund and Heeringa (2014) are
devoted to this method.
4. Maximum likelihood method, and its numerically friendly EM (expectation-
maximization) algorithm, is convenient for parametric models with missing data. Little
and Rubin (2002) is a classical reference.
5. A number of weighting methods, like the adjustment cell method, inverse probability
weighting, response propensity model, post-stratification, and survey weights, are proposed
in the literature. These methods are well documented and supported by a number of statis-
tical packages. See a discussion in Little and Rubin (2002), Molenberghs et al. (2014) and
Raghunathan (2016) where further references may be found.
6. Finally, let us mention that prevention of the missing data, if possible, is probably the
most powerful method of dealing with missing; see McKnight et al. (2007). Prevention is
discussed in practically all of the above-mentioned books. On the other hand, if collecting
of missing data is significantly cheaper, then working with missing data may be beneficial,
see Efromovich (2017). Efficiency and robustness are discussed in Cao, Tsiatis and Davidian
(2009).
4.1 Asymptotic justification of the proposed E-estimation methodology is presented in
Efromovich (2013c). Sequential estimation is considered in Efromovich (2015). An inter-
esting possible extension is to propose a second-order efficient estimate of the cumulative
distribution function, see Efromovich (2001b).
4.2 The theory of efficient nonparametric regression with MAR responses may be found
in Efromovich (2011b). It is shown that the E-estimation methodology, based on the com-
plete case approach, yields efficient estimation of the regression function, and more discus-
sion can be found in Efromovich (2012a; 2014c,e; 2016c; 2017). Sequential estimation is
considered in Efromovich (2012b). Müller and Van Keilegom (2012) and Müller and Schick
144 NONDESTRUCTIVE MISSING
(2017) present more cases when a complete-case approach is optimal and show that there is
no need to use imputation, work with inverse probability weights or use any other traditional
remedy for missing data. Estimation of functionals is considered in Müller (2009).
4.3 Asymptotic theory of nonparametric regression estimation with MAR predictor
is more complicated than its missing response counterpart. The theory and applications
are presented in Efromovich (2011a, 2016c, 2017). The theory supports optimality of the
proposed E-estimation methodology. See also Goldstein et al. (2014).
4.4 Conditional density estimation is a classical statistical problem, see books by Fan
and Gijbels (1996) and Efromovich (1999a), as well as more recent articles by Izbicki and Lee
(2016) and Bott and Kohler (2017). The asymptotic theory of E-estimation and applications
may be found in Efromovich (2007g, 2010b,d).
4.5 Poisson regression for direct data is discussed in Efromovich (1999a). A number of
interesting settings and applications are discussed in Ivanoff et al. (2016) where further
references may be found. Discussion of a Bayesian nonparametric approach may be found
in Ghosal and van der Vaart (2017).
4.6 Regression with the scale depending on the predictor is a classical topic in the
regression analysis, and the regression is called heteroscedastic. Statistical analysis and ef-
ficient estimation are discussed in Efromovich (1999a, 2013a,b). Sequential estimation is
discussed in Efromovich (2007d,e; 2008a). Further, not only the scale function but the dis-
tribution of regression error may be efficiently estimated by the E-estimation methodology,
see Efromovich (2004h, 2005b, 2007c).
4.7 E-estimation methodology and its optimality for estimation of multivariate regres-
sion is considered in Efromovich (1999a, 2000b, 2013a,b). A further discussion of this and
related topics can be found in Izenman (2008), Efromovich (2010d, 2011c), Harrell (2015)
and Raghunathan (2016).
4.8 Additive regression is a natural method for avoiding the curse of multidimensionality
that slows down the rate of the MISE convergence and requires a dramatic increase in the
sample size to get a feasible estimation. There are no such dramatic issues with the additive
regression. It also may be argued that the additive regression is the first glance on data,
and it may be instrumental in finding a more appropriate model. A book-length treatment
of additive models may be found in the classical book by Hastie and Tibshirani (1990),
and see also Izenman (2008), Harrell (2015) and Wood (2017). The asymptotic theory of
E-estimation can be found in Efromovich (2005a).
Chapter 5
Destructive Missing
This chapter is devoted to an interesting and complicated topic of dealing with the missing
that may prevent us from consistent estimation based solely on available data. In other
words, we are considering missing mechanisms that modify observations in such a way that
no consistent estimation of an estimand, based solely on missing data (or we may say based
on an M-sample defined in Section 1.3 or Chapter 4) is possible. This is the reason why
we refer to this type of missing as a destructive missing. MNAR (missing not at random),
when the likelihood of missing a variable depends on its value, typically implies a destruc-
tive missing. At the same time, MNAR is not necessarily destructive. For instance, we will
explore a number of settings where knowledge of the availability likelihood function makes
MNAR nondestructive. Another comment due is about our reference to consistent estima-
tion. For any data, and even without data, one may propose an estimator. The point is that
an estimator should be good and produce some useful information, and consistent estima-
tion is a traditional statistical criterion for both parametric and nonparametric models. Of
course, in nonparametric curve estimation we could use the criterion of the MISE vanishing
instead, and this would not change the conclusions.
In general, some extra information is needed to solve the problem of consistent estima-
tion for the case of a destructive missing, and a number of reasonable approaches will be
suggested and discussed on what can be done and what can be expected.
Let us recall the used terminology. By E-estimator we understand the estimator defined
in Section 2.2. Recall that to construct an E-estimator we need to propose a sample mean
(or a plug-in sample mean) estimator of Fourier coefficients of an estimated function. An
underlying sample of observations of random variables of interest, which in general is un-
known to us due to a missing mechanism, is called an H-sample (hidden sample). Note that
H-sample is a sample of direct observations of random variables of interest. A corresponding
sample with missing observations is called an M-sample. Note that cases in an M-sample
may be incomplete while in a corresponding H-sample all cases are complete. If an extra
sample is available, then it is referred to as an E-sample (extra sample). The probability
of a variable to be available (not missing) in M-sample, given all underlying observations
in the corresponding H-sample, is called the availability likelihood (this function is always
denoted as w). Depending on this function, the missing mechanism may be either MCAR
(the availability likelihood is equal to a constant), or MAR (the availability likelihood is a
function in always observed variables), or MNAR (none of the above). In what follows we
will periodically remind the reader of the terminology.
The chapter begins with the topic of density estimation, which is pivotal for under-
standing destructive missing. This explains why each specific remedy for extracting useful
information from MNAR data is discussed in a separate section. Section 5.1 serves as an
introduction. It explains why MNAR may lead to destructive missing and why the problem
is related to the topic of biased data. Then it is explained that if the availability likelihood is
known, then MNAR does not imply a destructive missing. The latter is a pivotal conclusion
because one of the main tools to “unlock” information in an MNAR dataset is to estimate
145
146 DESTRUCTIVE MISSING
the availability likelihood based on an extra sample. This is the approach discussed in Sec-
tion 5.2. Namely, a sample of direct (and hence more expansive) observations is available
and it is used to estimate the availability likelihood function. The main issue here is the dis-
cussion of the size of an extra sample (E-sample) that makes the approach feasible. Section
5.3 considers another remedy when there exists an auxiliary random variable which defines
the missing mechanism. Under this possible scenario, using that auxiliary variable converts
MNAR into MAR. Section 5.4 is devoted to regression with MNAR responses, that is the
case when likelihood of missing a response depends on its value. Section 5.5 considers a
problem of regression with MNAR predictors. Section 5.6 considers a regression where both
predictors and responses may be missed.
where {0, 1} denotes a set consisting of two numbers 0 and 1, and the availability likelihood
function w(x) is defined as
Recall that if w(x) is constant, then the missing mechanism is MCAR (missing completely
DENSITY ESTIMATION WHEN THE AVAILABILITY LIKELIHOOD IS KNOWN 147
at random) and otherwise it is MNAR (missing not at random). In this section (and the
whole chapter) we are dealing exclusively with the MNAR.
Let us comment on (5.1.1). First of all, because X is continuous and A is discrete, we are
dealing with the mixed density. Second, it is natural to think about any missing process as
a two-step procedure: (i) First a realization of X occurs; (ii) Second, given the realization
of X, the Bernoulli random variable A is generated with P(A = 1|X) = w(X). If A = 1
then the realization is available and otherwise not. Hence, if A = 1 then
Relation (5.1.3) explains the first factor on the right-side of formula (5.1.1) which describes
the case a = 1. If A = 0, then an underlying realization of X is not available and in this
case the joint density is
Z 1
AX,A
f (0, 0) = P(X ∈ [0, 1], A = 0) = f X (u)P(A = 0|X = u)du
0
Z 1 Z 1
X
= f (u)(1 − w(u))du = 1 − f X (u)w(u)du. (5.1.4)
0 0
Similarly to all previously considered settings, in what follows it is assumed that (5.1.5)
holds.
To shed additional light on the MNAR problem, let us consider a subsample of complete
cases, that is realizations (Al Xl , Al ) with Al = 1. A random number N of complete cases
in an M-sample is
Xn
N := Al , (5.1.6)
l=1
and it has the Binomial(n, P(A = 1)) distribution with the mean E{N } = nP(A = 1) and
the variance V(N ) = nP(A = 1)(1 − P(A = 1)). Further, using (5.1.1) we can write,
Relation (5.1.7) implies that the density f AX|A (x|1) of observations in complete cases
is biased with respect to the estimated f X (x) and the biasing function is the availability
likelihood w(x) = P(A = 1|X = x) (recall Section 3.1). This is an important conclusion
which connects a destructive missing (and the MNAR in particular) with the topic of biased
data. Let us recall a conclusion of Section 3.1 that for biased data a consistent estimation is
impossible if the biasing function is unknown. As a result, it should not be a surprise that
the same conclusion is valid for MNAR with unknown availability likelihood function w(x).
148 DESTRUCTIVE MISSING
Density
0.8
0.4
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
X X
M-Sample, N = 67 M-Sample, N = 60
Density
1.0
0.5
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
X[A==1] X[A==1]
Figure 5.1 Examples of destructive missing. Two columns of diagrams exhibit results of simulations
for the Uniform and the Bimodal underlying densities shown by the solid lines. Available observa-
tions in H- and M-samples are shown by histograms. Axis label X[A == 1] indicates that only
available observations in M-sample are shown (the label is motivated by R operator of extracting
elements of vector X corresponding to elements of vector A equal to 1). The availability likelihood
function w(x) is the same in the two simulations andP is shown by the dashed line in the bottom
diagrams. The sample size n and the number N = n l=1 Al of available observations in M-sample
are shown in the corresponding titles. {The arguments are: set.corn and n control the underlying
corner functions and the sample size of H-sample (recall that any custom-made corner function can
be used as explained in the caption of Figure 2.3), the string w controls a function w∗ (x) which is
bounded from above and below by constants dwU and dwL, respectively, and then the availability
likelihood is w(x) := max(dwL, min(dwU, w∗ (x))).} [set.corn = c(1,3), n = 100, w = 00 1-0.7*x 00 ,
dwL = 0.3, dwU = 0.9]
Let us illustrate this conclusion via two simulated examples shown in Figure 5.1. The
left-top diagram shows us the histogram of a hidden sample of size n = 100, and the solid
line shows the underlying Uniform density. The sample is rather typical for the Uniform.
The corresponding available observations in an MNAR sample (M-sample) are shown by
the histogram in the left-bottom diagram, and the histogram is overlaid by the solid line
(the underlying density) and the dashed line (the availability likelihood function). The
non-constant availability likelihood yields the MNAR and the destructive missing, and the
histogram of complete cases in the M-sample clearly indicates why. Indeed, the histogram
is skewed to the left and its shape resembles w(x) and not the underlying density, and the
latter was predicted by (5.1.7). Note that the skewness becomes even more pronounced
if we one more time look at the H-sample. Further, let us compare numbers of available
observations in the two samples. The missing decreases the sample size n = 100 of the
H-sample to N = 67 in the M-sample. This is what makes the missing so complicated even
if the availability likelihood function (which is also the biasing function for the M-sample)
is available, recall our discussion in Section 3.1.
A similar conclusion can be made from the second column of diagrams exhibiting an
DENSITY ESTIMATION WHEN THE AVAILABILITY LIKELIHOOD IS KNOWN 149
H-sample generated by the Bimodal density and an M-sample generated by the same avail-
ability likelihood function. The H-sample clearly indicates the Bimodal density, while the
missing dramatically changes the histogram of the M-sample where we also observe two
modes but their magnitudes are misleading and the left one is larger and more pronounced
than the right one. This is what the MNAR may do to an H-sample, and this is why this
missing mechanism is destructive.
We can conclude from these two simulations that an M-sample alone does not allow us
to estimate an underlying probability density. Similarly to the case of biased data, the only
chance to restore the hidden density is to know the availability likelihood function w(x).
In this section we are considering a pivotal case when w(x) is known, and other possible
scenarios are explored in the following two sections.
Suppose that the availability likelihood w(x) is known. This does not change the MNAR
nature of an M-sample but, as we will see shortly, the missing is no longer destructive and
consistent estimation of the density is possible. Let us stress one more time that MNAR
does not necessarily imply a destructive missing, everything depends on the possibility to
complement MNAR data by an additional information. Further, as we will see shortly, for
the problem at hand the main remaining challenge is the smaller, with respect to n, number
N of available observations.
First of all, let us explain why consistent estimation of the density is possible. If w(x)
is known, then the biasing function for complete cases in an M-sample is also known and
the density E-estimator of Section 3.1 may be used to estimate f X . Further, according to
Section 3.1, this estimator does not need to know the sample size n of an underlying H-
sample. This explains why consistent estimation of the density, based on the complete-case
approach, is possible.
Second, recall that according to Section 2.2, to construct an E-estimator we only need
to propose a sample mean (and possibly a plug-in) estimator of Fourier coefficients θj :=
R1 X
0
f (x)ϕj (x)dx where {ϕj (x)} is the cosine basis on [0, 1]. In our case, a natural sample
mean estimator of θj is
n
X
θ̂j := n−1 Al ϕj (Al Xl )/w(Al Xl ). (5.1.8)
l=1
Let us show that this estimator is indeed a sample mean estimator. Using (5.1.3) we may
write, Z 1
θj = f X (x)ϕj (x)
0
Z 1
= f X (x)w(x)[ϕj (x)/w(x)]dx = E{Aϕj (AX)/w(AX)}. (5.1.9)
0
This verifies that (5.1.8) is the sample mean estimator of θj as well as that it is unbiased
estimator of θj .
The Fourier estimator (5.1.8) yields the density E-estimator fˆX (x), x ∈ [0, 1] of Section
2.2.
Figure 5.2 is a good tool to learn how the E-estimator performs and what can be expected
when the availability likelihood function w(x) is known. The structure of Figure 5.2 is
similar to Figure 5.1, only here we can simultaneously observe results of 4 simulations. The
missing is created by the same availability likelihood function as in Figure 5.1. The solid
and dashed lines show underlying densities f X and their E-estimates fˆX . Additionally,
the ISEs (integrated squared errors) of E-estimates are shown in subtitles, and recall that
R1
ISE:= 0 (f X (x) − fˆX (x))2 dx. Let us look at particular simulations. The left column of
diagrams explores the case of the Uniform density. We clearly see how the MNAR skews
150 DESTRUCTIVE MISSING
2.5
2.0
3.0
2.0
2.5
3
1.5
2.0
1.5
Density
Density
Density
Density
2
1.0
1.5
1.0
1.0
1
0.5
0.5
0.5
0.0
0.0
0.0
0
0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8
3.0
3.0
3.0
2.5
2.5
2.5
2.0
2.0
2.0
Density
Density
Density
Density
1.5
1.5
1.5
1.0
1.0
1.0
0.5
0.5
0.5
0.0
0.0
0.0
0
0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8
Figure 5.2 Density estimation for MNAR data when the availability likelihood function w(x) is
known. Results of four simulations are shown in the four columns of diagrams whose structure is
identical to those in Figure 5.1. In each diagram the histogram is overlaid by an underlying density
(the solid line) and its E-estimate (the dashed line), and ISE is the integrated squared error of E-
estimate. The availability likelihood is the same as in Figure 5.1. {The argument set.corn controls
four corner functions and the availability likelihood w(x) is defined as in Figure 5.1.} [set.corn =
c(1,2,3,4), n = 100, w= 00 1-0.7*x 00 , dwL = 0.3, dwU = 0.9, cJ0 = 3, cJ1 = 0.8, cTH = 4]
available data to the left, and this is due to the decreasing w(x). Nonetheless, the E-
estimator correctly recovers the hidden density (note that the dashed line is hidden by the
solid one). Also look at the number N = 66 of complete cases (available observations) in the
M-sample, this is a small number for any nonparametric problem and hence it is likely that
repeated simulations may produce worse estimates. The next column explores the case of
the Normal density. Here again the M-sample is skewed to the left, and the E-estimate does
correct the histogram and shows a symmetric bell-shaped curve around 0.5. The estimate
is far from being perfect, its tails are “wrong” but note that they do describe the data. The
number of available complete cases is N = 61, and this together with the MNAR increases
the ISE almost threefold with respect to the H-sample. The number of complete cases is
even smaller for the case of the Bimodal density (see the third column), it is N = 56. This
DENSITY ESTIMATION WITH AN EXTRA SAMPLE 151
is due to the decreasing availability likelihood and the skewed to the right shape of the
Bimodal. Despite all these facts, it is fair to say that E-estimate for the M-sample indicates
a bimodal shape and its ISE is just a little bit larger than its H-sample’s counterpart. Of
course, such an outcome is not typical but the random nature of simulations may produce
atypical outcomes. For the Strata, the two E-estimates are not perfect and close to each
other in terms of their shapes.
Overall, repeated simulations indicate that the missing takes a heavy toll on our ability
to produce reasonable nonparametric estimates even if the availability likelihood function
is known; the reduced number of available observations and their biased nature complicate
the estimation. At the same time, typically the shape of an underlying density is recog-
nizable, and this is an encouraging outcome keeping in mind that without knowing w(x)
we are dealing with a destructive missing which precludes us from a consistent estimation.
Further, a possible practical recommendation is to consider, if possible, larger sample sizes
to compensate for the MNAR.
Density
0.8
1.0
1.5
0.4
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
X X[A==1] X[A==1]
0.8
Density
Density
4
1.5
2
0.4
0.0
0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
X X[A==1] X[A==1]
0.8
Density
Density
2
2
0.2
0
0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
X X[A==1] X[A==1]
Density
1.5
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
X X[A==1] X[A==1]
Figure 5.3 Restoration of information in an M-sample with the help of an E-sample. This figure
allows one to visualize the proposed methodology of the density E-estimation. Each row of diagrams
corresponds to an underlying corner density. Underlying density and its E-estimate are shown by
the solid and dashed lines, respectively. In a middle diagram pairs (Xl , Al ) with Al = 1 (complete
cases) are shown by circles, an underlying w(Xl ) and its E-estimate ŵ(Xl ) are shown by triangles
and crosses, respectively. {The arguments are: set.corn controls four corner functions, k controls
the sample size of E-sample, n defines the size of H-sample, w defines a function w∗ (x) and then the
availability likelihood is w(x) := max(dwL, min(dwU, w∗ (x))). Argument c controls the parameters c
used in (5.2.1). All other arguments control parameters of the E-estimator.} [set.corn = c(1,2,3,3),
k = 30, n = 200, w = 00 1-0.7*x 00 , dwU = 0.9, dwL = 0.5, c = 0.3, cJ0 = 3, cJ1 = 0.8, cTH = 4]
uses this fact. Look at how nicely the E-estimate restores the bell-shaped Normal density
from the skewed histogram of the M-sample.
The outcome is much worse for the third experiment with the Bimodal density, see the
third row of diagrams. Here the poor density estimate, based on the E-sample, implies an
extremely poor estimate of w(x) which, in its turn, yields poor density E-estimate based
on M-sample. Note that here the density estimate based on the M-sample with N = 123,
154 DESTRUCTIVE MISSING
according to its ISE is worse than the E-estimate based on the E-sample of size k = 30.
This is what may be expected from small samples. In the bottom row the simulation is the
same as in the third row, meaning that we observe two realizations of the same experiment.
The two density E-estimates indicate two modes but they are clearly poor. Further, the
estimate of w(x) is also bad.
Overall, apart of the Uniform case and to some degree the Normal case, the density
estimates are poor (just compare ISEs of the density estimates based on very small E-
samples and much larger M-samples). There are several teachable moments here. The first
one is that the asymptotic theory does matter; even for these small extra samples we may
get a reasonable outcome. Second, the E-estimator is robust to imperfect estimates of w(x).
Finally, we need to understand why estimates of w(x) may be so bad and what can be done,
if any, to improve them.
Let us explore the above-observed bad estimation of the availability likelihood w(x). If
we look at histograms of E-samples in rows 2-4 of Figure 5.3, then it is striking how small the
empirical ranges of observations in the E-samples are. For instance, for the Normal case (the
left diagram in the second row) there are no observations smaller than 0.15 and larger 0.75.
In other words, from the E-sample nothing can be concluded for 40% of the range of X. This
is what can be expected from a small sample when an underlying density is not separated
from zero. Recall that the asymptotic theory assumes that in a Bernoulli regression the
design density is separated from zero, and this assumption is obviously violated here. Similar
poor outcomes may be observed for the two experiments with the Bimodal density (see the
left diagrams in the two bottom rows). At the same time, the outcome is much better for
the Uniform which is separated from zero. What can be done for densities that are not
separated from zero? There are really only two options. The former one is to increase the
sample size k of E-sample, and this may not be possible in some applications. The latter
is to assume that w(x) cannot be too wiggly and then decrease the number of estimated
Fourier coefficients used by the Bernoulli regression E-estimator ŵ(x). But overall we are
dealing with an extremely complicated problem, and the reader is advised to spend some
time exploring simulations created by Figure 5.3.
Figure 5.4 allows us to explore cases with densities bounded below from zero. Here the
following mixture densities are used,
In (5.2.2) fUX (x) is the Uniform density and fiX (x) is the ith corner density. Further, Figure
5.4 allows us to control parameters cJ0 and cJ1 of the regression E-estimator used to
estimate the availability likelihood w(x). The default parameters, used in Figure 5.4, imply
that the E-estimator uses no more than three first Fourier coefficients. All other features
are identical to Figure 5.3.
The first two experiments, shown in the top two rows of Figure 5.4, correspond to the
mixture with the Normal, and the last two to the mixture with the Bimodal. As we see, the
mixtures are separated from zero, and this increases the empirical range of E-samples shown
by the histograms. With the exception of the first experiment (the top row of diagrams), the
density E-estimates for E-samples are poor but do reflect the histograms. At the same time,
recall that we need these estimates only to estimate the availability likelihood w(x). The
corresponding estimates ŵ(x) are shown in the middle column. With the exception of the
third experiment, the estimates of the availability likelihood are fair, and this is reflected
in the final density E-estimates for M-samples. Even for the third experiment the final
estimate is better than the one based on the E-sample, and in all other cases the ISEs are
significantly smaller. Further, note how the density E-estimator corrects biased M-samples,
and recall that without extra samples no consistent estimation of the density, based on an
M-sample, is possible. Overall, we may conclude that the asymptotic theory does help us
DENSITY ESTIMATION WITH AN EXTRA SAMPLE 155
E-Sample, k = 30, ISE = 0.018 E-Estimate of w(X) M-Sample, N = 137, ISE = 0.0095
1.5
Density
Density
1.5
0.7
0.0
0.0
0.3
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
X X[A==1] X[A==1]
E-Sample, k = 30, ISE = 0.33 E-Estimate of w(X) M-Sample, N = 148, ISE = 0.015
Density
1.5
2
0.0
0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
X X[A==1] X[A==1]
E-Sample, k = 30, ISE = 0.1 E-Estimate of w(X) M-Sample, N = 141, ISE = 0.059
0.0 1.0 2.0
Density
Density
0.8
1.5
0.5
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
X X[A==1] X[A==1]
E-Sample, k = 30, ISE = 0.12 E-Estimate of w(X) M-Sample, N = 149, ISE = 0.039
0.4 0.7 1.0
Density
Density
1.5
1.5
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
X X[A==1] X[A==1]
Figure 5.4 E-estimation for a MNAR sample with extra sample. This figure is similar to Figure
5.3 but it allows one to choose underlying densities separated from zero and different parameters
for the E-estimator of w(x). The underlying density is the mixture of the Uniform (with weight v)
and a corner density (with weight 1 − v). {New arguments are: v controls the weight in the mixture
(5.2.2), setw.cJ0 and setw.cJ1 control parameters of the regression E-estimator of w(x) in the 4
experiments.} [set.corn = c(2,2,3,3), n = 200, k = 30, v = 0.6, setw.cJ0 = c(2,2,2,2), setw.cJ1 =
c(0,0,0,0), w = 00 1.1-0.7*x 00 , dwU = 0.9, dwL = 0.3, cJ0 = 3, cJ1 = 0.8, cTH = 4]
to understand performance of the E-estimator and why its assumptions are important. The
reader is encouraged to repeat Figure 5.4, get used to the setting and the proposed solution,
and try to use different parameters of the regression and density E-estimators.
Our final remark is about a possible aggregation of estimators based on E- and M-
samples. Consider a mixture of two estimators of Fourier coefficients based on E- and
M-samples,
θ̃j = λθ̂Ej + (1 − λ)θ̂M j , λ ∈ [0, 1]. (5.2.3)
156 DESTRUCTIVE MISSING
Choice of the mixture coefficient λ is based on minimization of the variance of the estimator
θ̃j . Then the aggregated estimator of Fourier coefficients can be used by the E-estimator.
We will return to the aggregation later in Section 10.4.
Note that given (5.3.1) the only interesting case is when X and Y are dependent because
otherwise we are dealing with the MCAR. This is an important remark, and let us formally
prove it. Suppose that X and Y are independent, (5.3.1) holds, and f X (x) is supported on
[0, 1]. Then we can write for x ∈ [0, 1],
R ∞ AX,A,Y R∞
−∞
f (x, 1, y)dy w(y)f X,Y (x, y)dy
P(A = 1|X = x) = X
= −∞ =
f (x) f X (x)
R∞
f X (x) −∞ w(y)f Y (y)dy
Z ∞
= = w(y)f Y (y)dy. (5.3.2)
f X (x) −∞
The integral on the right side of (5.3.2) does not depend on x and hence the missing is
MCAR. We conclude that the only interesting case is when X and Y are dependent.
Now, followingR the E-estimation methodology, let us explain how we can estimate Fourier
1
coefficients θj := 0 f X (x)ϕj (x)dx. First, we note that the joint mixed density of the triplet
(AX, A, Y ) is
To verify (5.3.6) we recall that V(Z) = E{Z 2 } − [E{Z}]2 , the variance of a sum of
independent random variables is equal to the sum of variances of the variables, and that
ϕ2j (x) = 1 + 2−1/2 ϕ2j (x). This allows us to write (recall the notation w−k (y) := 1/[w(y)]k ),
Now we are in a position to test the proposed density E-estimator. We are going to
consider two models for the auxiliary variable when Y := β0 + β1 X + σε and Y = σXε with
ε being standard normal and independent of X. In the first case we have a classical linear
relation, and in the second X defines the standard deviation of Y .
Figure 5.5 exhibits in its rows four simulations with Y := β0 + β1 X + σε, and its caption
explains the diagrams. A diagram in the left column shows the scattergram of observations
of (Y, A), the underlying likelihood function and its E-estimate by circles, triangles and
crosses, respectively. Two top rows show independent simulations for the Normal density
and two bottom ones for the Bimodal density. Due to the increasing availability likelihood
function, we may expect an M-sample to be skewed to the right. And indeed, we see this
158 DESTRUCTIVE MISSING
Density
0.6
1.5
0.0
0.0
-6 -4 -2 0 2 4 0.0 0.2 0.4 0.6 0.8 1.0
Y X[A==1]
Y X[A==1]
Density
0.6
2
0.0
Y X[A==1]
1.5
0.0
0.0
Y X[A==1]
Figure 5.5 Performance of the density E-estimator fˆX (x) based on M-sample from (AX, A) and
an extra sample from the auxiliary variable Y = β0 + β1 X + σε where ε is standard normal. The
variable Y defines the missing mechanism according to (5.3.1). Rows of diagrams show performance
of the estimator for 4 different simulations. Circles show a scattergram of pairs (Yl , Al ), l = 1, . . . , n
while triangles and crosses show values of w(Yl ) and ŵ(Yl ) at points Yl corresponding to Al = 1.
Complete cases in an M-sample are shown by the histogram while the solid and dashed lines show
the underlying density f X and its E-estimate, respectively. {Arguments set.beta = c(β0 , β1 ) and
sigma define Y , all other arguments are the same as in Figure 5.4. [set.c = c(2,2,3,3), n = 200,
c = 1, set.beta = c(0,0.3), sigma = 2, setw.cJ0 = c(3,3,3,3), setw.cJ1= c(0.3,0.3,0.3,0.3), w =
00
0.3+0.5*exp(1+y)/ (1+exp(1+y)) 00 , dwL = 0.3, dwU = 0.9, cJ0 = 3, cJ1 = 0.8, cTH = 4]
in the first three simulations but not in the last one (look at the histogram in the right-
bottom diagram). This is an interesting observation but not a surprise for the reader who
did recommended simulations in Chapter 2. Indeed, the numbers of complete cases in the
M-samples are close to 125, and we know that these sizes may produce peculiar samples
that are far from expected ones.
After these remarks, let us look at particular outcomes beginning with the experiment
shown in the top row. The E-estimate of the availability likelihood is pretty good. In its
turn, it implies a fairly good, keeping in mind the size N = 124 of complete cases, E-
DENSITY ESTIMATION WITH AUXILIARY VARIABLE 159
Density
2.0
0.6
0.0
0.0
-4 -2 0 2 4 0.0 0.2 0.4 0.6 0.8 1.0
Y X[A==1]
Density
2.0
0.6
0.0
0.0
-4 -2 0 2 0.0 0.2 0.4 0.6 0.8 1.0
Y X[A==1]
Density
0.6
2.0
0.0
0.0
Y X[A==1]
1.5
0.0
0.0
Y X[A==1]
Figure 5.6 Performance of the density E-estimator fˆX (x) based on M-sample from (AX, A) and
an extra sample from the auxiliary variable Y = σXε, where ε is standard normal. Everything else
is identical to Figure 5.5. [set.c = c(2,2,3,3), n = 200, c = 1, setw.cJ0 = c(3,3,3,3), setw.cJ1 =
c(0.3,0.3,0.3,0.3), sigma=2, w = 00 0.3+0.5*exp(1+y)/(1+exp(1+y)) 00 , dwL = 0.3, dwU = 0.9, cJ0
= 3, cJ1 = 0.8, cTH = 4]
estimate of the Normal density. Note how the estimate corrects the biased histogram and
indicates a symmetric and unimodal bell-shaped density. The next and absolutely similar
simulation, shown in the second row, implies a poor estimate of w(x), but is this the fault
of the E-estimator? If we carefully look at the scattergram, shown by circles, we may note
that the E-estimate does reflect the data at hand. Nonetheless, the poor estimate ŵ(y)
does not dramatically affect E-estimate of the Normal. Yes, it is worse than in the first
simulation (compare the ISEs), but still it is symmetric and bell-shaped. The explanation
of this robustness is that the particular ŵ correctly shows the monotonicity of the underlying
availability likelihood for a majority of points Yl corresponding to Al = 1.
In the last two experiments of Figure 5.5 with the Bimodal density, estimates of the
availability likelihood are good, but the density estimates are different in their quality.
The first one (it is the second from the bottom) is relatively good for N = 122 (compare
with outcomes in Section 2.2 and conduct more simulations). At the same time, in the last
160 DESTRUCTIVE MISSING
experiment modes of the estimate have practically the same magnitudes and the estimate
itself is worse (compare the ISEs). Is this the fault of the E-estimator? The answer is “no”
because the estimator does exactly what it should do. Due to the biased data, the E-
estimator increases the left mode and decreases the right one. In this particular M-sample,
the right mode in the histogram is a bit taller than the left one, and this implies the E-
estimate (the dashed line) with about the same magnitudes of the just barely pronounced
modes.
Figure 5.6 allows us to explore the case Y = 2Xε where ε is standard normal and
independent of X, otherwise this figure and simulations are identical to Figure 5.5. The two
top rows show us interesting outcomes for the Normal density. The first experiment yields
a poor estimate of the availability likelihood (which is supported by the scattergram and
note that the estimator knows only data), and nonetheless the density E-estimate is good
for the case when only N = 122 observations are available in the MNAR sample. In the
second experiment the situation is reversed. The availability likelihood estimate is good but
the density estimate is poor (compare the ISEs). The reason for the latter is the M-sample
exhibited by the histogram. We may conclude that the density E-estimator is robust toward
the availability likelihood and, as we know, it always follows data in an M-sample. The same
conclusion may be obtained from visualization of the two experiments for the Bimodal. We
again see the importance of a “good” M-sample.
Overall, the conclusion is that in the case of a destructive missing it is prudent to search
after an auxiliary variable which can explain the missing mechanism.
It is assumed that w(x, y) is not a constant in y, and hence we are dealing with the MNAR
(missing not at random) case. The problem is to estimate the regression function
The following example explains the setting. Suppose that we would like to predict the
current salary of college graduates who graduated 5 years ago based on their GPA (grade
point average). We may try to get data for n graduates, but it is likely that salaries of some
graduates will not be known. What can be done in this case?
As we will see shortly, based solely on an M-sample it is impossible to suggest a consistent
regression estimator, and the MNAR implies a destructive missing. Hence, an additional
information is needed to restore information about regression function contained in an M-
sample. Note a dramatic difference with the case of MAR responses, discussed in Section
4.2, when a complete-case approach yields optimal estimation of the regression function.
In this section a number of topics and issues are discussed, and it is convenient to consider
them in corresponding subsections. We begin with the case w(x, y) = w(y), explain why
this implies the destructive missing, and then consider several possible sources of additional
information that will allow us to consistently estimate the regression function. Then the
general case of w(x, y) is considered.
REGRESSION WITH MNAR RESPONSES 161
The case w(x, y) = w(y). We are considering the setting when the likelihood of missing
the response depends solely on its value. This is a classical MNAR, and let us show that
no consistent estimation is possible in this case. The joint mixed density of the triplet
(X, AY, A) is
f X,AY,A (x, ay, a) = [f X,Y (x, y)w(y)]a [f X,A (x, 0)]a−1
Z ∞
X Y |X a
= [f (x)(f (y|x)w(y))] [ f X,Y,A (x, y, 0)dy]a−1
−∞
Z ∞
= [f X (x)(f Y |X (y|x)w(y))]a [f X (x)(1 − f Y |X (y|x)w(y)dy)]a−1 . (5.4.3)
−∞
Here x ∈ [0, 1], y ∈ (−∞, ∞), and a ∈ {0, 1}. As we see, the joint density depends on the
product f Y |X (y|x)w(y), and hence only this product or its functionals can be estimated
based on an M-sample.
Inconsistency of a complete-case approach. As we know from Chapter 4, for the case
of MAR responses the complete-case methodology implies a consistent and even optimal
estimation. Let us check what may be expected if we use a complete-case approach for
regression estimation in the MNAR case. Using (5.4.3) we can write,
R∞
yf X,AY,A (x, y, 1)dy
Z ∞
E{AY |X, A = 1} = yf AY |X,A
(y|x, 1)dy = −∞
−∞ f X,A (x, 1)
R∞ R∞
−∞
yf X (x)f Y |X (y|x)w(y)dy yf Y |X (y|x)w(y)dy
= X R∞ = R−∞
∞ . (5.4.4)
f (x) −∞ f Y |X (y|x)w(y)dy −∞
f Y |X (y|x)w(y)dy
We conclude that the complete-case approach yields estimation of a regression function
corresponding to the conditional density
Y |X f Y |X (y|x)w(y)
f∗ (y|x) := R ∞ , (5.4.5)
−∞
f Y |X (u|x)w(u)du
Y |X
rather than to the underlying f Y |X (y|x). Note that, as it could be expected, f∗ is the
biased conditional density.
Figure 5.7 complements this theoretical discussion by allowing us to look at simulations
and appreciate performance of the complete-case approach. The two columns of diagrams
correspond to two underlying regression functions, here the Normal and the Strata. A top
diagram shows a hidden H-sample for a classical regression
where Xl are independent uniform random variables and εl are independent standard nor-
mal. The H-scattergram is shown by circles. The solid line shows the underlying regression
and the dashed line shows the E-estimate. For the Normal regression the E-estimate is
reasonable, it is skewed a bit and its mode is not as large as desired, but this is what the
scattergram indicates and the E-estimate just follows the data. Overall the estimate is good
and this conclusion is supported by the small ISE. For the Strata the E-estimate is not per-
fect but, as we know from previous simulations for the Strata, it is not bad either. Further,
the relative magnitudes of the two modes are shown correctly.
Now let us look at the bottom diagrams in Figure 5.7 that show us MNAR samples. The
MNAR is created by the availability likelihood w(y) = max(0.3, min(0.9, 1−0.3y)). Hence we
are more likely to preserve smaller responses and to miss larger ones. The bottom diagrams
show us complete pairs by circles and incomplete by crosses. Let us look at the left-bottom
162 DESTRUCTIVE MISSING
4
3
3
2
2
Y
Y
1
1
0
0
-1
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
X X
4
3
3
2
2
AY
AY
1
1
0
0
-1
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
X X
Figure 5.7 Complete-case approach for regression with MNAR responses. The underlying regres-
sion is Y = m(X) + σε where X is uniform and ε is independent standard normal. E-estimate for
M-sample is based on complete cases that are shown by circles, while incomplete cases are shown
by crosses. Underlying regression function and its E-estimate are shown by the solid and dashed
lines, respectively. N is the number of complete cases in an M-sample. {The choice of underly-
ing regression functions is controlled by the argument set.c. The availability likelihood is equal to
max(dwL, min(dwU, w(y))) and w(y) is controlled by the string w.} [n = 100, set.c = c(2,4), sigma
= 2, w= 00 1-0.3*y 00 , dwL = 0.3, dwU = 0.9, cJ0 = 3, cJ1 = 0.8, cTH = 4]
This estimator is unbiased. Let us also show that for the considered model (5.4.6) the
variance of θ̄j is
∞ 1
(m2 (x) + σ 2 )f Y |X (y|x)
Z Z
−1
V(θ̄j ) = n dxdy[1 + oj (1)]. (5.4.9)
−∞ 0 f X (x)w(y)
To verify (5.4.9) we recall that V(Z) = E{Z 2 } − [E{Z}]2 , the variance of a sum of
independent random variables is equal to the sum of variances of the variables, and that
ϕ2j (x) = 1 + 2−1/2 ϕ2j (x). This allows us to write,
4
3
3
2
2
Y
Y
1
1
0
0
-1
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
X X
4
3
3
2
2
AY
AY
1
1
0
0
-1
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
X X
Figure 5.8 Regression with MNAR responses when the availability likelihood w(y) is known. The
underlying experiment and the structure of diagrams are the same as in Figure 5.7. [n = 100, set.c
= c(2,4), sigma = 2, w = 00 1-0.3*y 00 , dwL = 0.3, dwU = 0.9, c = 1, cJ0 = 3, cJ1 = 0.8, cTH =
4]
The final remark is about the case of an unknown design density. Because all predictors
are available, we can calculate the density E-estimator fˆX (x) of Section 2.2, and then plug
its truncated version max(fˆX (x), c/ ln(n)) in (5.4.8).
Figure 5.8 shows how the proposed E-estimator performs. The simulation and the struc-
ture of diagrams is the same as in Figure 5.7. The only difference is that now we are using
the E-estimator based on both M-sample and the known availability likelihood w(y). As
we see, despite the significantly smaller sizes N of available responses in M-samples, the
E-estimates are relatively good. In the case of the Normal regression, the magnitude of
the mode is shown correctly. Further, we know from Figure 5.7 that the case of the Strata
regression is extremely complicated for the considered MNAR, and typically the MNAR
implies that the left mode is significantly smaller than the right one. Here this is not the
case, and actually the relative heights of the modes are shown better for the M-sample than
for the H-sample.
The reader is advised to repeat Figure 5.8, use different parameters and realize that
while the missing is MNAR, knowing the availability likelihood function allows us to re-
store information about a regression function contained in M-sample. Another important
REGRESSION WITH MNAR RESPONSES 165
conclusion is that if we are able to estimate the availability likelihood function, then the
MNAR is no longer destructive.
Estimation of availability likelihood w(y). As we already know, if availability likelihood
w(y) is unknown then no consistent regression estimation is possible based solely on an M-
sample. One of the approaches to restore information about regression function from an
M-sample is to use an extra sample that will allow us to estimate the availability likelihood
and then convert the problem into the above-considered one. We discussed in the previous
sections several possibilities that also may be used here.
One possibility is to utilize a more expansive sample from (Y, A) without missing. Sup-
pose that we may get an extra sample (E-sample) (Yn+1 , An+1 ), . . . , (Yn+k , An+k ) of size k
which is smaller than n. Note that the E-sample alone cannot help us to estimate the re-
gression because we do not know the predictors. On the other hand, we can use a Bernoulli
regression to estimate the availability likelihood w(y) and then plug the E-estimator ŵ(y)
in (5.4.8).
Another viable possibility, which is more feasible than the previous one, is to get an
extra sample Yn+1 , . . . , Yn+k from the response Y . Note that this E-sample cannot help us
to estimate the regression function per se because the corresponding predictors are unknown,
but it may allow us to estimate the underlying availability likelihood. Let us explain the
proposed approach. First, the density of responses f Y (y) is estimated. Second, Bernoulli
regression of Al on Al Yl , based on complete cases in the M-sample, is considered. As we know
from Sections 2.4 and 3.7, if the design density f Y (y) of predictors in a Bernoulli regression
is known, then only “successes” may be used to estimate an underlying regression function.
Finally, the estimated availability likelihood is used in (5.4.8).
We may conclude that whatever an extra opportunity exists to estimate w(y), it should
be explored because otherwise the M-sample is a pure loss due to the destructive MNAR.
Auxiliary variable defines the missing. Suppose that we can find an auxiliary and
always observed variable Z which defines the missing, and let Z and Y be dependent.
Further, the available M-sample is from the quartet (X, AY, A, Z). Then the missing model
changes and it becomes MAR with the availability likelihood
Let us explore regression estimation for this model assuming that Z, similary to X and
Y , is a continuous random variable. The joint mixed density of the quartet for a complete
case (when A = 1) may be written as
4
3
3
2
2
1
Y
1
0
0
-1
-1
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
X X
0.8
A
A
0.4
0.4
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Z Z
3
2
2
AY
AY
1
1
0
0
-1
-1
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
X X
Figure 5.9 Regression with missing responses when an auxiliary variable Z, which defines the
missing, is available. The regression is Y = m(X) + σε where X is uniform and ε is stan-
dard normal. The auxiliary variable is Z = e1+a(Y −b) /[1 + e1+a(Y −b) ]. The availability likelihood
P(A = 1|X = x, Y = y, Z = z) = max(dwL , min(dwU , w(z))) and w(z) = 1 − z. The structure
of diagrams in the top and bottom rows is identical to those in Figure 5.8. In a middle diagram,
circles show pairs (Zl , Al ), l = 1, 2, . . . , n while triangles and crosses show values of the underlying
availability likelihood and its E-estimate at points Zl , l = 1, 2, . . . , n. {The choice of underlying re-
gression functions is controlled by set.c, a string w controls the choice of the availability likelihood,
parameters a and b control the choice of Z, and c is used in the lower bound c/ ln(n) for the density
and availability likelihood E-estimates (recall that these estimates are plugged in the denominator.}
[n = 100, set.c = c(2,4), sigma = 0.5, a = 3, b = 1, c = 1, w = 00 1-z 00 , dwL = 0.4, dwU = 0.9,
cJ0 = 3, cJ1 = 0.8, cTH = 4]
Suppose that the design density f X (x) and the availability likelihood w(z) are known,
and f X (x) ≥ c∗ > 0, x ∈ [0, 1]. Then (5.4.13) implies the following sample mean Fourier
estimator,
n
X Al Yl ϕj (Xl )
θ̄j := n−1 . (5.4.14)
f X (Xl )w(Zl )
l=1
REGRESSION WITH MNAR RESPONSES 167
Because all observations of X are available, we can estimate the design density and plug
its E-estimate in (5.4.14). Similarly, because all observations of the pair (Z, A) are available,
the availability likelihood can be estimated via Bernoulli regression and also plugged in
(5.4.14). Note that these estimates are used in the denominator, so we truncate them from
below by c/ ln(n). The Fourier E-estimator is constructed, and it yields the regression E-
estimator m̂(x), x ∈ [0, 1].
Figure 5.9 illustrates the setting. Here, similar to Figures 5.7 and 5.8, the top diagrams
show regression for underlying H-samples. The middle diagrams show estimation of the
availability likelihood w(z) which is the regression function E{A|Z = z}. Note that in both
simulations the E-estimates ŵ(z) are not good, but they do reflect the data. For instance
in the right-middle diagram the E-estimate, shown by crosses, is well below the underlying
availability likelihood, shown by triangles, in the middle of the unit interval. However, this
behavior of the E-estimate is supported by the data. The bottom diagrams show us recovered
regressions based on M-samples. Keeping in mind the small numbers N of available complete
cases, the recovered regressions are good and comparable with E-estimates for H-samples.
Availability likelihood w(x, y). Let us consider the general case (5.4.1) when the avail-
ability likelihood depends on both X and Y . In other words, missing of the response depends
on values of the predictor and the response. Of course, as we already know, in this case the
missing is MNAR and M-sample alone does not allow us to suggest a consistent regression
estimator. The open question is as follows. Suppose that the availability likelihood is known.
Can the regression be estimated in this case?
To answer this question, consider the joint density
f X,AY,A (x, y, 1) = f X,Y (x, y)P(A = 1|X = x, Y = y) = f X,Y (x, y)w(x, y). (5.4.15)
R1
Then a Fourier coefficient θj := 0 m(x)ϕj (x)dx of the regression function m(x) may
be written as Z 1 n AY ϕ (X) o
j
θj = E{Y |X = x}ϕj (x)dx = E X . (5.4.16)
0 f (X)w(X, Y )
Assume for a moment that the design density f X (x) is known. Then we can suggest the
following sample mean estimator,
n
X Al Yl ϕj (Xl )
θ̄j := n−1 X
. (5.4.17)
f (Xl )w(Xl , Al Yl )
l=1
This estimator is unbiased, and for the model (5.4.6) its variance satisfies
Z ∞Z 1
(m2 (x) + σ 2 )f Y |X (y|x)
V(θ̄j ) = n−1 dxdy(1 + oj (1)). (5.4.18)
−∞ 0 f X (x)w(x, y)
To verify (5.4.18) we recall that V(Z) = E{Z 2 } − [E{Z}]2 , the variance of a sum of
independent random variables is equal to the sum of variances of the variables, and that
ϕ2j (x) = 1 + 2−1/2 ϕ2j (x). This allows us to write,
4
4
3
3
2
2
Y
1
1
0
0
-1
-1
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
X X
4
4
3
3
2
2
AY
AY
1
1
0
0
-1
-1
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
X X
Figure 5.10 Regression with MNAR responses when the availability likelihood function w(x, y) is
known. The underlying experiment and the structure of diagrams are the same as in Figure 5.7.
{The availability likelihood is defined by a string w whose values are truncated from below by dwL
and from above by dwU.} [n = 100, set.c = c(2,4), sigma = 2, w = 00 1-0.3*x*y 00 , dwL = 0.3, dwU
= 0.9, cJ0 = 3, cJ1 = 0.8, cTH = 4]
If the design density f X (x) is unknown, it is estimated by the E-estimator and then
plugged in (5.4.17).
Figure 5.10 illustrates the setting, here we use the same underlying model as in Figure 5.7
only with the availability likelihood w(x, y) = max(0.3, min(0.9, 1 − 0.3xy)). Let us look at
the left column of diagrams corresponding to the Normal regression function. The underlying
availability likelihood increases chances of missing responses corresponding to larger values
of the product XY , and this pattern is clearly seen in the bottom diagram of the M-sample.
Further, note that 20% of responses are missed. Nonetheless, the E-estimator does a good
job in recovering the Normal regression function from the MNAR data. A similar pattern
can be observed in the right column for the Strata regression. Here again the E-estimator
does a good job in recovering the underlying regression.
The main challenge of the setting is that in general the availability likelihood w(x, y)
is unknown and some extra information should be provided, and this is where the main
statistical issue arises. We need to estimate a bivariate function w(x, y) and then use it
for estimation of a univariate function m(x). As we know from Section 2.5, estimation of a
REGRESSION WITH MNAR PREDICTORS 169
multivariate function is complicated by the curse of multidimensionality. In our setting this
is a serious problem because the nuisance function w(x, y) is bivariate while the function
of interest m(x) is univariate. As a result, if an extra sample of size k may be collected for
estimation of w(x, y), it is no longer possible to guarantee that k can be smaller in order
than n.
It is explicitly assumed that w(x, y) depends on x and hence the missing mechanism is
MNAR.
The problem is to estimate the regression function
where E{ε|X} = 0 and E{ε2 |X} = 1 almost sure, and the regression error ε and the
predictor X may be dependent.
To shed light on the problem, note that the joint mixed density of the triplet (AX, Y, A)
is
Z 1
f AX,Y,A (ax, y, a) = [f X,Y (x, y)w(x, y)]a [ (1 − w(u, y))f X,Y (u, y)du]1−a , (5.5.4)
0
This Fourier estimator yields the regression E-estimator m̂(x). Note that only complete
cases are used by the E-estimator.
170 DESTRUCTIVE MISSING
4
3
3
2
2
Y
Y
1
1
0
0
-1
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
X X
4
3
3
2
2
Y
Y
1
1
0
0
-1
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
AX AX
Figure 5.11 Regression with MNAR predictors when the design density and the availability likelihood
w(x, y) are known. The underlying experiment and the structure of diagrams are the same as in
Figure 5.10, only here predictors (and not responses) are missed with the same availability likelihood
function. [n = 100, set.c = c(2,4), sigma = 2, w = 00 1-0.3*x*y 00 , dwL = 0.3, dwU = 0.9, cJ0 =
3, cJ1 = 0.8, cTH = 4]
Figure 5.11 illustrates the model and also shows how the E-estimator performs. Note
that the underlying regression model is the same as in Figure 5.10, and the only difference
is that here the predictors (and not the responses as in Figure 5.10) are missed. The latter
allows us to understand and appreciate differences between the two settings. First of all,
let us look at incomplete cases. It was relatively simple to recognize the pattern of w(x, y)
for the case of missing responses in Figure 5.10. Here, for the case of missing predictors,
this is not as simple. For the case of the Normal regression function (the left column), only
via comparison between the top and bottom diagrams, it is possible to recognize that the
availability likelihood is decreasing in xy, and this explains the rather complicated structure
of missing predictors. For the Strata (the right column of diagrams) this recognition may be
simpler but again only if we compare M-sample with H-sample. In short, such an analysis
is a teachable moment in recognizing the destructive nature of the MNAR. The proposed
E-estimator performs relatively well.
REGRESSION WITH MNAR PREDICTORS 171
4
3
3
2
2
Y
Y
1
1
0
0
-1
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
X X
3.0
2.0
Density
Density
2.0
1.0
1.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0
X[A==1] X[A==1]
3
2
2
Y
Y
1
1
0
0
-1
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
AX AX
Figure 5.12 Regression with MNAR predictors when the availability likelihood w(x, y) is known.
The underlying experiment and the structure of the top and bottom diagrams are the same as in
Figure 5.11. The middle diagrams show histograms of available predictors and E-estimates of the
design density. [n = 100, set.c = c(2,4), sigma = 2, w = 00 1-0.3*x*y 00 , c=1, dwL = 0.3, dwU =
0.9, cJ0 = 3, cJ1 = 0.8, cTH = 4]
Of course, the above-considered estimator used the underlying design density f X , which
in general is unknown. Is it possible to estimate it based on M-sample and an underlying
availability likelihood? Let us explore this question.
To use our density E-estimator, we need to understand how to estimate Fourier coeffi-
cients Z 1
κj := f X (x)ϕj (x)dx (5.5.8)
0
of the design density f X (x), x ∈ [0, 1]. Using (5.5.4) we may write that
Z 1 Z ∞
κj = [ f X,Y (x, y)ϕj (x)dy]dx
0 −∞
172 DESTRUCTIVE MISSING
1 ∞
f AX,Y,A (x, y, 1)ϕj (x)
Z Z n ϕj (AX) o
= dydx = E A . (5.5.9)
0 −∞ w(x, y) w(AX, Y )
This allows us to define the sample mean estimator
n
X ϕj (Al Xl )
κ̂j := n−1 Al . (5.5.10)
w(Al Xl , Yl )
l=1
Denote the corresponding density E-estimator as f˜X (x), and because it is used in de-
nominator of (5.5.7), we plug in (5.5.7) its truncated version
Note that unless the availability likelihood is constant, the missing mechanism is MNAR.
The problem is to estimate the regression function
To propose a solution, we begin with analysis of the joint mixed density of the triplet
(AX, AY, A). Write,
f AX,AY,A (ax, ay, a) = [f X,Y (x, y)w(x, y)]a [1 − E{w(X, Y )}]1−a , (5.6.3)
where (x, y, a) ∈ [0, 1]×(−∞, ∞)×{0, 1}. We may conclude that observations in a complete
case, when A = 1, are biased and the biasing function is w(x, y). As a result, while it may
be tempting to simply ignore missing cases, this may lead to inconsistent estimation.
MISSING CASES IN REGRESSION 173
Let us assume that the availability likelihood w(x, y) and the design density f X (x)
are known. Following the E-estimation methodology, we need to propose a sample mean
estimator of Fourier coefficients
Z 1
θj := m(x)ϕj (x)dx (5.6.4)
0
This Fourier estimator yields the regression E-estimator m̂(x), x ∈ [0, 1].
In general the design density is unknown and should be estimated. Here we again use our
traditional approach of construction of a density E-estimator. Note that Fourier coefficients
of f X (x) can be written as
Z 1 n ϕj (AX) o
κj := f X (x)ϕj (x)dx = E A . (5.6.7)
0 w(AX, AY )
Denote the corresponding density E-estimator as f˜X (x), and because it is used in denomi-
nator, we plug in (5.6.6) its truncated version
4
3
3
2
2
Y
1
1
0
0
-1
-1
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
X X
3.0
2.0
Density
Density
2.0
1.0
1.0
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
X[A==1] X[A==1]
4
3
3
Y[A==1]
Y[A==1]
2
2
1
1
0
0
-1
-1
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
X[A==1] X[A==1]
Figure 5.13 Regression with MNAR cases when the availability likelihood w(x, y) is known. The
underlying regression experiment and the structure of diagrams are the same as in Figure 5.12. [n
= 100, set.c = c(2,4), sigma = 2, w = 00 1-0.3*x*y 00 , c = 1, dwL = 0.3, dwU = 0.9, cJ0 = 3, cJ1
= 0.8, cTH = 4]
Our final remark is about the case when the bivariate availability likelihood w(x, y) is
unknown. This is a serious statistical problem with two possible approaches. The former
is to conclude that a consistent estimation of the regression is impossible. The latter is
to begin, if possible, collecting an extra information that will allow us to estimate the
availability likelihood. This option was discussed in the previous section. The best case
scenario is the presence of an auxiliary variable that defines the missing. In this case the
MNAR is converted in MAR. A more complicated remedy is to obtain an extra sample that
allows us to estimate w(x, y). The challenge of this remedy is that we need to estimate a
bivariate function, and due to the curse of multidimensionality, that estimation may require
a relatively large, with respect to the sample size n, number of additional observations.
EXERCISES 175
5.7 Exercises
5.1.1 Explain MCAR, MAR and MNAR missing mechanisms. Give corresponding exam-
ples.
5.1.2 Consider a continuous random variable X and a Bernoulli availability A. Explain
why observing the pair (AX, A) and the single variable AX is equivalent. Also, explain the
meaning of relation A = I(AX 6= 0).
5.1.3 What is the availability likelihood?
5.1.4 Verify relation (5.1.3).
5.1.5 Consider the case of a not constant availability likelihood w(x). Explain why knowing
only an M-sample implies a destructive missing.
5.1.6∗ Consider the number N of available observations in an M-sample. Describe statistical
characteristics of N including its mean, variance and the probability of large deviations.
5.1.7 Verify each relation in (5.1.7).
5.1.8 Explain how MNAR is related to biased data.
5.1.9 Repeat Figure 5.1 with different availability likelihoods that imply M-samples skewed
to the left and to the right with respect to underlying H-samples. Explain histograms.
5.1.10 Explain the E-estimator used in Figure 5.2. Why is consistent estimation possible
for the MNAR?
5.1.11∗ Find the variance of estimator (5.1.8).
5.1.12∗ Find the MISE of the density E-estimator based on Fourier estimator (5.1.8).
5.1.13 Using Figure 5.2, explore how different shapes of the availability likelihood affect
density estimation.
5.1.14 Explain how w(x) affects the number N of available observations.
5.1.15 Using Figure 5.2, propose better parameters of the E-estimator.
5.1.16 Explain why knowing availability likelihood makes the MNAR nondestructive.
5.1.17 What is the shape of w(x) that is more damaging for estimation of the Strata
density?
5.2.1 Explain why an extra sample may be needed for recovery density from MNAR data.
5.2.2 What is the idea of using an extra sample? Is it simpler to repeat sampling without
missing observations? Hint: Pay attention to the fact that the size of extra sample may be
in order smaller than the size n of an M-sample.
5.2.3∗ Why can the size of extra sample be smaller in order than n?
5.2.4 Present several examples where, for an additional price, it is possible to get an extra
sample with no missing observations.
5.2.5 Explain the methodology of E-estimation for the case of an extra-sample.
5.2.6 Explain the estimator (5.2.1).
5.2.7∗ What is the variance of estimator (5.2.1)?
5.2.8 What is the role of parameter c in estimator (5.2.1)?
5.2.9∗ Evaluate the MISE of the density E-estimator based on the Fourier estimator (5.2.1).
5.2.10 Explain the simulation used in Figure 5.3.
5.2.11 E-estimates of w(x), exhibited in Figure 5.3, are not satisfactory. Explain why and
what can be done to improve them.
5.2.12 Using Figure 5.3, suggest better parameters of E-estimator.
5.2.13 Explain the underlying simulation used in Figure 5.4.
5.2.14 What is the difference, if any, between Figure 5.3 and Figure 5.4?
5.2.15∗ Explore variance of the aggregated estimator (5.2.3).
5.2.16∗ Suggest parameter λ which minimizes variance of estimator (5.2.3).
5.3.1 MNAR typically implies a destructive missing. Explain how an auxiliary variable can
help in restoring information contained in M-sample.
5.3.2 Explain the relation (5.3.1) and its meaning.
176 DESTRUCTIVE MISSING
5.3.3 Consider the case of MNAR, and suppose that (5.3.1) is valid. Can X and Y be
independent?
5.3.4 Verify each relation in (5.3.2).
5.3.5 Verify (5.3.3).
5.3.6 Consider the case of a known w(y) and propose a density E-estimator.
5.3.7 Explain the underlying idea of the estimator (5.3.4).
5.3.8 Show that (5.3.4) is unbiased estimator.
5.3.9∗ Find the variance of estimator (5.3.4).
5.3.10 Verify every equality in (5.3.5).
5.3.11 Verify every relation
R ∞ in (5.3.7).
5.3.12 Prove inequality −∞ f Y |X (y|x)w−1 (y)dy < ∞.
5.3.13∗ Explain how w(y) may be estimated.
5.3.14∗ Evaluate the mean squared error of estimator ŵ(y0 ) defined in (5.3.8).
5.3.15 Explain the underlying idea of Fourier estimator (5.3.9).
5.3.16∗ Evaluate the mean of estimator (5.3.9).
5.3.17∗ Evaluate the variance of the estimator (5.3.9). Hint: Begin with the case of given
w(y).
5.3.18 Explain and comment on two models of the auxiliary variable Y as a function in X
used in the simulations.
5.3.19 Explain the underlying simulation used in Figure 5.5.
5.3.20 How does sample size n affect estimation of the density?
5.3.21 Using Figure 5.5, explain how parameters of the linear model for Y affect the esti-
mation.
5.3.22 Using Figure 5.5, consider all four corner functions and comment on complexity of
their estimation.
5.3.23 Using Figure 5.5, propose better parameters of the E-estimator.
5.3.24 Explain the underlying simulation used in Figure 5.6.
5.3.25 Using Figure 5.6, explore how the sample size n affects the estimation.
5.3.26 Explore how parameter σ of Figure 5.6 affects estimation, and complement a nu-
merical study by a theoretical argument. Hint: Use empirical ISE and theoretical MISE.
5.3.27 Consider all four corner functions and comment on their estimation using Figure
5.6.
5.3.28 Find better parameters of E-estimator for Figure 5.6.
5.3.29 Compare performance of E-estimator for models used in Figures 5.5 and 5.6.
5.4.1 Explain the aim of a nonparametric regression.
5.4.2 What is the definition of a nonparametric regression?
5.4.3 Formulate the model of nonparametric regression with MNAR responses.
5.4.4 What is the availability likelihood for MAR regression with missing responses?
5.4.5 What is the availability likelihood for MNAR regression with missing responses?
5.4.6∗ Explain why MNAR may imply destructive missing but MAR may not.
5.4.7∗ Describe several possible scenarios when MNAR does not imply destructive missing
and a consistent estimation is possible. Explain why.
5.4.8 Explain a possibility of consistent estimation when w(x, y) = w(y). Further, is the
case w(x, y) = w(x) of interest here?
5.4.9 Explain every equality in (5.4.3).
5.4.10 Consider the case w(x, y) = w(y), and explain what functions can and cannot be
consistently estimated using M-sample.
5.4.11∗ For the case w(x, y) = w(y), propose a consistent estimator of the product
f Y |X (y|x)w(y).
5.4.12 Explore using a complete-case approach for the MNAR with w(x, y) = w(y).
5.4.13 Verify relations in (5.4.4).
EXERCISES 177
5.4.14 Explain why (5.4.5) is the pivot for understanding a complete-case approach.
5.4.15 What is the underlying simulation in Figure 5.7?
5.4.16 What can be concluded from E-estimates exhibited in Figure 5.7?
5.4.17 Using Figure 5.7, explain how the availability likelihood affects M-sample, then
complement your conclusion by a theoretical analysis.
5.4.18 Can an adjustment of parameters of the E-estimator improve E-estimation based
on complete cases?
5.4.19 Explain all arguments of Figure 5.7.
5.4.20∗ Assume that the availability likelihood w(y) is known. Propose density E-estimators
for f X (x), f Y (y), f Y |X (y|x) and f X,Y (x, y).
5.4.21 Prove (5.4.7).
5.4.22 Explain the idea of Fourier estimator (5.4.8). Describe assumptions.
5.4.23 Show that (5.4.8) is an unbiased estimator.
5.4.24∗ Evaluate the variance of Fourier estimator (5.4.8). Then explain how w(y) affects
the mean squared error of the estimator.
5.4.25∗ It was explained in Section 2.3 how to decrease the variance of a sample mean
estimator when the variance depends on an underlying regression function m(x). Use that
approach and propose a modification of (5.4.8) with a smaller variance.
5.4.26 Verify each equality in (5.4.10).
5.4.27 Explain the underlying simulation used in Figure 5.8.
5.4.28 What are the assumptions used by the E-estimator in Figure 5.8?
5.4.29 Repeat Figure 5.8 for several sample sizes and comment on the effect of the sample
size on quality of estimation.
5.4.30 Using Figure 5.8, explore the effect of availability likelihood on estimation of all four
corner regression functions.
5.4.31 For each corner regression function, suggest better parameters of the E-estimator
used in Figure 5.8. Then compare your suggestions and comment on your findings.
5.4.32 Explain the underlying idea of using an auxiliary variable to overcome destructive
missing caused by the MNAR.
5.4.33 Explain (5.4.11).
5.4.34 Under the MNAR, is it reasonable to assume that the auxiliary Z and the response
Y are independent?
5.4.35 Verify (5.4.12), and explain the used assumptions.
5.4.36 Prove every equality in (5.4.13) and explain the necessity of used assumptions.
5.4.37 Explain the underlying idea of Fourier estimator (5.4.14).
5.4.38∗ Evaluate the mean and variance of estimator (5.4.14).
5.4.39 Explain the simulation used in Figure 5.9.
5.4.40 Does the parameter σ affect the missing mechanism?
5.4.41∗ Explain each step of E-estimation used in Figure 5.9.
5.4.42 Using Figure 5.9, comment on how w(z) affects estimation of regression functions.
5.4.43 Explain all parameters of Figure 5.9.
5.4.44 Present several examples of MNAR missing when the availability likelihood depends
on both the predictor and the response.
5.4.45 Verify (5.4.15).
5.4.46 Prove (5.4.16).
5.4.47 Explain the underlying idea of Fourier estimator (5.4.17).
5.4.48 Prove, or disprove, that (5.4.17) is an unbiased estimator. Mention used assumptions.
5.4.49∗ Evaluate the variance of estimator (5.4.17). Explain used assumptions.
5.4.50∗ Variance of estimator (5.4.17) depends on regression function. Propose a modifica-
tion of the estimator which removes the dependence and makes the variance smaller. Hint:
Recall Section 2.3.
178 DESTRUCTIVE MISSING
5.4.51∗ Suppose that the assumption f X (x) ≥ c∗ > 0, x ∈ [0, 1] is violated. Can the
estimator (5.4.17) be recommended in this case?
5.4.52 Verify every equality in (5.4.19).
5.4.53 Explain how E-estimator of the design density is constructed.
5.4.54 Explain the simulation used in Figure 5.10.
5.4.55 Repeat Figure 5.10 and analyze simulated data and estimates.
5.4.56 Suggest better parameters of the E-estimator used in Figure 5.10.
5.4.57 Using bottom diagrams in Figure 5.10, try to figure out an underlying availability
likelihood function w(x, y).
5.4.58∗ Consider the case of unknown w(x, y) and propose a reasonable scenario when its
estimation is possible.
5.5.1 Explain models of nonparametric regression with MAR and MNAR predictors. Hint:
Think about appropriate availability likelihoods.
5.5.2 Explain MNAR model (5.5.1). Give several possible examples.
5.5.3 Explain a connection between models (5.5.2) and (5.5.3).
5.5.4 Verify (5.5.4).
5.5.5 Can a complete-case approach imply a consistent estimation?
5.5.6 Prove every equality in (5.5.6).
5.5.7 What is the underlying idea of estimator (5.5.7)?
5.5.8 Is estimator (5.5.7) unbiased? Formulate assumptions needed for validity of your
assertion.
5.5.9∗ Evaluate variance of estimator (5.5.7).
5.5.10∗ Variance of estimator (5.5.7) depends on an underlying regression function. Suggest
a modified estimator that asymptotically has no such dependence and also has a smaller
variance.
5.5.11∗ Explore a design density that minimizes variance of estimator (5.5.7).
5.5.12 Explain the underlying experiment used in Figure 5.11.
5.5.13 Propose better values for parameters of the E-estimator used in Figure 5.11.
5.5.14 Repeat Figure 5.11 for other corner functions and analyze diagrams and estimators.
5.5.15 Explain the motivation of estimator (5.5.10).
5.5.16∗ Find the mean and variance of estimator (5.5.10).
5.5.17 Explain how E-estimator of the design density is constructed.
5.5.18 What is the simulation used in Figure 5.12?
5.5.19 Explain the difference between Figures 5.11 and 5.12.
5.5.20 Using available ISEs, explore how accuracy in estimation of the design density affects
the regression estimation. Hint: Use Figures 5.11 and 5.12.
5.5.21 Explain all parameters of E-estimator used in Figure 5.12.
5.6.1 Present examples of regression data where cases are missed. In other words, when
either cases are complete or contain no data.
5.6.2 Describe observations in a regression model with missed cases.
5.6.3 Why is it easier to ignore complete-case missing than missing responses?
5.6.4 Can the studied missing mechanism be MAR?
5.6.5 Prove relation (5.6.3).
5.6.6 Verify (5.6.5).
5.6.7 Explain the motivation of estimator (5.6.6).
5.6.8∗ Find the mean and variance of estimator (5.6.6).
5.6.9∗ Suggest a modification of estimator (5.6.6) with smaller variance. Hint: Recall Section
2.3.
5.6.10∗ Explain how the design density may be estimated.
5.6.11∗ Find the mean and variance of estimator (5.6.8). Formulate used assumptions.
5.6.12∗ Propose several possible scenarios when w(x, y) may be estimated.
NOTES 179
5.6.13 What is the difference between simulations used in Figures 5.12 and 5.13?
5.6.14∗ Explain the difference, if any, between E-estimators used in Figures 5.12 and 5.13.
5.6.15 Repeat Figure 5.13 and, using available ISEs, explore how accuracy in estimation
of the design density affects the regression estimation.
5.6.16 Suggest better parameters of the E-estimator used in Figure 5.13.
5.6.17 Explain all parameters used in Figure 5.13.
5.6.18 Explore, both theoretically and using Figure 5.13, the effect of the availability like-
lihood on estimation of the Normal and the Bimodal regression functions.
5.8 Notes
Missing mechanisms, implying MNAR, are considered in many books including Little and
Rubin (2002), Tsiatis (2006), Molenberghs and Kenward (2007), Enders (2010), Molen-
berghs et al. (2014) and Raghunathan (2016).
5.1 Efficiency of the E-estimation for nonparametric density estimation is well known
and the first results are due to Efromovich and Pinsker (1982) and Efromovich (1985;
2009a; 2010a,c). Sequential estimation is discussed in Efromovich (1989, 1995b, 2015). Mul-
tivariate E-estimation methodology, including the case of anisotropic densities, is discussed
in Efromovich (1994b, 1999a, 2000b, 2002, 2011c).
5.2 Estimation of the density based on indirect observations and optimality of the E-
estimation is discussed in Efromovich (1994c). It is known from Efromovich (2001b) that
a plug-in E-estimation procedure improves the classical empirical cumulative distribution
function. Further, according to Efromovich (2004c), a similar result holds for the case of
biased data. These asymptotic results indicate that a similar assertion may be proved for
the case of destructive missing with an extra sample.
5.3 Thresholding as an adaptive method, used by the E-estimator, is discussed in Efro-
movich (1995a, 1996b, 2000a). An interesting extension of the considered setting is, following
Efromovich and Low (1994; 1996a,b), to consider estimation of functionals of the density.
Overall, it may be expected that a plug-in procedure will yield optimal estimation. Another
interesting area of research is the effect of measurement errors, see Efromovich (1997a) as
well as the related Efromovich (1997b) and Efromovich and Ganzburg (1999). Is the pro-
posed density E-estimation simultaneously optimal for the density derivatives? The result
of Efromovich (1998c) points in this direction. The expansion to multivariate densities,
following Efromovich (2000b, 2011c), is another interesting problem to consider.
5.4 For regression, a possibility of reducing the variance of a sample mean Fourier
estimator is discussed in Efromovich (1996a, 1999a, 2005a, 2007d, 2013a), Efromovich and
Pinsker (1996) and Efromovich and Samarov (2000). Two main approaches are subtracting
a consistent regression estimate from the response and mimicking a numerical integration.
These approaches yield efficient estimation of Fourier coefficients of a regression function.
Following Efromovich (1996a), it is possible to extend the considered setting to a larger
class of regression setting and prove asymptotic efficiency of the proposed E-estimation
methodology.
5.5 Sequential estimation, for the considered model of MNAR predictors, is an interest-
ing opportunity. Here results of Efromovich (2007d,e,i; 2008a,c; 2009c; 2012b) shed light on
possible asymptotic results. Another interesting extension of the considered setting is, fol-
lowing Efromovich and Low (1994; 1996a,b), to consider estimation of functionals. Overall,
it may be expected that a plug-in procedure will yield optimal estimation. It is a special and
interesting new topic to explore the case of dependent observations, and then get results
similar to Efromovich (1999c).
5.6 One of the interesting and new topics is developing the asymptotic theory of equiv-
alence between the regression with missing data and the model of filtering a signal from
180 DESTRUCTIVE MISSING
the white Gaussian noise, see a discussion in Efromovich (1999a). Even more attractive
will be results on the limits of the equivalence and how the missing affects those limits,
see Efromovich and Samarov (1996), Efromovich and Low (1996a) and Efromovich (2003a).
Interesting topic of uncertainty analysis is discussed in Shaw (2017). Multivariate regres-
sion with different smoothness of the regression function in covariates, called anisotropic
regression, is another interesting setting to consider. While the problem is technically chal-
lenging, following Efromovich (2000b, 2002, 2005a) it is reasonable to conjecture that the
E-estimation methodology will yield efficient estimation. See also an interesting discussion
in Harrell (2015) where further references can be found. In a number of applications the scale
function depends on auxiliary covariates, and in this case a traditional regression estimator
may be improved as shown in Efromovich (2013a,b).
Chapter 6
Survival Analysis
Survival analysis traditionally focuses on the analysis of time duration until one or more
events happen and, more generally, positive-valued random variables. Classical examples are
the time to death in biological organisms, the time from diagnosis of a disease until death,
the time between administration of a vaccine and development of an infection, the time
from the start of treatment of a symptomatic disease and the suppression of symptoms, the
time to failure in mechanical systems, the length of stay in a hospital, duration of a strike,
the total amount paid by a health insurance, the time to getting a high school diploma. This
topic may be called reliability theory or reliability analysis in engineering, duration analysis
or duration modeling in economics, and event history analysis in sociology. Survival analysis
attempts to answer questions such as: what is the proportion of a population which will
survive past a certain time? Of those that survive, at what rate will they die or fail? Can
multiple causes of death or failure be taken into account? How do particular circumstances
or characteristics increase or decrease the probability of survival?
To answer such questions, it is necessary to define the notion of “lifetime”. In the case of
biological survival, death is unambiguous, but for mechanical reliability, failure may not be
well defined, for there may well be mechanical systems in which failure is partial, a matter
of degree, or not otherwise localized in time. Even in biological problems, some events (for
example, heart attack or other organ failure) may have the same ambiguity. The theory
outlined below assumes well-defined events at specific times; other cases may be better
treated by models which explicitly account for ambiguous events.
While we are still dealing with a random variable X, that may be characterized by its
cumulative distribution function F X (x) (also often referred to as the lifetime distribution
function), because survival analysis is primarily interested in the time until one or more
events, the random variable is assumed to be nonnegative (it is supported on [0, ∞)) and
it is traditionally characterized by the survival function GX (x) := P(X > x) = 1 − F X (x).
That is, the survival function is the probability that the time of death is later than some
specified time x. The survival function is also called the survivor function or survivorship
function in problems of biological survival, and the reliability function in mechanical survival
problems. Usually one assumes G(0) = 1, although it could be less than 1 if there is the
possibility of immediate death or failure. The survival function must be nonincreasing:
GX (u) ≥ GX (t) if u ≥ t. This reflects the notion that survival to a later age is only possible
if all younger ages are attained. The survival function is usually assumed to approach zero
as age increases without bound, i.e., G(x) → 0 as x → ∞, although the limit could be
greater than zero if eternal life is possible. For instance, we could apply survival analysis to
a mixture of stable and unstable carbon isotopes; unstable isotopes would decay sooner or
later, but the stable isotopes would last indefinitely.
Typically, survival data are not fully and/or directly observed, but rather censored and
the most commonly encountered form is right censoring. For instance, suppose patients
are followed in a study for 12 weeks. A patient who does not experience the event of
interest for the duration of the study is said to be right censored. The survival time for this
181
182 SURVIVAL ANALYSIS
person is considered to be at least as long as the duration of the study. Another example
of right censoring is when a person drops out of the study before the end of the study
observation time and did not experience the event. This person’s survival time is said to be
right censored, since we know that the event of interest did not happen while this person
was under observation. Censoring is an important issue in survival analysis, representing a
particular type of modified data.
Another important modification of survival data is truncation. For instance, suppose
that there is an ordinary deductible D in an insurance policy. The latter means that if a
loss occurs, then the amount paid by an insurance company is the loss minus deductible.
Then a loss less than D may not be reported, and as a result that loss is not observable
(it is truncated). Let us stress that truncation is an illusive modification because there is
nothing in truncated data that manifests about the truncation, and only our experience
in understanding of how data were collected may inform us about truncation. Further, to
solve a statistical problem based on truncated observations, we need to know or request
corresponding observations of the truncating variable.
It also should be stressed that censoring and/or truncation may preclude us from consis-
tent estimation of the distribution of a random variable over its support, and then a feasible
choice of interval of estimation becomes a pivotal part of statistical methodology. In other
words, censoring or truncation may yield a destructive modification of data.
As it will be explained shortly, survival analysis of truncated and censored data re-
quires using new statistical approaches. The main and pivotal one is to begin analysis of
an underlying distribution not with the help of empirical cumulative distribution function
or empirical density estimate but with estimation of a hazard rate which plays a pivotal
role in survival analysis. For a continuous lifetime X, its hazard rate (also referred to as the
failure rate or the force of mortality) is defined as hX (x) := f X (x)/GX (x). Similarly to the
density or the cumulative distribution function, the hazard rate characterizes the random
variable. In other words, knowing hazard rate implies knowing the distribution. As we will
see shortly, using a hazard rate estimator as the first building block helps us in solving a
number of complicated problems of survival analysis including dealing with censored and
truncated data.
Estimation of the hazard rate has a number of its own challenges, and this explains
why it is hardly ever explored for the case of direct observations. The main one is that
the hazard rate is not integrated over its support (the integral is always infinity), and then
there is an issue of choosing an appropriate interval of estimation. Nonetheless, estimation
of the hazard rate (in place of more traditional characteristics like density or cumulative
distribution function) becomes more attractive for truncated and/or censored data. Further,
as it was mentioned earlier, truncation and/or censoring may preclude us from estimation
of the distribution over its support, and then the problem of choosing a feasible interval
of estimation becomes bona fide. Further, the hazard rate approach allows us to avoid
using product-limit estimators, like a renowned Kaplan–Meier estimator, that are the more
familiar alternative to the hazard rate approach. We are not using a product-limit approach
because it is special in its nature, not simple for statistical analysis, and cannot be easily
framed into our E-estimation methodology. At the same time, estimation of the hazard rate
is based on the sample mean methodology of our E-estimation.
The above-presented comments explain why the first four sections of the chapter are
devoted to estimation of the hazard rate for different types of modified data. We begin with
a classical case of direct observations considered in Section 6.1. It introduces the notion
of the hazard rate, explains that the hazard rate, similarly to the density or the cumula-
tive distribution function, characterizes a random variable, and explains how to construct
hazard rate E-estimator. Right censored (RC), left truncated (LT), and left truncated and
right censored (LTRC) data are considered in Sections 6.2-6.4, respectively. Sections 6.5-6.7
HAZARD RATE ESTIMATION FOR DIRECT OBSERVATIONS 183
discuss estimation of the survival function and the density. Sections 6.8 and 6.9 are devoted
to nonparametric regression with censored data.
In what follows αX and βX denote the lower and upper bounds of the support of a
random variable X.
where we use Rour traditional notation f X (x) for the probability density of X and GX (x) :=
∞
P(X > x) = x f X (u)du = 1 − F X (x) for the survival (survivor) function, and F X (x) is
the cumulative distribution function of X. If one thinks about X as a time to an event-
of-interest, then hX (x)dx represents the instantaneous likelihood that the event occurs
within the interval (x, x + dx) given that the event has not occurred at time x. The hazard
rate quantifies the trajectory of imminent risk, and it may be referred to by other names
in different sciences, for instance as the failure rate in reliability theory and the force of
mortality in actuarial science and sociology.
Let us consider classical properties and examples of the hazard rate. The hazard rate,
similarly to the probability density or the survival function, characterizes the random vari-
able X. Namely, if the hazard rate is known, then the corresponding probability density
is Rx X X
f X (x) = hX (x)e− 0 h (v)dv =: hX (x)e−H (x) , (6.1.2)
x
where H X (x) := 0 hX (v)dv is the cumulative hazard function, and the survival function
R
is Rx X X
GX (x) = e− 0 h (v)dv = e−H (x) . (6.1.3)
The preceding identity follows from integrating both sides of the equality
and then using GX (0) = 1. Relation (6.1.2) follows from (6.1.1) and the verified (6.1.3).
A corollary from (6.1.3) is that for a random variable X supported on [0, b], b < ∞ we
get limx→b hX (x) = ∞. This property of the hazard rate for a bounded lifetime plays a
critical role in its estimation.
An important property of the hazard rate is that if V and U are independent lifetimes,
then the hazard rate of their minimum is the sum of the hazard rates, that is, hmin(U,V ) (x) =
hU (x) + hV (x). Indeed, we have
Gmin(U,V ) (x) = P(min(U, V ) > x) = P(U > x)P(V > x) = GU (x)GV (x), (6.1.5)
and this, together with (6.1.4), yield the assertion. This property allows us to create a wide
variety of shapes for hazard rates. Another important property, following from (6.1.3) and
GX (∞) = 0, is that R ∞the hazard rate is not integrable on its support, that is the hazard
rate must satisfy 0 hX (x)dx = ∞. This is the reason why hazard rate estimates are
constructed for a finite interval [a, a + b] ⊂ [0, ∞) with a = 0 being the most popular choice.
Further, similarly to the probability density, the hazard rate is nonnegative and has the same
smoothness as the corresponding density because the survival function is always smoother
184 SURVIVAL ANALYSIS
than the density. The last but not the least remark is about scale-location transformation
Z = (X − a)/b of the lifetime X. This transformation allows us to study Z on the standard
unit interval [0, 1] instead of exploring X over [a, a + b]. Then the following formulae become
useful,
GZ (z) = GX (a + bz), f Z (z) = bf X (a + bz), (6.1.6)
and
hZ (z) = bhX (a + bz), hX (x) = b−1 hZ ((x − a)/b). (6.1.7)
Among examples of hazard rates for variables supported on [0, ∞), the most “famous”
is the constant hazard rate of an exponential random variable X with the mean E{X} = λ.
Then the hazard rate is hX (x) = λ−1 I(x ≥ 0) and the cumulative hazard is H X (x) =
(x/λ)I(x ≥ 0). Indeed, the density is f X (x) = λ−1 e−x/λ I(x ≥ 0), the survival function is
GX (x) = e−x/λ , and this yields the constant hazard rate. The converse is also valid and
a constant hazard rate implies exponential distribution, the latter is not a major surprise
keeping in mind that the hazard rate characterizes a random variable. A constant hazard rate
has coined the name memoryless for exponential distribution. Another interesting example
k
is the Weibull distribution whose density is f X (x; k, λ) = (k/λ)(x/λ)k−1 e−(x/λ) I(x ≥ 0),
where k > 0 is the shape parameter and λ > 0 is the scale parameter. The mean is
λΓ(1 + 1/k) with Γ(z) being the Gamma function, the survivor function is GX (x; k, λ) =
k
e−(x/λ) I(x ≥ 0), the hazard rate function is hX (x; k, λ) = (k/λ)(x/λ)k−1 I(x ≥ 0), and the
cumulative hazard is H X (x; k, λ) = (x/λ)k I(x ≥ 0). Note that if k < 1 then the hazard
rate is decreasing (it is often used to model “infant mortality”), if k > 1 then the hazard
rate is increasing (it is often used to model “aging” process), and if k = 1 then the Weibull
distribution becomes exponential (memoryless) with a constant hazard rate.
Now we are in a position to formulate the aim of this section. Based on a sample
X1 , X2 , . . . , Xn of size n from the random variable of interest (lifetime) X, we would like to
estimate its hazard rate hX (x) over an interval [a, a + b], a ≥ 0, b > 0. Because hazard rate
is the density divided by the survival function, and the survival function is always smoother
than the density, the hazard rate can be estimated with the same rate as the corresponding
density. Furthermore, a natural approach is to use (6.1.1) and to estimate the hazard rate
by a ratio between estimates of the density and survival function. We will check the ratio-
estimate shortly in Figure 6.1, and now simply note that the aim is to understand how a
hazard rate may be estimated using our E-estimation methodology because for censored and
truncated data, direct estimation of the density or survival function becomes a challenging
problem.
To construct an E-estimator of the hazard rate, according to Section 2.2 we need to
suggest a sample mean or a plug-in sample mean estimator of Fourier coefficients √ of the haz-
ard rate. Remember that on [0, 1] the cosine basis is {ϕ0 (x) = 1, ϕj (x) = 2 cos(πjx), j =
1, 2, . . .}. Similarly, on [a, a+b] the cosine basis is {ψj (x) := b−1/2 ϕj ((x−a)/b), j = 0, 1, . . .}.
Note that the cosine basis on [a, a + b] “automatically” performs the above-discussed trans-
formation Z := (X − a)/b of X. As a result, we can either work with the transformed Z and
the cosine basis on [0, 1] or directly with X and the corresponding cosine basis on [a, a + b].
Here, to master our skills in using different bases, we are using the latter approach. Suppose
that GX (a + b) > 0 and write for jth Fourier coefficient of hX (x), x ∈ [a, a + b],
Z a+b Z a+b X
X f (x)
θj := h (x)ψj (x)dx = ψj (x)dx
a a GX (x)
= E{I(X ∈ [a, a + b])[GX (X)]−1 ψj (X)}. (6.1.8)
Also, to shed light on the effect of rescaling a random variable, note that if κj :=
R1 Z
0
h (z)ϕj (z)dz is the jth Fourier coefficient of hZ (z), z ∈ [0, 1], then
θj = b−1/2 κj . (6.1.9)
HAZARD RATE ESTIMATION FOR DIRECT OBSERVATIONS 185
Assume for a moment that the survival function GX (x) is known, then according to
(6.1.8) we may estimate θj by the sample mean estimator
n
X ψj (Xl )I(Xl ∈ [a, a + b])
θ̃j := n−1 . (6.1.10)
GX (Xl )
l=1
Further, using (1.3.4) and the assumed GX (a + b) > 0 we may conclude that the corre-
sponding coefficient of difficulty is
Z a+b
−1
d(a, a + b) := lim lim nV(θ̃j ) = b hX (x)[GX (x)]−1 dx. (6.1.12)
n→∞ j→∞ a
The coefficient of difficulty explicitly shows how the interval of estimation, the hazard rate
and the survival function affect estimation of the hazard rate. As we will see shortly, the
coefficient of difficulty may point upon a feasible interval of estimation.
Of course θ̃j is an oracle-estimator which is based on an unknown survival function (note
that if we know GX , then we also know the hazard rate hX ). The purpose of introducing
an oracle estimator is two-fold. First to create a benchmark to compare with, and second to
be an inspiration for its mimicking by a data-driven estimator, which is typically a plug-in
oracle estimator. Further, in some cases the mimicking may be so good that asymptotic
variances of the estimator and oracle estimator coincide.
Let us suggest a good estimator of the survival function that may be plugged in the
denominator of (6.1.10). Because X is a continuous random variable, the survival function
can be written as
Note that mink∈{1,...,n} ĜX (Xk ) = n−1 > 0 and hence we can use the reciprocal of ĜX (Xl ).
The sample mean estimator (6.1.14) may be referred to as an empirical survival function.
We may plug (6.1.14) in (6.1.10) and get the following data-driven estimator of Fourier
coefficients θj of the hazard rate hX (x), x ∈ [a, a + b],
n
X ψj (Xl )I(Xl ∈ [a, a + b])
θ̂j := n−1 . (6.1.15)
l=1
ĜX (Xl )
This is the proposed Fourier estimator, and it is possible to show that its coefficient of
difficulty is identical to (6.1.12). We may conclude that the empirical survival function is a
perfect estimator for our purpose to mimic oracle-estimator (6.1.10). Further, the asymptotic
theory shows that no other Fourier estimator has a smaller coefficient of difficulty, and hence
the proposed Fourier estimator (6.1.15) is efficient. In its turn this result yields asymptotic
efficiency of a corresponding hazard rate E-estimator ĥX (x).
186 SURVIVAL ANALYSIS
Let us present an example where the coefficient of difficulty is easily calculated. Consider
X with exponential distribution and E{X} = λ. Then hX (x) = 1/λ, GX (x) = e−x/λ , and
hence Z a+b
d(a, a + b) = b−1 λ−1 ex/λ dx = b−1 ea/λ [eb/λ − 1]. (6.1.16)
a
Note that the coefficient of difficulty increases to infinity exponentially in b. This is what
makes estimation of the hazard rate and choosing a feasible interval of estimation so chal-
lenging. On the other hand, it is not difficult to suggest a plug-in sample mean estimator
of the coefficient of difficulty,
n
X
ˆ a + b) := n−1 b−1
d(a, I(Xl ∈ [a, a + b])[ĜX (Xl )]−2 . (6.1.17)
l=1
To realize that this is indeed a plug-in sample mean estimator, note that (6.1.12) can be
rewritten as
Z a+b
d(a, a + b) = b−1 f X (x)[GX (x)]−2 dx = b−1 E{I(X ∈ [a, a + b])[GX (X)]−2 }. (6.1.18)
a
This is what was wished to show. The estimator (6.1.17) may be used for choosing a feasible
interval of estimation.
Figure 6.1 helps us to understand the problem of estimation of the hazard rate via
analysis of a simulated sample of size n = 400 from the Bimodal distribution. The cap-
tion of Figure 6.1 explains its diagrams. The left-top diagram exhibits reciprocal of the
survival function GX (x) by the solid line and its estimate 1/ĜX (Xl ) by crosses. Note that
x-coordinates of crosses indicate observed realizations of X, and we may note that while
the support is [0, 1], just a few of the observations are larger than 0.85. We also observe a
sharply increasing right tail of 1/Ĝ(x) for x > 0.7, and this indicates a reasonable upper
bound for intervals of estimation of the hazard rate.
Our next step is to look at two density E-estimates exhibited in the right-top diagram.
The solid line is the underlying Bimodal density. The dotted line is the E-estimate con-
structed for interval [a, a + b] = [0, 0.6] and it is based on N = 211 observations from this
interval. The dashed line is the E-estimate based on all observations and it is constructed
for the interval [0, 1] (the support). The subtitle shows ISEs of the two estimates. Because
the sample size is relatively large, both estimates are good and have relatively small ISEs.
These estimates will be used by the ratio-estimator of the hazard rate.
Diagrams in the second row show us the same estimate d(a, ˆ x) of the coefficients of
difficulty for different intervals. This is done because the estimate has a sharply increasing
right tail. In the left diagram we observe the logarithm of d(0, ˆ Xl ), l = 1, . . . , n, while
the right diagram allows us to zoom in on the coefficient of difficulty by considering only
Xl ∈ [a1, a1+b1]. Similarly to the left-top diagram, we conclude that interval [0, 0.6] may be
a good choice for estimation of the hazard rate, and we may try [0, 0.7] as a more challenging
one.
Diagrams in the third (from the top) row exhibit performance of the ratio-estimator
fˆX (x)
ȟX (x) = , (6.1.19)
ĜX (x)
where fˆX is an E-estimate of the density. The ratio-estimate is a natural plug-in estimate
that should perform well as long as the empirical survival function is not too small. The
left diagram shows us the ratio-estimate based on all observations, that is, it is the ratio of
the dashed line in the right-top diagram over the estimated survival function shown in the
HAZARD RATE ESTIMATION FOR DIRECT OBSERVATIONS 187
40 80
1.5
1/G
0.0
0
0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 1.0
X
ISEab = 0.009, ISE01 = 0.016
^ (0,x))
log(d ^ (a1,x) on [a1,a1+b1], a1 = 0, b1 = 0.7
d
2
2
-2
0
0.0 0.2 0.4 0.6 0.8 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
x x
2.0
h
h
4
0.0
0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.0 0.1 0.2 0.3 0.4 0.5 0.6
2
h
0
-2
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.0 0.1 0.2 0.3 0.4 0.5 0.6
Figure 6.1 Estimation of the hazard rate based on direct observations. The left-top diagram exhibits
reciprocals of the underlying survival function (the solid line) and the empirical one (the crosses
at n observed values). The right-top diagram shows the underlying density (the solid line), the E-
estimate over [a, a+b] based on observations from this interval (the dotted line), and the E-estimate
over the support based on all observations (the dashed line). The second row of diagrams exhibits
ˆ Xl ) over different intervals. In the third row of diagrams, the left diagram shows the
estimates d(a,
hazard rate (the solid line) and its ratio-estimate on [a, a + B], based on all observations, by the
dashed line, while the right diagram shows ratio-estimates based on the density estimates shown in
the right-top diagram. The left-bottom diagram shows the underlying hazard rate (the solid line)
and the E-estimate on [a, a + B] (the dashed line); it also shows the pointwise (the dotted lines)
and simultaneous (the dot-dashed lines) confidence bands with confidence level 1 − α, α = 0.05.
The right-bottom diagram is similar to the left one, only here estimation over interval [a, a + b] is
considered. [n = 400, corn = 3, a = 0, b = 0.6, B = 0.7, a1= 0, b1 = 0.7, alpha = 0.05, cJ0 = 4,
cJ1 = 0.5, cTH = 4]
left-top diagram. The estimate is shown only over the interval [a, a + B] = [0, 0.7] because
the reciprocal of the estimated survival function is too large beyond this interval. The solid
line shows the underlying hazard rate (it increases extremely fast beyond the point 0.7).
For this particular simulation the ISE=0.022 is truly impressive. The right diagram shows
us two ratio-estimates for the smaller interval [a, a + b] = [0.0.6]; the estimates correspond
188 SURVIVAL ANALYSIS
to the two density estimates shown in the top-right diagram. Note that the ratio-estimate
based on the density estimate for the interval [a, a + b] (the dotted line) is worse than
the ratio-estimate based on the density which uses all observations (the dashed line); also
compare the corresponding ISEs in the subtitle. This is clearly due to the boundary effect;
on the other hand, the dotted curve better fits the solid line on the inner interval [0, 0.45].
It will be explained in Notes at the end of the chapter how to deal with severe boundary
effects.
The bottom row shows E-estimates of the underlying hazard rate for different intervals
of estimation. The estimates are complemented by pointwise and simultaneous confidence
variance-bands introduced in Section 2.6. The E-estimate over the larger interval [a, a+B] =
[0, 0.7] is bad. We have predicted the possibility of such outcome based on the analysis of
the reciprocal of the survival function and the coefficient of difficulty. Estimation over the
smaller interval [a, a + b] = [0, 0.6], shown in the right-bottom diagram, is much better.
Further, the E-estimate is better than the corresponding ratio-estimate (the dotted line
in the right diagram of the third row) based on the same N = 211 observations from this
interval. On the other hand, due to the boundary effect, the E-estimate performs worse than
the ratio-estimate based on all observations. Further, note that confidence bands present
another possibility to choose a feasible interval of estimation.
It is highly advisable to repeat Figure 6.1 with different corner distributions, sample
sizes and intervals to get used to the notion of hazard rate and its estimation. Hazard
rate is rarely studied in standard probability and statistical courses, but it is a pivotal
characteristic in understanding nonparametric E-estimation in survival analysis.
Differentiation of (6.2.3) with respect to v yields the following formula for the joint mixed
density of the observed pair,
f V,∆ (v, δ) = [f X (v)GC (v)]δ [f C (v)GX (v)]1−δ I(v ≥ 0, δ ∈ {0, 1}). (6.2.4)
The formula exhibits a remarkable symmetry with respect to δ which reflects the fact
that while C censors X on the right, we may also say that the random variable X also
censors C on the right whenever C is the lifetime of interest. In other words, the problem
of right censoring is symmetric with respect to the two underlying random variables X and
C. This is an important observation because if a data-driven estimator for distribution of
X is proposed, it can be also used for estimation of the distribution of C by changing ∆ on
1 − ∆.
Formula (6.2.4) implies that available observations of X (when ∆ = 1) are biased with
the biasing function being the survival function GC (x) of the censored random variable
(recall definitions of biased data and a biasing function in Section 3.1). Note that the
biasing function is decreasing in v because larger values of X are more likely to be censored.
As we know, in general the biasing function should be known for a consistent estimation of
an underlying distribution. As we will see shortly, because here we observe a sample from
a pair (V, ∆) of random variables, we can estimate the biasing function GC (x) and hence
to estimate the distribution of X. Recall that knowing a distribution means knowing any
characteristic of a random variable like its cumulative distribution function, density, hazard
rate, survival function, etc. We will see shortly in Section 6.5 that estimation of the density
of X is a two-step procedure. The reason for that is that censored data are biased and hence
we first estimate the biasing function GC , which is a complicated problem on its own, and
only then may proceed to estimation of the density of X.
Surprisingly, there is no need to estimate the biasing function if the hazard rate is the
function of interest (the estimand). This is a pivotal statistical fact to know about survival
data where the cumulative distribution function and/or probability density are no longer
190 SURVIVAL ANALYSIS
natural characteristics of a distribution to begin estimation with. The latter is an interesting
consequence of data modification caused by censoring.
Let us explain why hazard rate hX (x) is a natural estimand in survival analysis. Using
(6.2.2) and (6.2.4) we may write,
This is a pivotal formula because (6.2.5) expresses the hazard rate of right censored X via
the density and survival function of directly observed variables.
Suppose that we observe a sample of size n from (V, ∆). Denote by θj :=
R a+b X
a
h (x)ψj (x)dx the jth Fourier coefficient of the hazard rate hX (x), x ∈ [a, a + b].
Here and in what follows, similarly to Section 6.1, the cosine basis {ψj (v)} on an interval
[a, a + b] is used and recall our discussion of why a hazard rate is estimated over an interval.
Assume that a + b < βV (recall that βV denotes the upper bound of the support of V ),
and note that using (6.2.5) we can write down a Fourier coefficient as an expectation of a
function in V and ∆,
n ∆I(V ∈ [a, a + b])ψ (V ) o
j
θj = E . (6.2.6)
GV (V )
This immediately yields the following plug-in sample mean Fourier estimator,
n
X ∆l ψj (Vl )I(Vl ∈ [a, a + b])
θ̂j := n−1 , (6.2.7)
l=1
ĜV (Vl )
where
n
X
ĜV (v) := n−1 I(Vl ≥ v) (6.2.8)
l=1
a+b
f V,∆=1 (v) n ∆I(V ∈ [a, a + b]) o
Z
= b−1 dv = b−1
E (6.2.9)
a [GV (v)]2 [GV (V )]2
Z a+b X Z a+b
h (v) hX (v)
= b−1 V
dv = b−1
dv. (6.2.10)
a G (v) a G (v)GC (v)
X
Formula (6.2.10) clearly shows how an underlying hazard rate and a censoring variable affect
accuracy of estimation. Note that the new here, with respect to formula (6.1.12) for the
case of direct observations, is an extra survival function GC (v) in the denominator. This
is a mathematical description of the negative effect of censoring on estimation. Because
GC (v) ≤ 1 and the survival function decreases in v, that effect may be dramatic. Further, if
βC < βX then the censoring implies a destructive modification of data when no consistent
estimation of the distribution of X is possible.
Formula (6.2.9) implies that the coefficient of difficulty may be estimated by a plug-in
sample mean estimator
n
X
ˆ a + b) := n−1 b−1
d(a, ∆l I(Vl ∈ [a, a + b])[ĜV (Vl )]−2 . (6.2.11)
l=1
CENSORED DATA AND HAZARD RATE ESTIMATION 191
1
0
0 4 8
6
2
V[ ==1]
5
-5
ISE = 0.43
ISE = 0.012
Figure 6.2 Estimation of the hazard rate based on right censored observations. The top diagram
shows n = 300 observations of (V, ∆), among those N := n
P
l=1 ∆l = 192 uncensored ones. Second
ˆ
from the top diagram shows us by crosses values of 1/Ĝ(Vl ) and by circles values of d(a1, Vl ) for
uncensored Vl ∈ [a1, a1 + b1]; the corresponding scales are on the left and right vertical axes. Note
how fast the right tails increase. The third from the top diagram shows E-estimate of the hazard
rate based on observations Vl ∈ [a, a + B], while in the bottom diagram the E-estimate is based
on Vl ∈ [a, a + b]. A corresponding N shows the number of uncensored observations within an
interval. The underlying hazard rate and its E-estimate are shown by the solid and dashed lines,
the pointwise and simultaneous 1 − α confidence bands are shown by dotted and dot-dashed lines.
{Censoring distribution is either the default Uniform(0, uC ) with uC = 1.5, or Exponential(λC )
with the default λC = 1.5 where λC is the mean. To choose an exponential censoring, set cens
= 00 Expon 00 . Notation for arguments controlling the above-mentioned parameters is evident.} [n =
300, corn = 3, cens = 00 Unif 00 , uC = 1.5, lambdaC = 1.5, a = 0, b = 0.55, B = 0.75, a1 = 0, b1
= 0.75, alpha = 0.05, cJ0 = 4, cJ1 = 0.5, cTH = 4]
Note that the new in the estimator, with respect to the case of direct observations of Section
6.1, is that ĜX is replaced by ĜV . Because GV = GX for the case of direct observations,
this change is understandable.
The proposed estimator of Fourier coefficients allows us to construct the E-estimator of
the hazard rate.
192 SURVIVAL ANALYSIS
Figure 6.2 sheds light on right censoring, performance of the proposed E-estimator, and
the methodology of choosing a feasible interval of estimation explained in Section 6.1. In
the particular simulation X is the Bimodal and C is the Uniform(0,1.5). The top diagram
shows P the sample from (V, ∆) = (min(X, C), I(X ≤ C)). The sample size n = 300 and
n
N := l=1 ∆l = 192 is the number of uncensored realizations of X. The latter indicates
a severe censoring. The second from the top diagram allows us to evaluate complexity of
the problem at hand and to choose a reasonable interval of estimation. Crosses (and the
corresponding scale is shown on the left vertical axis) show values of 1/ĜV (Vl ) for N = 192
uncensored observations over interval [a1, a1 + b1] = [0, 0.75], while circles show values
of the estimated coefficient of difficulty (its scale is on the right-vertical axis). Here we
have an interesting outcome that the two functions look similar, and this is due to using
different scales and rapidly increasing right tails. The interested reader may decrease b1
and get a better visualization of the functions for moderate values of v. Analysis of the
two estimates indicates that it is better to avoid estimation of the hazard rate beyond
v = 0.75, and probably a conservative approach is to consider intervals with the upper
bound smaller than 0.6. Let us check this conclusion with the help of corresponding hazard
rate E-estimates and confidence bands. The two bottom diagrams allow us to do this, here
the chosen intervals are [a, a + B] = [0, 0.75] and [a, a + b] = [0, 0.55], respectively. At
first glance, the E-estimate for the larger interval (the third from the top diagram) and
its confidence bands may look attractive, but this conclusion is misleading. Indeed, please
look at the scale of the bands and realize that the bands are huge! Further, the large bands
“hide” the large deviation of the estimate (the dashed line) from the underlying hazard rate
(the solid line). Further, the ISE = 0.43, shown in the subtitle, tells us that the estimate is
far from the underlying hazard rate. The reason for this is large right tails of 1/ĜV (v) and
ˆ v) that may be observed in the second from the top diagram. The outcome is much
d(0,
better for the smaller interval of estimation considered in the bottom diagram, and this is
despite the smaller number N = 106 of available uncensored observations. Also note that
the ISE is dramatically smaller.
It is fair to conclude that: (i) The proposed in Section 6.1, for the case of direct ob-
servations, methodology of choosing a feasible interval of estimation is robust and can be
also recommended for censored data; (ii) The E-estimator performs well even for the case
of severe censoring.
Let us finish this section by a remark about left censoring. Under a left censoring,
the variable of interest X is censored on the left by a censoring variable C if available
observations are from the pair (V, ∆) := (max(X, C), I(X ≥ C)). For instance, when a
physician asks a patient about the onset of a particular disease, the answer may be either a
specific date or that the onset occurred prior to some specific date. In this case the variable
of interest is left censored. Left censoring may be “translated” into a right censoring. To
do this, choose a value A that is not less than all available left-censored observations and
then consider new observations that are A minus left-censored observations. Then the new
observations become right-censored. The latter is the reason why it is sufficient to learn
about estimation for right-censored data.
= p−1 P(T ∗ ≤ t, T ∗ ≤ X ∗ ≤ x)
Z t Z x
∗ ∗
= p−1 f T (v)[ f X (u)du]dv, 0 ≤ t ≤ x. (6.3.3)
0 v
Here p is defined in (6.3.1). Then, taking partial derivatives with respect to x and t we get
the following expression for the bivariate density,
∗ ∗
f T,X (t, x) = p−1 f T (t)f X (x)I(0 ≤ t ≤ x < ∞). (6.3.4)
This allows us to obtain, via integration, a formula for the marginal density of X,
∗ ∗
f X (x) = f X (x)[p−1 F T (x)]. (6.3.5)
∗
In its turn, for values of x such that F T (x) > 0, (6.3.5) yields a formula for the density of
the random variable of interest X ∗ ,
∗ f X (x) ∗
f X (x) = whenever F T (x) > 0. (6.3.6)
p−1 F T ∗ (x)
TRUNCATED DATA AND HAZARD RATE ESTIMATION 195
∗ ∗
Note that for values x such that F T (x) = 0 we cannot restore the density f X (x)
because all observations of X ∗ with such values are truncated. In other words, using our
notation αZ for a lower bound of the support of a continuous random variable Z, for
consistent estimation of the distribution of X ∗ we need to assume that
αX ∗ ≥ αT ∗ . (6.3.7)
Another useful remark is that formula (6.3.5) mathematically describes the biasing mech-
∗
anism caused by the left truncation, and according to (3.1.2), the biasing function is F T (x).
Note that the biasing function is increasing in x.
Let us also introduce a function, which is a probability, that plays a pivotal role in the
analysis of truncated data,
∗ ∗
g(x) := P(T ≤ x ≤ X) = P(T ∗ ≤ x ≤ X ∗ |T ∗ ≤ X ∗ ) = p−1 F T (x)GX (x). (6.3.8)
In the last equality we used the assumed independence between T ∗ and X ∗ . Note that g(x)
is a functional of the distributions of available observations and hence can be estimated.
The latter will be used shortly.
Now we have all necessary formulas and notations to explain the method of constructing
an E-estimator of the hazard rate of X ∗ .
Using (6.3.6) and (6.3.8), we conclude that the hazard rate of X ∗ can be written as
∗
∗ f X (x) f X (x) f X (x) ∗ ∗
hX (x) := X ∗ = = whenever F T (x)GX (x) > 0. (6.3.9)
G (x) P(T ≤ x ≤ X) g(x)
∗
Note that hX (x) is expressed via distributions of available random variables (X, T ) and
∗ ∗
hence can be estimated. Further, note that the restriction F T (x)GX (x) > 0 is equivalent
to g(x) > 0.
Formula (6.3.9) for the hazard rate is the key for its estimation. Indeed, consider esti-
mation of the hazard rate over an interval [a, a + b] such that g(x) > 0 over this interval.
Similarly to the previous sections, {ψj (x)} is the cosine basis on [a, a + b]. The proposed
R a+b ∗
sample mean estimator of Fourier coefficients θj := a ψj (x)hX (x)dx is
n
X ψj (Xl )I(Xl ∈ [a, a + b])
θ̂j := n−1 , (6.3.10)
ĝ(Xl )
l=1
where
n
X
ĝ(x) := n−1 I(Tl ≤ x ≤ Xl ) (6.3.11)
l=1
is the sample mean estimator of function g(x) defined in (6.3.8). Note that ĝ(Xl ) ≥ n−1
and hence this estimator can be used in the denominator of (6.3.10).
The Fourier estimator (6.3.10) yields the corresponding E-estimator of the hazard rate.
Further, the corresponding coefficient of difficulty is
a+b ∗
hX (x)
Z
= b−1 dx = b−1 E{I(a ≤ X ≤ a + b)/g 2 (X)}. (6.3.12)
a g(x)
Further, the plug-in sample mean estimator of the coefficient of difficulty is
n
X
ˆ b) := n−1 b−1
d(a, I(a ≤ Xl ≤ a + b)/ĝ 2 (Xl ). (6.3.13)
l=1
196 SURVIVAL ANALYSIS
0.6
T
0.2
3.0
1.4 2.0
1.0
0.2 0.3 0.4 0.5 0.6
ISE = 0.86
ISE = 0.1
Figure 6.3 Estimation of the hazard rate based on left truncated observations. The top diagram
shows a sample of left truncated observations, the sample size n = 300. In the simulation X ∗ is
the Bimodal and T ∗ is Uniform(0, 0.5). Second from the top diagram shows by crosses the estimate
ˆ
1/ĝ(Xl ) and by circles the estimate d(a1, Xl ), Xl ∈ [a1, a1 + b1]. Note the different scales used for
these two estimates that are shown correspondingly on the left and right vertical axes. The third
from the top diagram shows E-estimate of the hazard rate on interval [A, A + B], while in the
bottom diagram the E-estimate is for interval [a, a + b]. N shows the number of observations within
a considered interval. The underlying hazard rate and its E-estimate are shown by the solid and
dashed lines, the pointwise and simultaneous 1 − α confidence bands are shown by dotted and dot-
dashed lines, respectively. {Distribution of T is either the default Uniform(0, uT ) with uT = 0.5,
or Exponential(λT ) with the default λT = 0.3 where λT is the mean. Set trunc = 00 Expon 00 to
choose the exponential truncation. Parameter α is controlled by alpha.} [n = 300, corn = 3, trunc
= 00 Unif 00 , uT = 0.5, lambdaT = 0.3, a = 0, b = 0.55, A = 0.2, B = 0.75, a1 = 0.2, b1 = 0.45,
alpha = 0.05, cJ0 = 4, cJ1 = 0.5, cTH = 4]
Further, while we do not know the total number of hidden “failures”, the negative binomial
distribution sheds some light on the hidden number Nf of “failures”, and in particular the
mean and variance of Nf are calculated as E(Nf ) = n(1−p)p−1 and Var(Nf ) = n(1−p)p−2 .
Now our aim is to understand how the distribution of LTRC observations is related to the
distribution of the hidden realizations of the triplet (T ∗ , X ∗ , C ∗ ). Set V ∗ := min(X ∗ , C ∗ ).
Suppose that a hidden realization of the triplet is observed, meaning that it is given that
T ∗ ≤ V ∗ and we observe (T, V, ∆) where T = T ∗ , V = V ∗ and ∆ := ∆∗ := I(X ∗ ≤ C ∗ ).
Recall that T ∗ is the underlying truncation random variable, X ∗ is the lifetime (random
variable) of interest, C ∗ is the underlying censoring random variable, and p, defined in
(6.4.1), is the probability of observing the hidden triplet (T ∗ , V ∗ , ∆∗ ). For the joint mixed
distribution function of the observed triplet of random variables we can write,
∗
,V ∗ ,∆∗ |T ∗ ≤V ∗
F T,V,∆ (t, v, δ) := P(T ≤ t, V ≤ v, ∆ ≤ δ) = F T (t, v, δ)
0
f (t)G (t)GC (t)dt, and using (6.4.2) we can write for any 0 ≤ t ≤ v < ∞ and
δ ∈ {0, 1},
P(T ≤ t, V ≤ v, ∆ = δ)
Z t h Z v iδ h Z v ∗ i1−δ
∗ ∗ ∗ ∗
= p−1 f T (τ ) f X (x)GC (x)dx f C (x)GX (x)dx dτ. (6.4.3)
0 τ τ
Taking partial derivatives of both sides in (6.4.3) with respect to t and v yields the following
mixed joint probability density,
∗
h ∗ ∗
iδ h ∗ ∗
i1−δ
f T,V,∆ (t, v, δ) = p−1 f T (t)I(t ≤ v) f X (v)GC (v) f C (v)GX (v) . (6.4.4)
Note that the density is “symmetric” with respect to X ∗ and C ∗ whenever δ is replaced on
1 − δ; we have already observed this fact for censored data.
Formula (6.4.4) yields the following marginal joint density
∗ ∗ ∗ ∗ ∗ ∗ ∗
f V,∆ (v, 1) = p−1 f X (v)GC (v)F T (v) = hX (v)[p−1 GC (v)F T (v)GX (v)]. (6.4.5)
LTRC DATA AND HAZARD RATE ESTIMATION 199
∗ ∗ ∗
In the last equality we used definition of the hazard rate hX (x) := f X (x)/GX (x).
The first equality in (6.4.5) yields a nice formula for the density of the random variable
of interest,
∗ f V,∆ (x, 1) ∗ ∗
f X (x) = whenever GC (x)F T (x) > 0. (6.4.6)
p−1 GC ∗ (x)F T ∗ (x)
g(x) := P(T ≤ x ≤ V )
= P(T ∗ ≤ x ≤ V ∗ |T ∗ ≤ V ∗ )
∗ ∗ ∗
= [p−1 GC (x)F T (x)]GX (x), x ∈ [0, ∞). (6.4.7)
Note that the right side of (6.4.7) contains in the square brackets the denominator of
the ratio in (6.4.6). This fact, together with (6.4.6), yields two important formulae. The
first one is that the underlying density of X ∗ may be written as
∗
X∗ f V,∆ (x, 1)GX (x) ∗ ∗
f (x) = whenever GC (x)F T (x) > 0. (6.4.8)
P(T ≤ x ≤ V )
∗
Next, if we divide both sides of the last equality by the survival function GX (x), then
we get the following expression for the hazard rate,
∗ f V,∆ (x, 1) ∗ ∗ ∗
hX (x) = whenever GC (x)F T (x)GX (x) > 0, (6.4.9)
P(T ≤ x ≤ V )
or equivalently the formula holds whenever g(x) > 0 where g(x) is defined in (6.4.7).
The right side of equality (6.4.9) includes characteristics of observed (and not hidden)
variables that may be estimated, and this is why we can estimate the hazard rate for values
of x satisfying the inequality in (6.4.9). On the other hand, in (6.4.8) the right side of the
∗
equality depends on survival function GX of an underlying lifetime, and this is why the
problem of density estimation for the LTRC is more involved and will be considered later
in Section 6.7.
Now we are ready to propose an E-estimator of the hazard rate based on LTRC data. Let
[a, a+b] be an interval of estimation where a and b are positive and finite constants. As usual,
R a+b ∗
we begin with a sample mean estimator for Fourier coefficients θj := a hX (x)ψj (x)dx
∗
of the hazard rate hX (x), x ∈ [a, a + b]. Recall that {ψj (x)} is the cosine basis on [a, a + b]
introduced in Section 6.1. Using (6.4.9), together with notation (6.4.7), we may propose a
plug-in sample mean estimator of the Fourier coefficients,
n
X ψj (Vl )
θ̂j := n−1 ∆l I(Vl ∈ [a, a + b]), (6.4.10)
ĝ(Vl )
l=1
where
n
X
ĝ(v) := n−1 I(Tl ≤ v ≤ Vl ). (6.4.11)
l=1
Statistic ĝ(v) is the sample mean estimator of g(v) (see (6.4.7)) and ĝ(Vl ) ≥ 1/n.
∗
Fourier estimator (6.4.10) allows us to construct a hazard rate E-estimator ĥX (x), and
recall that its construction is based on the assumption that hidden random variables T ∗ , X ∗
200 SURVIVAL ANALYSIS
and C ∗ are continuous and mutually independent. The corresponding coefficient of difficulty
is Z a+b
∗
d(a, b) := b−1 hX (x)g −1 (x)dx, (6.4.12)
a
and its plug-in sample mean estimator is
n
X
ˆ b) := n−1 b−1
d(a, [ĝ(Vl )]−2 ∆l I(Vl ∈ [a, a + b]). (6.4.13)
l=1
Figure 6.4 allows us to gain experience in understanding LTRC data, choosing a feasible
interval of estimation, and E-estimation. The top diagram shows by circles and triangle
the scattergram of LTRC realizations of (T, V ). Circles show uncensored observations, cor-
responding to ∆ = 1, and triangles show censored observations corresponding to ∆ = 0.
Note that all observations are below the solid line T = V ; this is what we expect from
left-truncated observations. The second from the top diagram allows us to evaluate com-
plexity of the problem at hand. Crosses (and the corresponding scale is shown on the left
vertical axis) show values of 1/ĝ(Vl ) for uncensored (∆l = 1) observations that are used
in estimation of Fourier coefficients; see (4.6.10). The estimated coefficient of difficulty is
exhibited by circles and its scale is shown on the right vertical axis. Both estimates are
shown for Vl ∈ [a1, a1 + b1], and Figure 6.4 allows one to change the interval. For the data
at hand, function 1/ĝ(v) has sharply increasing tails, and the same can be said about right
tail of the coefficient of difficulty. These two estimates can be used for choosing a feasible
interval of estimation for the E-estimator of the hazard rate. Two bottom diagrams show
us hazard rate E-estimates for different intervals of estimation. Note that the larger interval
includes areas with very large values of 1/g(v) and this dramatically increases the coefficient
of difficulty, the confidence bands, and the ISE. The “effective” number N of uncensored Vl
fallen within an interval of estimation is indicated in a corresponding title. In the bottom
diagram N is almost twice smaller than in the third from the top diagram, and nonetheless
the estimation is dramatically better. This sheds another light on the effect of interval of
estimation and the complexity of estimating tails.
So far we have considered the case of continuous and independent underlying hidden
random variables. This may be not the case in some applications. For instance, in a clinical
trial we may have C ∗ := T ∗ + min(u, U ∗ ) where P(U ∗ ≥ 0) = 1 and u is a positive constant
that defines length of the trial. Let us consider a LTRC where a continuous lifetime X ∗ is
independent of (T ∗ , C ∗ ) while T ∗ and C ∗ may be dependent and have a mixed (continuous
and discrete) joint distribution.
We begin with formulas for involved distributions. Write,
Differentiation of (6.4.14) with respect to v yields a formula for the mixed density,
∗ ∗ ∗
f V,∆ (v, 1) = p−1 f X (v)P(T ∗ ≤ v ≤ C ∗ ) = hX (v)[p−1 GX (v)P(T ∗ ≤ v ≤ C ∗ )]. (6.4.16)
P(T ≤ x ≤ V ) = P(T ∗ ≤ x ≤ V ∗ |T ∗ ≤ V ∗ )
LTRC DATA AND HAZARD RATE ESTIMATION 201
0.6
T
0.0
2.5
0.5
0.2 0.3 0.4 0.5 0.6
V[ ==1]
ISE = 0.79
ISE = 0.056
Figure 6.4 Estimation of the hazard rate based on LTRC observations generated by independent
and continuous hidden variables. The P top diagram shows a sample of size n = 300 from (T, V, ∆).
Uncensored (their number is N := n l=1 ∆l = 202) and censored observations are shown by circles
and triangles, respectively. The solid line is T = V . In the underlying hidden model, the random
variable of interest X ∗ is the Bimodal, the truncating variable T ∗ is Uniform(0, 0.5), and the cen-
soring variable C ∗ is Uniform(0, 1.5). Second from the top diagram shows us by crosses and circles
ˆ
estimates 1/ĝ(Vl ) and d(a1, Vl ) for uncensored Vl ∈ [a1, a1 + b1]. Note the different scales for the
two estimates shown on the left and right vertical axes, respectively. Two bottom diagrams show
E-estimate (the dashed line), underlying hazard rate (the solid line) and pointwise and simultaneous
1 − α = 0.95 confidence bands by dotted and dot-dashed lines. The interval of estimation and the
number N of observations fallen within the interval are shown in the title, the subtitle shows the
ISE of E-estimate over the interval. {Distribution of T ∗ is either the default Uniform(0, uT ) with
uT = 0.5, or Exponential(λT ) with the default λT =0.3 where λT is the mean. Censoring distribu-
tion is either the default Uniform(0, uC ) with uC = 1.5 or Exponential(λC ) with the default λC =
1.5. For instance, to choose exponential truncation and censoring, set trunc = 00 Expon 00 and cens
= 00 Expon 00 .} [n = 300, corn = 3, trunc = 00 Unif 00 , uT = 0.5, lambdaT = 0.3, cens = 00 Unif 00 ,
uC = 1.5, lambdaC = 1.5, a = 0.1, b = 0.5, A = 0.05, B = 0.7, a1 = 0.05, b1 = 0.6, alpha =
0.05, cJ0 = 4, cJ1 = 0.5, cTH = 4]
0.6
T
0.0
V[ ==1]
-5 5
ISE = 0.49
1
-1
ISE = 0.053
Figure 6.5 Estimation of the hazard rate for LTRC data with C ∗ = T ∗ + min(u, U ∗ ) and X ∗ being
independent of (T ∗ , C ∗ ). The default U ∗ is Uniform(0, uC ) and independent of T ∗ , and u = 0.5.
The structure of diagrams is identical to Figure 6.4, only in the top diagram the second solid line
is added to indicate largest possible censored observations. {Distribution of T ∗ is either the default
Uniform(0, uT ) with uT = 0.5, or Exponential(λT ) with the default λT =0.3 where λT is the mean.
Distribution of U ∗ is either the default Uniform(0, uC ) with uC = 1.5 or Exponential(λC ) with the
default λC = 1.5. To choose the exponential truncation and censoring, set trunc = 00 Expon 00 and
cens = 00 Expon 00 .} [n = 300, corn = 3, trunc = 00 Unif 00 , uT = 0.5, lambdaT = 0.3, cens = 00 Unif 00 ,
u = 0.5, uC = 1.5, lambdaC = 1.5, a = 0, b = 0.55, B = 0.75, a1 = 0, b1 = 0.75, alpha = 0.05,
cJ0 = 4, cJ1 = 0.5, cTH = 4]
∗ f V,∆ (x, 1) ∗
hX (x) = whenever P(T ∗ ≤ x ≤ C ∗ )GX (x) > 0. (6.4.18)
P(T ≤ x ≤ V )
GX (x) := P(X > x) = P survive in [0, Vl1 ) P survive in [Vl1 , Vl2 )|survive in [0, Vl1 )
×P survive in [Vl2 , x]|survive in [0, Vl2 ) . (6.5.4)
For the first probability in the right side of (6.5.4) a natural estimate is 1 because no deaths
have been recorded prior to the moment Vl1 . For the second probability a natural estimate
is (n − l1 )/(n − l1 + 1) because n − l1 + 1 is the number of individuals remaining in the study
before time Vl1 and then one individual died prior to moment Vl2 . Absolutely similarly, for
the third probability a natural estimate is (n − l2 )/(n − l2 + 1). If we plug these estimators
in (6.5.4), then we get the Kaplan–Meier estimator
This explanation also sheds light on the notion of product-limit estimation often applied to
Kaplan–Meier estimator.
Kaplan–Meir estimator is the most popular estimator of survival function for RC data.
At the same time, it is not as simple P as the classical empirical (sample mean) cumulative
n
distribution estimator F̂ X (x) := n−1 l=1 I(Xl ≤ x) used for the case of direct observa-
tions. Can a sample mean method be used for estimation of the survival function and RC
data? The answer is “yes” and below it is explained how this estimator can be constructed.
Let us explain the proposed method of estimation of the survival function and density
of X based on RC data. As usual, we begin with formulas for the joint density of observed
variables,
f V,∆ (x, 1) = f X (x)GC (x) = [f X (x)/GX (x)]g(x) = hX (x)g(x), (6.5.6)
and
f V,∆ (x, 0) = f C (x)GX (x) = [f C (x)/GC (x)]g(x) = hC (x)g(x), (6.5.7)
X C
where g(x) is defined in (6.5.1), h (x) and h (x) are the hazard rates of X and C. Note
that (6.5.6) and (6.5.7) contain several useful expressions for the joint density that shed
extra light on RC. Further, please look again at the formulas and note the symmetry of the
RC with respect to the lifetime of interest and censoring variable.
We begin with the explanation of how the sample mean methodology can be used for
estimation of the survival function GC (x) of the censoring random variable. Recall that it
can be estimated only for x ∈ [0, β] where β ≤ βV := min(βX , βC ). The idea of estimation is
based on a formula which expresses the survival function via the corresponding cumulative
hazard H C (x),
GC (x) = exp{−H C (x)}. (6.5.8)
ESTIMATION OF DISTRIBUTIONS FOR RC DATA 205
Using (6.5.1) and (6.5.7), the cumulative hazard may be written as
Z x Z x
C C
H (x) := h (u)du = [f C (u)/GC (u)]du
0 0
Z x
= [f V,∆ (u, 0)/g(u)]du = E{(1 − ∆)I(V ∈ [0, x])/g(V )}. (6.5.9)
0
As a result, we can use a plug-in sample mean estimator for the cumulative hazard, then
plug it in (6.5.8) and get the following estimator of the survival function,
n
X
ĜC (x) := exp{−n−1 (1 − ∆l )I(Vl ≤ x)/ĝ(Vl )}. (6.5.10)
l=1
Here
n
X
ĝ(x) := n−1 I(Vl ≥ x) (6.5.11)
l=1
is the sample mean estimate of g(x) defined in (6.5.1). Note that ĝ(Vl ) ≥ n−1 and hence
the estimator (6.5.10) is well defined.
The appealing feature of estimator (6.5.10) is its simple interpretation because we esti-
mate the logarithm of the survival function by a sample mean estimator. This is why we
may refer to the estimator as a sample mean estimator (it is also explained in the Notes
that this estimator, written as a product-limit, becomes a Nelson–Aalen–Breslow estimator
which is another canonical estimator in the survival analysis). Another important remark
is that GX (x) also may be estimated by (6.5.10) with 1 − ∆l being replaced by ∆l .
Now let us consider estimation of the probability density f X (x), x ∈ [0, β]. We are using
notation {ψj (x)} for the cosine basis on [0, β]. Fourier coefficients of f X (x) can be written,
using (6.5.6), as
Z β Z β
θj := f X (x)ψj (x)dx = [f V,∆ (x, 1)ψj (x)/GC (x)]dx
0 0
In its turn, the Fourier estimator yields a corresponding density E-estimator fˆX (x),
x ∈ [0, β]. Further, if β < βC then the corresponding coefficient of difficulty is
β
f X (x)
Z
−1 C −2 −1
d(0, β) := β E{I(V ≤ β)∆[G (V )] }=β dx, (6.5.14)
0 GC (x)
Let us check how the estimator performs. Figure 6.6 exhibits a particular RC sample and
the suggested estimates. In the simulation βC = 1.5 > βX = 1, and hence the distribution
of X can be consistently estimated. Keeping this remark in mind, let us look at the data
206 SURVIVAL ANALYSIS
0.0 0.6
ISE = 0.025
Figure 6.6 Right-censored data with βC > βX and estimation of distributions. The top diagram
shows simulated data. The distribution of the censoring random variable C is uniform on [0, uC ]
and the lifetime of interest X is the Bimodal. The second from the top diagram shows by the
solid line the underlying GC , dashed and dotted lines show the sample mean and Kaplan–Meier
estimates, respectively. The third from the top diagram shows the underlying density (the solid
line), the E-estimate (the dashed line), and (1 − α) pointwise (the dotted lines) and simultaneous
(the dot-dashed lines) confidence bands. The estimate is for the interval [0,1]. The bottom diagram
is similar, only here the estimate is for the interval [0, V(n) ] where V(n) := max(V1 , . . . , Vn ). {For
exponential censoring set cens = 00 Expon 00 , and the mean of the exponential censoring variable is
controlled by argument lambdaC}. [n = 300, corn = 3, cens = 00 Unif 00 , uC = 1.5, lambdaC = 1.5,
alpha = 0.05, cJ0 = 4, cJ1 = 0.5, cTH = 4]
0.0 0.6
ISE = 0.9
Figure 6.7 Right-censored data with βC < βX and estimation of distributions. This figure is created
by Figure 6.6 using arguments uC = 0.7 and corn = 4. [n = 300, corn = 4, cens = 00 Unif 00 , uC =
0.7, lambdaC = 1.5, alpha = 0.05, cJ0 = 4, cJ1 = 0.5, cTH = 4]
the top diagram shows the underlying GC , the proposed sample-mean and Kaplan–Meier
estimates (consult the caption about the corresponding curves). The estimates are close to
each other. Let us also stress that there is no chance to consistently estimate the right tail of
the distribution of C due to the fact that βX < βC . The two bottom diagrams show how the
density E-estimator, constructed for intervals [0, 1] and [0, V(n) ], performs. The first interval
is of a special interest because [0, 1] is the true support of the density f X , the second one
is a data-driven interval of estimation. Keeping in mind that only N = 185 uncensored
observations are available and the Bimodal is a difficult density for estimation even for the
case of direct observations, the particular density E-estimates are very good. Also look at
and compare the corresponding confidence bands and ISEs.
What will be if βC < βX ? Theoretically this precludes us from consistent estimation
of the right tail of f X (x), but still can we estimate the density over an interval [0, β] with
β < βV ? The above-presented theory indicates that this is possible, and Figure 6.7 illustrates
the case. This figure is generated by Figure 6.6 using uC = 0.7, also note that the underlying
208 SURVIVAL ANALYSIS
density is the Strata. The title for the top diagram shows that only N = 101 uncensored
observations of the lifetime X are available, and hence we are dealing with a severe censoring.
Further, note that while the largest observation of C is close to βV = βC = 0.7, all but
one observations of X are smaller than 0.32. The second from the top diagram exhibits
very good estimates of the survival function GC (x) over the interval [0, 0.7], and this is not
a surprise due to the large number n − N = 199 of observed realizations of the censored
variable. Now let us look at the two bottom diagrams. As it could be predicted, the density
estimate in the second from the bottom diagram is not satisfactory because there are simply
no observations to estimate the right part of the density. The E-estimate for the interval
[0, V(n) ], shown in the bottom diagram, is better but still its right tail is poor because we
have just one observation of X which is larger than 0.32 and hence there is no way an
estimator may indicate the large underlying right stratum.
It is recommended to repeat Figures 6.6 and 6.7 with different corner functions, censoring
distributions and parameters to gain an experience in dealing with RC data.
and ∗
∗ ∗ f X (x)g(x)
f X (x) = p−1 f X (x)F T (x) = , (6.6.7)
GX ∗ (x)
where ∗ ∗
g(u) := P(T ≤ u ≤ X) = p−1 F T (u)GX (u). (6.6.8)
Now we are in a position to explain proposed estimators. We begin with estimators for
the boundary points of the supports,
where
n
X
ĝ(x) := n−1 I(Tl ≤ x ≤ Xl ), (6.6.12)
l=1
and this is a sample mean estimator of g(x) = E{I(T ≤ x ≤ X)}. Note that estimator
∗
(6.6.11) is zero for x ≤ X(1) and it is equal to Ĥ X (X(n) ) for all x ≥ X(n) . In what follows
we may refer to estimator (6.6.11) as an empirical cumulative hazard.
Now we can explain how to estimate the survival function
∗
Rx ∗ X∗
hX (u)du
GX (x) := P(X ∗ > x) = e− 0 = e−H (x)
. (6.6.13)
We plug the empirical cumulative hazard (6.6.11) in the right side of (6.6.13) and get
∗ X∗
ĜX (x) := e−Ĥ (x)
. (6.6.14)
The attractive feature of the estimator (6.6.14) is that its construction is simple and it
is easy for statistical analysis. For instance, suppose that its asymptotic (as the sample size
increases) variance is of interest. Then the asymptotic variance of the cumulative hazard is
calculated straightforwardly because it is a sample mean estimator, and then the variance
of the survival function is evaluated by the delta method. Let us follow these two steps and
210 SURVIVAL ANALYSIS
calculate the asymptotic variance. First, under the above-made assumptions the asymptotic
variance of the empirical cumulative hazard is
∗ ∗
lim [nV(Ĥ X (x))] = E{I(X ≤ x)/g 2 (X)} − [H X (x)]2 . (6.6.15)
n→∞
x ∗
∗
Z f X (u) X∗
= [GX (x)]2 p du − [H (x))]2
. (6.6.16)
0 F T ∗ (u)[GX ∗ (u)]2
There are two lines in (6.6.16) and both are of interest to us. The top line explains how
the variance may be estimated via a combination of plug-in and sample mean techniques.
Furthermore, for a sample mean estimator, under a mild assumption, the central limit
theorem yields asymptotic normality, then the delta method tells us that the asymptotic
normality is preserved by the exponential transformation (6.6.14), and hence we can use
this approach for obtaining confidence bands for the estimator (6.6.14). The bottom line
in (6.6.16) is important for our understanding conditions implying consistent estimation of
∗
the survival function GX (x). For instance, if αT ∗ < αX ∗ , then the integral in (6.6.16) is
finite. To see this, note that 0/0 = 0 and hence the integral over u ∈ [0, x], x > αX ∗ is the
∗ ∗
same as the integral over u ∈ [αX ∗ , x] where F T (u) ≥ F T (αX ∗ ) > 0 due to the assumed
αT ∗ < αX ∗ . The situation changes if αT ∗ = αX ∗ , and this is a case in many applications.
∗ ∗
Depending on the ratio f X (u)/F T (u) for u near αT ∗ , the integral in (6.6.16) is either finite
or infinity. This is what makes the effect of the LT on estimation so unpredictable because
∗
it may preclude us from consistent estimation of GX even if αT ∗ = αX ∗ . Recall that we
made a similar conclusion in Section 6.3 for estimation of the hazard rate. Furthermore, if
αT ∗ > αX ∗ then no consistent estimation of the distribution of X ∗ is possible, and it will
be explained shortly what may be estimated in this case.
∗
Now we are developing an E-estimator of the probability density f X (x). According to
(6.6.7) we have the following relation,
∗ ∗
f X (x) = f X (x)GX (x)/g(x) whenever g(x) > 0. (6.6.17)
Suppose that we are interested in estimation of the density over an interval [a, a + b] such
that g(x) is positive on the interval. Denote by {ψj (x)} the cosine basis on [a, a + b]. Then
(6.6.17) allows us to write for a Fourier coefficient,
Z a+b
∗ ∗
θj := f X (x)ψj (x)dx = E{I(X ∈ [a, a + b])ψj (Xl )GX (X)/g(X)}. (6.6.18)
a
0.6
0.4
0.2
0.0
p = 0.63, ^ = 0.64
p
Figure 6.8 Estimation of the distribution of the lifetime of interest X ∗ in an LT sample. In the
hidden model X ∗ is distributed according to the Strata (the choice is controlled by parameter corn)
and T ∗ is Uniform([0, uT ]). The top diagram shows by circles and crosses g(Xl ) and ĝ(Xl ), and
x-coordinates of triangles show observations of the truncating variable T . In the middle diagram, the
wide solid, dotted and dashed lines show the underlying survival function, Kaplan–Meyer estimate
and sample mean estimate (6.6.14). The subtitle shows the underlying probability (6.6.4) and its
estimate (6.6.22). The bottom diagram shows the underlying density (the solid line), the E-estimate
(the dashed line), and (1 − α) pointwise (the dotted lines) and simultaneous (the dot-dashed lines)
confidence bands. [n = 200, corn = 4, uT = 0.7, alpha = 0.05, cJ0 = 4, cJ1 = 0.5, cTH = 4]
There is one more interesting unknown parameter that may be estimated. Recall that
p := P(T ∗ ≤ X ∗ ) defines the probability of a hidden pair (T ∗ , X ∗ ) to be observed, or in
other words p defines the likelihood of X ∗ to be not truncated. Of course it is of interest to
∗
know this parameter of the LT. Equality (6.6.8), together with F T (βT ∗ ) = 1, implies that
∗
g(βT ∗ ) = p−1 GX (βT ∗ ). This motivates the following estimator,
∗
p̂ := ĜX (T(n) )/ĝ(T(n) ). (6.6.22)
Note that any t ≥ T(n) can be also used in (6.6.22) in place of T(n) .
212 SURVIVAL ANALYSIS
Figure 6.8 allows us to look at a simulated LT sample and evaluate performance of
the proposed estimators, explanation of its three diagrams can be found in the caption.
The top diagram allows us to visualize realizations of X and T . Here we also can compare
the underlying function g(x) and its sample mean estimate (6.6.12). Note how fast the
right tail of g(x) vanishes. Nonetheless, as we shall see shortly, this does not preclude us
from estimation of the distribution of X ∗ . The middle diagram indicates that the sample
mean and the Kaplan–Meyer estimators perform similarly. Its subtitle shows the underlying
parameter p and its estimate (6.6.22). As we see, we can get a relatively fair idea about
a hidden sampling which is governed by a negative binomial distribution with parameters
(n, p). Finally, the bottom diagram shows us the E-estimate of the density of X ∗ . As we
know, the Strata is a difficult density to estimate even for the case of direct observations.
Here we are dealing with LT observations that are, as we know, biased. Overall the particular
outcome is good because we clearly observe the two strata. The confidence bands are also
reasonable, and the ISE, shown in the subtitle, is relatively small. Further, the subtitle
shows us the estimated coefficient of difficulty (6.6.21).
Now let us return to the middle diagram in Figure 6.8 where the survival function and
its estimates are shown. In many applications the right tail of the survival function is of
∗
a special interest. Let us explore a new idea of the tail estimation. Note that F T (x) = 1
whenever x ≥ βT ∗ . This allows us to write for x ≥ βT ∗ ,
Figure 6.9 allows us to look at the zoomed-in tails of estimates produced by estimators
(6.6.14) and (6.6.24). The two diagrams show results of different simulations for the same
LT model. The solid, dashed and dotted lines are the underlying survival function and
estimates (6.6.14) and (6.6.24). As we see, the top diagram depicts an outcome where the
new estimate is better, this is also stressed by the ratio between the empirical integrated
squared error (ISE) of the estimate (6.6.14) and the integrated squared error (ISEN) of
the new estimate (6.6.24). Note that there are just few large observations, and this is a
rather typical situation with tails. The sample size n = 75 is relatively small but it is chosen
for better visualization of the estimates. The bottom diagram exhibits results for another
simulation, and here the estimate (6.6.14) is better. This is a tie, and we may conclude that
in general two simulations are not enough to compare estimators. To resolve the issue, we
repeat the simulation 300 times and then statistically analyze ratios of the ISEs. Sample
mean and sample median of the ratios are shown in the subtitle, and they indicate better
performance of the estimator (6.6.24). Of course, the sample size is too small for estimation
of the tail, but the method of choosing between two estimators is statistically sound. The
reader is advised to repeat Figure 6.9 with different parameters, compare performance of
the two estimators, and gain experience in choosing between several estimators.
ESTIMATION OF DISTRIBUTIONS FOR LT DATA 213
0.08
0.04
0.00
ISE/ISEN = 1.42
∗
Figure 6.9 Estimation of the tail of survival function GX (x). The underlying simulation is the
same as in Figure 6.8. The solid, dashed and dotted lines are the underlying survival function and
estimates (6.6.14) and (6.6.24), respectively. {Argument nsim controls the number of simulations.}
[n = 75, corn = 2, uT = 0.7, nsim = 300]
So far we have discussed the problem of estimation of the distribution of a hidden lifetime
of interest X ∗ . It is also of interest to estimate the distribution of a hidden truncating
random variable T ∗ . We again begin with probability formulas. Using (6.6.6) we can write,
∗
∗ f T (t) f T (t)
q T (t) := T ∗ = whenever g(t) > 0. (6.6.25)
F (t) g(t)
∗
Function q T (t) can be estimated using the second equality in (6.6.25), and hence we need
to understand how the distribution of interest can be expressed via the function q(t). This
function resembles the hazard rate only now the denominator is the cumulative distribution
function instead of the survival function. A straightforward algebra implies that
∗
Z βT ∗ ∗ ∗
F T (t) = exp − q T (u)du =: exp − QT (t)
t
= exp − E{I(T > t)/g(T )} , t ∈ [αT ∗ , βT ∗ ]. (6.6.26)
The expectation in (6.6.26) allows us to propose the following plug-in sample mean
estimator of the cumulative distribution function (compare with (6.6.14))
n
∗ ∗ X
F̂ T (t) := exp − Q̂T (t) := exp − n−1
I(Tl > t)/ĝ(Tl ) , (6.6.27)
l=1
214 SURVIVAL ANALYSIS
0.6
0.4
T
0.2
0.0
= 0.05
ISE = 7.8e-05
Figure 6.10 Estimation of the distribution of T ∗ for LT data. Underlying simulation is the same
as in Figure 6.8. Observations are shown in the top diagram. In the middle diagram the solid and
∗
dashed lines are the cumulative distribution function F T and its estimate. The dotted lines show
the (1 − α) pointwise confidence band. Curves in the bottom diagram are the same as in the bottom
diagram of Figure 6.8. [n = 100, corn = 4, uT = 0.7, alpha = 0.05, cJ0 = 4, cJ1 = 0.5, cTH = 4]
0.8
0.4
0.0
p = 0.51, ^ = 0.69
p
Figure 6.11 Estimation of the distribution of X ∗ for the case αX ∗ < αT ∗ . This figure is similar
to Figure 6.8 apart of two modifications. First, here T ∗ is uniform on interval [ut , uT ] = [0.2, 0.7].
Second, the narrow solid lines in the middle and bottom diagrams show the underlying conditional
∗ ∗ ∗ ∗
survival function GX |X >ut (x) and the underlying conditional density f X |X >ut (x), respectively.
[n = 200, corn = 4, ut = 0.2, uT = 0.7, alpha = 0.05, cJ0 = 4, cJ1 = 0.5, cTH = 4]
Further, typically [T(1) , T(n) )] may be recommended as the interval of estimation, and note
that P(T(n) ≤ X(n) ) = 1.
Figure 6.10 shows us how the proposed estimators of the cumulative distribution function
and the density of the hidden truncation variable T ∗ perform. For the particular simulation,
the estimates are good. Confidence bands for the density may look too wide, but this is only
because the E-estimate is good. The reader is advised to repeat Figure 6.10 with different
parameters and test performance of the estimator and confidence bands.
Note that we have developed estimators for T ∗ from scratch. It was explained in the
previous sections that for RC data there is a symmetry between estimation of distributions
of the lifetime of interest X ∗ and censoring variable C ∗ , and if an estimator for X ∗ is
proposed, then it also can be used for C ∗ . Is there any type of a similar symmetry for LT
data? The answer is “yes.” Let γ be a positive constant such that γ ≥ max(βX ∗ , βT ∗ ), and
introduce two new random variables X 0 := γ − T ∗ and T 0 := γ − X ∗ . Then (X 0 , T 0 ) can be
considered as underlying hidden variables for a corresponding LT data with X 0 being the
lifetime of interest and T 0 being the truncating variable. This is a type of symmetry that
allows us to estimate distributions of T ∗ using estimators developed for X ∗ .
Finally, let us explain what may be expected when assumptions (6.6.1) and (6.6.2) are
violated. If αX ∗ < αT ∗ then the LT hides left tail of the distribution of X ∗ , and we cannot
restore it. Violation of (6.6.2) hides right tail of the distribution of T ∗ .
216 SURVIVAL ANALYSIS
Figure 6.11 illustrates the case when αX ∗ < αT ∗ (note that the diagrams are similar to
the ones in Figure 6.8 whose caption explains the diagrams). First of all, let us look at the
top diagram which shows realizations of X and T . If we compare them with those in Figure
6.8, then we may conclude that visualization of observations is unlikely to point upon the
violation of assumption (6.6.1). The reader is advised to repeat Figures 6.8 and 6.11 and
gain experience in assessing LT data. Further, it is clear from the two bottom diagrams that
the estimators are inconsistent, and this is what was predicted. Nonetheless, it looks like
the estimates mimic the underlying ones. Let us explore this issue and shed light on what
the estimators do when αX ∗ < αT ∗ .
We begin with the estimator (6.6.14) of the survival function. Recall that we are con-
sidering the case αX ∗ < αT ∗ . Note that then P(X ≤ αT ∗ ) = 0, and using (6.6.11) we can
write for x > αT ∗ ,
∗
n I(X < x)I(X > α ∗ ) o
T
E{Ĥ X (x)} = E
ĝ(X)
n I(α ∗ < X < x) o n 1 1 o
T
=E + E I(αT ∗ < X < x) − . (6.6.32)
g(X) ĝ(X) g(X)
The first term in (6.6.32) is what we are interested in because the second one, under a mild
assumption, vanishes as n increases. Using (6.6.7) we can express the first term via the
hazard rate of X ∗ ,
n I(α ∗ < X < x) o Z x ∗
T
E = hX (u)du. (6.6.33)
g(X) αT ∗
0.6
0.4
0.2
0.0
p = 0.34, ^ = 0.7
p
Figure 6.12 Estimation of the distribution of X ∗ for the case βT ∗ > βX ∗ . The figure is created by
Figure 6.8 by setting uT = 1.5 and corn = 2.
Consider the expectation of Fourier estimator (6.6.19) given αX ∗ < αT ∗ . Let [a, a + b]
be an interval such that g(x) > 0 for x ∈ [a, a + b]. We may write using (6.6.7),
∗
E{θ̂j } = E{I(X ∈ [a, a + b])ψj (X)ĜX (X)/ĝ(X)}
∗ ∗
= E{I(X ∈ [a, a + b])ψj (X)GX |X >αT ∗ (X)/g(X)}
∗ ∗ ∗
n ĜX (X) GX |X >αT ∗ (X) o
+E I(X ∈ [a, a + b])ψj (X) − . (6.6.35)
ĝ(X) g(X)
The first expectation on the right side of (6.6.35) is the term of interest because the second
one, under a mild assumption, vanishes as n increases. Using (6.6.17) and (6.6.34) we may
write, ∗ ∗
E{I(X ∈ [a, a + b])ψj (X)GX |X >αT ∗ (X)/g(X)}
Z a+b Z a+b X ∗
X X ∗ |X ∗ >αT ∗ −1 f (x)ψj (x)
= f (x)ψj (x)G (x)[g(x)] dx = dx. (6.6.36)
a a GX ∗ (αT ∗ )
Introduce the conditional density,
∗
∗
|X ∗ >αT ∗ f X (x)
fX (x) := I(x > αT ∗ ). (6.6.37)
GX ∗ (αT ∗ )
= 0.05
ISE = 0.3
Figure 6.13 Estimation of the distribution of T ∗ for the case βT ∗ > βX ∗ . The figure is created by
Figure 6.10 via setting uT = 1.5.
If we now return to the bottom diagram in Figure 6.11, the narrow solid line shows
us the underlying conditional density. The E-estimate (the dashed line) is far from being
perfect but its shape correctly indicates the two strata in the underlying conditional density.
Now let us consider the case βT ∗ > βX ∗ . Figures 6.12 and 6.13 shed light on this case.
Figure 6.12 indicates that this case does not preclude us from consistent estimation of the
distribution of X ∗ . In the top diagram x-coordinates of circles and triangles show us the
observed values of X and T , respectively. Our aim is to explore the possibility to realize
that βT ∗ > βX ∗ . As we discussed earlier, theoretically T(n) should approach βT = βX ∗ = 1.
Because X(n) converges in probability to βX = 1, it may be expected that T(n) and X(n) are
close to each other. Unfortunately, even for the used relatively large sample size n = 200,
T(n) is significantly smaller than X(n) and the outcome resembles what we observed in
Figure 6.8. The explanation of this observation is based on formula (6.6.6) for the density
∗
f T (t). It indicates that the density is proportional to the survival function GX (t) which
vanishes (and in our case relatively fast) as t → βX = 1. This is what significantly slows
down the convergence of T(n) to X(n) . The lesson learned is that data alone may not allow
us to verify validity of (6.6.2). Further, we cannot consistently estimate p. At the same time,
as it could be expected, consistent estimation of the distribution of X ∗ is possible and the
exhibited results support this conclusion.
The situation clearly changes when the aim is to estimate the distribution of T ∗ . Figure
6.13 shows a particular outcome, and it clearly indicates our inability to estimate the dis-
∗
tribution of T ∗ . At the same time, as it follows from formulas, the shape of density f T (t)
over interval [T(1) , T(n) ] may be visualized, and the bottom diagram sheds light on this
ESTIMATION OF DISTRIBUTIONS FOR LTRC DATA 219
conclusion. The interested reader may theoretically establish what the proposed estimator
estimates for the considered case βT ∗ > βX ∗ .
The reader is advised to repeat Figures 6.8–6.13 with different parameters, pay a special
attention to feasible intervals of estimation, and get used to statistical analysis of LT data.
∗
h ∗ ∗
iδ h ∗ ∗
i1−δ
f T,V,∆ (t, v, δ) = p−1 f T (t)I(t ≤ v) f X (v)GC (v) f C (v)GX (v) . (6.7.3)
∗ f V,∆ (x, 1) ∗ ∗
f X (x) = −1 C ∗ T ∗ whenever GC (x)F T (x) > 0. (6.7.6)
p G (x)F (x)
Finally, we introduce a probability that plays a key role in the analysis of LTRC data,
∗ ∗ ∗
g(x) := P(T ≤ x ≤ V ) = p−1 F T (x)GX (x)GC (x). (6.7.7)
Formula (6.7.9) allows us to obtain a simple formula for the cumulative hazard of X ∗ ,
Z x
∗ ∗ ∗
H X (x) := [f X (u)/GX (u)]du
0
Z x
= [f V,∆ (u, 1)/g(u)]du = E{∆I(V ≤ x)g −1 (V )}, x ∈ [0, βX ∗ ). (6.7.10)
0
∗ ∗
Recall that GX (x) = exp{−H X (x)}, and then the expectation on the right side of (6.7.10)
implies the following plug-in sample mean estimator of the survival function,
n
∗
n X ∆l I(Vl ≤ x) o
ĜX (x) := exp − n−1 . (6.7.11)
ĝ(Vl )
l=1
Here
n
X
ĝ(x) := n−1 I(Tl ≤ x ≤ Vl ) (6.7.12)
l=1
is the sample mean estimator of the probability g(x) defined in (6.7.7). Further, it is a
straightforward calculation to find an asymptotic expression for the variance of empirical
survival function (6.7.11),
∗ ∗ ∗
lim nV(ĜX (x)) = [GX (x)]2 [E{∆I(V ≤ x)[g(V )]−2 } − (H X (x))2 ]. (6.7.13)
n→∞
This result, together with the central limit theorem and delta method, allows us to get a
pointwise confidence band for the empirical survival function.
ESTIMATION OF DISTRIBUTIONS FOR LTRC DATA 221
∗
To use our E-estimation methodology for estimation of the density f X (x) over an
interval [a, a+b] ⊂ [αX ∗ , βX ∗ ), we need to understand how to express its Fourier coefficients
as expectations. Recall our notation {ψj (x)} for the cosine basis on [a, a + b]. We can write
with the help of (6.7.6) and (6.7.7) that
a+b a+b ∗
ψj (x)f V,∆ (x, 1)GX (x)
Z Z
∗
θj := ψj (x)f X (x)dx = dx
a a g(x)
n ∆I(V ∈ [a, a + b])ψ (V )GX ∗ (V ) o
j
=E . (6.7.14)
g(V )
The expectation in (6.7.14) yields the following plug-in sample mean Fourier estimator,
n
X ∗
θ̂j := n−1 ∆l I(Vl ∈ [a, a + b])ψj (Vl )ĜX (Vl )/ĝ(Vl ). (6.7.15)
l=1
∗
Fourier estimator (6.7.15) yields a density E-estimator fˆX (x) with the coefficient of
difficulty ∗
d(a, a + b) = b−1 E{∆I(V ∈ [a, a + b])[GX (V )/g(V )]2 }
Z a+b ∗
f X (x)
= pb−1 dx. (6.7.16)
a F T ∗ (x)GC ∗ (x)
The coefficient of difficulty is of a special interest to us because it sheds light on a
∗
feasible interval of estimation. Namely, we know that F T (x) vanishes as x → αT ∗ and
∗
GC (x) vanishes as x → βC ∗ , and this is what may make the integral in (6.7.16) large. The
made assumption (6.7.8) allows us to avoid the case of infinite coefficient of difficulty by
choosing an interval of estimation satisfying [a, a + b] ⊂ [V(1) , V(n) ], but still the coefficient
of difficulty may be prohibitively large. At the same time, vanishing tails of an underlying
∗
density f X may help in keeping the coefficient of difficulty reasonable, while an increasing
tail makes estimation more complicated.
Now let us stress that if αT ∗ > αX ∗ then, as it was explained in Section 6.6, the proposed
∗ ∗
estimators estimate the conditional survival function GX |X >αT ∗ (x) and the conditional
∗ ∗ ∗
density f X |X >αT ∗ (x). Of course, these conditional characteristics coincide with GX (x)
∗
and f X (x) whenever αT ∗ ≤ αX ∗ , and hence we may say that in general we estimate those
conditional characteristics. Statistical analysis of this setting is left as an exercise.
Figure 6.14 illustrates performance of the proposed estimators of the conditional survival
function and the conditional density of the lifetime of interest. Its caption explains the sim-
ulation and diagrams. Note that the assumption (6.7.8) holds for the particular simulation,
and choosing parameter ut > 0 allows us to consider the case αT ∗ > αX ∗ and then test
estimation of the conditional characteristics.
The top diagram in Figure 6.14 shows us simulated LTRC data generated by the Normal
lifetime and uniformly distributed truncating and censoring random variables. These three
variables are mutually independent. The good news here is that the number N = 230 of
available uncensored observations of the lifetime of interest is relatively large with respect
to the sample size n = 300. If we look at the smallest observations, we may observe that
αT ∗ is close to zero, and this implies a chance for a good estimation of the left tail of the
distribution of X ∗ . We may also conclude that it is likely that βX ∗ < βC ∗ because a relatively
large number of uncensored observations are available to the right of the largest observation
of the censoring variable. The middle diagram shows us the proposed plug-in sample mean
∗
estimate (6.7.11) of GX (x), the Kaplan–Meier estimate, and the above-explained 95%
confidence band for the plug-in sample mean estimate. The two estimates are practically
the same. Despite a relatively large sample size n = 300, we see a pronounced deviation of
222 SURVIVAL ANALYSIS
T, V
∗ ∗
Figure 6.14 Estimation of the conditional survival function GX |X >αT ∗ (x) and the conditional
∗ ∗
density f X |X >αT ∗ (x) for the case of LTRC observations generated by independent and continu-
ous hidden variables. In the used simulation T ∗ is Uniform(ut , uT ), X ∗ is the Normal and C ∗ is
Uniform(uc , uC ); the used parameters are shown in the main title. The top diagram P shows a sample
of size n = 300 from (T, V, ∆). Observations of (V, ∆) are shown by circles, N := n l=1 ∆l = 230
is the number of uncensored observations. Observations of the truncation variable T are shown via
horizontal coordinates of crosses. In the middle diagram, the solid line is the underlying condi-
∗ ∗
tional survival function GX |X >αT ∗ (x) shown for x ∈ [V(1) , V(n) ], the dashed and dot-dashed lines
are the sample-mean and Kaplan-Meier estimates (they are close to each other), and the dotted
lines show the (1 − α) pointwise confidence band. The bottom diagram shows the underlying con-
ditional density (the solid line), its E-estimate (the dashed line), and the pointwise (dotted lines)
and simultaneous (dot-dashed lines) 1 − α confidence bands. The E-estimate is for interval [a, a + b]
with default values a = V(1) and a + b = V(n) shown in the subtitle. {Distribution of T ∗ is either
the Uniform(ut , uT ) or Exponential(λT ) where λT is the mean. Censoring distribution is either
Uniform(uc , uC ) or Exponential(λC ). Parameters of underlying distributions are shown in the ti-
tle of the top diagram. To choose, for instance, exponential truncation and censoring, set trunc =
00
Expon 00 , cens = 00 Expon 00 and then either use default parameters or assign wished ones. To choose
a manual interval [a, a + b], assign wished values to arguments a and b.} [n = 300, corn = 2, trunc
= 00 Unif 00 , ut = 0, uT = 0.5, lambdaT = 0.3, cens = 00 Unif 00 , uc = 0, uC = 1.5, lambdaC = 1.5,
a = NA, b = NA, alpha = 0.05, cJ0 = 4, cJ1 = 0.5, cTH = 4]
the estimates from the underlying survival function, and we can also see the relatively large
width of the band which predicts such a possibility. In the bottom diagram, despite the
skewed LTRC observations, the density E-estimate correctly shows the unimodal shape of
the underlying Normal density. The estimated coefficient of difficulty, shown in the subtitle,
ESTIMATION OF DISTRIBUTIONS FOR LTRC DATA 223
is equal to 1.7. The latter implies that, with respect to the case of direct observations of
X ∗ , we need 70% more LTRC observations to get the same MISE. Figure 6.14 is a good
learning tool to explore LRTC data and estimators. It also allows a user to manually choose
an interval of estimation [a, a + b], and this will be a valuable lesson on its own.
What can be said about estimation of the distribution of C ∗ and T ∗ as well as the
parameter p? Recall our discussion in Section 6.5 that X ∗ and C ∗ are “symmetric” random
variables in the sense that if we consider 1 − ∆ instead of ∆, then formally C ∗ becomes the
random variable of interest and X ∗ becomes the censoring random variable. As a result,
we can use the proposed estimators for estimation of the distribution of C ∗ , only by doing
so we need to keep in mind the assumptions and then correspondingly choose a feasible
interval of estimation. The problem of estimation of the distribution of T ∗ and parameter
p is not new for us because we can consider (T, V ) as an LT realization of an underlying
pair (T ∗ , V ∗ ). Then Section 6.6 explains how the distribution of T ∗ and parameter p can
be estimated.
We finish the section by considering a case where the main assumption about mutual
independence and continuity of the triplet of hidden random variables is no longer valid. In
some applications it is known that the censoring random variable is not smaller than the
truncating variable, and the censoring random variable may have a mixed distribution. As
an example, we may consider the model of a clinical trial where
C ∗ := T ∗ + U ∗ := T ∗ + [uC B ∗ + (1 − B ∗ )U 0 ], (6.7.17)
P(T ∗ ≤ x ≤ V ∗ , T ∗ ≤ V ∗ ) P(T ∗ ≤ x ≤ V ∗ )
= =
P(T ∗ ≤ V ∗ ) p
∗
= p−1 P(T ∗ ≤ x, X ∗ ≥ x, C ∗ ≥ x) = p−1 GX (x)P(T ∗ ≤ x ≤ C ∗ ). (6.7.18)
In the last equality the independence of X ∗ and (T ∗ , C ∗ ) was used. Next, we can write,
P(V ≤ x, ∆ = 1) = P(X ∗ ≤ x, X ∗ ≤ C ∗ |T ∗ ≤ V ∗ )
Z x
∗
−1 ∗ ∗ ∗ ∗ −1
= p P(X ≤ x, T ≤ X ≤ C ) = p f X (u)P(T ∗ ≤ u ≤ C ∗ )du. (6.7.19)
0
Differentiation of (6.7.19) with respect to x and then using (6.7.18) allows us to write,
∗
∗ f X (x)g(x)
f V,∆ (x, 1) = p−1 f X (x)P(T ∗ ≤ x ≤ C ∗ ) = . (6.7.20)
GX ∗ (x)
In its turn, (6.7.20) implies that the density of interest can be written as
∗
∗ f V,∆ (x, 1)GX (x)
f X (x) = , 0 ≤ x < βX ∗ . (6.7.21)
g(x)
224 SURVIVAL ANALYSIS
T, V
Figure 6.15 Estimation of the conditional survival function and the conditional density of a lifetime
of interest for LTRC data when C ∗ := T ∗ + U ∗ as in (6.7.17). Variables in the triplet (X ∗ , T ∗ , U ∗ )
are mutually independent. Variable U ∗ is the mixture of Uniform(uc , uC ) random variable with the
constant uC , and the probability P(U ∗ = uC ) is controlled by the argument censp. Otherwise the
underlying simulation and the structure of diagrams are as in Figure 6.14. [n = 300, corn = 2,
trunc = 00 Unif 00 , ut = 0.2, uT = 0.7, lambdaT = 0.3, cens = 00 Unif 00 , uc = 0, uC = 0.6, lambdaC
= 1.5, censp = 0.2, a = NA, b = NA, alpha = 0.05, cJ0 = 4, cJ1 = 0.5, cTH = 4]
This is the same formula as (6.7.9) which was established for the case of independent and
continuous hidden random variables. Hence an estimator, motivated by that formula, can
be used for the studied case as well. Further, if αX ∗ > αT ∗ then we estimate conditional
∗ ∗ ∗ ∗
characteristics GX |X >αT ∗ (x) and f X |X >αT ∗ (x).
Figure 6.15 allows us to test the made conclusion for the model (6.7.17). Its structure
is similar to Figure 6.14 and the caption explains the underlying LTRC mechanism where
the parameter censp controls the choice of P(U ∗ = uC ). In the considered simulation this
parameter is equal to 0.2, meaning that in the example of a clinical trial 20% of participants
are right censored by the end of the trial. The used LTRC mechanism creates challenges
for the estimators because αT∗ = ut = 0.2 > αX ∗ = 0. As a result, theoretically we may
estimate only the underlying conditional survival function and the conditional density given
X ∗ > ut = 0.2. The top diagram indicates that the truncation variable is separated from
zero and it is likely that αT ∗ is close to 0.2. Further, note that there are just few observations
NONPARAMETRIC REGRESSION WITH RC RESPONSES 225
in the tails. The middle diagram shows that despite these challenges, the conditional survival
function is estimated relatively well. The conditional density E-estimate is very good and
∗ ∗
correctly indicates the underlying conditional density f X |X >αT ∗ (x) despite the heavily
skewed LTRC observations. The estimated coefficient of difficulty is equal to 2.1, and this,
together with the large confidence bands, sheds light on the complexity of the problem and
the possibility of poor estimates in other simulations. Nonetheless, it is fair to conclude that
the estimators are robust to the above-discussed deviations from the basic LTRC model.
LRTC is one of the most complicated modifications of data, and it is highly recommended
to use Figures 6.14 and 6.15 to learn more about this important statistical problem. Explor-
ing different underlying distributions and parameters will help to gain necessary experience
in dealing with LTRC data.
We also know from Section 6.5 that the distribution of Y cannot be estimated beyond the
value βV = min(βY , βC ) (recall our notation βZ for the upper bound of the support of a
random variable Z). Hence, if
βY < βC , (6.8.3)
then the distribution of Y and the regression function (6.8.1) may be consistently estimated,
226 SURVIVAL ANALYSIS
otherwise the distribution of Y may be recovered only up to value βV = βC . As a result,
let us introduce a censored (or we may say trimmed) regression function
where
n
X
ĝ(v) := n−1 I(Vl ≥ v). (6.8.7)
l=1
In its turn, this Fourier estimator allows us to construct the regression E-estimator
m̂(x, βV ). Further, given (6.8.3) this estimator consistently estimates m(x).
Figure 6.16 allows us to understand the model, observations, and a possible estimation
of the regression function. The underlying simulation and diagrams are explained in the
caption. The top diagram shows us the available sample from (X, V, ∆) where uncensored
cases (when ∆ = 1) are shown by circles and censored cases by crosses. The underlying
NONPARAMETRIC REGRESSION WITH RC RESPONSES 227
5
4
3
V
2
1
0
4
0
Regression Estimates
2.5
m(x)
1.5
0.5
Figure 6.16 Regression with heavily censored responses. The responses are independent exponential
random variables with mean a + f (X) where f is a corner function, here it is the Normal. The
predictor is the Uniform. The censoring variable is exponential with the mean λC . The top diagram
shows by circles
Pn observations with uncensored responses and by crosses observations with censored
ones, N := l=1 ∆l . Underlying scattergram of the hidden sample from (X, Y ) is shown in the
middle diagram, and it is overlaid by the underlying regression function (the solid line) and its
E-estimate based on this sample (the dot-dashed line). The bottom diagram shows the underlying
regression (the solid line), the E-estimate based on RC data shown in the top diagram (the dashed
line), the E-estimate based solely on cases with uncensored responses shown by circles in the top
diagram (the dotted line), and the dot-dashed line is the estimate shown in the middle diagram.
{Parameter λC is controlled by argument lambdaC, function f is chosen by argument corn.} [n =
300, corn = 2, a = 0.3, lambdaC = 2, lambdaC = 2, cJ0 = 4, cJ1 = 0.5, cTH = 4, c = 1.]
sample from (X, Y ) is shown in the middle diagram. It is generated by the Uniform predictor
X and Y being exponential variable with the mean equal to 0.3 + f (X) where f (x) is the
Normal corner function. Note that the mean is the regression function and it is shown by
the solid line. Pay attention to the large volatility of the exponential regression and the
range of underlying responses. The dot-dashed line is the E-estimate based on the hidden
sample, and it may be better visualized in the bottom diagram. The E-estimate is good,
and also the scattergram corresponds to a unimodal and symmetric regression function.
Now let us return to the top diagram which shows available data and compare them with
the underlying ones shown in the middle diagram. First of all, let us compare the scales.
228 SURVIVAL ANALYSIS
8
6
V
4
2
0
4
2
0
Regression Estimates
3.0
2.0
m(x)
1.0
0.0
Figure 6.17 Regression with mildly censored responses. The simulation and diagrams are the same
as in Figure 6.16, only here the smaller sample size n = 200 and larger mean λC = 5 of the
exponential censoring variable are used. [n = 200, corn = 2, a = 0.3, lambdaC = 5, cJ0 = 4, cJ1
= 0.5, cTH = 4, c = 1.]
Value of the largest observed V(n) is close to 6 while value of the largest Y(n) is close to 14.
This is a sign of severe censoring. Further, practically all larger underlying responses are
censored, with just few uncensored responses being larger than 3. Further, among n = 300
underlying responses, only N = 199 are uncensored, that is every third response is censored.
If we return to our discussion of the assumption (6.8.3) and the censored regression, we may
expect here that a regression estimate, based on the censored data, may be significantly
smaller than the underlying regression. The bottom diagram supports this prediction. The
dashed line is the E-estimate, and it indeed significantly underestimates the underlying
regression function shown by the solid line. Nonetheless, the proposed estimator does a
better job than a complete-case approach which yields an estimate shown by the dotted
line. Recall that a complete-case approach simply ignores cases with censored responses.
What we see is that despite a relatively large sample size, the observed V(n) is too small
and this explains the underperformance of the E-estimator.
Of course, the simulation of Figure 6.16 implies a severe censoring. Let us relax it a bit
by considering a censoring variable with exponential distribution only now with a larger
mean λC = 5. This should produce more uncensored observations and a better regression
NONPARAMETRIC REGRESSION WITH RC PREDICTORS 229
estimation. A particular outcome is shown in Figure 6.17 where also a reduced sample size
n = 200 is chosen for better visualization of scattergrams. Otherwise, the simulation and
diagrams are the same as in Figure 6.16.
Let us begin with comparison of observations in the top and middle diagrams. First of
all, note that the scales are about the same, and V(n) is close to Y(n) . Further, now less than
a quarter of responses is censored. This is still a significant proportion but dramatically
smaller than in Figure 6.16. Now let us look at the estimates. The E-estimates based on
the underlying and censored data are close to each other (compare dot-dashed and dashed
curves), and they are much better than the E-estimate based on complete cases (the dotted
line).
It is highly recommended to repeat these two figures with different parameters and gain
experience in dealing with censored responses. It is useful to manually analyze scattergrams
and try to draw a reasonable regression which takes into account the nature of data. Fur-
ther, use different regression functions and realize whether the E-estimator allows us to
make correct conclusions about modes, namely about their number, locations and relative
magnitudes.
This formula allows us to get the joint density of (U, Y ) given ∆ = 1, that is the joint
density of observations for uncensored (complete) cases. Write,
4
2
Y
0
-2
0
-2
Uncensored-Case Regression
4
m(x)
2
0
-2
Figure 6.18 Regression with censored predictors. The underlying predictors have the Uniform dis-
tribution, the responses are m(X) + σε where m(x) is the Strata and ε is standard normal, and the
censoring variable is Uniform(0, uC ). The top diagram shows by circles observations with uncen-
sored predictors and by crosses observations with censored predictors, N := n
P
l=1 ∆l is the number
of uncensored predictors. Underlying scattergram of the hidden sample from (X, Y ) is shown in the
middle diagram, and it is overlaid by the underlying regression function (the solid line) and its
E-estimate based on this sample (the dotted line). The bottom diagram shows data and curves over
an interval [0, U(n) ]. Circles show observations with uncensored predictors (they are identical to
circles in the top diagram). These observations are overlaid by the underlying regression (the solid
line), the E-estimate based on uncensored-case observations (the dashed line), and the E-estimate
based on underlying observations (the dotted line) and it is the same as in the middle diagram.
{The regression function is chosen by argument corn, parameter uC is controlled by the argument
uC.} [n = 300, corn = 4, sigma = 1, uC = 1.1, cJ0 = 4, cJ1 = 0.5, cTH = 4]
familiar curse of severe censoring. On the other hand, we may always consistently estimate
left tail of regression.
Let us compare our conclusions for regressions with MAR data (recall that the latter
was discussed in Chapter 4). For the setting of MAR responses a complete-case approach
is optimal, and for the setting of MAR predictors a special estimator, which uses all obser-
NONPARAMETRIC REGRESSION WITH RC PREDICTORS 231
4
2
Y
0
-2
0
-2
Uncensored-Case Regression
4
m(x)
2
0
-2
Figure 6.19 Regression with heavily censored predictors. The simulation and the diagrams are the
same as in Figure 6.18 only here uC = 0.9. [n = 300, corn = 4, sigma = 1, uC = 0.9, cJ0 = 4,
cJ1 = 0.5, cTH = 4]
vations, is needed. As we now know, for regression with RC data the outcome is different.
Here a complete-case approach is optimal for the setting of censored predictors and a special
estimator, based on all observations, is needed for the setting of censored responses. This is
a teachable moment because each modification of data has its own specifics and requires a
careful statistical analysis.
Figure 6.18 allows us to understand the discussed regression with right censored predictor
and the proposed solution. Its caption explains the simulation and the diagrams. The top
diagram shows the scattergram of censored observations with circles indicating cases with
uncensored predictors and crosses indicating cases with censored predictors. Note that the
largest available predictor is near 1, and this hints that the censoring variable may take
even larger values. And indeed, the censoring variable is supported on [0, uC ] with uC = 1.1.
Further, while the underlying predictor X has the Uniform distribution, the observed U is
left skewed. We see this in the diagram, and this also follows from the relation GU (x) =
GX (x)GC (x). The latter allows us to conclude that RC predictors may create problems for
232 SURVIVAL ANALYSIS
estimation of a right tail of the regression. Further, note that 45% of predictors are censored,
and this is a significant loss of data.
The middle diagram shows us the underlying (hidden) scattergram, the regression func-
tion and its E-estimate. The estimate is not perfect, but note the large standard deviation
σ=1 of the regression error (controlled by the argument sigma) and recall that the Strata
is a difficult function for a perfect estimation. The bottom diagram sheds light on the pro-
posed uncensored-case approach. If we look at the exhibited uncensored cases, shown by
circles, then we may visualize the underlying Strata regression. Of course, just few obser-
vations in the right tail complicate the visualization, but the E-estimator does a good job
and the E-estimate (the dashed line) is comparable with the estimate based on all hidden
observations (the dotted line).
What will be if we use a more severe censoring, for instance one with uC = 0.9? A
corresponding outcome is shown in Figure 6.19. The top diagram shows that the largest
uncensored predictor is about 0.85, while the largest observation of the censoring variable
is near 0.9. This tells us that it is likely that the support of U is defined by the censoring
variable C, and this is indeed the case here. Further, note that we have just few uncensored
predictors with values larger than 0.75. Further, note that only N = 105 predictors from
underlying n = 300 are uncensored and they are heavily left skewed. What we see is an
example of a severe censoring.
The middle diagram shows that the sample of underlying hidden observations is rea-
sonable and, keeping in mind the large regression noise (it is possible to reduce or increase
it using the argument sigma), the E-estimate is fair. The bottom diagram shows us the
E-estimate based on uncensored cases (the dashed line) which can be compared with the
underlying regression (the solid line) and the E-estimate of the middle diagram (the dotted
line). As we see, despite the small sample size N = 105 of uncensored cases and the large
regression noise, the uncensored-case E-estimate is relatively good. Let us stress that it is
calculated for the interval [0, U(n) ], and that we have no information about the underlying
regression beyond this interval.
Repeated simulations of Figure 6.18, using different parameters and underlying regres-
sions, may help to shed a new light on the interesting and important statistical problem of
regression with censored predictor.
6.10 Exercises
6.1.1 Verify (6.1.4).
6.1.2 Prove that if the hazard rate is known, then the survival function can be calculated
according to formula (6.1.3).
6.1.3 Verify (6.1.5).
6.1.4 Consider the Weibull distribution with the shape parameter k defined below line
(6.1.7). Prove that if k < 1 then the hazard rate is decreasing and if k > 1 then it is
increasing.
6.1.5∗ Suppose that X and Y are two independent lifetimes with known hazard rates and
Z := min(X, Y ). Find the hazard rate of Z.
6.1.6 Propose a hazard rate whose shape resembles aR bathtub.
∞
6.1.7∗ Prove that for any hazard
R rate hY the relation 0 hY (y)dy = ∞ holds. Furthermore,
Y
if S is the support of Y then S h (y)dy = ∞.
Ry Y
6.1.8 Prove that GY (y) = GY (a)e− a h (u)du for any a ∈ [0, y].
6.1.9 Verify (6.1.6) and (6.1.7). Hint: Here the scale-location transformation is considered.
Begin with the cumulative distribution function and then take the derivative to get the
corresponding density.
EXERCISES 233
6.1.10∗ Propose a series estimator of the hazard rate which uses the idea of transformed
Z = (X − a)/b.
6.1.11 Verify expression (6.1.8) for Fourier coefficients of the hazard rate.
6.1.12 Prove that the oracle-estimator θ̃j∗ , defined in (6.1.10), is unbiased estimator of θj .
6.1.13 Find variance of the estimator (6.1.10). Hint: Note that the oracle is a classical
sample mean estimate, and then use formula for the variance of the sum of independent
random variables.
6.1.14 Find the mean and variance of the empirical survival function (6.1.14). Hint: Note
that this estimate is the sample mean of independent and identically distributed Bernoulli
random variables. Furthermore, the sum has a Binomial distribution.
6.1.15∗ Use Hoeffding’s inequality for the analysis of large deviations of the empirical
survival function (6.1.14).
6.1.16∗ Evaluate the mean and variance of the Fourier estimator (6.1.15). Hint: Begin with
the case of known nuisance functions, consider the asymptotic as n and j increase, and then
show that the plug-in methodology is valid.
6.1.17∗ Explain why (6.1.17) is a reasonable estimator of the coefficient of difficulty. Hint:
Replace the estimated survival function by an underlying survival function GY , and then
calculate the mean and variance of this oracle-estimator.
6.1.18∗ Explain why the ratio-estimator (6.1.19) may be a good estimator of the hazard
rate. What are possible drawbacks of the estimator?
6.1.19 Repeat Figure 6.1 for sample sizes n =100, 200, 300, 400, 500 and make your con-
clusion about corresponding feasible intervals of estimation. Comment about your method
of choosing an interval. Does a sample size affect the choice of interval?
6.1.20 Repeat Figure 6.1 twenty times, write down ISEs for the estimates, and then rank
the estimates based on results of this numerical study.
6.1.21 Repeat Figure 6.1 for different underlying distributions. Choose largest feasible
intervals of estimation for each distribution.
6.1.22 What are the optimal parameters of the E-estimator for the experiment considered
in Figure 6.1?
6.1.23 Explain underlying simulations and all histograms in Figure 6.1.
6.1.24∗ Write down the hazard rate E-estimator and explain parameters and statistics used.
6.2.1 Give several examples of right censoring. Further, present an example of left censoring
and explain the difference.
6.2.2 Explain the mechanism of right censoring. How are available observations in right-
censored data related to underlying random variables? Hint: Check (6.2.1).
6.2.3 Are random variables V and ∆ dependent, independent or is there not enough infor-
mation to answer?
6.2.4 Explain formula (6.2.2). Why is its last equality valid?
6.2.5 Prove (6.2.3). Formulate assumptions needed for its validity.
6.2.6 Formulate assumptions and verify (6.2.4). Describe the support of the density. Is this
a mixed density? What is the definition of a mixed density?
6.2.7 Are censored observations biased? If the answer is “yes,” then what is the biasing
function?
6.2.8 Explain formula (6.2.5). Does it express the hazard rate via functions that can be
estimated?
6.2.9 What is the motivation behind expression (6.2.6) for Fourier coefficients?
6.2.10∗ Explain the Fourier estimator (6.2.7) and then describe its statistical properties.
Hint: Find the mean, the variance and the probability of large deviations using the Hoeffding
inequality.
6.2.11∗ What is the coefficient of difficulty? What is its role in nonparametric estimation?
6.2.12∗ Verify (6.2.9) and (6.2.10).
234 SURVIVAL ANALYSIS
6.2.13 Explain the simulation used to create Figure 6.2.
6.2.14 Repeat Figure 6.2 and write down analysis of its diagrams.
6.2.15 Describe all parameters/arguments of Figure 6.2. What do they affect?
6.2.16 Repeat Figure 6.2 about 20 times and make your own conclusion about the effect
of estimator ĜV on the hazard rate E-estimator.
6.2.17 Explain how the interval of estimation affects the hazard estimator. Support your
conclusion using Figure 6.2.
6.2.18 Consider different distributions of the censoring variable. Then, using Figure 6.2,
present a report on how those distributions affect a reasonable interval of estimation and
the quality of E-estimator.
6.2.19∗ Explain formula (6.2.11) for the empirical coefficient of difficulty. Then calculate
its mean and variance.
6.2.20∗ Propose a hazard rate E-estimator for left censored data.
6.3.1 Give an example of left truncated data.
6.3.2 Describe an underlying stochastic mechanism of left truncation. Is it a missing mech-
anism?
6.3.3 Formulate the probabilistic model of truncation.
6.3.4 Assume that the sample size of a hidden sample is n. Is the sample size of a corre-
sponding truncated sample deterministic or stochastic? Explain your answer.
6.3.5 What are the mean and variance of the sample size of a truncated sample given that
n is the size of an underlying hidden sample?
6.3.6 Explain and verify (6.3.3). What are the used assumptions?
6.3.7 Explain and verify (6.3.4). Hint: Pay attention to the support of this bivariate density.
6.3.8∗ Assume that T is a discrete random variable. How will (6.3.3) and (6.3.4) change?
6.3.9 Verify (6.3.5).
6.3.10∗ Explain and verify (6.3.6). Pay attention to the conditions when the equality holds.
Will the equality be valid for points x ≤ αT ∗ ?
6.3.11 Does truncation imply biasing? If the answer is “yes,” then what is the biasing
function?
6.3.12 Does left truncation skew a hidden sample of interest to the left or the right?
6.3.13∗ Explain the motivation of introducing the probability g(x) in (6.3.8). Hint: Can
this function be estimated based on a truncated sample? Then look at (6.3.9).
6.3.14∗ Verify relations in (6.3.8).
6.3.15 Verify (6.3.9).
6.3.16∗ How can (6.3.9) be used for estimation of the hazard rate of interest?
6.3.17 Explain the underlying idea of Fourier estimator (6.3.10). Hint: Replace the estimate
ĝ by g and show that this is a sample mean estimator.
6.3.18∗ What is the mean of the estimator (6.3.10)? Is it unbiased or asymptotically unbi-
ased? Calculate the variance.
6.3.19 Explain the motivation behind the estimator (6.3.11).
6.3.20 What is the mean and variance of the estimator (6.3.11)?
6.3.21∗ Use Hoeffding’s inequality for the analysis of the estimator (6.3.11). Then explain
why this estimator may be used in the denominator of (6.3.10).
6.3.22∗ Verify (6.3.12).
6.3.23 Explain the motivation behind the estimator (6.3.13).
6.3.24∗ What are the mean and the variance of the estimator (6.3.13).
6.3.25∗ Explain how an E-estimator of the hazard rate is constructed.
6.3.26 Describe the underlying simulation that creates data in Figure 6.3.
6.3.27 Repeat Figure 6.3 a number of times and recommend a feasible interval of estimation.
6.3.28 Using Figure 6.3, propose “good” values for parameters of the E-estimator.
EXERCISES 235
6.3.29 Using Figure 6.3, explore the issue of how truncating variables affect quality of
estimation of an underlying hazard function.
6.3.30 Can Figure 6.3 be used for statistical analysis of the used confidence bands? Test
your suggestion.
6.3.31 Rank corner distributions according to difficulty in estimation of their hazard rates.
Then check your conclusion using Figure 6.3.
6.3.32 Confidence bands may take on negative values. How can they be modified to take
into account that a hazard rate is nonnegative?
6.3.33∗ Consider three presented examples of left censoring (actuarial, startups and clinical
trials), and explain how each may be “translated” into another.
6.3.34∗ Write down the hazard rate E-estimator and explain parameters and statistics used.
6.4.1 Give several examples of LTRC data.
6.4.2 Explain an underlying stochastic mechanism of creating LTRC data.
6.4.3 Present examples when the LT is followed by the RC and vice versa.
6.4.4 Explain formula (6.4.1). Can the probability be estimated?
6.4.5 Suppose that the sample size of an underlying hidden sample of interest is n. What
is the distribution of the sample size of a LTRC sample? What are its mean and variance?
6.4.6 Present examples when truncating and censoring variables are dependent and inde-
pendent.
6.4.7∗ Explain formula (6.4.2) for the cumulative distribution function of the triplet of
random variables observed in a LTRC sample. Comment on the support. What does the
formula tell us about a possibility of consistent estimation of the distributions of X ∗ , T ∗
and C ∗ ?
6.4.8 Verify formula (6.4.3) and explain assumptions under which it is correct.
6.4.9 Prove validity of (6.4.4).
6.4.10 What is the meaning of the joint density (6.4.4)? Hint: note that one of the variables
is discrete.
6.4.11 Verify expression (6.4.5) for the marginal mixed density of (V, ∆).
6.4.12∗ Verify validity of formula (6.4.6) for the density of the variable of interest X ∗ .
Explain the assumptions when it is valid. Can this formula be used for construction of a
consistent E-estimator and what are the assumptions?
6.4.13∗ Why is probability (6.4.7) a pivotal step in constructing an E-estimator?
6.4.14 Explain formula (6.4.7).
6.4.15 Can the function g(x) be estimated based on LTRC data?
6.4.16 Verify formulas (6.4.8) and (6.4.9). Explain the assumptions.
∗
6.4.17∗ Suggest an estimator for hX (x). Hint: Use (6.4.9).
6.4.18 Is the Fourier estimator (6.4.10) a sample mean estimator?
6.4.19 Why can ĝ(x), defined in (6.4.11), be used in the denominator of (6.4.10)?
6.4.20∗ Present a theoretical statistical analysis of estimators (6.4.10) and (6.4.11). Hint:
Begin with the distribution and then write down the mean, the variance, and for estimator
(6.4.11) use the Hoeffding inequality to describe large deviations.
6.4.21 Verify (6.4.12).
6.4.22 Explain the estimator (6.4.13). Is it asymptotically unbiased?
6.4.23 Conduct a series of simulations, using Figure 6.4, and explore the effect of estimate
ĝ on the E-estimate of the hazard rate.
6.4.24 Choose a set of sample sizes, truncated and censoring distributions, and then try to
determine a feasible interval of estimation for each underlying model. Explain your choice.
6.4.25 Suggest better values of parameters of the E-estimator. Does your recommendation
depend on distributions of truncating and censoring variables?
6.4.26 What is the main difference between simulations in Figures 6.4 and 6.5?
6.4.27 Present several examples where T ∗ and C ∗ are related.
236 SURVIVAL ANALYSIS
6.4.28 Present several examples where P(C ∗ ≥ T ∗ ) = 1 and C ∗ has a mixed distribution.
6.4.29∗ Verify each equality in (6.4.14). Explain the used assumptions.
6.4.30 Establish validity of (6.4.16). What are the used assumptions?
6.4.31 Prove (6.4.17).
6.4.32 Explain (6.4.18) and the underlying assumptions.
6.4.33 Describe the underlying simulation of Figure 6.5.
6.4.34 Explain diagrams in Figure 6.5.
6.4.35 For estimation of the hazard rate, is the model of Figure 6.5 more challenging than
of Figure 6.4?
6.4.36 Using Figure 6.5, suggest better parameters of the E-estimator.
6.4.37 Using Figure 6.5, infer about performance of the confidence bands.
6.4.38∗ Consider a setting where the made assumptions are violated. Then propose a con-
sistent hazard rate estimator or explain why this is impossible.
6.5.1 Present several examples of RC observations.
6.5.2 Is there something in common between RC and MNAR? If the answer is “yes,” then
why does MNAR typically preclude us from a consistent estimation and RC not?
6.5.3 Explain validity of (6.5.1) and the underlying assumption.
6.5.4 Prove (6.5.2).
6.5.5 Under RC, over what interval may the distribution of interest be estimated?
6.5.6 Describe assumptions that are sufficient for consistent estimation of the distribution
of X.
6.5.7 Explain how the Kaplan–Meier estimator is constructed.
6.5.8∗ What is the underlying motivation of the Kaplan–Meier estimator? Why is it called
a product limit estimator? Evaluate its mean and variance.
6.5.9 Explain formulae (6.5.6) and (6.5.7). Under what assumptions are they valid?
6.5.10 Explain how to estimate the survival function of the censoring random variable C.
6.5.11 Verify (6.5.9).
6.5.12∗ Find the mean and variance of the estimator (6.5.10).
6.5.13 Explain and then verify (6.5.12).
6.5.14 What is the motivation behind the Fourier estimator (6.5.13)?
6.5.15∗ Find the mean and variance of the estimator (6.5.13).
6.5.16 Using (6.5.14) explain how the estimator (6.5.15) is constructed.
6.5.17 Explain the simulation of Figure 6.6.
6.5.18 Explain the simulation of Figure 6.7.
6.5.19 What is the difference between simulations in Figures 6.6 and 6.7?
6.5.20 Propose better values of parameters for E-estimators used in Figures 6.6 and 6.7.
Are they different? Explain your findings.
6.5.21∗ Consider a setting where some of the made assumptions are no longer valid. Then
explore a possibility of consistent estimation.
6.6.1 Describe the model of LT.
6.6.2 Present several examples of LT observations. Based solely on LT observations, can
one conclude that the observations are LT?
6.6.3 Is LT based on a missing? Is the missing MNAR? Typically MNAR precludes us from
consistent estimation. Is the latter also the case for LT?
6.6.4 Explain the assumption (6.6.1) and its importance.
6.6.5 Why is the assumption (6.6.2) important?
6.6.6 Verify each equality in (6.6.3) and explain used assumptions.
6.6.7 Verify (6.6.5) and explain the used assumption.
6.6.8 Establish (6.6.6) and (6.6.7).
6.6.9∗ Explain the underlying idea of estimators defined in (6.6.9). Find their expectations.
Can these estimators be improved?
EXERCISES 237
6.6.10 Verify all relations in (6.6.10).
6.6.11∗ Explain construction of the estimator (6.6.11). Find its mean and variance.
6.6.12∗ Conduct a statistical analysis of the estimator (6.6.12). Hint: Describe the distri-
bution and its properties.
6.6.13∗ Explain the motivation behind the estimator (6.6.14). Evaluate its mean and vari-
ance.
6.6.14 Verify (6.6.16).
∗
6.6.15∗ Suggest an E-estimator of the density f X .
6.6.16∗ Find the mean and variance of Fourier estimator (6.6.19).
6.6.17 What is the motivation behind the estimator (6.6.21)? Can you propose another
feasible estimator?
6.6.18 Explain the simulation used by Figure 6.8.
6.6.19 Repeat Figure 6.8 and analyze diagrams.
6.6.20 Use Figure 6.8 and compare performance of the Kaplan–Meier estimator with the
sample mean estimator.
6.6.21 Explain the underlying idea of estimator (6.6.22).
6.6.22 Use Figure 6.9 to compare performance of the two estimators.
6.6.23 Explain all relations in (6.6.23).
6.6.24 Explain diagrams in Figure 6.10. Then use it for statistical analysis of the proposed
density estimator.
6.6.25 Explain the underlying simulation in Figure 6.11.
6.6.26 Repeat Figure 6.11 for different sample sizes. Write a report about your findings.
6.6.27 Suggest better values for parameters of the E-estimator used in Figure 6.11. Is your
recommendation robust to changes in other arguments of the figure?
6.6.28 Explain how the cumulative distribution function of T ∗ can be estimated.
6.6.29 Explain how the probability density of T ∗ can be estimated.
6.6.30 Describe the underlying simulation in Figure 6.12.
6.6.31 Explain diagrams in Figure 6.12.
6.6.32 Describe E-estimators used in Figure 6.12.
6.6.33 Explain formula (6.6.29).
6.6.34∗ Evaluate the mean and variance of Fourier estimator (6.6.30).
6.6.35 Prove (6.6.31).
∗
6.6.36 Verify (6.6.32), and then explain why we are interested in the analysis of E{Ĥ X (x)}.
∗
6.6.37 Show that the second expectation in the right side of (6.6.32) vanishes as n increases.
Hint: Propose any needed assumptions.
6.6.38 Verify (6.6.33).
6.6.39 Prove (6.6.34). Then explain meaning of the conditional survival function.
6.6.40 Given αX ∗ < αT∗ , explain why the distribution of X ∗ cannot be consistently esti-
mated, and then explain what may be estimated.
6.6.41 Explain all curves in the middle diagram of Figure 6.11. Then repeat it and analyze
the results. Is estimation of the conditional survival function robust?
6.6.42 What does p̂ estimate in Figure 6.11?
6.6.43 Explain each relation in (6.6.35).
6.6.44∗ Show that the second expectation on the right side of (6.6.35) vanishes as n in-
creases. Hint: Make a reasonable assumption.
6.6.45 Verify (6.6.36).
6.6.46 Explain the definition of conditional density (6.6.37). Then comment on what can
and cannot be estimated given αT ∗ > αX ∗ .
6.6.47 Consider the bottom diagram in Figure 6.11 and explain the curves. Then repeat
Figure 6.11 with different parameters and explore robustness of the proposed estimators.
∗
6.6.48∗ Explain all steps in construction of the E-estimator fˆT (t). Formulate necessary
assumptions.
238 SURVIVAL ANALYSIS
6.6.49 Use Figure 6.13 and analyze statistical properties of the confidence bands.
6.6.50 Is the relation P(T(n) ≤ X(n) ) = 1 valid?
6.6.51 Is there any relationship between X(1) and T(1) ?
∗
6.6.52∗ Relax one of the used assumptions and propose a consistent estimator of fˆX (x)
or prove that the latter is impossible.
6.7.1 Present examples of left truncated, right censored and LTRC data.
6.7.2 What is the difference (if any) between truncated and censored data.
6.7.3 Explain how LTRC data may be generated.
6.7.4 Find a formula for the probability of an observation in a LTRC simulation.
6.7.5 Explain each equality in (6.7.2).
6.7.6∗ Using (6.7.3), obtain formulas for corresponding marginal densities.
6.7.7 Can the probability (6.7.7) be estimated based on LTRC data?
6.7.8∗ Explain assumptions (6.7.8). What will be if they do not hold?
6.7.9 Why is formula (6.7.9) critical for suggesting a density E-estimator?
6.7.10 Explain how formula (6.7.10) is obtained.
6.7.11∗ What is the motivation behind the estimator (6.7.11)? Find its mean and variance.
6.7.12∗ Present statistical analysis of the estimator ĝ. Hint: Think about its distribution.
6.7.13 Verify (6.7.13).
6.7.14 Explain all relations in (6.7.14).
6.7.15∗ Is (6.7.15) a sample mean Fourier estimator? Find its mean and variance.
6.7.16∗ Verify (6.7.16).
6.7.17 Explain how underlying distributions of the truncating and censoring random vari-
ables affect the coefficient of difficulty of the density E-estimator.
6.7.18 Explain the underlying simulation used in Figure 6.14.
6.7.19 Explain diagrams in Figure 6.14.
6.7.20 Using Figure 6.14, present statistical analysis of the E-estimator.
6.7.21 How well do the confidence bands perform? Hint: Use repeated simulations of Figure
6.14.
6.7.22 Explain every argument of Figure 6.14.
6.7.23∗ Write down a report about the effect of distributions of the truncated and censoring
variable on quality of E-estimate. Hint: Begin with the theory based on the coefficient of
difficulty and then complement your conclusion by empirical evidence created with the help
of Figure 6.14.
6.7.24 What parameters of the E-estimator, used in Figure 6.14, would you recommend for
sample sizes n = 100 and n = 300?
6.7.25 Use Figure 6.15 and explain how the dependence between truncated and censored
variables affects the estimation.
6.7.26 Explain the motivation behind the model (6.7.17). Present several corresponding
examples.
6.7.27 Explain and verify each equality in (6.7.18).
6.7.28 Verify every equality in (6.7.19).
6.7.29 Explain how formula (6.7.21) for the density of interest is obtained.
6.7.30∗ Using (6.7.21), suggest an E-estimator of the density. Hint: Describe all steps and
assumptions.
6.7.31∗ Consider the case αX ∗ < αT ∗ and develop the theory of estimation of the condi-
∗ ∗
tional survival function GX |X >αT ∗ (x).
∗
6.7.32 Consider the case αX ∗ < αT ∗ and develop the theory of estimation of the condi-
∗ ∗
tional density f X |X >αT ∗ (x).
6.7.33 Using Figure 6.14, explore the proposed E-estimators for the case αX ∗ < αT ∗ .
EXERCISES 239
6.8.1 Present an example of a regression problem with direct observations. Then describe
a situation when response may be censored.
6.8.2 Explain (6.8.2).
6.8.3∗ What is the implication, if any, of assumption (6.8.3)? What can be done, if any, if
(6.8.3) does not hold?
6.8.4 What is the meaning of the censored regression function (6.8.4)? Verify each equality
in (6.8.4).
6.8.5∗ Explain the underlying idea of consistent estimation of the regression.
6.8.6 Verify relations in (6.8.5).
6.8.7∗ Explain the estimator (6.8.6). Evaluate its mean and variance.
6.8.8 What is the distribution of estimator (6.8.7)?
6.8.9 Consider a RC data with Y being the random variable of interest and C being the
censoring random variable. If ĜY (y) is an estimator of the survival function of Y , how can
this estimator be used for estimation of GC (z)?
6.8.10 Explain the estimator (6.8.8).
6.8.11∗ What is the mean and variance of the Fourier estimator (6.8.8)?
6.8.12∗ Consider a setting where the made assumptions do not hold. Then explore a pos-
sibility of consistent regression estimation.
6.8.13 Consider diagrams in Figure 6.16 and explain the underlying simulation.
6.8.14 What are the four curves in the bottom diagram of Figure 6.16? Why are they all
below the underlying regression? Is this always the case?
6.8.15∗ Explain theoretically how the parameter λC affects the regression estimation, and
then compare your conclusion with empirical results using Figure 6.16.
6.8.16 Suggest better parameters of the E-estimator for Figure 6.16.
6.8.17 Conduct several simulations similar to Figures 6.16 and 6.17, and then explain the
results.
6.8.18 Do you believe that values of parameters of the E-estimator should be different for
simulations shown in Figures 6.16 and 6.17? If the answer is “yes,” then develop a general
recommendation for choosing better values of the parameters.
6.9.1 Explain the mechanism of RC modification. Does this modification involve a missing
mechanism? If “yes,” then is it MAR or MNAR?
6.9.2 Present several examples of a regression with RC predictor.
6.9.3 What complications in regression estimation may be expected from RC predictor?
6.9.4∗ Write down probability formulas for all random variables involved in regression with
RC predictor.
6.9.5 Can RC predictor imply a destructive modification when a consistent regression es-
timation is impossible?
6.9.6 For the case of RC response, the notion of a censored regression was introduced. Is
there a need to use this notion for the case of RC predictor?
6.9.7 Explain a difference (if any) between regressions with censored predictor and response.
6.9.8 Is expression (6.9.2) correct? Do you need any assumptions? Prove your assertion.
6.9.9 Verify every equality in (6.9.3). Do you need any assumptions for its validity?
6.9.10∗ Explain why the relation (6.9.3) is the key in regression estimation.
6.9.11∗ Describe the random variable Z defined in (6.9.3). Propose an estimator of its
density.
6.9.12∗ Propose an E-estimator of the conditional density f Y |X (y|x).
6.9.13 What are the assumptions for consistent estimation of the regression?
6.9.14 Explain the underlying simulation used in Figure 6.18.
6.9.15∗ Explain, step by step, how the regression E-estimator, used in Figure 6.18, is
constructed.
240 SURVIVAL ANALYSIS
6.9.16∗ Explore theoretically and empirically, using Figure 6.18, the effect of parameter uC
on estimation.
6.9.17 In your opinion, which of the corner functions are less and more difficult for estima-
tion? Hint: Use Figure 6.18.
6.9.18 Repeat Figure 6.19. Comment on scattergrams and estimates.
6.9.19 Propose better values for parameters of the estimators used in Figures 6.18 and 6.19.
Explain your recommendation. Is it robust toward different regression functions?
6.9.20∗ Propose E-estimator for regression of Y on C. Explain its motivation, used proba-
bility formulas and assumptions.
6.11 Notes
Survival analysis is concerned with the inference about lifetimes, that is times to an event.
The corresponding problems occur in practically all applied fields ranging from medicine,
biology and public health to actuarial science, engineering and economics. A common feature
of available data is that observations are modified by either censoring, or truncation, or both.
There is a vast array of books devoted to this topic, ranging from those using a math-
ematically nonrigorous approach to mathematically rigorous books using a wide range of
theories including empirical processes, martingales in continuous time and stochastic inte-
gration among others. The literature is primarily devoted to parametric and semiparametric
inference as well as nonparametric estimation of the survival function, and the interested
reader can find many interesting examples, ad hoc procedures, advanced theoretical results
and a discussion of using different software packages in the following books: Kalbfleisch and
Prentice (2002), Klein and Moeschberger (2003), Martinussen and Scheike (2006), Aalen,
Borgan and Gjessing (2008), Hosmer et al. (2008), Kosorok (2008), Allison (2010, 2014),
Guo (2010), Fleming and Harrington (2011), Mills (2011), Royston and Lambert (2011),
van Houwelingen and Putter (2011), Wienke (2011), Chen, Sun and Peace (2012), Crowder
(2012), Kleinbaum and Klein (2012), Klugman, Panjer and Willmot (2012), Liu (2012), Lee
and Wang (2013), Li and Ma (2013), Allison (2014), Collett (2014), Klein et al. (2014),
Harrell (2015), Zhou (2015), Moore (2016), Tutz and Schmid (2016), and Ghosal and van
der Vaart (2017).
6.1 Nonparametric estimation of the hazard rate is a familiar topic in the literature.
Different type of estimators, including kernel, spline, classical orthogonal series and modern
wavelet methods, have been proposed. A number of adaptive, to smoothness of an underlying
hazard rate function, procedures motivated by known ones for the probability density, have
been developed. A relevant discussion and thorough reviews may be found in a number of
classical and more recent publications including Prakasa Rao (1983), Cox and Oakes (1984),
Silverman (1986), Patil (1997), Wu and Wells (2003), Wang (2005), Gill (2006), Müller and
Wang (2007), Fleming and Harrington (2011), Patil and Bagkavos (2012), Lu and Min
(2014) and Daepp et al. (2015) where further references may be found. Interesting results,
including both estimation and testing, have been obtained for the case of known restrictions
on the shape of hazard rate, see a discussion in Jankowski and Wellner (2009). The plug-in
estimation approach goes back to Watson and Leadbetter (1964), and see a discussion in
Bickel and Doksum (2007). Boundary effect is a serious problem in nonparametric curve
estimation. Complementing a trigonometric basis by polynomial functions is a standard
method of dealing with boundary effects, and it is discussed in Efromovich (1999a, 2001a,
2018a,b).
X X X
−1
Pn estimator Ĝ (x), defined in (6.1.14), is not equal to 1 − F̂ (x) where F̂ (x) :=
The
n l=1 I(Xl ≤ x) is the classical empirical cumulative distribution function. The reason
for this is that we use reciprocal of ĜX (Xl ).
Efficient estimation of the hazard rate and the effect of the interval of estimation on
NOTES 241
the MISE is discussed in Efromovich (2016a, 2017). It is proved that the E-estimation
methodology yields asymptotically sharp minimax estimation.
6.2–6.4 The topic of estimation of the hazard rate from indirect observations, created
by left truncation and right censoring (LTRC), as well as estimation of the distribution,
have received a great deal of attention in the statistical literature with the main emphasis
on parametric models and the case of RC. See, for example, books by Cox and Oakes (1984),
Cohen (1991), Anderson et al. (1993), Efromovich (1999a), Klein and Moeschberger (2003),
Fleming and Harrington (2011), Lee and Wang (2013), Collet (2014), Harrell (2015), as well
as papers by Uzunogullari and Wang (1992), Cao, Janssen and Veraverbeke (2005), Brunel
and Comte (2008), Qian and Betensky (2014), Hagar and Dukic (2015), Shi, Chen and Zhou
(2015), Bremhorsta and Lamberta (2016), Dai, Restaino and Wang (2016), Talamakrouni,
Van Keilegom and El Ghouch (2016), and Wang et al. (2017), where further references may
be found. Estimation of the change point is an interesting and related problem in survival
analysis, see a review and discussion in Rabhi and Asgharian (2017). Bayesian approach is
discussed in Ghosal and van der Vaart (2017) where further references may be found.
Efficiency of the proposed E-estimation methodology is established in Efromovich and
Chu (2018a,b) where a numerical study and practical examples of the analysis of cancer
data and the longevity in a retirement community may be found.
6.5 Nelson–Aalen estimator of the cumulative hazard is a popular choice in survival
analysis. The estimator is defined as follows. Suppose that we observe right censored survival
times of n patients meaning that for some patients we only know that their true survival
times exceed certain censoring times. Let us denote by X1 < X2 < . . . the times when
deaths are observed. Then the Nelson–Aalen estimator for the cumulative hazard of X is
X
Ȟ X (x) := (1/Rl ), (6.11.1)
l: Xl ≤x
where Rl is the number of patients at risk of death (that is alive and not censored) just
prior to time Xl . Note that the estimator is nonparametric.
If we plug the Nelson-Aalen estimator in the formula GX (x) = exp(−H X (x)) for the
survival function, then the obtained estimator is referred to as Nelson-Aalen-Breslow es-
timator. The interested reader may compare this estimator with Kaplan-Meier estimator
(6.5.3) and realize their similarity, the latter is also supported by asymptotic results. Fur-
ther, estimator (6.5.10) is formally identical to the Nelson–Aalen–Breslow estimator while
the underlying idea of its construction is based on the sample mean methodology. This
remark sheds additional light on similar performance in the simulations of the sample mean
and Kaplan–Meier estimators of survival function. Of course, there is a vast variety of dif-
ferent ideas and methods proposed in the literature, see the above-cited books as well as
Woodroofe (1985), Dabrowska (1989), Antoniadis, Gregoire and Nason (1999), Efromovich
(1999a, 2001a), De Una-Álvarez (2004), Wang (2005), Brunel and Comte (2008) and Wang
et al. (2017).
Estimation under shape restrictions is an important part of survival analysis and special
procedures are suggested for taking the restrictions into consideration, see a discussion
in Groeneboom and Jongbloed (2014). See also Srivastava and Klassen (2016). For the
E-estimation, there is no need to make any adjustment, instead, after calculating an E-
estimate, it is sufficient to take a projection on the class of assumed functions and make the
estimate bona fide. A theoretical justification of this approach can be found in Efromovich
(2001a).
6.6 Numerical simulation is a popular statistical tool for a simultaneous analysis of sev-
eral estimators, see a discussion in Efromovich (2001a) and Efromovich and Chu (2018a,b).
A discussion of dependent data and further references may be found in El Ghouch and
Van Keilegom (2008, 2009), Liang and de Una-Álvarez (2011) and De Una-Álvarez and
Veraverbeke (2017).
242 SURVIVAL ANALYSIS
6.7 An interesting extension of the discussed topic is to develop oracle inequalities
for estimators under different loss functions. Here the approaches of Efromovich (2004e,f;
2007a,b) may be instrumental. Sequential estimation is another interesting topic, see Efro-
movich (2004d) where the case of direct observations is considered. See also Su and Wang
(2012).
It is an interesting and open problem to develop a second-order efficient estimator of
the survival function. Here the approach of Efromovich (2001b, 2004c) may be instrumen-
tal. Another area of developing is the efficient multivariate estimation, see a discussion in
Harrell (2015) as well as corresponding results for direct observations in Efromovich (1999a,
2000b, 2010c). Specifically, an interesting approach would be to exploit the possibility of
different smoothness of the density in variables. The latter is referred to as the case of an
anisotropic distribution. Further, some of the variables may be discrete (the case of a mixed
distribution), and this also may attenuate the curse of multidimensionality, see a discussion
of the corresponding E-estmation methodology in Efromovich (2011c).
Bayesian approach may be also useful in the analysis of multivariate distributions, see
Ghosal and van der Vaart (2017).
6.8–6.9 There is a number of interesting and practically important extensions of the
considered problem. The first and natural one is to consider LTRC modifications. Sequential
estimation with assigned risk is another natural setting, where similarly to Efromovich
(2007d,e; 2008a,c) it is possible to consider estimation with assigned risk. Specific of the
problem is that now censoring affects the risk, and the latter should be taken into account.
Estimation of the conditional density is another important topic, here results of Efromovich
(2007g, 2010b) can be helpful. See also Wang (1996), Delecroix, Lopez and Patilea (2008),
Zhang and Zhou (2013), and Wang and Chan (2017).
Wavelet estimation is a popular choice for the nonparametric estimation, see a discus-
sion in Wang (1995), Härdle et al. (1998), Mallat (1998), Efromovich (1999a), Vidakovic
(1999), Nason (2008), Addison (2017), as well as an introductory text on wavelets by Nick-
olas (2017). The E-estimation methodology for wavelets and multiwavelets is justified in
Efromovich (1999b, 2000a, 2001c, 2004e, 2007b), Tymes, Pereyra and Efromovich (2000),
and Efromovich and Smirnova (2014a). Practical applications of the wavelet E-estimation
are discussed in Efromovich et al. (2004), Efromovich et al. (2008), Efromovich (2009b),
Efromovich and Smirnova (2014b).
Quantile regression is another traditionally studied statistical problem, see a discussion
in Efromovich (1999a). Frumento and Bottai (2017) consider the problem of quantile re-
gression under the LTRC where further references may be found. A tutorial on regression
models for the analysis of multilevel survival data can be found in Austin (2017). Another
related topic is the regression models for the restricted residual mean life, see a discussion
in Cortese, Holmboe and Scheike (2017).
Chapter 7
This chapter continues exploring topics in survival analysis when, in addition to data mod-
ification caused by censoring and truncation, some observations may be missed. We already
know that censoring, truncation and missing may imply biased data, and hence it will be
of interest to understand how these modifications act together and how a consistent es-
timator should take into account the effect of these modifications. Further, we should be
aware that some modifications may be destructive. Further, missing always decreases the
size of available observations. All these and other related issues are discussed in this chapter.
Sections 7.1-7.2 are devoted to estimation of distributions, while remaining sections explore
regression problems.
243
244 MISSING DATA IN SURVIVAL ANALYSIS
= w(v)[f X (v)GC (v)]δ [f C (v)GX (v)]1−δ I(v ∈ [0, ∞), δ ∈ {0, 1}). (7.1.2)
The formula is symmetric with respect to X and C in the sense that if ∆ is replaced by
1−∆ then formally C is censored by X. This tells us that it is sufficient to explore estimation
of the distribution of X and then use the symmetry for estimation of the distribution of C.
Now let us recall useful notations. The first one is a probability which has several useful
representations,
Note that continuity of the random variable V was used in establishing (7.1.3). The second
one is our notation βZ for the upper bound of the support of a random variable Z. Using
this notation we can write that
βV = min(βX , βC ). (7.1.4)
Using (7.1.2)-(7.1.4) we may get the following formulas for the density f X (x) of the
lifetime of interest,
There are two important conclusions from this relation. The former is that we cannot
consistently estimate density f X (x) for points x > βV . This conclusion is also supported
by the fact that we do not have observations beyond βV . On the other hand, if βX ≤ βC
then consistent estimation of the distribution of X is possible. Of course, the opposite is
true for consistent estimation of the distribution of C. This yields that unless βX = βC the
distribution of one of the two variables cannot be consistently estimated. Because we are
primarily interested in estimation of the distribution of X, we assume that
βX ≤ βC . (7.1.6)
and recall that the survival function GX (x) := P(X > x) can be expressed as
Estimation of the cumulative hazard involves several steps. First, using (7.1.5) we can
write down the cumulative hazard as the expectation of a function of observed variables,
n A∆I(V ≤ x) o
H X (x) = E , x ∈ [0, βV ). (7.1.9)
g(V )w(V )
To use (7.1.9) for constructing a plug-in sample mean estimator, we need to propose
estimators for functions g(v) and w(v). As a result, our second step is to estimate function
g(v). According to (7.1.3), g(v) may be written as the expectation of indicator I(V ≥ v),
and this yields the sample mean estimator
n
X
ĝ(v) := n−1 I(Vl ≥ v). (7.1.10)
l=1
MAR INDICATOR OF CENSORING IN ESTIMATION OF DISTRIBUTION 245
Note that ĝ(Vl ) ≥ n−1 for l = 1, . . . , n and hence ĝ(Vl ) may be used in a denominator.
Our third step is to propose an estimator of w(v). According to (7.1.1), the availability
likelihood may be written as
w(v) = E{A|V = v}. (7.1.11)
We conclude that w(v) is a nonparametric Bernoulli regression of A on V . Using the avail-
able sample (V1 , A1 ), . . . , (Vn , An ), we construct a regression E-estimator ŵ(v) proposed in
Section 2.4.
Now we are ready to define a plug-in sample mean estimator of the cumulative hazard,
n
X Al ∆l I(V ≤ x)
Ĥ X (x) := n−1 , x ∈ [0, V(n) ]. (7.1.12)
ĝ(Vl ) max(ŵ(Vl ), c/ ln(n))
l=1
Here c > 0 is the parameter of the estimator and recall that V(n) denotes the largest among
n observations V1 , . . . , Vn , it is used as an estimator of βV , and note that P(V(n) < βV ) = 1.
In its turn, the estimator (7.1.12) and formula (7.1.8) yield a plug-in sample mean
estimator of the survival function (the standard notation exp(z) := ez is used),
n
X Al ∆l I(V ≤ x)
ĜX (x) := exp − n−1 , x ∈ [0, V(n) ]. (7.1.13)
ĝ(Vl ) max(ŵ(Vl ), c/ ln(n))
l=1
Then, using the delta method (or straightforwardly using a Taylor expansion of function
exp(−x)) we evaluate variance of the empirical survival function,
Formulas (7.1.14) and (7.1.15) show us how distributions of X and C, together with the
missing mechanism, affect accuracy of estimation. What we see is an interesting compounded
effect of the two data modifications.
The survival function GC (x) of the censoring variable C may be also estimated by
(7.1.13) with ∆l being replaced by 1 − ∆l , recall the above-discussed symmetry between
X and C. In what follows we denote this estimator as ĜC (x) and it will be used only for
x ≤ V(n) .
Now let us explain how the hazard rate of X may be estimated. According to (7.1.5),
the hazard rate may be written as follows,
Recall our discussion of estimation of the hazard rate in Section 6.1 and that it may be
estimated only over a finite subinterval of the support of X. Suppose that we are interested in
estimation of hX (x) over an interval [a, a+b] ⊂ [0, βV ). Then, using our traditional notation
246 MISSING DATA IN SURVIVAL ANALYSIS
ψj (x) for elements of the cosine basis on [a, a + b], we can write for Fourier coefficients of
the hazard rate,
Z a+b n A∆I(V ∈ [a, a + b])ψ (V ) o
j
κj := hX (x)ψj (x)dx = E . (7.1.17)
a g(V )w(V )
The expectation in the right side of (7.1.17) yields the following plug-in sample mean Fourier
estimator,
n
X Al ∆l I(Vl ∈ [a, a + b])ψj (Vl )
κ̂j := n−1 . (7.1.18)
g(Vl )w(Vl )
l=1
In its turn, the Fourier estimator implies the hazard rate E-estimator ĥX (x).
Our last estimand to consider is the probability density f X (x). Similarly to the hazard
rate, we estimate it over an interval [a, a + b] ⊂ [0, βV ). Using (7.1.5), a Fourier coefficient
of the density f X (x) can be written as the following expectation,
Z a+b
θj := f X (x)ψj (x)dx
a
Z a+b
= [f V,A∆,A (x, 1, 1)ψj (x)/(w(x)GC (x))]dx
a
n A∆I(V ∈ [a, a + b])ψ (V ) o
j
=E . (7.1.19)
w(V )GC (V )
This expression immediately yields a plug-in sample-mean Fourier estimator,
n
X Al ∆l I(Vl ∈ [a, a + b])ψj (Vl )
θ̂j := n−1 . (7.1.20)
l=1
max(ŵ(Vl )ĜC (Vl ), c/ ln(n))
In (7.1.17), similarly to (7.1.12), we are bounding from below the denominator and the
constant c is a parameter that may be manually chosen in a corresponding figure. The
latter allows us to have a larger control of the E-estimator.
Fourier estimator (7.1.20) yields a corresponding density E-estimator fˆX (x) defined in
Section 2.2.
The coefficient of difficulty of the proposed Fourier estimator is
a+b
f X (x)
Z
= dx, (7.1.21)
a bw(x)GC (x)
and it can be estimated by the following plug-in sample mean estimator,
n
X
dˆ := n−1 b−1 Al ∆l I(Vl ∈ [a, a + b])[max(ŵ(Vl )ĜC (Vl ), c/ ln(n))]−2 . (7.1.22)
l=1
R a+b
Formula (7.1.21) allows us to conclude that integral a f X (x)[bw(x)GC (x)]−1 dx
should be finite for the E-estimator to be consistent. Of course, this integral is finite when-
ever a + b < βV . Further, recall that the E-estimation inference, developed in Section 2.6,
allows us to calculate confidence bands for the density E-estimator.
Let us check how the estimators, proposed for estimation of the survival function and the
density of a lifetime of interest, perform in simulated examples. We begin with Figure 7.1.
MAR INDICATOR OF CENSORING IN ESTIMATION OF DISTRIBUTION 247
0.0 0.6
w(v)
0.6
A
0.0
Figure 7.1 Right censored data with MAR indicators of censoring, the case βX < βC . The RC
mechanism is identical to the one described in Figure 6.6. The top diagram shows available data.
Circles show available (not missed) realizations of (V, ∆) while crossed circles show hidden realiza-
tions whose values of ∆ are missed.P The latter allows us to visualize the hidden RC sample shown
by circles. The title shows NP := n l=1 ∆l , which is the number of uncensored lifetimes in the hid-
n
den RC sample, and M := l=1 l ∆l , which is the number of available uncensored lifetimes in
A
complete cases. The second from the top diagram shows by circles the Bernoulli scattergram of n re-
alizations of (V, A) as well as the underlying availability likelihood (the solid line) and its regression
E-estimate (the dashed line). The third from the top diagram shows the survival function GX and
its estimate ĜX by the solid and dashed lines, respectively. It also exhibits the survival function of
GC and its estimate ĜC by the dotted and dot-dashed lines, respectively. The bottom diagram shows
the underlying density (the solid line) and its E-estimate (the dashed line) over interval [0, V(n) ]. It
also exhibits (1 − α) pointwise (the dotted lines) and simultaneous (the dot-dashed lines) confidence
bands. {Use cens= 00 Expon 00 for exponential censoring variable whose mean is controlled by argu-
ment lambdaC. Parameter uC is controlled by the argument uC. Availability likelihood is defined
by the string w.} [n = 300, corn = 3, w = 00 0.1+0.8*exp(1+6*(v-0.5))/(1+exp(1+6*(v-0.5))) 00 ,
cens= 00 Unif 00 , uC = 1.5, lambdaC = 1.5, alpha = 0.05, c = 1, cJ0 = 4, cJ1 = 0.5, cTH = 4]
Because the missing mechanism acts after right censoring, it is convenient to use a familiar
RC simulation of Figure 6.6. Recall that in Figure 6.6 the lifetime of interest X is generated
by a corner density and C is either uniform or exponential. Then we introduce a missing
mechanism for the indicator of censoring with a logistic availability likelihood function
defined in the caption of Figure 7.1. In Figure 7.1 the used censoring is Uniform(0, uC ), and
248 MISSING DATA IN SURVIVAL ANALYSIS
0.0 0.6
w(v)
0.6
A
0.0
Figure 7.2 Right censored data with MAR indicators of censoring, the case βC < βX . This figure is
created by Figure 7.1 using uC = 0.7 and corn = 4. [n = 300, corn = 4, w = 00 0.1+0.8*exp(1+6*(v-
0.5))/(1+exp(1+6*(v-0.5))) 00 , cens = 00 Unif 00 , uC = 0.7, lambdaC = 1.5, alpha = 0.05, c = 1,
cJ0 = 4, cJ1 = 0.5, cTH = 4]
more information about the simulation and diagrams may be found in the caption of Figure
7.1.
The top diagram shows us the underlying RC data by circles. In the simulation βC =
1.5 > βX = 1, and hence the distribution of X can be consistently estimated based on the
underlying RC data, while the distribution of C may be estimated up to theP point βV = 1.
n
As we can see from the main title, in the particular simulation only N := l=1 ∆l = 188
of underlying n = 300 observations of X are uncensored by the Uniform(0,1.5) censoring
random variable C. Further,
Pn among those 188 hidden uncensored observations of the lifetime
of interest, only M := l=1 Al ∆l = 130 belong to complete cases. This is a dramatic
decrease in the available information due to censoring and missing. Of course, it is important
to stress that the information about N , the supports, and the underlying distributions are
MNAR INDICATOR OF CENSORING IN ESTIMATION OF DISTRIBUTION 249
known only because the data are simulated. Otherwise we would know only the MAR RC
data, n and M .
The second from the top diagram explains estimation of the availability likelihood func-
tion w(v). This is a straightforward Bernoulli regression because all n = 300 observations
of the pair (V, A) are available, and the E-estimate is good.
The third from the top diagram exhibits underlying survival functions GX (x), GC (x) and
their estimates. Note that GC (x) is estimated only for x ≤ V(n) . Keeping in mind complexity
of the setting and the relatively small size of available observations, the estimates are good.
The bottom diagram allows us to visualize the underlying Bimodal density (the solid
line), its E-estimate (the dashed line) and the confidence bands. The density E-estimator
uses the interval [a, a + b] = [0, V(n) ]. The subtitle of the diagram shows us the estimated
coefficient of difficulty dˆ = 3.4. This coefficient of difficulty informs us that for the developed
density E-estimator to get the same MISE as for the case of 100 direct observations of X,
the sample size n should be 3.4 times larger, that is n = 340. Of course, here we are
dealing with the estimated coefficient of difficulty, but the example explains the purpose of
estimating the coefficient of difficulty. Further, recall that for Figure 6.6, with the same RC
mechanism and no missing, the estimated coefficient of difficulty is 2. These numbers shed a
new light on complexity of the considered model with MAR indicators of censoring. Overall,
the particular density E-estimate is good, and the reader is advised to repeat Figure 7.1
and learn more about the model and proposed estimators.
The simulation in Figure 7.1 favors estimation of the distribution of X because it uses
βC > βX . What will be if this condition is violated? Figure 7.2 allows us to explore such
a situation. Here the underlying simulation is the same as in Figure 7.1 only the censoring
variable is Uniform(0, 0.7), and hence we have βC < βX . We also use the Strata as the
density of X. Diagrams of Figure 7.2 are similar to diagrams of Figure 7.1.
First of all, let us look at available data and statistics presented in the top diagram. As it
could be expected, all available observations of V are smaller than and close to 0.7. Further,
note that there are just few uncensored observations with values larger than 0.3. Further,
the censoring is so severe that in the hidden RC sample there are only N = 122 uncensored
lifetimes, and in the available sample with missing indicators there are only 43 uncensored
lifetimes that we are aware of. This is an absolutely devastating loss of information contained
in 300 underlying realizations of the lifetime X.
The second from the top diagram shows us the Bernoulli regression and estimation of
the availability likelihood. The third from the top diagram shows us underlying survival
functions and their estimates. Under the above-explained circumstances, the estimates are
reasonable. Note that no longer we can consistently estimate the distribution of X, but
we can estimate its survival function up to V(n) . Finally, the bottom diagram shows us
the underlying Strata density, its E-estimate and confidence bands for x ∈ [0, V(n) ]. The
E-estimate is far from being good, but at the same time note that it correctly indicates the
main features of the Strata.
It is fair to conclude that the settings considered in Figures 7.1 and 7.2 are complicated.
Repeated simulations with different parameters will be helpful in understanding MAR RC
data.
In (7.2.1) it is understood that the equality holds for all nonnegative v = x if δ = 1 and for
all positive x and 0 ≤ v < x if δ = 0. Further, let us recall that the lifetimes X and C are
nonnegative continuous random variables while A and ∆ are Bernoulli.
Let us comment on this complicated missing mechanism which involves a hidden variable.
The missing is not MAR because P(A = 1|V = v, ∆ = 0) 6= P(A = 1|V = v). Hence, here
we are dealing with the MNAR. We know from Chapter 5 that typically MNAR implies
a destructive missing when consistent estimation is impossible. Is this the case here or are
we dealing with a case where MNAR is nondestructive? We know from Chapter 5 that
consistent estimation, based on MNAR data, may be possible if the MNAR is converted
into MAR by an always observed auxiliary variable that defines the missing. Here we do
not have an auxiliary variable. On the other hand, we know that V = X given ∆ = 1. The
latter is our only chance to convert the MNAR into MAR via considering a corresponding
subsample of uncensored lifetimes, and in what follows we are exploring this opportunity
to the fullest.
We begin with several probability formulas. First, let us note that events {V = x, ∆ = 1}
and {V = x, ∆ = 1, X = x} coincide due to definition of the indicator of censoring ∆. This,
together with (7.2.1), implies that
Second, similarly to Section 7.1, let us assume that X and C are continuous and inde-
pendent random variables. Then, using (7.2.1) we can write for the joint mixed density of
observations,
f V,A∆,A (x, 1, 1) = P(A = 1|V = x, ∆ = 1)f V,∆ (x, 1)
= P(A = 1|X = x, ∆ = 1)f X (x)GC (x) = w(x)f X (x)GC (x). (7.2.4)
Similarly,
f V,A∆,A (x, 0, 1) = P(A = 1|V = x, ∆ = 0)f V,∆ (x, 0)
= P(A = 1|C = x, X > x)f C (x)GX (x)
P(A = 1, X > x|C = x)f C (x)GX (x)
=
P(X > x|C = x)
Z ∞
=[ w(u)f X (u)du]f C (x). (7.2.5)
x
In the last line we used independence of X and C.
Formulas (7.2.4) and (7.2.5) indicate that, unless w(x) is constant and the missing is
MCAR, the symmetry between X and C, known for directly observed censored data, no
longer present.
Now we are in a position to explain how we can estimate characteristics of the distri-
bution of X, and here we are considering estimation of the cumulative hazard, survival
function and the density.
MNAR INDICATOR OF CENSORING IN ESTIMATION OF DISTRIBUTION 251
Our first step is to estimate the availability likelihood w(x). According to (7.2.3), for
the MNAR missing w(x) is the Bernoulli regression of A on V in a subsample of uncensored
lifetimes, that is when ∆ = 1. The Bernoulli regression E-estimator ŵ(x), introduced in
Section 2.4, can be used to estimate the availability likelihood.
Our second step is to estimate the cumulative hazard of X. Using (7.2.4) we write,
Z x n A∆I(V ≤ x) o
X
H (x) := [f X (u)/GX (u)]du = E , x ∈ [0, βV ). (7.2.6)
0 g(V )w(V )
Here
g(v) := P(V ≥ v) = E{I(V ≥ v)} = GV (v) = GX (v)GC (v), (7.2.7)
where we used the assumption that X and C are continuous and independent random
variables.
The probability g(v) may be estimated by a sample mean estimator,
n
X
ĝ(v) := n−1 I(Vl ≥ v). (7.2.8)
l=1
Combining the obtained results, we get a plug-in sample mean estimator of the cumu-
lative hazard,
n
X Al ∆l I(Vl ≤ x)
Ĥ X (x) := n−1 , x ∈ [0, V(n) ]. (7.2.9)
ĝ(Vl ) max(ŵ(Vl ), c/ ln(n))
l=1
Here, as usual, c > 0 is a parameter that may be chosen for a corresponding figure.
In its turn, (7.2.9) allows us to estimate the survival function GX (x). Recall that
G (x) = exp(−H X (x)), and this, together with (7.2.9), yields the plug-in sample mean
X
Estimation of the density f X (x) over an interval [a, a + b] ⊂ [0, βV ) is based on using
formula (7.2.4). Denote by ψj (x) elements of the cosine basis on [a, a+b] and write a Fourier
coefficient of the density f X (x) as
Z a+b n A∆I(V ∈ [a, a + b])ψ (V ) o
j
θj := f X (x)ψj (x)dx = E . (7.2.11)
a w(V )GC (V )
To use the sample mean approach for estimation of the expectation in the right side
of (7.2.11) we need to know functions w(v) and GC (v). The former is estimated by the
the above-defined Bernoulli regression E-estimator ŵ(x). There are several possibilities to
estimate GC (v), and here we are using a plug-in estimator motivated by (7.2.7), namely we
set
ĜC (v) := ĝ(v)/ĜX (v), (7.2.12)
where ĝ(v) is defined in (7.2.8) and ĜX (x) in (7.2.10). These estimators allow us to define
a plug-in sample mean Fourier estimator,
n
−1
X Al ∆l I(Vl ∈ [a, a + b])ψj (Vl )
θ̂j := n . (7.2.13)
l=1
max(ŵ(Vl )ĜC (Vl ), c/ ln(n))
Note that both censoring and missing creates bias in observed data, and then the random
factor A∆[ŵ(V )ĜC (V )]−1 , used in (7.2.13), corrects the compounded bias.
252 MISSING DATA IN SURVIVAL ANALYSIS
0.0 0.6
0.0
Figure 7.3 Right censored data with MNAR indicators of censoring. The MNAR is defined by the
lifetime of interest X according to the availability likelihood w indicated below. All other components
of the simulation and the diagrams are the same as in Figure 7.1. [n = 300, corn = 3, w =
00
0.1+0.8*exp(1+6*(x-0.5))/(1+exp(1+6*(x-0.5))) 00 , cens = 00 Unif 00 , uC = 1.5, c = 1, alpha =
0.05, cJ0 = 4, cJ1 = 0.5, cTH = 4]
The Fourier estimator yields the density E-estimator fˆ(x), x ∈ [a, a + b].
Figure 7.3 illustrates the setting and the proposed solution, and its diagrams are similar
to those in Figure 7.1. The top diagram shows us the availablePdata. Note that the hidden
n
sample from X contains n = 300 observations,
Pn only N := l=1 ∆l = 174 of them are
uncensored, and of those only M := l=1 Al ∆n = 129 contain indicators of censoring. In
other words, we have only 129 observations of X to work with, and they are biased by both
censoring and MNAR. These observations are shown by horizontal coordinates of circles
with y-coordinates equal to 1. They are also shown by horizontal coordinates of circles,
corresponding to A = 1, in the second diagram.
The second from the top diagram shows us the scatterplot of pairs (Xl , Al ) with corre-
sponding indicators ∆l = 1. Note that if ∆l = 1, then Vl = Xl and hence we know Xl . The
MAR CENSORED RESPONSES 253
underlying regression function (the solid line) is the availability likelihood. Its regression
E-estimate is shown by the dashed line, and it is based on M = 129 observations. Note that
among 129 observations of X only a few have values less than 0.2 and larger than 0.8. This
is a serious complication, and for this particular simulation the E-estimator does a good
job. At the same time, let us recall that according to (7.2.9) and (7.2.13), we need to know
values of w(v) only for points Vl corresponding to cases with Al = ∆l = 1.
The third diagram shows us the underlying survival function of the Bimodal distribution
and the sample mean estimate (7.2.10). The estimate is good. Finally, the density estimate
is shown in the bottom diagram together with its confidence bands. It is worthwhile to pay
attention to a relatively large coefficient of difficulty shown in the subtitle. It raises a red flag
about complexity of the considered estimation problem. The reader is advised to repeat the
simulation, use different parameters, and appreciate complexity of the problem of estimation
of the density based on censored observations with MNAR indicators of censoring.
This is a classical regression problem, and if the sample would be known, then we could use
the regression E-estimator of Section 2.3.
The second layer of hidden observations is created by right censoring of the response
Y by a censoring variable C. The hidden sample is (X1 , V1 , ∆1 ), . . . , (Xn , Vn , ∆n ) where
Vl := min(Yl , Cl ) and ∆l := I(Yl ≤ Cl ). Note that this sample is from (X, V, ∆) where
V := min(Y, C) and ∆ := I(Y ≤ C). If this sample was known, we could use the regression
E-estimator of Section 6.8.
Finally, censored responses are subject to MAR (missing at random) with the likelihood
of missing defined by the always observed predictor. Namely, what we observe is a sample
(X1 , A1 V1 , ∆1 , A1 ), . . . , (Xn , An Vn , ∆n , An ) from (X, AV, ∆, A) where the availability A is
a Bernoulli random variable and
P(A = 1|X = x, Y = y, V = v, ∆ = δ, C = u)
βY ≤ βC . (7.3.4)
f X,AV,∆,A (x, v, 0, 1) = w(x)f X (x)f C (v)GY |X (v|x)I(x ∈ [0, 1], v ∈ [0, βV ]). (7.3.8)
This formula implies allows us to write (recall that (X, Y ) and C are independent)
f X,AV,∆,A (x, v, 0, 1)
= , x ∈ [0, 1], v ∈ [0, βV ). (7.3.9)
w(x)f X (x)GV |X (v|x)
In its turn, formula (7.3.9) allows us to get a useful formula for the cumulative hazard
of C, Z v Z vZ 1
C C C
H (v) := [f (y)/G (y)]dy = [f C (y)/GC (y)]dxdy
0 0 0
Z v Z 1 X,AV,∆,A
f (x, y, 0, 1)
= X (x)GV |X (y|x)
dxdy
0 0 w(x)f
n (1 − ∆)AI(AV ≤ v) o
=E , v ∈ [0, βV ). (7.3.10)
[w(X)f X (X)]GV |X (AV |X)
This result shows that we can express the cumulative hazard of C as an expectation of
observed variables. In that expectation we know how to estimate the product w(x)f X (x),
and we need to explain how to estimate the conditional survival function GV |X (v|x) :=
P(V > v|X = x). This is a new problem for us and it is of interest on its own.
Using the assumed continuity of the random variable V we may write,
GV |X (v|x) := P(V > v|X = x) = P(V ≥ v|X = x) = E{I(V ≥ v)|X = x}. (7.3.11)
The right side of (7.3.11) reveals that for a given v the conditional survival function
GV |X (v|x) is a Bernoulli regression of the response I(V ≥ v) on the predictor X. The only
complication here is that some realizations of the response I(V ≥ v) are missed. The good
news is that the missing is MAR and it is defined by the predictor. Hence, according to
Section 4.2, we may use a complete-case approach together with our regression E-estimator.
This gives us the regression E-estimator ĜV |X (v|x).
Now we are in a position to finish our explanation of how to estimate the survival
function GC (v). Recall that this function can be expressed via the cumulative hazard as
GC (v) = exp(−H C (v)), and hence, according to (7.3.7) and (7.3.10), the plug-in sample
mean estimator of the survival function is
n
n X (1 − ∆l )Al I(Al Vl ≤ v) o
ĜC (v) := exp − n , v ∈ [0, (AV )(n) ]. (7.3.12)
p̂X (Xl )[ k=1 Ak ]ĜV |X (Al Vl |Xl )
P
l=1
All estimators of nuisance functions, needed for mimicking (7.3.6), are constructed, and
this allows us to propose the following plug-in sample mean Fourier estimator,
n
X ∆l Al Vl ϕj (Xl )
θ̂j := Pn . (7.3.13)
l=1
[ k=1 Ak ] max(p̂X (Xl )ĜC (Al Vl ), c/ ln(n))
256 MISSING DATA IN SURVIVAL ANALYSIS
8
Y
0 2 4 6 8
0.5
Figure 7.4 Regression with MAR right censored response. The simulation and the diagrams are
explained in the text. [n = 300, corn = 4, a = 0.3, w = 00 0.1+0.8*exp(1+6*(x-0.5))/(1+exp(1+6*(x-
0.5))) 00 , lambdaC = 3, c = 1, cJ0 = 4, cJ1 = 0.5, cTH = 4]
In its turn, the Fourier estimator yields the regression E-estimator m̂(x, βV ). Note that
while we do not explicitly estimate parameter βV because this is not needed, using (AV )(n)
would be a natural choice.
Figure 7.4 sheds additional light on the problem and the proposed solution. We begin
with explanation of the diagrams and the simulation. The top diagram shows an under-
MAR CENSORED RESPONSES 257
lying hidden scattergram created by the Uniform predictor X and the response Y having
exponential distribution with the mean a + f (X) where a is a positive constant and f is
a corner function, here a = 0.3 and f is the Strata. As usual, the solid and dashed lines
show us the underlying regression and its E-estimate. The sample size n = 300 and the
integrated squared error (ISE = 0.15) of the E-estimate are shown in the title. The second
(from the top) diagram shows us the scattergram where the responses are right censored
by an exponential censoring variable C with the mean λC . Circles show observations with
uncensored responses and crosses show censored ones. Again, the solid and dashed lines are
the underlying regression and its E-estimate of Section 6.8 based onPthe censored data. The
n
title indicates that the number of uncensored responses is N := l=1 ∆l = 225 and the
mean of the censoring variable is 3. The third diagram shows us the available data when
the censored responses are MAR according to the availability likelihood w(x) Pn defined in the
caption. Incomplete cases are shown by triangles and in the title M := l=1 Al ∆l = 147
shows the number of available uncensored responses. The fourth diagram shows us by cir-
cles and squares the underlying survival function GC (Vl ) and its estimate ĜC (Vl ) at points
corresponding to Al ∆l = 1. The bottom diagram shows the underlying regression and its
E-estimate based on data exhibited in the third diagram.
Now we are ready to analyze the simulated data and estimates presented in Figure
7.4. The top two diagrams show us the sequentially simulated layers of hidden data. The
underlying scattergram of direct observations from (X, Y ) is shown in the top diagram by
circles. Note that we are dealing with the response that is exponentially distributed and
the underlying regression is the vertically shifted Strata. Visual analysis of the scattergram
supports the E-estimate. The second diagram exhibits a scattergram where responses, shown
in the top diagram, are right censored by an exponential random variable. The E-estimate
(the dashed line), introduced in Section 6.8, is both visually and in terms of the ISE is
worse than the one in the top diagram. The latter should be expected and note that the
size of uncensored responses reduced from 300 to 225. Further, note that larger responses
have a larger likelihood to be censored, and this dramatically affects visualization of the
left stratum. The latter is the primary reason of the poor estimation. Nonetheless, even
these censored observations are not available to us. Instead, we get observations shown
in the third diagram where some responses are missed according to a missing at random
(MAR) mechanism with the availability likelihood w(x) defined in the caption. As a result,
in the available dataset we have only 147 uncensored responses. Further, the availability
function is increasing in x, and this additionally complicates estimation of the left tail of the
underlying regression. The two bottom diagrams are devoted to the proposed E-estimation.
The fourth diagram shows us by circles values of the underlying survival function GC (v) and
the squares show us its estimate (7.3.12). Note that we need to know the survival function
only at points Vl corresponding to Al ∆l = 1, and these are the points at which the diagram
shows us values of the survival function and its estimate. This diagram highlights estimation
of a more challenging function among nuisance ones, and it also sheds light on the hidden
censoring mechanism. Finally, the bottom diagram exhibits the underlying regression and
its E-estimate based on the MAR RC data shown in the third diagram. As it could be
expected, the estimate is worse than the two E-estimates based on the hidden samples, but
keeping in mind complexity of the problem and that only 147 uncensored and complete cases
are available, the outcome is very respectful. Indeed, we do see the two strata, the right one
is exhibited almost perfectly, the left one is shifted to the right but its magnitude is shown
well. It is a good idea now to look one more time at the scattergram in the third diagram
and try to recognize the underlying regression based on the data. The latter is definitely a
complicated task, and it may be instructive to look again at how the E-estimator solved it.
It is highly recommended to repeat Figure 7.4 with different parameters and gain expe-
rience in analyzing the scattergrams and understanding how the E-estimator performs.
258 MISSING DATA IN SURVIVAL ANALYSIS
7.4 Censored Responses and MAR Predictors
We are considering a regression problem when first responses are right censored and then
predictors may be missed. For this setting we are dealing with two sequentially created
layers of hidden observations. The first layer of hidden observations is created by a sample
(X1 , Y1 ), . . . , (Xn , Yn ) of size n from a pair of continuous random variables (X, Y ). Here X
is the predictor, Y is the response, and the aim is to estimate the regression function
The second layer of hidden observations is created by right censoring of the response Y by
a censoring variable C. This creates a sample (X1 , V1 , ∆1 ), . . . , (Xn , Vn , ∆n ) where Vl :=
min(Yl , Cl ) and ∆l := I(Yl ≤ Cl ). This sample is from (X, V, ∆) where V := min(Y, C)
and ∆ := I(Y ≤ C). Finally, predictors are subject to MAR (missing at random) with the
likelihood of missing defined by the always observed variable V . Namely, we observe a sample
(A1 X1 , V1 , ∆1 , A1 ), . . . , (An Xn , Vn , ∆n , An ) from (AX, V, ∆, A) where the availability A is
Bernoulli and
Based on this twice modified original sample from (X, Y ), first by RC of the response and
then by MAR of the predictor, the aim is to estimate the underlying regression function
(7.4.1). In what follows it is assumed that X, Y and C are continuous variables, the pair
(X, Y ) is independent of C, the predictor X has a continuous and positive design density
supported on [0, 1], and both Y and C are nonnegative. Also recall our notation βZ for the
right boundary point of the support of a continuous random variable Z.
To explain the proposed solution, we begin with a formula for the joint mixed density
of (AX, V, ∆, A),
βY ≤ βC . (7.4.4)
Hence, as it was explained in Sections 6.8 and 7.3, a reasonable estimand is a censored
regression function
m(x, βV ) := E{Y I(Y ≤ βV )|X = x}, (7.4.5)
which is equal to m(x) whenever (7.4.4) holds.
To use our methodology of E-estimation, we need to understand how to construct a
sample mean estimator of Fourier coefficients of m(x, βV ). Using our traditional notation
ϕj (x) for elements of the cosine basis on [0, 1] and with the help of (7.4.3) we can write,
Z 1
θj := m(x, βV )ϕj (x)dx
0
Z 1 Z βV
= [ vf Y |X (v|x)dv]ϕj (x)dx
0 0
1 βV
vf AX,V,∆,A (x, v, 1, 1)ϕj (x) i
Z hZ
= dv dx
0 0 w(v)f X (x)GC (v)
CENSORED RESPONSES AND MAR PREDICTORS 259
n A∆V ϕj (AX) o
=E . (7.4.6)
w(V )f X (AX)GC (V )
To use (7.4.6) for constructing a plug-in sample mean estimator, we need to estimate
functions w(v), f X (x) and GC (v). We begin with estimation of the availability likelihood
w(v). Using (7.4.2) we note that
This implies that w(v) is a regression function. Because all n realizations of pair (V, A) are
available, we can use our Bernoulli regression E-estimator ŵ(v) of Section 2.4 for estimation
of the availability likelihood.
Our second step is to estimate the design density f X (x). To apply our density E-
estimator, we need to understand how to express Fourier coefficients of the density as
expectations. Recall that observations of the predictor are MAR, and we can write that
f AX,V,A (x, v, 1) = w(v)f X (x)f V |X (v|x)I(x ∈ [0, 1], v ∈ [0, βV ]). (7.4.8)
This formula allows us to write down a Fourier coefficient of the design density as follows,
Z 1 Z 1 Z βV
κj := X
f (x)ϕj (x)dx = f X (x)ϕj (x)f V |X (v|x)dvdx
0 0 0
Z 1 Z βV n Aϕ (AX) o
j
= [f AX,V,A (x, v, 1)ϕj (x)/w(v)]dvdx = E . (7.4.9)
0 0 w(V )
The expectation in (7.4.9) implies the following plug-in sample mean estimator of Fourier
coefficients of the design density,
n
X Al ϕj (Al Xl )
κ̂j := n−1 . (7.4.10)
max(ŵ(Vl ), c/ ln(n))
l=1
This Fourier estimator yields the density E-estimator fˆX (x) defined in Section 2.2.
Our third step is to estimate the survival function GC (v) of the censoring variable C. The
estimation is based on the relation GC (v) = exp{−H C (v)} between the survival function
and the cumulative hazard H C (v). Let us explain how H C (v) may be estimated. The joint
mixed density of (AX, V ) and the events ∆ = 0 and A = 1 can be written as
f AX,V,∆,A (x, v, 0, 1) = w(v)f X (x)f C (v)GY |X (v|x)I(x ∈ [0, 1], v ∈ [0, βV ]). (7.4.11)
f AX,V,∆,A (x, v, 0, 1)
= , x ∈ [0, 1], v ∈ [0, βV ). (7.4.12)
w(v)f X (x)GV |X (v|x)
In its turn, (7.4.12) allows us to write for the cumulative hazard of C,
Z v Z v Z 1
H C (v) := [f C (y)/GC (y)]dy = [f C (y)/GC (y)]dxdy
0 0 0
v 1
f AX,V,∆,A (x, y, 0, 1)
Z Z
= dxdy
0 0 w(y)f X (x)GV |X (y|x)
260 MISSING DATA IN SURVIVAL ANALYSIS
n (1 − ∆)AI(V ≤ v) o
=E , v ∈ [0, βV ). (7.4.13)
w(V )f X (AX)GV |X (V |AX)
To mimic (7.4.13) by a plug-in sample mean estimator, we need to make an extra step
and estimate the conditional cumulative hazard GV |X (v|x). Recall that V is a continuous
random variable and write,
GV |X (v|x) := P(V > v|X = x) = P(V ≥ v|X = x) = E{I(V ≥ v)|X = x}. (7.4.14)
Note that for a fixed v we converted the problem of estimation of the conditional survival
function GV |X (v|x) into a regression problem based on n observations from the triplet
(AX, I(V ≥ v), A). This is a regression problem with MAR predictor considered in Section
4.3. Let us recall its solution for the data at hand. For a fixed v, using (7.4.8) allows us to
write down a Fourier coefficient of GV |X (v|x) as
Z 1 Z 1 Z ∞
νj := GV |X (v|x)ϕj (x)dx = f V |X (u|x)ϕj (x)dudx
0 0 v
This allows us to construct the regression E-estimator ĜV |X (v|x) for a given v, and note
that we need to calculate ĜV |X (Vl |Al Xl ) only for cases with Al (1 − ∆l ) = 1.
We have estimated all nuisance functions used in (7.4.13). These estimators, together
with formula GC (v) = exp(−H C (v)) and (7.4.13), yield the following plug-in sample mean
estimator of the survival function of censoring variable C,
n
n X (1 − ∆l )Al I(Vl ≤ v) o
ĜC (v) := exp − n−1 , v ∈ [0, V(n) ]. (7.4.17)
l=1
ŵ(Vl )fˆX (Al Xl )ĜV |X (Vl |Al Xl )
8 10
6
Y
4
2
0
2
0
AX
Figure 7.5 Regression with right censored responses and MAR predictors. The simulation of
(X, Y, C) is the same as in Figure 7.4, and the availability likelihood w(v) is defined below. The
solid line and the dashed line show the underlying regression and its estimate based on data shown
in a diagram. In the top diagram circles show the hidden simulation from (X, PY ). In the middle
diagram circles show uncensored cases and crosses show censored cases. N := n l=1 ∆l is shown in
the title. The bottom diagram shows by circles uncensored and Pcomplete cases, by crosses censored
and complete cases, and by triangles incomplete cases. M := n l=1 Al ∆l is shown in the title. [n =
300, corn = 4, a = 0.3, lambdaC = 3, w = 00 0.3+0.5*exp(1+2*(v-2))/(1+exp(1+2*(v-2))) 00 , c =
1, cJ0 = 4, cJ1 = 0.5, cTH = 4]
that the exponential regression, considered here, has a large variance which complicates the
estimation.
The middle diagram shows data with right censored responses. It is a valuable learning
moment to compare circles in this diagram with circles in the top diagram, and then to
realize how damaging the censoring is for the regression. Note that we have only N =
219 uncensored responses. As a result, the poor visual appeal of the E-estimate is well
understood. On the other hand, in terms of the ISE, this estimate is on par with the
estimate in Figure 7.4. Let us stress that so far we have being dealing with just another
simulation of Figure 7.4. The difference between the underlying simulations is in the missing
mechanism.
262 MISSING DATA IN SURVIVAL ANALYSIS
Recall that data shown in the two top diagrams are not available and hidden. We
may visualize them only due to the underlying simulation, and the diagrams allow us to
appreciate complexity of the underlying simulation. Available observations are shown in the
bottom diagram. Here the underlying hidden predictors, shown in the middle diagram, are
missing at random. The underlying availability likelihood is indicated in the caption, it is
increasing in v, and the latter favors not missing X corresponding to a larger V . Keeping
this remark in mind, it is a valuable lesson to look at data and realize how the missing
modified data. Further, note that we are left only with M = 76 complete and uncensored
pairs, and this is just a quarter of the initial n = 300 pairs. The E-estimate (the dashed
line) is clearly far from being good, but it is close to the one in the middle diagram and
does indicate two strata.
The considered setting is one of the most complicated ones considered so far. It is highly
advisable to repeat Figure 7.5 with different parameters and get used to the setting and the
proposed solution.
The second layer of hidden observations is created by right censoring of the predictor X by a
censoring variable C. The hidden observations are (U1 , Y1 , ∆1 ), . . . , (Un , Yn , ∆n ) where Ul :=
min(Xl , Cl ) and ∆l := I(Xl ≤ Cl ). This sample is from (U, Y, ∆) where U := min(X, C)
and ∆ := I(X ≤ C). Finally, the responses are missing at random and we observe a sample
(U1 , A1 Y1 , ∆1 , A1 ), . . . , (Un , An Yn , ∆n , An ) from (U, AY, ∆, A). Here A is the availability
which is a Bernoulli random variable and
Based on this twofold modified sample, first by RC of the predictor and then by MAR
of the response, the aim is to estimate the underlying regression function (7.5.1). In what
follows it is assumed that X, Y and C are continuous variables, the pair (X, Y ) is indepen-
dent of nonnegative C, the predictor X is supported on [0, 1] according to a continuous and
positive design density, and (recall notation βZ for the right boundary point of the support
of a continuous random variable Z)
βX ≤ βC . (7.5.3)
We begin the explanation of how to construct a regression estimator with a formula for
the joint mixed density of (U, AY, ∆, A) for the case ∆ = 1 and A = 1 (or equivalently
A∆ = 1), that is when no censoring of the predictor and nor missing of the response occurs.
Write,
f U,AY,∆,A (u, y, 1, 1) = [w(u)f X (u)GC (u)]f Y |X (y|u)I(u ∈ [0, 1], y ∈ (−∞, ∞)). (7.5.4)
Now let us make a remark that sheds light on the formula. Consider the probability
density of U given A∆ = 1. Using (7.5.4) that conditional density may be written as
15
10
Y
5
0
5
0
5
0
Figure 7.6 Regression with right censored predictors and missing at random responses. The top dia-
gram shows an underlying hidden scattergram created by the Uniform predictor X and the response
Y having exponential distribution with the mean a + f (X) where f is a corner function, here it
is the Strata. The solid and dashed lines show the underlying regression and its E-estimate. The
sample size n and the integrated squared error (ISE) of the E-estimate are shown in the title. The
middle diagram shows the available data where predictors are censored by a Uniform(0, uC ) variable
C, and then responses are missed according to the availability function w(u) defined below. Uncen-
sored and complete cases are shown by circles, censored and complete cases are shown by crosses,
incomplete cases are shown by triangles. The bottom diagram shows us by circles the scattergram
of uncensored and complete cases, and the underlying regression and its E-estimate, based
P on these
observations, are shown by the solid and dashed lines, respectively. In the title M := n l=1 Al ∆l
is the number of uncensored and complete cases. [n = 300, corn = 4, a = 0.3, uC = 1.5, w =
00
0.1+0.8*exp(1+6*(u-0.5))/(1+exp(1+6*(u-0.5))) 00 , c = 1, cJ0 = 4, cJ1 = 0.5, cTH = 4]
Note that this density, up to the denominator P(A∆ = 1), is the factor in the square brackets
in (7.5.4). This remark sheds a new light on the joint distribution of (U, AY ), and it also
highlights a possibility of using an uncensored-complete-case approach for the regression
E-estimation. Let us check that this approach is applicable here.
To construct a regression E-estimator, we need to propose a sample mean estimator of
Fourier coefficients of regression m(x),
Z 1 Z 1 Z ∞
θj := m(x)ϕj (x)dx = yf Y |X (y|x)ϕj (x)dydx. (7.5.6)
0 0 −∞
264 MISSING DATA IN SURVIVAL ANALYSIS
Here ϕj (x), j = 0, 1, . . . are elements of the cosine basis on [0, 1].
Using (7.5.4) and (7.5.5) we can continue (7.5.6) and write,
n A∆Y ϕj (U ) o
θj = E U |A∆
. (7.5.7)
P(A∆ = 1)f (U |1)
Pn
Set M := l=1 Al ∆l , note that the sample mean estimator of P(A∆ = 1) is M/n,
and denote by fˆU |A∆
(u|1) the density E-estimator of Section 2.2 based on uncensored and
complete cases. Then the expectation in the right side of (7.5.7) yields the plug-in sample
mean Fourier estimator,
n
X Al ∆l ϕj (Ul )
θ̂j := M −1 . (7.5.8)
ˆ
max(f |A∆ (Ul |1), c/ ln(n))
U
l=1
Note that this estimator is based solely on uncensored and complete cases.
In its turn, this Fourier estimator allows us to construct the regression E-estimator m̂(x).
Further, this is the same estimator that one would get by using E-estimator of Section 2.3
based on uncensored and complete cases. The latter proves the above-made conjecture about
the applicability here of the uncensored-complete-case approach. This is a welcome news
for the considered regression setting dealing with twice modified data.
Another way to look at the proposed solution is as follows. Formula (7.5.4) implies that
In words, the underlying conditional distribution of Y given X is the same as the conditional
distribution of AY given U in the subsample with uncensored and complete cases. This
is an interesting and important conclusion on its own because it simplifies estimation of
the conditional distribution. It also immediately implies that, as we already know, the
corresponding regressions also coincide.
Figure 7.6 illustrates the setting and E-estimation based on uncensored and complete
cases. The top diagram exhibits the underlying scattergram, the regression function and
its E-estimate. Note the high volatility of data. The middle diagram shows us the available
data. The bottom diagram shows us the scattergram of uncensored and complete cases where
A∆ = 1, and the proposed E-estimate. We may see the Strata pattern in the scattergram,
and this supports the theoretical conclusion about feasibility of the uncensored-complete-
case approach. The main statistical complication here is the dramatic reduction in the size
of available cases from n = 300 to M = 113. Keeping this in mind, the E-estimate does a
good job in recovering the underlying regression.
Keeping this example in mind, let us formulate the considered regression problem. There
is a hidden sequential sampling (X1∗ , Y1∗ , T1∗ ), (X2∗ , Y2∗ , T2∗ ), . . . from a triplet (X ∗ , Y ∗ , T ∗ ).
Here X ∗ is the predictor, Y ∗ is the response, and T ∗ is the truncation variable. The problem
is to estimate the regression function
f X,AY,T,A (x, y, t, 1)
∗ ∗ ∗
|X ∗
= w(x)f T (t)f X (x)f Y (y|x)[P(T ∗ ≤ X ∗ )]−1 I(0 ≤ t ≤ x ≤ 1)I(y ≥ 0). (7.6.3)
We can integrate both sides of (7.6.3) with respect to t and get
f X,AY,A (x, y, 1)
∗ ∗
|X ∗ ∗
= w(x)f X (x)f Y (y|x)F T (x)[P(T ∗ ≤ X ∗ )]−1 I(0 ≤ x ≤ 1, y ≥ 0). (7.6.4)
Further, we integrate both sides of (7.6.4) with respect to y and get
∗ ∗
f X,A (x, 1) = w(x)f X (x)F T (x)[P(T ∗ ≤ X ∗ )]−1 I(0 ≤ x ≤ 1). (7.6.5)
Now we are ready to explain a proposed solution. Recall our notation αZ for the left
boundary point of the support of a continuous variable Z. It is clear from (7.6.3), as well
as from the definition of the left truncation, that a consistent estimation of the regression
is possible only if
αT ∗ ≤ αX ∗ . (7.6.6)
Otherwise, we may estimate m(x) only for x ≥ max(αT ∗ , αX ∗ ).
Assume that (7.6.6) holds. To construct a regression E-estimator, we need to under-
stand how to express Fourier coefficients of m(x) via an expectation. Recall our traditional
notation ϕj (x) for elements of the cosine basis on [0, 1] and write,
Z 1 Z 1 Z ∞
∗
|X ∗
θj := m(x)ϕj (x)dx = [ yf Y (y|x)dy]ϕj (x)dx. (7.6.7)
0 0 0
266 MISSING DATA IN SURVIVAL ANALYSIS
Using (7.6.4) we continue,
Z 1Z ∞
yf X,AY,A (x, y, 1)ϕj (x)
θj = dydx. (7.6.8)
0 0 w(x)f X ∗ (x)F T ∗ (x)[P(X ∗ ≥ T ∗ )]−1
Now note that, according to (7.6.5), the denominator in (7.6.8) is equal to f X,A (x, 1).
This allows us to continue (7.6.8) and write,
n AY ϕ (X) o
j
θj = E X,A . (7.6.9)
f (X, 1)
This is the key formula because Fourier coefficients are expressed as expectations. The
mixed density f X,A (x, 1) = P(A = 1)f X|A (x|1) can be estimated by (N/n)fˆX|A (x|1) where
Pn
N := l=1 Al is the number of complete cases and fˆX|A (x|1) is the density E-estimator of
Section 2.2 based on predictors in complete cases. This and (7.6.9) yield the plug-in sample
mean Fourier estimator
n
X Al Yl ϕj (Xl )
θ̂j := N −1 . (7.6.10)
ˆ
max(f X|A (Xl |1), c/ ln(n))
l=1
In its turn, the Fourier estimator yields the regression E-estimator m̂(x).
If we look at formula (7.6.10) one more time, then it is not difficult to realize that the
regression E-estimator is based on complete cases. In other words, a complete-case approach
is consistent for the considered regression with truncated predictors and MAR responses.
This is a welcome conclusion for the otherwise complicated statistical setting.
Figure 7.7 illustrates the setting and the proposed estimation, and its caption explains
the simulation and the diagrams. The hidden left truncated data is shown in the top di-
agram. Note that the underlying predictor X ∗ is uniform on [0, 1], and despite this and
the large sample size of 300 observations, only a few LT predictors are less than 0.2. Note
that without LT there should be on average about 60 predictors smaller than 0.2. This
tells us about very strong truncation which makes a reasonable estimation of left tail of a
regression function practically impossible. And indeed, the E-estimator does a very good
job in restoring the underlying regression apart of the left tail. The middle diagram shows
us the same data only with MAR responses. Cases with missed responses are highlighted
by crossed circles. As we see, the missing dramatically reduced the number of complete
cases from 300 to 213. Further, we may observe that now just a few complete cases have
predictors smaller than 0.3. To understand why, it is worthwhile to look at the availability
likelihood whose E-estimate and the corresponding scattergram are shown in the bottom
diagram. As we see, the availability likelihood is an increasing function with a small left
tail. This availability likelihood confounds the already complicated left-tail issue.
Note that estimates of nuisance functions, here the availability likelihood, may be a
useful statistical tool in analysis of modified data. Confidence bands are another useful
statistical tool, they are not shown here to avoid overcrowding the diagrams.
Now let us look at the top diagram in Figure 7.8 where the case αT ∗ > αX ∗ is considered.
Here we observe a simulation which is similar to the one in Figure 7.7, again the left tail of
the E-estimate is bad, and we know the root of the problem. The middle diagram exhibits
the availability likelihood (the solid line) and its E-estimate (the dashed line). The used
availability likelihood again compounds difficulties in estimation of the left tail. Nonetheless,
while the diagrams in Figures 7.7. and 7.8 look similar, in Figure 7.8 we are dealing with
destructive left truncation because T is uniform on [0.1, 0.8], and hence even if n → ∞ we
cannot recover the regression m(x) for x ∈ [0, 0.1). Again, compare the diagrams in Figures
7.7 and 7.8 and think about a possibility to recognize the fact that in the former simulation
the LT is nondestructive, while in the latter it is destructive.
TRUNCATED PREDICTORS AND MAR RESPONSES 267
15
10
Y
5
0
5
0
0.4
0.0
Figure 7.7 Regression with left truncated predictors and then missing at random responses. The
top diagram shows the hidden left truncated scattergram created by the Uniform predictor X ∗ , the
response Y ∗ having exponential distribution with the mean a + f (X ∗ ) where f is a corner function,
here it is the Bimodal, and the truncation variable T ∗ which is Uniform(0, uT ). The solid and dashed
lines show the underlying regression and its E-estimate. The middle diagram shows us the available
data when the responses are missing according to the availability function w(x) defined below.
The incomplete cases are highlighted by crossed circles, N := n
P
l=1 Al is the number of complete
cases. The solid and dashed lines show the underlying regression and its E-estimate. The bottom
diagram shows the regression E-estimate of the availability likelihood based on the scattergram of
pairs (Xl , Al ), l = 1, . . . , n. [n = 300, corn = 3, a = 0.3, uT = 0.5, w = 00 0.1+0.8*exp(1+6*(x-
0.5))/(1+exp(1+6*(x-0.5))) 00 , c = 1, cJ0 = 4, cJ1 = 0.5, cTH = 4]
We may conclude that in general it may be prudent to estimate regression m(x) over
an interval based on data. For instance, it may be an interval between smallest and largest
available predictors. In our particular case, only the choice of the left boundary of the interval
of estimation is of interest, and Figure 7.8 allows us to choose the interval of estimation
[amanual , 1] manually. Namely, after the two top diagrams are drawn, the program stops
and we can visualize data and the availability likelihood. Then, at the R prompt >, enter
a wished value for amanual , then press ”enter”, after next R prompt > press “enter” and
the bottom diagram will appear. For the particular simulation in Figure 7.8, the value
amanual = 0.2 was chosen. The bottom diagram clearly shows the improvement in the left
tail of the E-estimate. Note that M = 232 is the number of complete cases with predictors
268 MISSING DATA IN SURVIVAL ANALYSIS
12
0 2 4 6 8
Y
Availability Likelihood
0.8
A
0.4
0.0
Figure 7.8 Left truncated predictors and MAR responses, the case αT ∗ > αX ∗ . The simulation
and histograms, apart of their order, are the same as in Figure 7.7 with only two differences. The
former is that the truncation variable is now Uniform(ut , uT ). The latter is that after exhibiting
two top diagrams the program stops and asks about input of amanual that is used by the E-estimator
to calculate E-estimate over the interval [amanual , 1]. The title shows the number N of complete
cases and the number M of predictors in complete cases belonging to the interval of estimation. {To
enter a wished value for amanual , which must be between 0 and 1, after R prompt > type a wished
number, press “return”, and after next R prompt > press “return”.} [n = 300, corn = 3, a = 0.3,
ut = 0.1, uT = 0.8, w = 00 0.1+0.8*exp(1+6*(x-0.5))/(1+exp(1+6*(x-0.5))) 00 , c = 1, cJ0 = 4, cJ1
= 0.5, cTH = 4]
not smaller than amanual , in other words this is the number of complete cases used by the
E-estimator.
Figure 7.8 is a useful tool to learn about effects of the LT predictor and the MAR
response on regression estimation and how to choose a feasible interval of estimation. The
reader is advised to repeat it with different parameters and gain a necessary experience in
dealing with this complex problem.
Finally, let us again stress that considered in this and previous sections exponential
regression is a complicated regression model due to possibly large regression errors. Indeed,
an exponential variable with mean λ has variance equal to λ2 . For exponential regression
with regression function m(x) this implies that
Note that here we are dealing with a rather intricate modification of the underlying sample
from (X ∗ , Y ∗ ).
To understand a possible solution of the regression problem, we begin with probability
formulas that shed light on the problem. Write,
P(X ∗ ≤ u, C ∗ ≥ X ∗ , Y ∗ ≤ y, A = 1, X ∗ ≥ T ∗ )
= . (7.7.3)
P(U ∗ ≥ T ∗ )
Suppose that pairs (X ∗ , Y ∗ ) and (C ∗ , T ∗ ) are independent, predictor X ∗ and response
Y are continuous random variables, and X ∗ has a continuous and positive density on its
∗
P(U ≤ u, ∆ = 1, AY ≤ y, A = 1)
Ry Ru ∗ ∗ ∗
−∞
[ 0 w(x)f X (x)f Y |X (v|x)P(T ∗ ≤ x ≤ C ∗ )dx]dv
= . (7.7.4)
P(U ∗ ≥ T ∗ )
270 MISSING DATA IN SURVIVAL ANALYSIS
By taking partial derivatives with respect to u and y, we get the corresponding joint
mixed density,
h w(u)f X ∗ (u)P(T ∗ ≤ u ≤ C ∗ ) i ∗ ∗
f U,∆,AY,A (u, 1, y, 1) = f Y |X (y|u). (7.7.5)
P(U ∗ ≥ T ∗ )
In its turn, via integration with respect to y, formula (7.7.5) yields the marginal mixed
density
∗
w(u)f X (u)P(T ∗ ≤ u ≤ C ∗ )
f U,∆,A (u, 1, 1) = . (7.7.6)
P(U ∗ ≥ T ∗ )
Using (7.7.5) and (7.7.6) we conclude that for values of u such that
∗
w(u)f X (u)P(T ∗ ≤ u ≤ C ∗ ) > 0, (7.7.7)
We conclude that given (7.7.7) and assumed independence between pairs (X ∗ , Y ∗ ) and
(C , T ∗ ), the conditional distribution of the underlying response Y ∗ given the underlying
∗
predictor X ∗ is the same as the conditional distribution of the observed AY given the ob-
served U in uncensored and complete cases. This is a fantastic news for considered regression
problem because the complicated problem of estimating a regression in the bottom layer of
hidden observations, based on threefold modified data, is converted into a standard regres-
sion for the above-described subsample of available data. As a result, the main complication
here is the smaller sample size of the subsample.
Let us check validity of the proposed solution via simulations. We begin with the case
of independent and continuous truncation and censoring variables. Figure 7.9 illustrates
the setting and the proposed estimation, and its caption explains the simulation and the
diagrams. The top diagram illustrates hidden LTRC data where only cases with uncen-
sored predictors are shown. Note that these cases may be used for consistent regression
E-estimation. The censoring reduced the original sample size n = 300 to N = 204. Fur-
ther, the underlying predictor X ∗ is uniform, and note how light the tails of the observed
predictors are after the LTRC modification. This is a glum reality of dealing with LTRC
predictors.
Because there may be no observations in the left and right tail that point upon an
underlying relationship between the predictor and the response, it is prudent to estimate
the regression only over the range of available uncensored predictors. The corresponding
E-estimate is shown by the dashed line, while the underlying regression is shown by the
solid line over the whole interval [0, 1]. The E-estimate is good, and this is also reflected
by its ISE shown in the title. Of course, we know the hidden observations only due to the
simulation, on the other hand we look at an important regression on its own whenever no
missing of responses occurs.
The middle diagram sheds light on the MAR. It exhibits Bernoulli regression for the
problem of estimation of the underlying availability likelihood. The estimate is not used by
the regression E-estimator but it is an important tool to understand an underlying missing
mechanism. Note that the E-estimate is far from being perfect but it correctly describes
the data at hand. In other words, while we do know the underlying Bernoulli regression
(the solid line) thanks to the simulation, the E-estimator knows only data and describes the
data. It is important to note that here we are dealing with n = 300 pairs of observations.
Further, the diagram helps us to visualize LTRC predictors in complete cases.
LTRC PREDICTORS AND MAR RESPONSES 271
12
0 2 4 6 8
Y
U[ ==1]
Availability Likelihood
0.8
A
0.4
0.0
Data with MAR Response and LTRC Predictor, N = 202 , M = 140 , ISE= 0.12
12
0 2 4 6 8
AY
U[ ==1]
Figure 7.9 Regression with left truncated and right censored predictors and MAR responses. Solid
and dashed lines in diagrams are an underlying curve and its E-estimate, respectively. The top di-
agram shows by circles a subsample with uncensored predictors whose size N := n
P
l=1 ∆l is shown
in the title. In the simulation T ∗ is Uniform(ut , uT ), X ∗ is the Uniform, C ∗ is Uniform(uc , uC ),
and Y ∗ is generated as in Figure 7.7; see the parameters in the title. The E-estimate is calcu-
lated over the range of uncensored predictors, while the underlying regression (the solid line) is
shown over [0, 1]. The middle diagram shows the scattergram of (U, A) overlaid by the underly-
ing availability likelihood, controlled by the argument w, and its E-estimate. The bottom diagram
shows theP scattergram from (U, AY, ∆ = 1), the regression function and its E-estimate. In the ti-
tle N := n
Pn
l=1 Al = 202 is the total number of available responses and M := l=1 Al ∆l = 140
is the number of complete pairs with uncensored predictor. Note that only 140 pairs are used to
calculate the uncensored-complete-case E-estimate, and these pairs are shown by circles. The E-
estimate is calculated over an interval defined by the range of uncensored predictors in complete
cases. {Distribution of T ∗ is either the Uniform(ut , uT ) or Exponential(λT ) where λT is the mean.
Censoring distribution is either Uniform(uc , uC ) or Exponential(λC ). For instance, to choose expo-
nential truncation and censoring, set trunc = 00 Expon 00 , cens = 00 Expon 00 and then choose wished
parameters. The parameters will be shown in the title and point upon underlying distributions.} [n
= 300, a = 0.3, corn = 3, trunc = 00 Unif 00 , ut = 0, uT = 0.5, lambdaT = 0.3, uc = 0, uC = 1.5,
cens = 00 Unif 00 , lambdaC = 1.5, w = 00 0.1+0.8*exp(1+6*(u-0.5))/(1+exp(1+6*(u-0.5))) 00 , c=1,
cJ0 = 4, cJ1 = 0.5, cTH = 4]
The exhibited Bernoulli regression E-estimate indicates that the MAR of responses may
dramatically aggravate complexity of estimation of the left tail because the probability of
272 MISSING DATA IN SURVIVAL ANALYSIS
an incomplete pair increases for smaller values of the LTRC predictor. And indeed, let us
look at available MAR data exhibited in the bottom diagram. The E-estimator is based
solely on pairs shown by circles, and note that there are only 8 predictors with values less
than 0.2. Further, the title shows that there is a total of N = 202 complete pairs from
underlying n = 300, and among those only M = 140 with uncensored predictors.
Recall that the above-presented theory asserts that the conditional distribution of AY
given U in complete pairs with uncensored predictors coincides with the underlying con-
ditional distribution of Y ∗ given X ∗ . Hence, the regression functions are also the same.
This phenomenon is used by the proposed regression E-estimator based on the uncensored-
complete-case approach. Of course, the design densities of the underlying uniform X ∗ and
observed U , given A∆ = 1, are different, and we can clearly see this in the bottom diagram
where we have just a few observations in the left tail.
The regression E-estimator is constructed over the range of complete and uncensored
pairs (see the circles), it is fairly good and nicely indicates the two modes. Of course, it
is a challenge to estimate the left tail of the regression function due to the LTRC of the
predictor and the special shape of the availability likelihood.
The reader is advised to repeat this simulation with different parameters to gain expe-
rience in dealing with this complicated modification of regression data.
So far we have considered the case of independent and continuous censoring and trunca-
tion variables. The above-presented theory asserts that the proposed uncensored-complete-
case approach is valid also for a general case where these variables may be dependent
and have a mixed distribution. This setting is examined in Figure 7.10 where the model
C ∗ := T ∗ + W ∗ is considered. Here W ∗ := (1 − B)Z ∗ + BuC where Z ∗ is Uniform(uc , uC )
and B is an independent Bernoulli(pc ). Note that parameter pc controls the frequency with
which W ∗ is equal to the largest possible value of Z ∗ . A typical example is when censoring
may occur only after truncation, T ∗ is the baseline for a study of a lifetime X ∗ , and uC is
the length of the study when the lifetime is necessarily censored.
Let us look at the diagrams. The top diagram shows us the hidden simulated LTRC
data. Here we use T ∗ with exponential distribution, and this fact is highlighted by showing
its mean λT = 0.3 in the title of the top diagram. Note that in Figure 7.9, where uniform
truncation is used, its parameters ut and uT are shown. With respect to Figure 7.9, here the
exponential truncation may affect larger values of the predictor, and note that in Figure 7.9
no values of X ∗ larger than 0.5 are truncated. Further, here the censoring and truncation
variables are dependent, and the chosen joint distribution dramatically affects the number of
censored predictors. The title shows that only N = 150 from underlying n = 300 predictors
are uncensored. This is indeed a dramatic decrease from what we have seen in Figure 7.9.
Also, please look at the tails with just a few predictors.
All these factors definitely affect the shown regression E-estimate. At the same time, the
E-estimate correctly indicates two modes of the underlying Bimodal regression, and even
their magnitudes are exhibited relatively well.
The middle diagram shows us the Bernoulli regression E-estimate of the availability
likelihood w(u). Note that here we are dealing with direct observations and a large sample
size n = 300. Nonetheless, the estimate looks bad. Why does it look bad? Because we
compare it with the underlying availability likelihood which generated the data. On the
other hand, the estimate clearly follows the data, and its left tail is well justified. Here
we again are dealing with a small number of predictors in the tails despite the Uniform
distribution of X ∗ . Why is this the case? Because U is the LTRC version of X ∗ , and as a
result U has lighter tails of its density. We may conclude that even the sample size n = 300
does not preclude us from anomalies that we see in the data.
The proposed E-estimator of the underlying regression (7.7.2) does not use the availabil-
ity likelihood and is based on the uncensored-complete-case approach. The bottom diagram
shows us a relatively fair estimate which indicates a regression with two modes. Note that
LTRC PREDICTORS AND MAR RESPONSES 273
Hidden LTRC Data, n = 300 , N = 150 , lambdaT = 0.3, uc = 0 , uC = 0.6 , censp = 0.2 , ISE= 0.15
12
0 2 4 6 8
Y
U[ ==1]
Availability Likelihood
0.8
A
0.4
0.0
Data with MAR Response and LTRC Predictor, N = 189 , M = 98 , ISE= 0.35
12
0 2 4 6 8
AY
U[ ==1]
Figure 7.10 Regression with LTRC predictors and MAR responses when censoring and truncat-
ing variables are dependent and C ∗ = T ∗ + W ∗ . In the simulation variables X ∗ , T ∗ and W ∗ are
mutually independent. Variable W ∗ is a mixture of Uniform(uc , uC ) with a constant uC , namely
P(W ∗ = uC ) = pc and otherwise W ∗ has the uniform distribution. Parameter pc is controlled
by the argument censp. The used truncation is exponential. Otherwise the simulation and the
structure of diagrams are identical to Figure 7.9. {Distribution of T ∗ is either Uniform(ut , uT )
or Exponential(λT ) where the parameter λT is the mean. To choose uniform truncation set trunc =
00
Unif 00 . } [n = 300, a = 0.3, corn = 3, trunc = 00 Expon 00 , ut = 0, uT = 0.5, lambdaT = 0.3, uc
= 0, uC = 0.6, censp = 0.2, w = 00 0.1+0.8*exp(1+6*(u-0.5))/(1+exp(1+6*(u-0.5))) 00 , c=1, cJ0
= 4, cJ1 = 0.5, cTH = 4]
there are only M = 98 pairs that are neither censored nor incomplete, and this is a relatively
small sample size for estimation of the Bimodal regression function. On the top of the small
sample, we are dealing with the small number of available predictors in the tails, and the
latter is caused by the LTRC predictors and MAR responses.
It is highly advisable to repeat Figure 7.10 with different parameters and gain first-hand
experience in dealing with this complicated modification of regression data.
274 MISSING DATA IN SURVIVAL ANALYSIS
7.8 Exercises
7.1.1 Explain a right censoring mechanism and present several examples.
7.1.2 Under the considered RC of a lifetime of interest X, is the censored data biased?
Prove your assertion analytically and define, if data is biased, the biasing function.
7.1.3 Explain the assumption (7.1.1). Hint: Think about possible MAR and MNAR mech-
anisms.
7.1.4 What will be if in (7.1.1) the probability in the left side depends on δ? Can a consistent
estimator be constructed in this case?
7.1.5 Explain each equality in (7.1.2). Then prove them.
7.1.6 There is some type of symmetry in (7.1.2) with respect to the lifetime of interest X
and the censoring variable C. Explain it. Further, explain how the symmetry may be used.
7.1.7 Is the function g(v), defined in (7.1.3), equal to the survival function of V ? If the
answer is “no” then explain the motivation behind its definition.
7.1.8 Explain each equality in (7.1.3). What are the used assumptions?
7.1.9 Verify equality (7.1.4). Is it valid if X and C are dependent?
7.1.10 Explain each equality in (7.1.5). Pay attention to assumptions.
7.1.11 Why do we have a restriction on x in (7.1.5)? Explain both mathematically and via
the underlying probability model.
7.1.12 Why do we assume (7.1.6)?
7.1.13 Is it possible to relax (7.1.6) if we want to estimate the distribution of X?
7.1.14 Explain the definition of the cumulative hazard. Hint: Use (7.1.7).
7.1.15 Write down (7.1.7) using the hazard rate function. What is the relationship between
the cumulative hazard and the hazard rate?
7.1.16 Can a hazard rate function take on negative values? Is it, similarly to a probability
density, integrated to 1?
7.1.17 Prove formula (7.1.8) which expresses the survival function via the cumulative haz-
ard. Is an assumption required?
7.1.18 Explain why formula (7.1.9) is valid and formulate assumptions.
7.1.19∗ Is the estimator (7.1.10) unbiased? What is it variance? What is the probability of
a large deviation?
7.1.20 Why is estimator (7.1.10) convenient for plugging in a denominator?
7.1.21 Explain how the availability likelihood affects distribution estimation.
7.1.22 Explain validity of formula (7.1.11). Why is this representation of the availability
likelihood important?
7.1.23 What is the motivation of the estimator (7.1.12)? What are the used assumptions?
7.1.24∗ Find asymptotic expressions for the mean and the variance of estimator (7.1.12).
Hint: Begin with known plug-in functions. Use assumptions.
7.1.25∗ Find asymptotic expressions for the mean and the variance of estimator (7.1.13).
Hint: Recall the delta method and begin with known plug-in functions.
7.1.26 Verify (7.1.14).
7.1.27 Verify (7.1.15).
7.1.28 Explain how the survival function GC may be estimated by a given estimator of
GX .
7.1.29∗ Verify relations in (7.1.16) and (7.1.17).
7.1.30∗ Calculate the mean and the variance of Fourier estimator (7.1.18).
7.1.31 Verify relations in (7.1.19). What are the used assumptions?
7.1.32∗ Explain how the Fourier estimator (7.1.20) is constructed. Evaluate its mean and
variance.
7.1.33 Is there a necessity to bound from below the denominator in (7.1.20)? Can a different
approach be proposed?
EXERCISES 275
7.1.34 Explain definition of the coefficient of difficulty (7.1.21), its role in nonparametric
estimation, and then verify relations in (7.1.21).
7.1.35∗ Explain the underlying idea of the estimator (7.1.22). Then evaluate its mean and
variance.
7.1.36 Formulate a sufficient condition for a consistent density estimation.
7.1.37 Explain the simulation used in Figure 7.1.
7.1.38 Repeat Figure 7.1 and conduct an analysis of the diagrams.
7.1.39 Explain how censoring affects estimation of the survival function. Use both the
theory and Figure 7.1 to justify your answer.
7.1.40∗ Propose better parameters of the E-estimator used in Figure 7.1. Does your answer
depend on an underlying density, censoring and sample size? Comment on your answer and
support it by simulations.
7.1.41 What is the difference between cases considered in Figures 7.1 and 7.2? Is a consistent
estimation possible in both cases?
7.1.42 Repeat Figure 7.2 with different parameters and present analysis of your observa-
tions. Do you have any suggestions on how to improve the estimators?
7.1.43 Explain limitations of estimators used in Figure 7.2.
7.2.1 What is the difference, if any, between MAR and MNAR?
7.2.2 Explain a right censoring model with the MNAR satisfying (7.2.1).
7.2.3 Present an example of a model considered in Section 7.2. Explain all variables and
the missing mechanism.
7.2.4 Does the availability likelihood (7.2.1) imply MAR or MNAR?
7.2.5∗ Explain why the considered MNAR is nondestructive. Hint: Recall a discussion in
Chapter 5.
7.2.6 Verify all equalities in (7.2.4).
7.2.7 Verify (7.2.5).
7.2.8 Compare (7.2.4) with (7.2.5) and explain why the expressions are different and there
is no traditional symmetry.
7.2.9 Explain why the cumulative hazard may be written as (7.2.6). Then comment on the
inequality.
7.2.10 Verify (7.2.7). Is any assumption needed?
7.2.11∗ Find the mean and the variance of estimator (7.2.8). Explore the probability of
large deviations.
7.2.12∗ Explain the underlying idea of the estimator (7.2.9) of the cumulative hazard. Then
evaluate its mean and variance.
7.2.13 Is it possible to increase the interval for which estimator (7.2.9) is proposed?
7.2.14 Explain how the survival function of X may be estimated.
7.2.15∗ Find the mean and variance of the estimator (7.2.10).
7.2.16 Explain the expression (7.2.11) for the Fourier coefficient of the density f X (x).
7.2.17 What is the underlying idea of Fourier estimator (7.2.13)?
7.2.18∗ Is it possible to avoid bounding from below the denominator in (7.2.13)?
7.2.19∗ Find the mean and variance of the estimator (7.2.13).
7.2.20 Explain the simulation used in Figure 7.3.
7.2.21∗ Repeat Figure 7.3 using different shapes of the availability likelihood function. Then
comment on shapes that benefit or make worse estimation of the density. Is your conclusion
robust toward an underlying density?
7.2.22 Repeat Figure 7.3 with different distributions of the censoring variable. Comment
on your observations.
7.2.23∗ Consider the case when the MNAR is defined by the hidden censoring variable C.
Propose estimators for the survival function and the density of the lifetime of interest. Hint:
276 MISSING DATA IN SURVIVAL ANALYSIS
Begin with writing down the assumption about the MNAR,
7.9 Notes
It is possible to consider censoring as a special example of missing when a logical NA (not
available), which indicates a missed value, is replaced by the value of a censoring vari-
able and by an indicator of this replacement. Truncation is also related to missing via a
hidden underlying sequential sampling with missing observations. Further, all these modifi-
cations imply biased data, and this sheds light on similarity in the estimation methodology.
Nonetheless, it is a long standing tradition to consider survival analysis and missing data
as separate branches of statistical science.
A review of the literature devoted to analysis of truncated and censored data by methods
of the theory of missing data can be found in the book by van Buuren (2012) where a number
of imputation procedures and softwares are discussed. The prevention of missing data in
clinical trials is discussed in Little et al. (2012). An example of the treatment of missing
data in survival analysis of a large clinical study can be found in Little R. et al. (2016).
Klein et al. (2014), Allison (2014), Harrell (2015), and Little T.D. et al. (2016) cover a
number of interesting settings and practical examples.
The literature on optimal (efficient) nonparametric estimation for missing LTRC data
is practically next to none, and this is an interesting and new area of research, see Efro-
movich (2017). It is reasonable to conjecture that the E-estimation approach yields efficient
estimation for considered problems.
7.1-7.2 For the literature, Chen and Cai (2017) and Zou and Liang (2017) are among
more recent publications where further references may be found. Applications and Bayesian
approach are discussed in Allen (2017). Functional data analysis is an interesting extension,
see Kokoszka and Reimherr (2017).
7.3-7.7 Developing the asymptotic theory of efficient nonparametric regression estima-
tion for missing survival data is an open problem. It is possible to conjecture that the
E-estimation methodology is still efficient and implies sharp minimax results. Here the
asymptotic theory developed in Efromovich (1996a; 2000b; 2011a,b,d; 2012a; 2013a; 2014c,f;
2016a,b; 2017; 2018a,b) will be instrumental. A book-length treatment of interval-censored
failure time data can be found in Sun and Zhao (2013).
Sequential estimation is a natural approach for the considered settings. Following Efro-
movich (2007d,e; 2008a,c; 2009c) it is possible to consider estimation with assigned risk. The
problem becomes complicated because now both the missing and the underlying LTRC af-
fect the choice of optimal sequential estimator. Multivariate regression is another attractive
topic of research. See also Efromovich (1980a,b; 1989; 2004b,d,g; 2007f,g).
Chapter 8
Time Series
So far we have considered cases of samples where observations are independent. In many sta-
tistical applications the observations are dependent. Time series, stochastic process, Markov
chain, Brownian motion, mixing, weak dependence and long-memory are just a few exam-
ples of the terminology used to describe dependent observations. Dependency may create
dramatic complications in statistical analysis that should be understood and taken into
account. It is worthwhile to note that, in a number of practically interesting cases, depen-
dent observations may be considered as a modification of independent ones, and this is
exactly how many classical stochastic processes are defined and/or generated. Further, the
dependency will allow us to test the limits of the proposed E-estimation methodology.
Dependent observations are considered in this and next chapters. This chapter is devoted
to stationary time series with the main emphasis on estimation of the spectral density
and missing data, while the next chapter is primarily devoted to nonstationary dependent
observations.
Sections 8.1 and 8.2 serve as the introduction to dependent observations and the spectral
density, respectively. Sections 8.3 and 8.4 consider estimation of the spectral density for time
series with missing observations. In particular, Section 8.4 presents a case of destructive
missing. Estimation of the spectral density for censored time series is discussed in Section
8.5. Probability density estimation for dependent observations is explored in Section 8.6.
Finally, Section 8.7 is devoted to the problem of nonparametric autoregression.
281
282 TIME SERIES
A time series is called strictly stationary (or stationary) if the joint distribution of
(Xs1 , Xs2 , . . . , Xsm ) and (Xs1 +k , Xs2 +k , . . . , Xsm +k ) are the same for all sets (s1 , . . . , sm )
and all integers k, m. In other words, a shift in time does not change the joint distribution
and thus the time series is stationary in time. Note that no assumption about moments
is made, and for instance a time series of independent realizations of a Cauchy random
variable is a strictly stationary time series.
A time series {Xt } is called zero-mean if E{Xt } = 0 for all t. Note that a zero-mean time
series assumes existence of the first moment, but no other assumptions about moments or
the distribution is made.
A time series {Xt } := {. . . , X−1 , X0 , X1 , . . .} is called second-order stationary time series
if: (i) E{Xt2 } < ∞ for all t, that is, the second moment is finite; (ii) E{Xt } = µ for all t,
that is, the expectation is constant; (iii) the autocovariance function γ X (l, s) := E{(Xl −
µ)(Xs − µ)} satisfies the relation γ X (l, s) = γ X (l + k, s + k) for all integers l, s, and k,
that is, a translation in time does not affect the autocovariance function. The property (iii)
implies that γ X (l, s) =: γ X (l − s) = γ X (s − l). To see this just set k = −s and k = −l. Thus
a zero-mean and second-order stationary time series is characterized by its autocovariance
function γ X (k) at the lag k, and further there is a nice relation γ X (0) = E{Xt2 } = V(Xt )
which holds for all t. Also note that no assumptions about higher moments is made for a
second-order stationary time series.
Now let us comment about estimation of the mean of a second-order stationary time
series {Yt } := {µ + Xt } where {Xt } is a zero-mean and second-order stationary time series
with the autocovariance function γ X (t) := E{(Xt − E{Xt })(X0 − E{X0 })} = E{Xt X0 }.
Suppose that we observe a realization Y1 , . . . , Yn of {Yt }. First of all we note that
n
X n
X
= E{[n−1 (Yl − µ)]2 } = E{[n−1 Xl ]2 }. (8.1.3)
l=1 l=1
The squared sum in the right side of (8.1.3) may be written as a double sum, and we
continue (8.1.3),
Xn n
X
V(µ̄) = E{[n−1 Xl ]2 } = E{n−2 Xl Xt }
l=1 l,t=1
n
X n X
X n
= n−2 γ X (l, t) = n−1 [n−1 { γ X (l − t)}]. (8.1.4)
l,t=1 l=1 t=1
DISCRETE-TIME SERIES 283
Note that only in the last equality we used the second-order stationarity of {Xt }.
Relation (8.1.4) is the result that we need. Let us consider several possible scenarios. If
the observations are independent, then γ X (l) = 0 for any l 6= 0, and (8.1.4)Ptogether with
∞
γ X (0) = V(X1 ) imply the familiar formula V(µ̄) = n−1 V(X1 ). Similarly, if l=0 |γ X (l)| <
c < ∞ then V(µ̄) < cn−1 and we again have the parametric rate of the variance convergence.
In the latter case thePstochastic process may be referred to as a short-memory time series.
n
However, if the sum l=0 γ X (l) diverges, then we lose the rate n−1 . For instance, consider
the case of a long-memory time series when γ X (t) is proportional to |t|−α , 0 < α < 1. Then
the sum in the curly brackets on the right side of (8.1.4) is proportional to n1−α and the
variance is proportional to n−α . This is a dramatic slowing down of the rate of convergence
caused by dependence between observations. The above-explained phenomenon sheds light
on complexity of dealing with stochastic processes.
The simplest zero-mean and second-order stationary time series is a process in which the
random variables {Xt } are uncorrelated (that is, γ X (t) = 0 for t 6= 0) and have zero mean
and unit variance. Let us denote this time series as {Wt } and call it a standard (discrete
time) white noise. A classical example is a time series of independent standard Gaussian
random variables, which is the white noise that will be used in the following simulations,
and we call it a standard Gaussian white noise.
In its turn, a white noise allows us to define a wide variety of dependent second-order
stationary and zero-mean processes via a set of linear difference equations. This leads us
to the notion of an autoregressive moving average process of orders p and q, an ARMA(p,
q) process for short. By definition, the process {Xt , t = . . . , −1, 0, 1, . . .} is said to be an
ARMA(p, q) process if {Xt } is zero-mean and second-order stationary, and for every t
where {Wt } is a standard white noise, σ > 0, the orders p and q are nonnegative integers,
and a1 , . . . , ap , b1 , . . . , bp are real numbers. For the case of a Gaussian white noise we shall
refer to the corresponding ARMA process as a Gaussian ARMA process.
Two particular classical examples of an ARMA process are a moving average MA(q)
process, which is a moving average of q + 1 consecutive realizations of a white noise,
The MA and AR processes play an important role in the analysis of time series. For
instance, prediction of values {Xt , t ≥ n + 1} in terms of {X1 , . . . , Xn } is relatively sim-
ple and well understood for an autoregressive process because E{Xt |Xt−1 , Xt−2 , . . .} =
a1 Xt−1 + . . . + ap Xt−p . Also, for a given autocovariance function it is simpler to find an
AR process with a similar autocovariance function. More precisely, if an autocovariance
function γ X (j) vanishes as j → ∞, then for any integer k one can find an AR(k) process
with the autocovariance function equal to γ X (j) for |j| ≤ k. The “negative” side of an AR
process is that it is not a simple issue to find a stationary solution for (8.1.7), and moreover,
it may not exist. For instance, the difference equation Xt − Xt−1 = σWt has no stationary
solution, and consequently there is no AR(1) process with a1 = 1. A thorough discussion of
this issue is beyond this short introduction, and in what follows a range for the coefficients
that “keeps us out of trouble” will be always specified.
The advantages of a moving average process are its simple simulation, the given expres-
sion for a second-order stationary solution, and that it is very close by its nature to white
noise, namely, while realizations of a white noise are uncorrelated, realizations of an MA(q)
284 TIME SERIES
process are also uncorrelated whenever the lag is larger than q. The disadvantages, with
respect to AR processes, are more complicated procedures for prediction and estimation
of parameters. Thus, among the two, typically AR processes are used for modeling and
prediction. Also, AR processes are often used to approximate an ARMA process.
For a time series, and specifically ARMA processes, the notion of causality (future
independence) plays an important role. The idea is that for a causal ARMA process {Xt } (or
more specifically, a causal process with respect to an underlying white noise {Wt }) it is quite
natural to expect that an ARMA time series {Xt } depends only on current and previous
P∞ we say that an ARMA process {Xt }
(but not future!) realizations of the white noise. Thus,
generated by a white noise {Wt } is causal if Xt = j=0 cj Wt−j , where the coefficients cj are
absolutely summable. Clearly, MA(q) processes are causal, but not all AR(p) processes are;
for instance, a stationary process corresponding to the difference equation Xt − 2Xt−1 =
Wt is not causal. We shall not elaborate more on this issue and only note that in what
follows we are considering simulations of Gaussian ARMA(1,1) processes corresponding
to the difference equation Xt − aXt−1 = σ(Wt + bWt−1 ) with |a| < 1 and −a 6= b. It
may be directly verified
P∞that for such a this equation has a stationary and causal solution
Xt = σWt + σ(a + b) j=1 aj−1 Wt−j .
As we know from our discussion of (8.1.4), for a statistical inference it is important to
know how fast autocovariance function γ X (k) decreases in k. Introduce a class of autoco-
variance functions
Let us present several general properties and examples of mixing coefficients. Mixing
coefficient αX (s) is either positive or equal to zero, and not increasing in s. For the case of a
stationary time series {Xt } of independent variables we have αX (s) = 0. Another important
example is an m-dependent series satisfying αX (s) = 0 for s > m. The meaning of m-
dependence is that variables, separated in time for more than m time-units, are independent.
Another class of time series, often considered in the mixing theory, is when for τ > 8
∞
X
(s + 1)τ /2−1 αX (s) ≤ Q < ∞. (8.1.11)
s=0
These are types of results that allow us to develop statistical inference in problems
with stationary dependent variables, more results (including rigorous measure-theoretical
formulations) may be found in the references mentioned in the Notes.
The reader may also recall that a discrete time Markov chain is another classical ap-
proach for modeling and analysis of dependent observations. It will be briefly discussed in
Section 8.3.
Here the frequency λ is in units radians/time, and to establish the equality (8.2.2) we used
the relation γ X (−j) = γ X (j). The spectral density is symmetric in λ about 0, i.e., the
spectral density is an even function. Thus, it is customary to consider a spectral density on
the interval [0, π]. The spectral density is also a nonnegative function (like the probability
density), and this explains why it is called a density.
One of the important applications of the spectral density is searching for a deterministic
periodic component (often referred to as a seasonal component) in nonstationary time series.
Namely, a peak in g X (λ) at frequency λ∗ indicates a possible periodic phenomenon with
period
2π
T∗ = ∗ . (8.2.3)
λ
This formula explains why spectral domain analysis is the main tool in searching after the
period of a seasonal component, and the estimator will be discussed in Section 9.3.
Let us explain how the spectral density may be estimated using our E-estimation
methodology. But first let us pause for a moment and stress the following important remark.
By its definition, spectral density is a cosine series, and this is an example where the basis
is chosen not due to its convenience, as we did in the cases of E-estimation of the probability
density and the regression, but due to definition of the estimand. In other words, spectral
density estimation is the most appealing example of using a series approach and the cosine
basis. Now let us return to estimation of the spectral density.
Denote by X1 , . . . , Xn the realization of a second-order stationary and zero-mean time
series. The classical empirical autocovariance estimator is defined as
n−j
X
X −1
γ̃ (j) := n Xl Xl+j , j = 0, 1, . . . , n − 1, (8.2.4)
l=1
Note that in the empirical autocovariance the divisor n is not equal to the number n − j of
terms in the sum, and hence it is a biased estimator. On the other hand, this divisor ensures
that an estimate corresponds to some second-order stationary series. For all our purposes
there is no difference between using the two estimators, but it is always a good idea to check
which one is used by a statistical software. In what follows, proposed E-estimators will be
based on the sample mean autocovariance estimator (8.2.5), and the reason is to follow
our methodology of sample mean estimation. On the other hand, many classical spectral
estimators, like the periodogram discussed below, use the estimator (8.2.4).
Based on (8.2.2), if one wants to estimate a spectral density and is not familiar with
basics of nonparametric estimation discussed in Chapter 2, it is natural to plug the empirical
autocovariance (8.2.4) in place of unknown autocovariance. And sure enough, this approach
is well known and the resulting estimator (up to the factor 1/2π) is called a periodogram,
n−1
X Xn 2
I X (λ) := γ̃ X (0) + 2 γ̃ X (j) cos(jλ) = n−1 Xl e−ilλ . (8.2.6)
j=1 l=1
SPECTRAL DENSITY AND ITS ESTIMATION 287
Here i is the imaginary unit, i.e., i2 := −1, eix = cos(x) + i sin(x), and the periodogram
is defined at the so-called Fourier frequencies λk := 2πk/n, where k are integers satisfying
−π < λk ≤ π.
Periodogram, as a tool for spectral-domain analysis, was proposed in the late nineteenth
century. It has been both the glory and the curse of the spectral analysis. The glory, because
many interesting practical problems were solved at a time when no computers were avail-
able. The curse, because the periodogram, which had demonstrated its value for locating
periodicities (recall (8.2.3)), proved to be an erratic and inconsistent estimator. The reason
for the failure of the periodogram is clear from the point of view of nonparametric curve
estimation theory discussed in Chapter 2. Indeed, based on n observations, the periodogram
estimates n Fourier coefficients, and this explains the erratic performance and inconsistency.
Nonetheless, it is still a popular estimator.
Using the sample mean estimator (8.2.5), we may use the E-estimator of Section 2.2
for estimation of the spectral density. Of course, the theory of E-estimation was explained
for the case of independent observations, but as we will see shortly, it can be extended to
dependent observations. We are beginning with a simulated example which sheds light on
the problem, periodogram and E-estimator, and then explore the theory.
Figure 8.1 allows us to visualize an ARMA process, its spectral density and the two
above-defined estimates of the spectral density. A particular realization of the Gaussian
ARMA(1, 1) time series Xt +0.3Xt−1 = 0.5(Wt −0.6Wt−1 ) is shown in the top diagram. Note
how fast observations oscillate over time. This is because here the covariance between Xt and
Xt−1 is negative. This follows from the following formula for calculating the autocovariance
function of the causal ARMA(1, 1) process Xt − aXt−1 = σ(Wt + bWt−1 ) with |a| < 1,
Note that if a > 0 and b > 0, then γ(1) > 0, and a realization of the time series will “slowly”
change over time. On the other hand, if a + b < 0 and 1 + ab > 0 then a realization of the
time series may change its sign almost every time. Thus, depending on a and b, we may see
either slow or fast oscillations in a realization of an ARMA(1, 1) process. Figure 8.1 allows
us to change parameters of the ARMA process and observe different interesting patterns in
this pure stochastic process. In particular, to make the process slowly changing and even
see interesting repeated patterns in a time series, choose positive parameters a and b.
The solid line in the bottom diagram of Figure 8.1 shows us the underlying theoretical
spectral density of the ARMA(1, 1) process. As we see, because here both a and b are
negative, in the spectral domain high frequencies dominate low frequencies. The formula
for calculating the spectral density is g X (λ) = σ 2 |1 + beiλ |2 /[2π|1 − aeiλ |2 ], and it is a
particular case of the following formula for a causal ARMA(p, q) process (8.1.5),
Pq 2
σ 2 1 + j=1 bj e−ijλ
X
g (λ) = Pp 2 . (8.2.8)
2π 1 − j=1 aj e−ijλ
The middle diagram shows us that the periodogram has a pronounced mode at frequency
λ∗ ≈ 2.6 which, according to (8.2.3), indicates a possibility of a deterministic periodic
(seasonal) component with the period which is either 2 or 3. One may see or not see such a
component in the data, but thanks to the simulation we do know that there is no periodic
component in the data. The reader is advised to repeat this figure and get used to reading
a periodogram because it is commonly used by statistical softwares. The bottom diagram
exhibits the spectral density E-estimate which correctly shows the absence of any periodic
288 TIME SERIES
2
1
0
-1
-2
0 20 40 60 80 100 120
Periodogram
0.6
0.4
0.2
Figure 8.1 ARMA(1,1) time series and two estimates of the spectral density. The top diagram
shows a particular realization of a Gaussian ARMA(1,1) time series Yt − aYt−1 = σ(Wt + bWt−1 ),
t = 1, 2, . . . , n, where a = −0.3, b = −0.6, σ = 0.5, and n = 120. The middle diagram shows the
periodogram. The spectral density E-estimate (the dashed line) and the underlying spectral density
(the solid line) are exhibited in the bottom diagram. {The length n of a realization is controlled
by the argument n. Parameters of simulated ARMA(1,1) process are controlled by the arguments
sigma, a, and b. Use |a| < 1. All the other arguments control parameters of the E-estimator. Note
that the string sp is added to these arguments to indicate that they control coefficients of the spectral
density E-estimator.} [n = 120, sigma = 0.5, a = -0.3, b = -0.6, cJ0sp = 2, cJ1sp = 0.5, cTHsp
= 4]
(seasonal) component, and the E-estimate nicely resembles the underlying spectral density.
Please pay attention to the relatively small sample size n = 120. Again, it is important
to repeat this figure, with different sample sizes and for different ARMA processes, to get
first-hand experience in spectral analysis of stationary time series.
We are finishing this section with theoretical analysis of the MISE of a series estima-
tor. This is an interesting and technically challenging problem because a relatively simple
technique of inference for a sum of independent variables no longer is applicable.
First, let us begin with an example of using the Parseval identity. We can write that
Z π ∞
X
[g X (λ)]2 dλ = (2π)−1 [γ X (0)]2 + π −1 [γ X (j)]2 . (8.2.9)
−π j=1
SPECTRAL DENSITY AND ITS ESTIMATION 289
Similarly, for a series estimator
J
X
ḡ X (λ, J) := (2π)−1 γ̂ X (0) + π −1 γ̂ X (j) cos(jλ), (8.2.10)
j=1
h J
X i X
= (2π)−1 E{[γ̂ X (0) − γ X (0)]2 } + π −1 E{[γ̂ X (j) − γ X (j)]2 } + π −1 [γ X (j)]2 . (8.2.11)
j=1 j>J
In (8.2.11) the term in the large square brackets is the integrated variance (or simply
variance) of ḡ X (λ, J), and the last term is the integrated squared bias of ḡ X (λ, J). This is
a classical decomposition of the MISE (recall our discussion in Chapter 2).
Now we need to learn the technique of inference for a sum of dependent variables. In
what follows we are considering j < n and assume that {Xt } is a Gaussian zero-mean and
second-order stationary time series. Recall that the expectation of a sum is always the sum
of expectations, and hence the mean of the sample mean autocovariance (8.2.5) is
n−j
X
E{γ̂ X (j)} = (n − j)−1 E{Xl Xl+j } = γ X (j). (8.2.12)
l=1
We conclude that the sample mean estimator (8.3.5) is unbiased and that the dependence
does not change this nice property of a sample mean estimator. Now we are considering the
variance of the sample mean estimator, and this problem is more involved because we need
to learn several technical steps. Write,
n−j
X
V(γ̂ X (j)) = (n − j)−2 E{[ (Xl Xl+j − γ X (j)]2 }. (8.2.13)
l=1
This is the place where we may either continue by going from the squared sum to a double
sum and then do a number of calculations, or use the following nice formula. Consider
a zero-mean and second-order stationary time series {Zt }. Then, using the technique of
(8.1.3)-(8.1.4) we get a formula
Xk k
X
E{[ Zl ]2 } = (k − |l|)E{Z0 Zl }. (8.2.14)
l=1 l=−k
Note how simple, nice and symmetric the formula is, and it allows us to write down the
variance of a sum of dependent variables as a sum of expectations. Using (8.2.14) in (8.2.13)
we get for j < n,
n−j
X
V(γ̂ X (j)) = (n − j)−2 (n − j − |l|)[E{X0 Xj Xl Xl+j } − (γ X (j))2 ]. (8.2.15)
l=−n+j
Our next step is to understand how to evaluate terms E{X0 Xj Xl Xl+j }. In Section 8.1
several possible paths, depending on the assumption about dependency, were discussed.
Here it is assumed that the time series is Gaussian, and this implies the following nice
formula,
We need to add one more assumption that holds for all short-memory time series (in-
cluding ARMA)
X∞
|γ X (l)| < ∞. (8.2.18)
l=0
Note that the Cauchy-Schwarz inequality implies that |γ X (j)| ≤ γ X (0), and this together
with (8.2.18) yields another useful inequality
∞
X ∞
X
[γ X (l)]2 ≤ γ X (0) |γ X (l)| < ∞. (8.2.19)
l=0 l=0
Using these two inequalities in (8.2.17), together with the Cauchy inequality
where cn,j → 0 as both n and j increase in such a away that j < Jn = on (1)n, and
∞
X Z π
d(g X ) := [γ X (0)]2 + 2 [γ X (l)]2 = 2π [g X (λ)]2 dλ. (8.2.22)
l=1 −π
This is a general expression for the MISE of a series estimator ḡ X (λ, Jn ) defined in
(8.2.10). To simplify it further, we need to add an assumption which will allow us to evalu-
ate the second term (the ISB) on the right side of (8.2.23). As an example, let us additionally
assume that the considered Gaussian time series is ARMA. Then the autocovariance func-
tion belongs to a class A(Q, q, β, r) defined in (8.1.8). For this class the assumption (8.2.18)
holds, we can bound from above the sum in (8.2.23) and get the upper bound for the MISE,
Note that this choice of the cutoff makes the integrated squared bias asymptotically
smaller in order than the variance. This is an important property which is typical for
ARMA processes. Further, this also means that in (8.2.25) we can replace the inequality on
equality and get
Further, note that Jn0 uses only parameter r of the class (8.1.8).
Asymptotic theory shows that no other estimator can improve the right side of (8.2.26)
uniformly over the class (8.1.8), namely for any (not necessarily series) estimator ǧ X (λ) the
following lower bound holds,
n MISE(ǧ X (λ), g X (λ)) o
sup Rπ
X (λ)]2 dλ
≥ r−1 ln(n)n−1 (1 + on (1)). (8.2.27)
g X ∈A(Q,q,β,r) −π
[g
The right sides of (8.2.26) and (8.2.27) coincide up to a factor 1 + on (1), and this allows
us to say that the lower bound (8.2.27) is sharp and that the series estimator ḡ X (λ, Jn0 )
is asymptotically minimax (optimal, efficient). Of course, Jn0 depends on parameter r of
the analytic class, and this is why the E-estimator chooses a cutoff using data, or we may
say that E-estimator is an adaptive estimator because it adapts to an underlying class of
spectral densities.
There is another interesting conclusion from (8.2.23) that can be made about a reason-
able estimator that does not require adaptation to the class (8.1.8) and may choose a cutoff
Jn a priori before getting data. Let us consider Jn = Jn∗ which is the largest integer smaller
than (cn /2) ln(n) where cn → ∞ as slow as desired, say cn = ln(ln(ln(n))). Then a direct
calculation shows that
MISE(ḡ X (λ, Jn∗ ), g X (λ))
Rπ
X (λ)]2 dλ
= cn ln(n)n−1 (1 + on (1)). (8.2.28)
−π
[g
Note that the rate of the MISE convergence is just “slightly” slower than the minimax rate
ln(n)n−1 .
As we have seen, estimation of spectral densities is an exciting topic with rich history,
fascinating asymptotic theory, and numerous practical applications.
Yl := Al Xl , l = 1, 2, . . . , n. (8.3.1)
To conclude our brief introduction to Markov chains, let us note that a Markov chain of
order m (or a Markov chain with memory m), where m is a finite positive integer, is a
process satisfying
P(At = at |At−1 = at−1 , At−2 = at−2 , . . .)
= P(At = at |At−1 = at−1 , At−2 = at−2 , . . . , At−m = at−m ).
In other words, only last m past states define the future state. Note that the classical Markov
chain may be referred to as the Markov chain of order 1.
BERNOULLI MISSING 293
2
1
0
-1
-2
Figure 8.2 Markov–Bernoulli missing mechanism. Markov chain is generated according to transition
probabilities P(At+1 = 0|At = 0) = α and P(At+1 = 1|At = 1) = β. The top diagram shows n
realizations of an underlying Gaussian ARMA(1,1) time series Xt defined in Figure 8.1 only here
parameters are a = 0.4, b = 0.5, σ = 0.5. ThePmiddle diagram shows the observed time series
n−j
{At Xt } with missing observations. Set Nj := l=1 Al Al+j , and then N := N0 and Nmin :=
min0≤j≤cJ0sp +cJ1sp ln(n) Nj are shown in the title. The bottom diagram shows by the solid line the
underlying spectral density as well as the following three estimates. The proposed E-estimate is
shown by the dashed line. The naı̈ve E-estimate (the dotted line) is based on available observations
of time series {At Xt } shown in the middle diagram. Oracle’s E-estimate, based on the hidden
underlying realizations of {Xt }, is shown by the dot-dashed line. [n = 240, sigma = 0.5, a = 0.4,
b = 0.5, alpha = 0.4, beta = 0.8, cJ0sp = 2, cJ1sp = 0.5, cTHsp = 4]
Using the assumed second-order stationarity of {Xt } and stationarity of {At }, we can
evaluate the expectation of the sample mean autocovariance. Write,
n−j
X
E{γ̂ Y (j)} = E{(n − j)−1 (Al Xl )(Al+j Xl+j )}
l=1
n−j
X
= (n − j)−1 E{Al Al+j }E{Xl Xl+j }
l=1
n−j
X
= γ X (j)[(n − j)−1 E{Al Al+j }] = γ X (j)E{A1 A1+j }. (8.3.4)
l=1
Note that the expectation E{A1 A1+j }, which we see in (8.3.4), looks like the autocovariance
but it is not because the expectation of At is not zero.
We conclude that the expectation of the estimator γ̂ Y (j), based on the observed process
Yt , is the product of the underlying autocovariance of interest γ X (j) and the function
E{A1 A1+j } = P(A1 A1+j = 1). The function E{A1 A1+j } can be estimated by its sample
mean, and this yields the following plug-in sample mean estimator of the autocovariance of
interest γ X (j),
Pn−j Pn−j
X (n − j)−1 l=1 Yl Yl+j Yl Yl+j
γ̂ (j) := Pn−j = Pn−j l=1 . (8.3.5)
(n − j)−1
l=1 I(Yl Yl+j 6= 0) l=1 I(Yl Yl+j 6= 0)
Pn−j
Similarly to Section 4.1, we need to comment on the term Nj := l=1 I(Yl Yl+j 6= 0)
Pn−j
= l=1 Al Al+j which is used in the denominator of (8.3.5). Theoretically the number
of available pairs of observations may be zero, and then using the assumed 0/0 := 0
in (8.3.5) we get γ̂ X (j) = 0. This is a reasonable outcome for the case when no in-
formation about γ X (j) is available. Another remedy is to consider only samples with
Nmin := min{0≤j≤cJ0sp +cJ1sp ln(n)} Nj > k for some k ≥ 0, because cJ0sp + cJ1sp ln(n) is
the largest frequency used by the E-estimator (recall (2.2.4)).
BERNOULLI MISSING 295
Is the plug-in estimator (8.3.5) unbiased whenever Nj > k ≥ 0? It is often not the case
when a statistic is plugged in the denominator of an unbiased estimator, and hence it is of
interest to check this property. Using independence between {At } and {Xt } we can write,
n n Pn−j (A X )(A X ) oo
l l l+j l+j
E{γ̂ X (j)|Nj > k} = E E l=1
Pn−j A1 , . . . , An , Nj > k
l=1 Al Al+j
n n−j A A E{X X }
P o
l l+j l l+j
=E l=1
Pn−j Nj > k = γ X (j). (8.3.6)
l=1 Al Al+j
This establishes that, whenever Nmin > 0, the proposed autocovariance estimator (8.3.5)
is unbiased, and it can be used to construct the spectral density E-estimator ĝ X (λ).
Now let us return to Figure 8.2. The title of the middle diagram indicates Nmin = 117,
and this tells us about complexity of the problem of the spectral density estimation with
missing data when the size n = 240 of the hidden time series is decreased to N = 172
available observations, and then the minimal number of available pairs for estimation of
the autocovariance is decreased to 117, that is more than the twofold decrease of n. It
is also of interest to present the underlying N0 , . . . , N5 that are 172, 134, 120, 117, 122
and 119, respectively. We may conclude that, in the analysis of a time series with missing
observations, it is important to take into account not only n and N but also Nmin .
The bottom diagram in Figure 8.2 exhibits three estimates and the underlying spectral
density of interest g X (λ) (the solid line). The dashed line is the proposed data-driven E-
estimate. Note that it correctly exhibits the slowly decreasing shape of the underlying
density (compare with the solid line which is the underlying spectral density). The dotted
line shows us the E-estimate based on γ̂ Y (j), in other words, this is a naı̈ve estimate which
ignores the missing data and deals with {Yt } like it is the time series of interest. While
overall it is a poor estimate of the underlying spectral density, note that it correctly shows
smaller power of the observed time series at low frequencies. The dot-dashed line shows us
oracle’s E-estimate based on the hidden time series {Xt } shown in the top diagram. The
bottom diagram allows us to compare performance of the same E-estimator based on three
different datasets, and based on this single simulation we may conclude that the proposed
data-driven E-estimator performs relatively well, and ignoring the missing, as the naı̈ve E-
estimator does, is a mistake. The interested reader is advised to repeat this figure, possibly
with different parameters, to get first-hand experience in dealing with the Markov-Bernoulli
missing mechanism.
Now let us consider a different mechanism of creating missing observations. Here the
number L of consecutive missing observations, that is the length of a batch of missing
observations, is generated by a random variable. Correspondingly, we will generate missing
observations by choosing the distribution of L and refer to this missing mechanism as batch-
Bernoulli. The following example clarifies the definition. Suppose that each hour we need to
conduct an experiment whose outcomes create a time series of hourly observations. However,
this task is of lower priority with respect to other jobs that may arrive in batches with each
job requiring one hour to be fulfilled. As a result we may have intervals of time when the
experiment is not conducted. A modification of that example is when the experiment cannot
be performed if the equipment malfunctions and then a random number of hours is required
to fix the equipment.
A distribution of L, which is often used to model batches, is Poisson with E{L} = λ, and
definition of this distribution can be found in Section 1.3. To simulate the corresponding
batch-Bernoulli missing, we generate a sample L1 , L2 , . . . of independent random variables
from L. If L1 = 0 then no missing occurs and if L1 = k > 0 then k consecutive observations
X1 , . . . , Xk are missed, etc. The following inequality sheds light on how large Poisson batches
can be,
P(L ≥ k) ≤ e−λ k −k (eλ)k = e−k ln(k/eλ)−λ , k > λ. (8.3.7)
296 TIME SERIES
1
0
-1
-2
Figure 8.3 Batch-Bernoulli missing mechanism. Lengths L1 , L2 , . . . of missing batches are indepen-
dent Poisson random variables with mean E{L} = λ. Otherwise the simulation and structure of
diagrams are similar to Figure 8.2. {Parameter λ is controlled by the argument lambda.} [n = 240,
sigma = 0.5, a = 0.4, b = 0.5, lambda=0.5, cJ0sp = 2, cJ1sp = 0.5, cTHsp = 4]
For a batch-Bernoulli missing the relation (8.3.6) still holds and hence the same spectral
density E-estimator may be used.
Figure 8.3 illustrates the batch-Bernoulli missing mechanism, and here λ = 0.5. Apart
of the new missing mechanism, the simulation and diagrams are the same as in Figure 8.2.
The top diagram shows a realization of the ARMA process. The middle diagram shows us
the missing pattern, and it sheds new light on the name of the missing mechanism. We can
also note from the title that in this particular simulation, the Poisson batches decreased the
number n = 240 of hidden observations to just N = 127 available observations, in other
words, almost a half of hidden underlying observations is missed. Further, the minimal
number of available pairs for calculation autocovariance coefficients of the E-estimator is
Nmin = 61, and it is almost a quarter of n = 240. Let us also add the information about
the underlying N0 , . . . , N5 that are 127, 71, 66, 62, 61 and 64, respectively. Not surprisingly,
these small numbers of pairs, available for calculation of the autocovariance, take a toll on
the E-estimate (compare the dashed line of the E-estimate with the dot-dashed line of the
oracle’s E-estimate based on n = 240 hidden realizations of {Xt }). At the same time, the
proposed estimate still nicely exhibits shape of the underlying spectral density and it is
AMPLITUDE-MODULATED MISSING 297
dramatically better than the naı̈ve E-estimate (the dotted line) which ignores the missing
and treats {Yt } as the time series of interest. It is a good exercise to repeat this figure with
different parameters and then analyze patterns in time series and performance of estimators.
Because {Xt } is zero-mean and it is assumed that the three time series are mutually in-
dependent, the observed time series {Yt } is also zero-mean. Then, using (8.4.1) we may
write,
E{γ̂ Y (j)} = E{(U1 A1 X1 )(U1+j A1+j X1+j )}
= E{U1 U1+j }E{A1 A1+j }E{X1 X1+j }
= [µ2 I(j = 0) + µ2 I(j 6= 0)]E{A1 A1+j }γ X (j). (8.4.3)
Equation (8.4.3) sheds light on a possibility to solve the problem of estimation of the
spectral density g X (λ). Namely, if parameters µ2 and µ are known, then it is possible to
estimate the spectral density using the approach of the previous section. Further, if the two
parameters are unknown but a sample from U is available, then the two parameters can be
estimated by sample mean estimators. Note that in both cases some extra information is
needed. On the other hand, (8.4.3) indicates that, based solely on amplitude-modulated data
we cannot consistently estimate the spectral density g X (λ). In other words, the considered
modification is destructive.
Despite the above-made gloom conclusion, the following approach may be a feasible
remedy in some practical situations. Introduce a function sX (λ) which is called the spectral
shape or the shape of spectral density,
∞
X
sX (λ) := π −1 γ X (j) cos(jλ). (8.4.4)
j=1
298 TIME SERIES
Note that g X (λ) = (2π)−1 γ X (0) + sX (λ), and hence in a graphic the spectral shape is just
shifted (in vertical direction) spectral density, and the shift is such that the integral of the
shape over [0, π] is zero. Further, apart of the variance of Xt , the spectral shape provides us
with all values of the autocovariance function γ X (j), j > 0. Further, in practical applications
the spectral density is often of interest in terms of its modes, and then knowing either the
spectral density, or the spectral shape, or the scaled spectral shape sX (λ, µ) defined as
∞
X ∞
X
sX (λ, µ) = µ2 sX (λ) =: π −1 (µ2 γ X (j)) cos(jλ) =: π −1 η X (j) cos(jλ) (8.4.5)
j=1 j=1
is equivalent. Recall that µ := E{U } is an unknown parameter (the mean) of the amplitude-
modulating distribution.
The above-made remark makes the problem of estimation of the scaled spectral shape
of a practical interest. Furthermore, estimation of the scaled shape of spectral density is
possible based on the available amplitude-modulated time series {Yt }. Indeed, consider the
estimator Pn−j
X Yl Yl+j
η̂ (j) := Pn−j l=1 , j≥1 (8.4.6)
l=1 I(Yl Yl+j 6= 0)
of coefficients η X (j) in the cosine expansion (8.4.5). Following Section 8.3, set Nj :=
Pn−j
l=1 Al Al+j , N := N0 and Nmin := min{0≤j≤cJ0sp +cJ1sp ln(n)} Nj . Let us explore the ex-
pectation of the estimator (8.4.6) given Nj > 0. Write,
Pn−j
n n Yl Yl+j oo
E{η̂ X (j)|Nj > 0} = E E Pn−j l=1 |A1 , . . . , An , Nj > 0
l=1 I(Yl Yl+j 6= 0)
1
0
-1
-2
Figure 8.4 Amplitude-modulated time series and estimation of the scaled shape of the spectral den-
sity. Amplitude-modulated observations Yt := Ut At Xt , t = 1, 2, . . . , n are generated by a Markov-
Bernoulli time series {At } used in Figure 8.2, and U1 , U2 , . . . , Un is a sample from a scaled corner
density whose mean E{Ut } = µ. The structure of diagrams is identical to Figure 8.2. For the con-
sidered setting only estimation of the scaled shape sX (λ, µ) of the spectral density is possible, and
the scaled shape and its estimates are shown in the bottom diagram. {Parameter µ is controlled by
the argument mu, the choice of corner function is controlled by argument corn.} [n = 240, sigma =
0.5, a = 0.4, b = 0.5, alpha = 0.4, beta = 0.8, corn = 2, mu = 3, cJ0sp = 2, cJ1sp = 0.5, cTHsp
= 4]
2
1
0
-1
In its turn, the autocovariance estimator yields the E-estimator of the spectral density.
Figure 8.5, whose structure is similar to Figure 8.2, illustrates both the setting and
the E-estimator. The top diagram shows the underlying time series of interest. The mid-
dle diagram
Pn shows the observed amplitude-modulated time series, as well as the number
N := l=1 I(Zl Xl 6= 0) = 193 of available observations, and Nmin = 149. Also, let us
present numbers N0 , . . . , N5 of available pairs 193, 151, 157, 150, 149 and 153, respectively.
Returning to the middle diagram, please note how visually different are realizations of the
amplitude-modulated time series and the underlying one. Further, note the difference in
CENSORED TIME SERIES 301
their scales. It is difficult to believe that it is possible to restore the underlying spectral
density of the hidden time series {Xt } after its amplitude-modulation by the Poisson vari-
able. Nonetheless, the bottom diagram shows that the spectral density E-estimate is close
to the oracle’s estimate based on the hidden time series, and overall it is very good. On
the other hand, the naı̈ve spectral density E-estimate of Section 8.2, based on observations
Z1 X1 , . . . , Zn Xn , is clearly poor. Note that it indicates a dramatically larger overall power
of the observed time series, especially on low frequencies.
The overall conclusion is that it is absolutely prudent to pay attention to a possible
modification of an underlying time series of interest.
f V1 ,V1+j ,∆1 ,∆1+j (v1 , v1+j , 1, 1) = f X1 ,X1+j (v1 , v1+j )GC1 ,C1+j (v1 , v1+j ), (8.5.1)
where GC1 ,C1+j (v1 , v1+j ) = P(C1 > v1 , C1+j > v1+j ) is the bivariate survival function. This
formula allows us to write,
E{V1 V1+j ∆1 ∆1+j }
Z ∞Z ∞
= v1 v1+j f X1 ,X1+j (v1 , v1+j )GC1 ,C1+j (v1 , v1+j )dv1 dv1+j . (8.5.2)
−∞ −∞
This relation points upon a possible solution. Recall that {Xt } is a zero-mean and
stationary time series. Using this assumption, together with (8.5.1), allows us to write
down the autocovariance function of interest as
where GZ (z) := P(Z > z) denotes the survival function of Z. This allows us to write
f C (v)GVt (v)
f Vt ,∆t (v, 0) = f C (v)GXt (v) = . (8.5.6)
GC (v)
Next, recall that the hazard rate function of the censoring variable is defined as
f C (v)
hC (v) := . (8.5.7)
GC (v)
Using this relation in (8.5.6) yields that
f Vt ,∆t (v, 0)
hC (v) = (8.5.9)
GVt (v)
Using this estimator in (8.5.11) implies the sample mean estimator of the cumulative hazard
H C (v),
n
X (1 − ∆l )I(Vl ≤ v)
Ĥ C (v) := n−1 . (8.5.13)
l=1
ĜVl (Vl )
CENSORED TIME SERIES 303
1
-1
Figure 8.6 Censored time series. The underlying time series {Xt } is generated by the same Gaus-
sian ARMA process as in Figure 8.2. The censoring time series {Ct } is generated by a sample from
a Gaussian variable with zero mean and standard deviation σC . In the third from the
P top diagram
squares and circles show cases with uncensored and censored Xt , respectively. N := n l=1 ∆l is the
number of uncensored observations. In the bottom diagram the underlying spectral density g X (λ),
its E-estimate and naı̈ve E-estimate are shown by the solid, dashed and dotted lines, respectively.
{Parameter σC of the Gaussian censoring time series is controlled by the argument sigmaC.} [n
= 240, sigma = 0.5, a = 0.4, b = 0.5, sigmaC = 1, cJ0sp = 2, cJ1sp = 0.5, cTHsp = 4]
As soon as the cumulative hazard is estimated, we use the familiar from Chapter 4
relation
GC (v) = exp(−H C (v)) (8.5.14)
to estimate the survival function of C by the plug-in estimator,
where Z 1
θj := ϕj (x)f X (x)dx (8.6.2)
0
are Fourier coefficients of f X (x). The idea of E-estimation is based on the fact that the
right side of (8.6.2) can be written as the expectation, namely
The expectation immediately implies the following sample mean Fourier estimator,
n
X
θ̂j := n−1 ϕj (Xl ). (8.6.4)
l=1
PROBABILITY DENSITY ESTIMATION 305
The Fourier estimator is unbiased and
Using this Fourier estimator by the E-estimator of Section 2.2 yields the wished density
estimator. Further, Section 2.2 shows via simulations a good performance of the E-estimator
for relatively small samples. In short, E-estimation is based on a good sample mean Fourier
estimator.
Now let us relax the assumption about independence and consider a realization
X1 , X2 , . . . , Xn of a stationary time series {Xt }. The problem is again to estimate the
density f X (x), and note that now this density is marginal with respect to the joint den-
sity f X1 ,...,Xn (x1 , . . . , xn ). Further, let us relax another assumption that the support of the
density is known. Further, what will be if some of the dependent observations are missed?
These are the issues that we would like to address.
First, let us relax the assumption about independence while still considering densities
supported on [0, 1]. Let us carefully look at the Fourier estimator (8.6.4). For a stationary
time series the estimator is still a sample mean estimator and unbiased. Indeed, we know that
the expectation of a sum of random variables is always a sum of expectations, regardless of
dependence between the random variables, and hence even if X1 , X2 , . . . , Xn are dependent
we get
Xn n
X
−1 −1
E{θ̂j } = E{n ϕj (Xl )} = n E{ϕj (Xl )}. (8.6.6)
l=1 l=1
This proves that Fourier estimator (8.6.4) is unbiased. Further, the proof of (8.6.7) is based
only on the property that the expectation E{ϕj (Xt ) does not depend on t, and this is a
weaker assumption than the stationarity.
Next we need to evaluate the variance of the estimator (8.6.4). Write,
n
X n
X
V(θ̂j ) = E{[n−1 ϕj (Xl ) − θj ]2 } = E{[n−1 (ϕj (Xl ) − θj )]2 }
l=1 l=1
n
X n
X
= n−2 E{(ϕj (Xl ) − θj )(ϕj (Xt ) − θj )} = n−2 E{ϕj (Xl )ϕj (Xt )} − θj2 . (8.6.8)
l,t=1 l,t=1
E{ϕj (Xl )ϕj (Xt )} = E{ϕj (X0 )ϕj (X|t−l| )}. (8.6.9)
The expression (8.6.10) allows us to make the following important conclusion. If the
following sum converges,
n
X
|E{ϕj (X0 )ϕj (Xl )}| < c∗ < ∞, (8.6.11)
l=−n
306 TIME SERIES
then the variance of θ̂j decreases with the parametric rate n−1 despite the dependence
between the observations, namely
Otherwise the dependence may slow down the rate of the variance decrease and hence make
the estimation less accurate.
As an example, consider the case of a stationary series {Xt } with the mixing coefficient
αX (s) defined in (8.1.10). Then using (8.1.13) we get
Using this relation we conclude that if the mixing coefficients are summable, that is
∞
X
αX (s) < ∞, (8.6.14)
s=0
then (8.6.11) holds and we get the classical rate n−1 for the variance. If (8.6.14) holds then
we are dealing with the case of weak dependency (short memory). For instance, if αX (s) is
proportional to s−β with β > 1 then this is the case of weak dependency. Another classical
case of weak dependency is a Gaussian ARMA time series where mixing coefficients decrease
exponentially. On the other hand, if β < 1, and an example will be presented shortly, the
variance convergence slows down to n−β . In the latter case the dependence is called strong
and we may say that we are dealing with variables having long memory of order β.
Now we are explaining our approach for the case when the support of X is unknown and
may be a real line like in the case of a Gaussian stationary time series. Denote by X(1) and
X(n) the smallest and largest observations (recall that this is our traditional notation for
ordered observations), define rescaled on [0, 1] observations Yl := (Xl −X(1) )/(X(n) −(X(1) ),
construct for the rescaled observations the density E-estimator fˆY (y), and then define the
the rescaled back density E-estimator
Note that the support of X 0 is the interval [a, a + b], and hence via differentiation of (8.6.16)
we get 0 0
f X (x) = b−1 f Y ((x − a)/b)I(x ∈ [a, a + b]). (8.6.17)
This is a classical formula for a scale-location transformation, and it motivated (8.6.15).
Now let us get a feeling of the effect of dependency on estimation of the density via
simulated examples. Figure 8.7 allows us to understand performance of the E-estimator
and complexity of the problem for the case of short-memory processes. The figure allows
us to consider two ARMA(1,1) processes with different parameters. In the left column we
consider the case of a highly oscillated zero-mean Gaussian ARMA(1,1) process {Xt }. In
the right column we are also considering the case of a zero-mean Gaussian ARMA(1,1)
PROBABILITY DENSITY ESTIMATION 307
4
2
2
0
0
-2
-2
-4
0 20 40 60 80 100 0 20 40 60 80 100
t t
a = -0.4, b = -0.6, sigma = 1, n = 100 a1 = 0.4, b1 = 0.5, sigma1 = 1, n = 100
0.30
0.20
0.20
Density
Density
0.10
0.10
0.00
0.00
-4 -2 0 2 4 -2 0 2 4
x z
Figure 8.7 Density estimation for short-memory stationary time series. Realizations of time series
{Xt } and {Zt } are generated by Gaussian ARMA(1,1) processes whose parameters are shown in
the subtitles. Note that they are similar to those in Figures 8.1 and 8.2, respectively. The histograms
are overlaid by the underlying density (the solid line) and the E-estimate (the dashed line) shown
over the estimated support. {Parameters a, b and sigma control {Xt } while a1, b1 and sigma1
control {Zt }.} [n = 100, sigma = 1, a = -0.4, b = -0.6, sigma1 = 1, a1 = 0.4, b1 = 0.5 , cJ0 =
4, cJ1 = 0.5, cTH = 4]
process {Zt } only with parameters implying slower oscillations. Note that the horizontal
dashed line in the top diagrams helps us to visualize oscillations around zero. Both {Xt }
and {Zt } are weak-dependent and short-memory processes with exponentially decreasing
mixing coefficients, but they have different shapes of spectral densities, see Figures 8.1 and
8.2, respectively. For each process, corresponding bottom diagrams shows us the histogram
of available observations as well as the underlying Gaussian density and its E-estimate.
Let us analyze and compare these diagrams. First of all, the left histogram is clearly more
symmetric about zero than the right one. The right histogram is skewed to the right, and this
reflects the fact that the observed series {Zt } spends more time above zero than below. This
308 TIME SERIES
is because {Zt } has more power on lower frequencies and it requires more time (larger n) to
exhibit its stationarity and zero-mean property. In terms of modes, the E-estimates indicate
a pronounced single mode, but again the more frequently oscillating {Xt } yields the better
shape of the E-estimate. Finally, let us look at the effect of using the empirical support
(note that this is the same
Pn support used by the famous empirical cumulative distribution
function F̂ X (x) := n−1 l=1 I(Xl ≤ x)). Of course, a Gaussian variable is supported on a
real line, and here we used a finite empirical support. This did a good job for {Xt } and a
reasonable one for {Zt }.
Now let us consider an even more extreme case of dependence, a long-memory time
series. The top diagram in Figure 8.8 exhibits a zero-mean Gaussian time series with long
memory of order β = 0.4. It looks like the above-discussed time series {Zt } on steroids. Note
that it begins above the zero and rarely goes into negative territory (the horizontal dashed
line helps us to see this). Of course, eventually the series will stay negative a long time, but
much larger samples are needed to see this. The reader is advised to generate more series and
get used to processes with long-memory. In the second (from the top) diagram we see the
histogram of available observations overlaid by the underlying density (the solid line) and
its E-estimate (the dashed line) shown over the range of observations. The data is clearly
skewed to the right, the estimated support is skewed to the right, and there is nothing that
can be done about this. We simply need more observations to correctly estimate the density,
and this is an important lesson to learn.
What will be if some observations are missed? How does missing, coupled with de-
pendency, affect E-estimation of the density? First, let us answer these natural questions
analytically. Consider a time series {At Xt } where the availabilities At are independent
Bernoulli random variables and P(At = 1|{Xt }) = w, w ∈ (0, 1]. Additionally, we are as-
suming that P(Xt = 0) = 0 and hence P(At = I(At Xt 6= 0)) = 1. As a result, even if we do
not directly observe a particular At , we do know it from the available time series {At Xt }.
We are dealing with MCAR (missing completely at random) and hence it is natural to
try a complete-case approach which yields the following Fourier estimator (compare with
(8.6.4)), Pn
I(Al Xl 6= 0)ϕj (Al Xl )
θ̃j := l=1Pn . (8.6.18)
l=1 I(Al Xl 6= 0)
Pn
Let us check that the Fourier estimator is unbiased given N := l=1 Al > 0. Using the rule
of calculation of an expectation via the expectation of a conditional expectation, together
with E{ϕj (Xl )} = θj , we get
n n Pn I(A X 6= 0)ϕ (A X ) o o
l=1P l l j l l
E{θ̃j |N > 0} = E E n |{At }, N > 0 |N > 0
l=1 I(Al Xl 6= 0)
n Pn A θ o
l j
= E Pl=1n |N > 0 = θj . (8.6.19)
l=1 Al
This is a pivotal result which yields unbiasedness of the sample mean estimator for the
case of missing dependent observations. Hence we again may use our density E-estimator
based on complete cases of time series {At Xt }.
Let us look at a simulation of a long-memory time series with missing observations. The
third (from the top) diagram in Figure 8.8 shows us such a realization, and note that the time
series of interest is shown in the top diagram. Clearly the density of available observations
will be skewed to the right because there are only several negative observations. Further,
note that we have dramatically less, just N = 58, available realizations of the long-memory
time series. How will the E-estimator perform under these circumstances? The bottom
diagram shows us the underlying density and the E-estimate. Yes, the estimate is skewed,
and support is shown incorrectly, but overall this estimate is on par with its benchmark
PROBABILITY DENSITY ESTIMATION 309
3
1
-1
0 20 40 60 80 100
Density Estimate
0.0 0.3 0.6
Density
-1 0 1 2 3
0 20 40 60 80 100
Density Estimate
0.0 0.3 0.6
Density
-1 0 1 2 3
Figure 8.8 Density estimation for long-memory of order β zero-mean Gaussian time series {Xt }
without and with missing observations. The missing is created by a time series {At } of independent
Bernoulli variables with P(A = 1) = w, and then P the available time series is {At Xt }. The title of
the third diagram shows w and the number N := n l=1 Al of available observations. In the second
and fourth diagrams the histograms are based on data in the first and third diagrams, respectively.
The histograms are overlaid by solid and dashed lines showing the underlying density and its E-
estimate over the range of available observations. {Parameter β is controlled by argument beta.
Standard deviation of the time series is controlled by argument sigma.} [n = 100, sigma = 1, beta
= 0.4, w = 0.6, cJ0 = 4, cJ1 = 0.5, cTH = 4]
in the second (from the top) diagram. Despite all the complications, we do see a unimodal
and surprisingly symmetric shape of the E-estimate. Note that the missing makes available
observations less dependent, and then the main complication is the smaller number N of
available observations.
A conclusion is that dependency in observations should not be taken lightly, and typically
larger sample size is the remedy. Further, while short-memory dependence rarely produces
a dramatic effect on statistical estimation or inference, a long-memory dependence, coupled
310 TIME SERIES
with a relatively small sample size, may be destructive. Working with and repeating Figures
8.7 and 8.8 will help the reader to understand and appreciate the dependency.
where m(x) and σ(x) > 0 are smooth functions, and {Wt } is a standard white noise, that is
a time series of independent and identically distributed variables with zero mean and unit
variance.
The time series (8.7.1) has many applications, interpretations, and it is known under
different names. If in (8.7.1) m(Xt ) = aXt and σ(Xt ) = σ then, according to (8.1.7), the
process becomes AR(1) autoregression. This explains the name nonparametric autoregres-
sion for (8.7.1). Further, consider the problem of prediction (forecasting) Xt given Xt−1 .
For instance, we would like to predict the temperature for tomorrow based on the tem-
perature today, or predict a stock return for tomorrow, etc. Then the best predictor, that
minimizes the mean squared error, is m(x) := E{Xt |Xt−1 = x} and it may be referred
to as a nonlinear predictor or nonparametric regression. Further, in the theory of dynamic
models, the equation is called a nonlinear dynamic model, Xt is called a state of the model,
m(x) is called an iterative map, and s(x) is called a scale map. Note that if σ(x) = 0, then
Xt = m(Xt−1 ) and a current state of this dynamic model is defined solely by its previous
state (the states are iterated). This explains the name iterative map of m(x).
Let us explain how we may estimate the autoregression function m(x) in model (8.7.1).
Set Zt := Xt−1 and rewrite (8.7.1) as
Let us look at (8.7.2) more closely. First, we observe n − 1 pairs (X2 , Z2 ), (X3 , Z3 ), . . . ,
(Xn , Zn ). Second, we can write using independence of Zt := Xt−1 and Wt , together with
the zero mean property of Wt ,
We may conclude that (8.7.2) is a regression problem with Zt being the predictor and Xt
being the response. Hence we can use the regression E-estimator of Section 2.3 to estimate
m(z) and the scale E-estimator of Section 3.6 to estimate σ(z).
Further, we can generalize model (8.7.1) and consider
where {Ut } is a stationary and unit-variance time series satisfying E{Ut |Xt−1 } = 0. Then
E{Xt |Xt−1 = x} = m(x) and we again may use the regression and scale E-estimators.
Figure 8.9 presents the proposed statistical analysis of the nonparametric autoregression
(8.7.4) with {Ut } being a Gaussian ARMA(1,1) process; this process allows us to test
robustness of the E-estimator to the assumption E{Ut |Xt−1 } = 0. The underlying simulation
is explained in the caption. Diagram 1 shows us a particular realization. It is difficult to
gain anything feasible from its visualization, and it looks like there is nothing special in
this highly oscillated time series. Keeping in mind that the autoregression is often used to
model a stock price over a short period of time, it is easy to understand why trading stocks
is a complicated issue. Diagram 2 shows us the scattergram of pairs (Xt−1 , Xt ), it sheds
light on the underlying iterative process of autoregression, and look at how inhomogeneous
NONPARAMETRIC AUTOREGRESSION 311
2
Xt
-1
-4
2. Scattergram of Xt Versus Xt 1
2
Xt
-1
-4
-4 -3 -2 -1 0 1 2
Xt 1
3. Estimate of m(x)
0.5
m(x)
-1.5
-4 -3 -2 -1 0 1 2
4. Estimate of Scale
1.0
(x)
0.4
-4 -3 -2 -1 0 1 2
Figure 8.9 Nonparametric autoregression. The simulation uses model (8.7.4) where m(x) :=
C exp(λm x)/(1 + exp(λm x)), σ(x) := a1 + b1 exp(λs x)/(1 + exp(λs x)), and {Ut } is a Gaussian
ARMA(1, 1) time series with parameters (a, b, σ). In second and third diagrams the solid line is
the underlying function m(x), and in Diagram 4 the solid line is the underlying scale σ(x). The
dashed lines show E-estimates. {Parameters λm and λs are controlled by arguments lambdam and
lambdas, respectively.} [n = 240, a = -0.3, b = 0.6, sigma = 1, a1 = 0.5, b1 = 1, lambdam = -2,
lambdas = 2, C = 3, cJ0 = 4, cJ1= 0.5, cTH = 4]
the scattergram is. The solid line is the underlying autoregression function m(x), and it
definitely sheds light on the data (of course, for real data this line would not be available).
The interesting and rather typical feature is a small number of observations in the tails.
Another interesting feature of the data is that the variability of observations depends on
the predictor Xt−1 and it is clearly larger for positive predictors.
Now an important remark is due. In a nonparametric autoregression the role of noise
{Ut } is absolutely crucial because it forces {Xt } to have a sufficiently large range of values
which, in its turn, allows us to estimate the autoregression function. To appreciate the role
312 TIME SERIES
0
-4
-4
-4 -2 0 2
Xt 1
3. Estimate of m(x)
0.5
m(x)
-1.5
-4 -2 0 2
4. Estimate of Scale
1.0
(x)
0.2
-4 -2 0 2
Figure 8.10 Nonparametric autoregression with Markov-Bernoulli missing. Markov chain {At } is
generated according to transition probabilities P(At+1 = 0|At = 0) = α and P(At+1 = 1|At = 1) =
β. Diagram 1 shows the observed time series {At Xt } where {Xt } is generated as in Figure 8.9.
N := n
P Pn
l=1 Al is the number of available observations of Xt while M := l=2 Al−1 Al is the number
of available pairs of observations; these statistics are shown in the titles. [n = 240, alpha = 0.4,
beta = 0.8, a = -0.3, b = 0.6, sigma = 1, a1 = 0.5, b1 = 1, lambdam = -2, lambdas = 2, C = 3,
cJ0 = 4, cJ1 = 0.5, cTH = 4]
of the noise, just set it to zero and then check the outcome theoretically and using Figure
8.9 (in the figure set a1 = 0 and b1 = 0).
Diagram 3 shows us the estimated autoregression function, and the E-estimate is good.
Diagram 4 shows us the E-estimate of the scale function, and it is also good. Note that its
left tail is smaller than the underlying scale but you can check diagram 2 and conclude that
the data do support opinion of the E-estimate.
Now let us consider the same model only when some observations are missing according
to a Markov-Bernoulli time series {At } discussed in Section 8.3. This model is illustrated
in Figure 8.10 whose caption explains the simulation and notation. The underlying process
EXERCISES 313
{Xt } is the same as in Figure
Pn 8.9. Diagram 1 shows us realization of the available time series
{At Xt } where only N = l=1 Al = 175 from n = 240 observations are available. Now note
that E-estimation is based on pairs (Xl−1 , Xl ), l = 2, . . . , n and these pairs are available
only if Al−1 Al = 1. The available pairs are shown in diagram 2, and the number M of
available pairs is 135. Note that we lost almost a half of underlying pairs (compare with
Diagram 2 in Figure 8.9). This loss definitely has affected estimation of the autoregression
function (Diagram 3) and the scale function (Diagram 4). At the same time, with the help
of Diagram 2 we may conclude that the estimates reflect the data. Indeed, let us look at
the left tail in Diagram 2. Clearly all observations are below the solid line (the underlying
autoregression), and this is what the E-estimate in diagram 3 tells us. Further, note that
observations of Xt in the left tail of Diagram 2 exhibit a minuscular variability, and this is
correctly reflected by the E-estimate in Diagram 4. It is possible to make a similar conclusion
about the right tail.
We conclude that the proposed methodology of E-estimation is robust and performs
relatively well for the complicated model of nonparametric autoregression with missing
observations. The reader is advised to repeat Figures 8.9 and 8.10 with different parameters
and get used to this important stochastic model.
8.8 Exercises
8.1.1 Consider a not necessarily stationary time series {Xt } with a uniformly bounded
second moment, that is E{Xt2 } ≤ c < ∞ for any t. Is the first moment of Xt uniformly
bounded?
8.1.2 Consider a stationary time series with a finite second moment. Does the first moment
exist? If the answer is “yes,” then can the first moment E{Xt } change in time?
8.1.3 Consider a zero-mean and second-order stationary time series {Xt }. Prove that the
autocovariance function γ X (0) = E{Xt2 } = V(Xt ) for any t.
8.1.4 Verify (8.1.3) and (8.1.4).
8.1.5∗ Show that for a short-memory second-order stationary time series the sample mean
estimator of its mean has variance which vanishes with the rate n−1 .
8.1.6∗ Consider the case of a long-memory second-order stationary time series with γ X (t)
−α
Pn to |t| , 0 < α < 1. Evaluate the variance of the sample mean estimator
proportional
−1
µ̄ := n l=1 Xl .
8.1.7 Let {Wt } be a standard Gaussian white noise. For the following time series, verify
the second-order stationarity and calculate the mean and autocovariance function:
(i) Xt = a + bWt .
(ii) Xt = a + bWt sin(ct).
(iii) Xt = Wt Wt−2 .
(iv) Xt = aWt cos(ct) + bWt−2 sin(ct).
(v) Zt = a + bWt + cWt2 .
8.1.8 Suppose that {Xt } and {Zt } are two uncorrelated second-order stationary time series.
What can be said about a time series {Yt } := a{Xt } + b{Zt } where the sum is understood
as the elementwise sum. Hint: Calculate the mean and autocovariance.
8.1.9 Prove that an autocovariance function always satisfies the inequality γ X (k) ≤ γ X (0).
Hint: Use Cauchy-Schwarz inequality.
8.1.10∗ Consider a realizationP X1 , . . . , Xn , n ≥ p of a causal AR(p) process (8.1.7). Prove
p
that the estimator X̂n+1 := k=1 ak Xn+1−k is the best linear predictor of Xn+1 that
minimizes
Pn the mean squared error E{(X̃n+1 − Xn+1 )2 } over all linear estimators X̃n+1 =
k=1 bk Xn+1−k . Hint: Write down the mean squared error and then minimize it with
2
respect to b1 , . . . , bn . Also, note that the mean squared error is not smaller than E{Wn+1 }.
8.1.11 What is the definition of a causal ARMA process? Why are these processes of
interest?
314 TIME SERIES
8.1.12∗ Consider the process Xt = Xt−1 + Wt . Show that it does not have a stationary
solution.
8.1.13 Explain how MA(q) and AR(p) processes can be simulated.
8.1.14∗ Consider a Gaussian ARMA(1,1) process {Xt } defined by Xt − aXt−1 = σ(Wt +
bWt−1 ) with |a| < 1 and −a 6= b. Show that the process is stationary
P∞ and causal. Hint:
Show that the process may be written as Xt = σWt + σ(a + b) j=1 aj−1 Wt−j .
8.1.15 Consider the product of two independent second-order stationary time series one of
which belongs to class (8.1.8). Does the product belong to class (8.1.8)?
8.1.16 For a second-order stationary Ptimes series {Xt } from a class (8.1.8), what can be
n
said about variance of a linear sum l=1 al Xl ? Hint: Introduce reasonable restrictions on
numbers al , l = 1, 2, . . . , n.
8.1.17 Consider a MA(2) process. What can be concluded about its mixing coefficient
(8.1.10)?
8.1.18 Give an ARMA example of m-dependent time series.
8.1.19 Consider a stationary time series {Xt }Pwith a known mixing coefficient αX (s).
n
Evaluate the mean and variance of Y := n−1 l=1 sin(Xl ). Hint: Use (8.1.13) and any
additional assumption needed.
8.1.20 Consider a stationary Gaussian ARMA process. What can be said about its mixing
coefficient (8.1.10)? Hint: Use the Kolmogorov-Rosanov result.
8.1.21∗ Prove (8.1.12).
8.1.22∗ Verify (8.1.13).
8.2.1
R π XIs the spectral density symmetric about zero (even function)? Also, calculate
−π
g (λ)dλ.
8.2.2∗ Show that a spectral density is a nonnegative function.
8.2.3 How can Fourier coefficients of a spectral density g X (λ) be expressed via the autoco-
variance function?
8.2.4∗ Explain formula (8.2.3). Present a motivating example.
8.2.5 What is the difference, if any, between estimators (8.2.4) and (8.2.5)?
8.2.6 Compare biases of autocovariance estimators (8.2.4) and (8.2.5) for a zero-mean and
second-order stationary time series.
8.2.7∗ Calculate variance and the mean squared error of the autocovariance estimator
(8.2.4). Hint: Make your own assumptions.
8.2.8 Calculate variance of the estimator (8.2.5). Hint: Make your own assumptions.
8.2.9∗ Calculate the mean and variance of the periodogram.
8.2.10 Using your knowledge of nonparametric estimation, explain why a periodogram
cannot be a consistent estimator.
8.2.11 What is the definition of a Gaussian ARMA(1,1) process?
8.2.12∗ Verify (8.2.7).
8.2.13 Prove that the autocovariance of an ARMA(1,1) process decreases exponentially.
8.2.14 In Figure 8.1 the ARMA process exhibits high fluctuations. Suggest parameters of
an ARMA process that fluctuates slower. Use Figure 8.1 to verify your recommendation.
8.2.15 The periodogram in Figure 8.1 indicates a possible periodic (so-called seasonal)
component in the underlying process. Use formula (8.2.3) to find the period of a possible
periodic component. Do you believe that this component is present in the process shown in
the top diagram?
8.2.16 Explain how the spectral density E-estimator is constructed.
8.2.17 Using Figure 8.1, propose better parameters of the E-estimator.
8.2.18 Using Figure 8.1, conduct simulations with different values of parameter σ, and
report your findings.
8.2.19 Is it reasonable to believe that, as (8.2.8) indicates, the spectral density of an ARMA
process is proportional to σ 2 ?
EXERCISES 315
8.2.20 Explain relation (8.2.9).
8.2.21 In the right side of (8.2.11), one part is called the integrated variance (or simply
variance), and another the integrated squared bias. Write down these two components and
explain their names.
8.2.22 Prove (8.2.12).
8.2.23∗ Evaluate the variance of the sample autocovariance.
8.2.24 Verify (8.2.14). Hint: Write down the squared sum via a corresponding double sum,
and then think about addends as elements of a matrix.
8.2.25 Verify (8.2.15).
8.2.26∗ Prove (8.2.16). Hint: Use the assumption that the time series is Gaussian.
8.2.27 Verify (8.2.17).
8.2.28 Why do we need the assumption (8.2.18)? Does it hold for ARMA processes?
8.2.29 Using Cauchy-Schwarz inequality, establish (8.2.19).
8.2.30 Prove (8.2.20).
8.2.31∗ Prove (8.2.21).
8.2.32 What do (8.2.21) and (8.2.22) tell us about the sample autocovariance estimator?
8.2.33∗ Use formula (8.2.23) and find optimal cutoff for an ARMA process. Then compare
estimation of the spectral density with estimation of a single parameter.
8.2.34 Verify (8.2.25).
8.2.35 Explain how (8.2.28) is obtained and why the proposed estimator ḡ X presents a
practical interest.
8.2.36∗ Suppose that (8.2.26) is correct. The aim is to propose an estimator and to choose
a minimal sample size that the MISE does not exceed a fixed positive constant ε. Propose
such an estimator and the sample size.
8.3.1 Explain the missing mechanism (8.3.1).
8.3.2 In the case of a time series with missing observations, we observe realizations of two
time series {At Xt } and {At }. Explain why under the assumption P(Xt = 0) = 0 it is
sufficient to know realizations of only one time series {At Xt }.
8.3.3 Prove that if P(Xt = 0) = 0, then P(At = I(Yt 6= 0)) = 1 where Yt is defyned in
(8.3.1). Does this imply that At = I(Yt 6= 0)?
8.3.4 Suppose that the chance of rain tomorrow depends only on rain or no rain today.
Suppose that if it rains today, then tomorrow it will rain with probability α, and if it is no
rain today, then tomorrow it will rain with probability β. Find the probability that if today
is no rain, then two days from today there will be rain. Hint: Consider a two-state Markov
chain.
8.3.5∗ For the previous problem, consider 10 consecutive days and find the expected number
of rainy days.
8.3.6∗ Consider a Markov-Bernoulli missing mechanism with α := P(At+1 = 0|At = 0). Let
L be the length of a batch of missing cases. Explain why, given the batch length L ≥ 1, the
distribution of L is geometric with P(L = k) = αk−1 (1 − α), k = 1, 2, 3, . . .
8.3.7∗ For the setting of Exercise 8.3.6, find the mean and the variance of the batch length
L.
8.3.8 Is the autocovariance γ A (j) equal to E{A1 A1+j }?
8.3.9 Give the definition of a Markov chain.
8.3.10 Is an ARMA(1,1) process a Markov chain? If the answer is “yes,” then what is
the order of the Markov chain? Hint: Think about the effect of parameters of an ARMA
processes.
8.3.11 Explain the simulation used to create Figure 8.2.
8.3.12 Repeat Figure 8.2 for different Markov-Bernoulli processes. Explain how its param-
eters affect the missing and estimation of the spectral density.
316 TIME SERIES
8.3.13∗ Why does the naı̈ve estimate in Figure 8.2 indicate a lower (with respect to the
E-estimate) spectrum power on low frequencies?
8.3.14 Explain the three estimates shown in Figure 8.2.
8.3.15 Using repeated simulations of Figure 8.2, propose better parameters of the E-
estimator. Pn
8.3.16∗ Find the mean and variance of the available number N = l=1 Al of observations.
8.3.17 Verify (8.3.3).
8.3.18 Explain every equality in (8.3.4). Do not forget to comment on used assumptions.
8.3.19 What is the underlying idea of the estimator (8.3.5)? Explain its numerator and
denominator.
8.3.20 Show that the autocovariance estimator (8.3.5) is unbiased.
8.3.21∗ Calculate the variance of estimator (8.3.5). Hint: Propose your assumptions.
8.3.22 Explain the underlying simulation in Figure 8.3.
8.3.23 Explain how parameter λ affects the number N of available observations in the
simulation of Figure 8.3.
8.3.24 Formulate basic statistical properties of a batch-Bernoulli process.
8.3.25∗ Explain how the three estimates, shown in Figure 8.3, are constructed.
8.3.26∗ Given the same number of available observations, is Markov-Bernoulli or batch-
Bernoulli missing mechanism better for estimation? You may use either a theoretical ap-
proach or simulations to answer the question.
8.3.27 How many parameters are needed to define a stationary Markov chain of order 2?
Hint: It may be helpful to begin with a Markov chain of order 1.
8.4.1 Explain an amplitude-modulated missing mechanism. Give several examples.
8.4.2 Suppose that {Xt }, {At } and {Ut } are (second-order) stationary time series. Is their
product a (second-order) stationary time series?
8.4.3 What do we need the assumption (8.4.1) for?
8.4.4∗ Consider the case when {At } and {Ut } are dependent. Does this affect estimation
of the spectral density g X ?
8.4.5∗ Consider the case when {At } and {Xt } are dependent. Does this affect estimation
of the spectral density g X ?
8.4.6∗ Find the mean and variance of the sample mean autocovariance (8.4.2).
8.4.7 Verify each equality in (8.4.3). Explain where and how the made assumptions about
the three processes are used.
8.4.8 Can the spectral density g X be consistently estimated based on amplitude-modulated
observations? In other words, is this missing destructive?
8.4.9∗ Give definition of the shape of a function. Then explain when and how shape of the
spectral density may be estimated for amplitude-modulated data.
8.4.10∗ Find the mean and variance of estimator (8.4.6).
8.4.11 Verify and explain all steps in establishing (8.4.7).
8.4.12 Explain the simulation used in Figure 8.4.
8.4.13 Explain how parameters of the processes {At } and {Ut } affect the number N of
available observations. Support your conclusion using Figure 8.4.
8.4.14 Find better parameters of the E-estimator used in Figure 8.4.
8.4.15 Explain why the naı̈ve estimate in Figure 8.4 indicates a smaller spectrum power on
low frequencies.
8.4.16 Explain the model of amplitude-modulation by a Poisson variable.
8.4.17 In general, an amplitude-modulation implies inconsistent estimation of the spectral
density. On the other hand, the Poisson amplitude modulation does allow a consistent
estimation. Why?
8.4.18∗ Explain the estimator (8.4.8) of the mean of a Poisson distribution. Then evaluate
its mean and variance.
EXERCISES 317
8.4.19∗ Prove (8.4.9). Explain the used assumption.
8.4.20∗ Find the mean and variance of the estimator (8.4.10).
8.4.21∗ Consider the case when {Xt } and Poisson {Ut } are dependent. Explore the possi-
bility of a consistent estimation of the spectral density or its shape.
8.4.22 Explain the simulation used in Figure 8.5.
8.4.23 Find the mean and variance of the available number N of observations in Figure
8.5. Does an underlying (hidden) time series {Xt } affect N ?
8.4.24 Consider the bottom diagram in Figure 8.5. The naı̈ve estimate exhibits a larger
spectrum power at all frequencies. Why?
8.4.25 Use Figure 8.5 to answer the following question. How do parameters of the underlying
ARMA process affect estimation of its spectral density?
8.5.1 Explain the model of right censored time series. Present examples.
8.5.2 What are the available observations when time series is censored?
8.5.3 Explain formula (8.5.1).
8.5.4 What is the definition of a bivariate survival function? What are its properties?
8.5.5 Prove (8.5.3).
8.5.6∗ Find the mean and variance of estimator (8.5.4). Is it unbiased?
8.5.7∗ Explain the method of estimation of the survival function GC (v).
8.5.8 Verify (8.5.5).
8.5.9 Explain why the formula (8.5.6) is of interest.
8.5.10 What is the definition of a hazard rate? What are its properties?
8.5.11 Assume that the hazard rate is known. Suggest a formula for the corresponding
probability density.
8.5.12 Verify (8.5.9).
8.5.13∗ Explain how the numerator and denominator in (8.5.9) may be estimated.
8.5.14 Prove validity of (8.5.11).
8.5.15∗ Find the mean and variance of the estimator (8.5.12).
8.5.16∗ Use an exponential inequality to infer about estimator (8.5.12).
8.5.17∗ Evaluate the mean and variance of estimator (8.5.13). Is it unbiased? Is it asymp-
totically unbiased?
8.5.18 Prove (8.5.14).
8.5.19 Explain why (8.5.15) is a reasonable estimator of the survival function.
8.5.20 Explain the simulation used to create Figure 8.6.
8.5.21 Explain all diagrams in Figure 8.6.
8.5.22 Consider Figure 8.6 and answer the following question. Why is the censored time
series highly oscillated, while the underlying {Xt } is not?
8.5.23 Explain how the naı̈ve spectral density estimate is constructed.
8.5.24 The considered problem is complicated. Repeat Figure 8.6 a number of times and
make your own conclusion about the E-estimator.
8.5.25 Use Figure 8.6 and then explain your observations about the size N of uncensored
observations.
8.5.26 Suggest better parameters for the E-estimator. Hint: Use Figure 8.6 with different
sample sizes and parameters.
8.5.27∗ Consider the case of a stationary time series {Ct }. Propose a consistent estimator
of the bivariate survival function GC1 ,C1+j (v, u).
8.5.28∗ Consider the same setting only for time series of lifetimes. Propose a spectral density
estimator and justify its choice. Hint: Estimate the mean, subtract it, and then check how
this step affects statistical properties of the E-estimator.
8.6.1 Give definition of the probability density f X (x) of a continuous random variable X.
8.6.2 Suppose that [0, 1] is the support for a random variable X (or we may say the support
of the probability density f X ). What is the meaning of this phrase?
318 TIME SERIES
8.6.3 Find the mean and variance of the sample mean estimator (8.6.4) for the case of
independent observations of (a sample from) X.
8.6.4 Consider a time series {Xt }. What is the assumption that allows us to define the
density f Xt ? What is the assumption that makes feasible the problem of estimation of the
density f Xt ?
8.6.5 Explain why (8.6.6) is still valid for the case of dependent observations.
8.6.6 Is the sample mean estimator θ̂j unbiased? Is it robust toward dependence between
observations?
8.6.7∗ Find the variance of the sample mean estimate θ̂j for the case of dependent obser-
vations. Explain all steps and made assumptions.
8.6.8 Explain all steps in establishing (8.6.8). Write down all used assumptions.
8.6.9 What is the assumption needed for validity of (8.6.9)?
8.6.10 Explain how the equality (8.6.10) was obtained.
8.6.11 Why is the assumption (8.6.11) important?
8.6.12 Explain importance of conclusion (8.6.12) for estimation of the probability density.
8.6.13 Give a definition of a weak dependence. Compare with the case of processes with
long memory.
8.6.14 Suppose that (8.6.14) holds. What can be said about dependency between observa-
tions?
8.6.15 Consider a continuous random variable Z with density f Z . What is the density of
Y := aZ + b? Note that we are dealing with a scale-location transformation of Z.
8.6.16 Explain the motivation behind estimator (8.6.15). Why do we use such a complicated
density estimator?
8.6.17∗ The density estimator (8.6.15) tells us that the support of X is [X(1) , X(n) ]. Is this
also the case for the empirical cumulative distribution function?
8.6.18∗ Consider a sample of size n from X. Find the probability P(X ≥ X(n) ). Use your
result to improve the approach (8.6.15).
8.6.19 Explain formulas (8.6.16) and (8.6.17).
8.6.20 Explain the simulation that creates top diagrams in Figure 8.7.
8.6.21 Explain how estimates, shown in Figure 8.7, are calculated.
8.6.22 What is the difference, if any, between left and right columns of diagrams in Figure
8.7?
8.6.23 Repeat Figure 8.7 a number of times, and make a conclusion about which type of
ARMA processes benefits estimation of the density.
8.6.24 Repeat Figure 8.7 with different parameters σ and σ1 . Report on how they affect
estimation of the density.
8.6.25 Find better parameters of the E-estimator used in Figure 8.7.
8.6.26 Explain the simulation that creates the top diagram in Figure 8.8.
8.6.27 Explain the estimate shown in the second from the top diagram in Figure 8.8.
8.6.28 Explain the simulation that creates the third from the top diagram in Figure 8.8.
8.6.29 Explain the estimate shown in the bottom diagram in Figure 8.8.
8.6.30 Evaluate the mean and variance of the available number N of observations in sim-
ulation in Figure 8.8.
8.6.31 Suggest better parameters of the E-estimator used in Figure 8.8.
8.6.32∗ Using Figure 8.8, as well as your understanding of the theory, comment on the
effect of missing data on density estimation for processes with long memory. Hint: Think
about the case when the sample size of a hidden sample and the number of complete cases
in a larger sample with missing observations are the same.
8.6.33 Explain the underlying idea of the estimator (8.6.18).
8.6.34∗ Find the mean and variance of the estimator (8.6.18).
EXERCISES 319
8.6.35∗ Suppose that you have a sample of observations. What test would you suggest for
independence of observations?
8.6.36∗ Propose a generator of a time series with long memory.
8.7.1 Explain the model of nonparametric autoregression.
8.7.2 Why can the model (8.7.1) be referred to as the prediction (forecasiting) model?
8.7.3∗ Consider a model Xt = m1 (Xt−1 ) + m2 (Xt−2 ) + σ(Xt−1 , Xt−2 )Wt . Propose E-
estimators for functions m1 (x), m2 (x) and σ(x1 , x2 ).
8.7.4 Is the process, defined by (8.7.1), second-order stationary? Hint: Think about as-
sumptions.
8.7.5 Explain (8.7.2).
8.7.6 Verify (8.7.3).
8.7.7 Explain how the nonparametric autoregression model is converted into a nonpara-
metric regression model.
8.7.8 Describe the simulation used in Figure 8.9.
8.7.9 How does a Markov-Bernoulli missing mechanism perform?
8.7.10 Is there any useful information that may be gained from analysis of Diagram 1 in
Figure 8.9?
8.7.11 How was Diagram 2 in Figure 8.9 created?
8.7.12∗ Explain all steps in construction of E-estimator of m(x). Then conduct a number
of simulations, using Figure 8.9, and comment on performance of the estimator.
8.7.13∗ Use the theory and Figure 8.9 to explain how parameters of the ARMA process
affect estimation of m(x). Hint: Check the assumption E{Ut |Xt−1 } = 0.
8.7.14 Scale function in model (8.7.1) is an important function on its own, and it is often
referred to as the volatility. Explain how it may be estimated using E-estimator. Then check
its performance using Figure 8.9.
8.7.15 Repeat Figure 8.9 several times, make hard copies of figures, and then write down
a report that explains performance of the E-estimators.
8.7.16 Find better parameters of the E-estimator used in Figure 8.9.
8.7.17 Explain how parameters of the E-estimator affect estimation of m(x). Then test
your conclusion using Figure 8.9.
8.7.18 Explain how parameters of the E-estimator affect estimation of σ(x). Then test your
conclusion using Figure 8.9.
8.7.19∗ Using Figure 8.9, find how parameters of the underlying model for m(x) affect
estimation of m(x). Then explain your conclusion theoretically.
8.7.20∗ Using Figure 8.9, find how parameters of the underlying model for m(x) affect
estimation of σ(x). Then explain your conclusion theoretically.
8.7.21 Using your understanding of the theory and simulations conducted by Figure 8.10,
explain how parameters of the Markov-Bernoulli missing affect estimation of m(x).
8.7.22 Scale function in model (8.7.1) is an important function on its own, and it is often
referred to as the volatility. Explain how it may be estimated using E-estimator.
8.7.23 Using your understanding of the theory and simulations conducted by Figure 8.10,
explain how parameters of the Markov-Bernoulli missing affect estimation of σ(x).
8.7.24 Repeat Figure 8.10 several times, make hard copies of figures, and then write a
report which explains shapes of the recorded E-estimates.
8.7.25∗ Model (8.7.1) is often referred to as a nonlinear one-step prediction. What will be
the definition of a nonlinear two-step prediction model? Suggest an E-estimator.
8.7.26∗ Consider a functional-coefficient autoregression model
Dependent Observations
This chapter is a continuation of Chapter 8 and it uses its notions and notations. The
primary interest is in the study of nonstationary processes where the joint distribution of
studied time series and/or the modification mechanism may change in time. The nonsta-
tionarity should test the limits of the E-estimation methodology.
The chapter begins with the analysis of a nonparametric regression with dependent re-
gression errors, in particular long-memory ones. We will learn in Section 9.1 that the design
of predictors, which may be either fixed or random, makes a dramatic effect on quality of
estimation. Recall that this phenomenon does not exist for independent regression errors.
Section 9.2 discusses classical continuous time processes, including Brownian motion and
white noise, and we are learning how to filter a continuous signal from a white noise. In
Section 9.3 we consider a nonstationary discrete-time series and learn how to detrend, desea-
sonalize and descale it so we may estimate the spectral density of an underlying stationary
time series. The case of missing observations is considered as well. Section 9.4 considers
a classical decomposition of amplitude-modulated time series. Section 9.5 explains how to
deal with a missing mechanism that changes in time. Section 9.6 considers the case of a
nonstationary time series with changing in time spectral density. Section 9.7 introduces us
to a Simpson’s paradox which explains the importance of paying attention to lurking vari-
ables that may dramatically change our opinion about data. Finally, Section 9.8 introduces
us to a sequential estimation. Here we are exploring the potential of a controlled design of
predictors in a regression.
323
324 DEPENDENT OBSERVATIONS
21/2 cos(πjx), j = 1, 2, . . .,
n
X
m(x) = θj ϕj (x), (9.1.2)
j=0
Let us also assume that the regression function is differentiable with bounded derivative on
[0, 1]. Then the variance V(θ̂j ) and the mean squared error of the Fourier estimator decrease
with the classical parametric rate n−1 .
The Fourier estimator (9.1.4) yields the regression E-estimator m̂(x) introduced in Sec-
tion 2.3.
Now let us explore the same characteristics of Fourier estimator (9.1.4) for the new case
when regression errors are a realization of a zero-mean and second-order stationary time
series {εt }. The dependence does not change the mean of the Fourier estimator (9.1.4) due
to the zero-mean property of regression errors. Let us evaluate the variance of θ̂j .
We begin with analysis of the variance of θ̂0 . Using (8.2.14) and the zero-mean property
of the errors we may write,
n
X n
X
V(θ̂0 ) = E{[n−1 (m(Xl ) − E{m(Xl )})]2 } + σ 2 E{(n−1 εl )2 }
l=1 l=1
h n
X i h n
X i
= E{[n−1 (m(Xl ) − E{m(Xl )})]2 } + σ 2 n−2 (n − |l|)γ ε (l) , (9.1.5)
l=1 l=−n
where γ ε (l) := E{ε0 εl } is the autocovariance function of the regression errors. There are
two terms on the right side of (9.1.5) highlighted by square brackets. The first one is either
zero for the fixed design or it decreases with the rate not slower than n−1 for the random
design. Further, the regression errors have no effect on the first term.
Regression errors affect only the second term in (9.1.5), and it is important to stress
that the design of predictors has no effect on the term. On the other hand, the dependence
in regression errors may dramatically change the rate of convergence to zero of the second
term. Indeed, if the dependence has short P∞ memory (say the errors are generated by an
ARMA process or m-dependent) then l=0 |γ ε (l)| < ∞ and the second term, as well as
the variance of θ̂0 , are proportional to n−1 . For a long-memory dependence of order β
with β ∈ (0, 1), when the autocovariance function γ ε (j) is proportional to j −β , the rate
of convergence to zero of the second term slows down to n−β . The latter yields the same
slowing down for the rate of convergence to zero of the variance of θ̂0 . Further, according
to the Parseval identity, we may expect lower rates for the MISE convergence.
The conclusion is that while the design of predictors does not affect the rate of con-
vergence of the variance of estimator θ̂0 , the dependence in regression errors may have
a pronounced effect on the variance. Furthermore, as we will see shortly, a long-memory
NONPARAMETRIC REGRESSION 325
dependence in regression errors may make a reliable estimation of θ0 impossible for small
samples.
Now let us consider the effect of dependent regression errors on the variance of Fourier
estimator θ̂j for j ≥ 1. Write,
n
nhX i2 o
V(θ̂j ) = E{(θ̂j − E{θ̂j })2 } = n−2 E (Yl ϕj (Xl ) − E{θ̂j })
l=1
n
nhX i2 o
= n−2 E (m(Xl )ϕj (Xl ) − E{θ̂j }) + σεl ϕj (Xl )
l=1
n
X
= n−2 E{(m(Xl )ϕj (Xl ) − E{θ̂j })(m(Xt )ϕj (Xt ) − E{θ̂j })} (9.1.6)
l,t=1
n
X n
X
+2n−2 E{(m(Xl )ϕj (Xl ) − θj )σεt ϕj (Xt )} + n−2 σ 2 E{εl εt ϕj (Xl )ϕj (Xt )}. (9.1.7)
l,t=1 l,t=1
Because the time series {εt } is zero-mean, the term (9.1.6) does not depend on the
distribution of regression errors. Further, for the fixed design this term is zero while for the
random design it decreases with the rate not slower than n−1 . To verify the last assertion
for the random design we use E{m(X)ϕj (X)} = E{θ̂j } = θj , and for the fixed design we
Pn
use Xl = l/n and n−1 l=1 m(l/n)ϕj (l/n) = E{θ̂j }.
The first sum in (9.1.7) is zero because {εt } is zero-mean and independent of predictors.
The main term to explore is the second sum in (9.1.7), and this is where the design of
predictors becomes critical. Note that a particular expectation in that sum can be written
as
E{εl εt ϕj (Xl )ϕj (Xt )} = E{εl εt }E{ϕj (Xl )ϕj (Xt )} = γ ε (t − l)E{ϕj (Xl )ϕj (Xt )}. (9.1.8)
E{ϕj (Xl )ϕj (Xt )} = E{ϕj (Xl )}E{ϕj (Xt )}I(l 6= t) + E{[ϕj (Xl )]2 }I(l = t). (9.1.9)
Relations (9.1.10) and (9.1.11) explain the pronounced difference between the effect of
dependence in the time series {εt } on regressions with random and fixed designs. For the
random design we use (9.1.10) to evaluate the second term in (9.1.7) and get
n
X n
X
n−2 σ 2 E{εl εt ϕj (Xl )ϕj (Xt )} = n−2 σ 2 E{ε2l }E{ϕ2j (Xl )}
l,t=1 l=1
3
1
2
Y
Y
0
1
-1
0
-2
-1
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
X X
2
1
1
0
Y
Y
-1
0
-2
-1
-3
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x
Figure 9.1 Regression with long-memory errors of order β ∈ (0, 1). Two columns correspond to
different simulations of the same experiment. The same regression errors are used in generating
random and fixed design regressions. The predictors are uniform in both designs. Circles show a
scattergram of observations overlaid by the underlying regression function (the solid line) and its
E-estimate (the dashed line). {Parameter β is controlled by the argument beta.} [corn = 3, n =
100, beta = 0.2, sigma = 1, cJ0 = 4, cJ1 = 0.5, cTH = 4]
n−1 for all j ≥ 1. Of course,Pthis is not the case for the θ̂0 , but if we are interested solely
∞
in estimation of the shape j=1 θj ϕj (x) of the regression function m(x), then this is a
remarkable statistical outcome.
Unfortunately, there is no similarly nice conclusion for the fixed-design case because
(9.1.11) does not allow us to eliminate the effect of dependency.
The teachable moment from our theoretical analysis is that a random design may at-
tenuate the effect of dependency if we are interested in estimating the shape of a regression
function.
Figure 9.1 allows us to understand the setting and appreciate performance of the E-
estimator for the case of long-memory regression errors and the uniform random and fixed
STOCHASTIC PROCESS 327
designs of predictors. Let us look at the left column of diagrams. Here we have the same
regression function, the same time series of long-memory regression errors, and the only
difference is in the design of predictors. What we see here is a typical long-memory series
that begins with negative values and, even after 100 realizations, it is still negative. Note
that the errors are zero mean, that is, eventually they will be positive and then they will
stay positive over a long period of time. Nonetheless, we do see the two modes, and even the
relation between the modes is shown reasonably well. In other words, while the regression
is shifted down (because we cannot reliably estimate θ0 for this sample size), the shape of
the regression is clearly visualized. The outcome is worse for the case of the fixed design,
and we know why. The right column of diagrams presents another realization of the same
underlying experiment. Here again we can see how the design, together with dependent
regression errors, affects both the scattergram and the E-estimator.
Let us make two more Rremarks. The first one is that there is no way for us to reliably
1
estimate parameter θ0 = 0 m(x)dx for the considered sample size n = 100. It will be a
teachable moment to repeat Figure 9.1 with different sample sizes and parameter β and
analyze chances of feasible estimation of θ0 . The second remark is about the observed time
series {Yt } of responses. Consider the case of a stationary time series {εt } of regression
errors. Then, under the random-design of predictors the time series {Yt } of responses is also
stationary, and it is nonstationary for the fixed-design predictors whenever the regression
function m(x) is not constant. To see the latter, note that the mean of the time series {Yt }
changes in time.
We may conclude that for a regression problem the dependence between responses may
dramatically affect quality of estimation. Further, there is a striking difference between
random- and fixed-design regressions, and knowing this fact may help in designing an ex-
periment and choosing an appropriate methodology of estimation. Finally, the E-estimator
still may be used and the asymptotic theory asserts optimality of the series methodology of
estimation.
As we see, the power of a frequency-limited white noise increases to infinity as its fre-
quency domain increases. There is nothing wrong with dealing with such a process, at least
mathematically, but this fact is necessary to know.
There is a simple way to overcome the last complication via taking an integral of the
frequency-limited white noise. Indeed, introduce its integral
Z t k−1
X Z t
B(t, k) := W (u, k)du = Wj ϕj (u)du
0 j=0 0
k−1
X
= W0 t + Wj [(πj)−1 21/2 sin(πjt)], 0 ≤ t ≤ 1. (9.2.3)
j=1
The process B(t, k) is called the frequency-limited standard Brownian motion. Let us look
at its distribution. For a fixed time t, the frequency-limited standard Brownian motion
B(t, k) is a Gaussian variable with zero mean and variance
k−1
X Z t k−1
X
V(B(t, k)) = [ ϕj (u)du]2 = t2 + 2 (πj)−2 sin2 (πjt), 0 ≤ t ≤ 1. (9.2.4)
j=0 0 k=1
We can immediately say that the variance is bounded by a constant for all k, and we also will
establish shortly that V(B(t, k)) ≤ t. Hence, at least formally, we can introduce a standard
Brownian motion (also known as a Wiener process)
∞
X Z t
B(t) := Wj ϕj (u)du. (9.2.5)
j=0 0
Brownian motion plays a central role in the theory of stochastic processes, similarly to
the role of a Gaussian variable in the classical theory of probability. Let us look at some
of basic properties of the Brownian motion. First, B(0) = 0. Second, at any moment t a
Brownian motion B(t) is a Gaussian variable with zero mean. Let us calculate its variance,
Rt
and this is a very nice exercise on using the Parseval identity. Note that 0 ϕj (u)du is the
jth Fourier coefficient of function I(0 ≤ u ≤ t), and then the Parseval identity implies
Z 1 ∞ Z t
X
t= [I(0 ≤ u ≤ t)]2 du = [ ϕj (u)du]2 . (9.2.6)
0 j=0 0
The conclusion is that B(t) is a Gaussian variable with zero mean and variance t. Third, the
random process B(t0 + t) − B(t0 ) is again a standard Brownian motion that starts at time
STOCHASTIC PROCESS 329
0.0
1.5
-0.4
1.0
-0.8
0.5
-1.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
t t
20
10 15
10
5
5
0
0
-5
-5
-15
-15
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
t t
Figure 9.2 Examples of a Brownian motion B(t, k) and a corresponding white noise W (t, k). {The
signals are frequency-limited by a cutoff k.} [k = 50]
t0 . Finally, we have an interesting property that for any 0 ≤ t1 < t2 ≤ t3 < t4 ≤ 1 Gaussian
variables B(t2 ) − B(t1 ) and B(t4 ) − B(t3 ) are independent. To prove this assertion, we note
that these two variables have a bivariate Gaussian distribution, and then we may use the
Parseval identity (1.3.39) and get,
Z 1 X∞ Z t2 Z t4
0= I(t1 ≤ u ≤ t2 )I(t3 ≤ u ≤ t4 )du = ϕj (u)du ϕj (u)du
0 k=0 t1 t3
Equation (9.2.9) explains why the problem of estimating m(t) is called filtering a signal
from a white Gaussian noise.
As we already
P∞ know, a white noise is a pure mathematical notion. Indeed, a white
noise W (t) := j=0 Wj ϕj (t) has the same power at all frequencies (this explains the name
“white”). Thus its total power is infinity, and no physical system can generate a white
Pk−1
noise. On the other hand, its frequency-limited version W (t, k) = j=0 Wj ϕj (t) has a
perfect physical sense, and at least theoretically, W (t, k) may be treated as W (t) passed
through an ideal low-pass rectangular filter. This explains why a white noise is widely used
in communication theory.
Our E-estimation methodology perfectly fits the problem of filtering a signal from white
noise. Indeed, as usual we write the signal m(t) as a Fourier series,
∞
X
m(t) = θj ϕj (t), 0 ≤ t ≤ 1, (9.2.10)
j=0
where Z 1
θj := m(t)ϕj (t)dt (9.2.11)
0
are Fourier coefficients of the signal. Then using (9.2.8) or (9.2.9) we may introduce the
statistic
Z 1 Z 1 Z 1
θ̂j := ϕj (t)dY (t) = ϕj (t)m(t)dt + σ ϕj (t)dB(t) = θj + σWj . (9.2.12)
0 0 0
1.0
1.0
0.8
0.8
0.6
0.6
Y(t)
Y(t)
0.4
0.4
0.2
0.2
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
t t
sigma = 1 , n = 50 sigma = 1 , n = 50
Filtering Filtering
0.0 0.5 1.0 1.5 2.0 2.5
2
1
m(t)
m(t)
0
-1
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
t t
Figure 9.3 Filtering a signal from a white Gaussian noise by the E-estimator. Two columns of
diagrams correspond to the Normal and the Bimodal underlying signals, respectively. Top diagrams
show processes Y (t) simulated according to (9.2.8) with B(t) replaced by B(t, k). A bottom diagram
shows an underlying signal m(t) and its E-estimate by the solid and dashed lines, respectively.
{Parameter σ := σ ∗ /n1/2 , and σ ∗ is controlled by the argument sigma. The argument J controls
the parameter J in (9.2.13). Choosing of two underlying signals is controlled by the argument set.c.}
[set.c = c(2,3), sigma = 1, n = 50, k = 50, cJ0 = 4, cJ1 = .5, cTH = 4, J = 20]
know variables that were used to calculate it. What can be done in this case? A possible
solution is based on the result of Section 2.1 that Fourier coefficients of a smooth signal
decrease fast. Recall that in Section 2.2 we also used this fact in restricting our attention
to estimating only first cJ0 + cJ1 ln(n) Fourier coefficients. As a result, we may introduce
the following estimator of the unknown parameter σ,
h 2J
X i1/2
σ̂ := J −1 θ̂j2 (9.2.13)
j=J+1
332 DEPENDENT OBSERVATIONS
0.8
Y(t)
0.4
0.0
t
sigma = 1 , n = 50
Filtering
2
m(t)
1
0
-1
Figure 9.4 Filtering a signal from a white Gaussian noise where a part of the noisy signal is missed.
The simulation is identical to the one in the right column of Figure 9.3. In the bottom diagram
the solid line shows the underlying signal and the dashed line shows the filtered signal over the two
periods when the noisy signal is observed. [corn = 3, sigma = 1, n = 50, k = 50, cJ0 = 4, cJ1 =
0.5, cTH = 4, J = 20]
with some reasonably large integer J. Statistics (9.2.12) and (9.2.13) allow us to use the
E-estimator of Section 2.2 with an artificial parameter n defined below.
Let us shed light on the parameter σ and how it is related to the sample size n in
the density
R 1 estimation problem discussed in Section 2.2. Using notation of that section,
θj := 0 ϕj (x)f X (x)dx is the jth Fourier coefficient of the density f X (x) supported on
[0, 1]. The Fourier coefficients are estimated by the sample mean estimator
n
X n
X
θ̂j = n−1 ϕj (Xl ) = θj + n−1 [ϕj (Xl ) − θj ]. (9.2.14)
l=1 j=1
This decomposition resembles a fixed-design regression model, only now the aim is to eval-
uate statistical properties of the time series {Xt } while all other functions are consid-
ered as a nuisance. The following terminology is used. In (9.3.1) m(t) is a slowly changing
function known as a trend component, S(t) is a periodic function with period T (that is,
S(t + T ) = S(t) for all t), known as a seasonal (cyclical) component (it is also customarily
assumed that the integral or sum of the values of the seasonal component over the period
is zero), σ(t) is called a scale function (it is also often referred to, especially in finance and
econometrics literature, as a volatility), and {Xt } is a zero-mean and unit-variance station-
ary time series. Of course, the nuisance components may be of a practical interest on their
own, and we will discuss their estimation.
334 DEPENDENT OBSERVATIONS
Another typical complication in analysis of time series is that some observations may
be missed, and instead of (9.3.1) we observe a time series
Here {At } is a Bernoulli time series discussed in Section 8.3. In particular, we will consider
examples of Markov-Bernoulli and batch-Bernoulli time series. It is assumed that time
series {At } and {Xt } are independent. The main aim is again to restore the modified time
series {Xt } and evaluate its statistical characteristics. In particular, we are interested in the
spectral density of {Xt }.
The underlying idea of solving the problem is to estimate the three nuisance components
by statistics m̃(t), S̃(t) and σ̃(t), then use the detrended, deseasoned and rescaled statistics
At [Yt − m̃(t) − S̃(t)][σ̃(t)]−1 as a proxy for At Xt , and finally invoke the technique of Section
8.3 of the analysis of stationary time series with missing observations. This is the plan that
we will follow in this section. In the meantime, it will be also explained how to estimate a
trend, a seasonal component and a scale function. This is the reason why it is convenient to
divide the rest of this section into subsections devoted to a particular statistical component
of the proposed solution with the last subsection being a detailed example explaining all
steps of the analysis. The reader may also first look at the example in Subsection 9.3.6 and
then return to subsections of interest.
In what follows, it is always assumed that the (hidden) time series of interest {Xt } is
zero-mean, unit-variance and second-order stationary.
9.3.1 Estimation of a Trend. Estimation of a trend is a regression problem, and as usual
it is convenient to rescale times of observations onto the unit interval [0, 1]. As a result, in
this subsection it is convenient to consider times t = 1/n, 2/n, . . . , 1 where n is the number
of available observations.
We begin with the case of no missing data, that is, the model (9.3.1). In that model
the sum m(t) + S(t) is deterministic, and we distinguish between the trend m(t) and the
seasonal component S(t) in frequency domain. Namely, it is assumed that the trend is a
slowly changing component. As a result, it is natural to use an orthogonal series approach to
find Fourier coefficients of m(t) + S(t) and then discuss how to separate the trend from the
seasonal component. (In some cases this separation may be a tricky issue and we postpone
this discussion until particular examples). In what follows c1 , c2 , . . . denote finite positive
constants whose specific values are not of interest. Set q(t) := m(t) + S(t), t ∈ [0, 1] and,
with some abuse of the notation, rewrite our observations of the time series (9.3.1) as
n Z
hX l/n n
X Z l/n i
+ [q(t) − q(l/n)]ϕj (t)dt − Xl σ(l/n) ϕj (t)dt
l=1 (l−1)/n l=1 (l−1)/n
is a Fourier estimator of θj . Let us explore its properties via the analysis of νj and ηj .
In (9.3.4) the term νj is deterministic and defines the bias of estimator θ̃j , while the
term ηj is random, has zero mean and defines the variance of the Fourier estimator. For νj
we need to evaluate its absolute value. Let us assume that maxt∈[0,1] |dq(t)/dt| ≤ c1 , that is
the trend and the seasonal components are differentiable and their derivatives are bounded
on [0, 1]. Then the mean value theorem allows us to write
n Z
X l/n Z 1
|νj | ≤ c1 n−1 |ϕj (t)|dt = n−1 [c1 |ϕj (t)|dt]. (9.3.6)
l=1 (l−1)/n 0
We conclude that the deterministic term νj is of order n−1 . Recall that a sample mean
estimate, based on a sample of independent and identically distributed variables, is unbiased
and its variance is of order n−1 . Estimator θ̃j is biased but the squared bias decreases in order
faster than n−1 . As a result, we may conclude that the biased nature of θ̃j has no effect on
its statistical properties as long as we are dealing with reasonably large samples. Further,
R l/n
(9.3.6) explains why in (9.3.5) it is better to use (l−1)/n ϕj (t)dt in place of n−1 ϕj (l/n).
Indeed, in the latter case we would have an extra factor j in (9.3.6) because the derivative
of ϕj (t) is proportional to j.
Now let us consider the stochastic component ηj . Write,
n
nX Z l/n o
E{ηj } = E Xl σ(l/n) ϕj (t)dt = 0. (9.3.7)
l=1 (l−1)/n
Here we used the assumed zero-mean property of {Xt }. For the variance we get,
n
X Z l/n Z s/n
V(ηj ) = E{Xl Xs }σ(l/n)σ(s/n)[ ϕj (t)dt][ ϕj (t)dt]
l,s=1 (l−1)/n (s−1)/n
n
X Z l/n Z s/n
= γ X (l − s)σ(l/n)σ(s/n)[ ϕj (t)dt][ ϕj (t)dt]. (9.3.8)
l,s=1 (l−1)/n (s−1)/n
Assume that {Xt } is a short-memory time series, with the main example of interest
being an ARMA process. Then we have
∞
X
|γ X (l)| ≤ c2 < ∞. (9.3.9)
l=−∞
Let us also assume that maxt∈[0,1] |σ(t)| ≤ c3 < ∞. Then we can continue evaluation of the
right side of (9.3.8),
n
X
V(ηj ) ≤ |γ X (l − s)|n−2 2c23 ≤ n−1 [2c2 c23 ]. (9.3.10)
l,s=1
Note that the only reason why this estimator may perform poorly is if the intervals of
integration (Zs+1 − Zs )/n do not vanish. This is where our discussion of properties of
Markov-Bernoulli and batch-Bernoulli time series {At } becomes handy because we know
how to evaluate the probability of large gaps in observation of the underlying time series.
Let us explore the proposed Fourier estimator (9.3.12). Similarly to (9.3.4) and using
the above-introduced notation we can write,
N Z
X Zs+1 /n
θj − θ̂j = [q(t) − qn (Zs /n)]ϕj (t)dt
s=0 Zs /n
N
X Z Zs+1 /n
− XZs σn (Zs /n) ϕj (t)dt. (9.3.13)
s=0 Zs /n
Consider the first sum in (9.3.13). Using the assumption that the absolute value of the
derivative of q(t) is bounded by a constant c1 , we write,
N Z
X Zs+1 /n
| [q(t) − qn (Zs /n)]ϕj (t)dt|
s=0 Zs /n
n
X Z Zs+1 /n
≤ I(s ≤ N ) |q(t) − qn (Zs /n)||ϕj (t)|dt
s=0 Zs /n
n
X
≤ 21/2 c1 n−2 I(s ≤ N )[Zs+1 − Zs ]2 . (9.3.14)
s=0
Note how we were able to replace the random number N + 1 of terms in the sum by the
fixed n + 1, and this will allow us to use our traditional methods for calculation of the
NONSTATIONARY TIME SERIES WITH MISSING DATA 337
expectation and the variance. As we already know from Section 8.3, for the considered time
series {At } we have E{[Zs+1 − Zs ]4 } ≤ c4 < ∞ , and this yields (compare with (9.3.11))
N Z
X Zs+1 /n
E{| [q(t) − qn (Zs /n)]ϕj (t)dt|} ≤ c5 n−1 . (9.3.15)
s=0 Zs /n
Using the same technique we can evaluate the second moment of the sum,
XN Z Zs+1 /n
E{[ [q(t) − qn (Zs /n)]ϕj (t)dt]2 }
s=0 Zs /n
n
X
≤ 2c21 n−4 E{(Zs+1 − Zs )2 (Zr+1 − Zr )2 } ≤ c6 n−2 . (9.3.16)
s,r=0
Again, it is of interest to compare (9.3.16) with (9.3.11). We conclude that the effect of
the first sum in (9.3.13) on the Fourier estimator is negligible.
Now we are considering the second sum in (9.3.13) which is the main term. Because
{Xt } is zero-mean and independent of {At } (and hence of {Zs }), we get
N
nX Z Zs+1 /n o
E XZs σn (Zs /n) ϕj (t)dt = 0. (9.3.17)
s=0 Zs /n
n
X Z Zs+1 /n
=V I(s ≤ N )XZs σn (Zs /n) ϕj (t)dt
s=0 Zs /n
n n
X
=E I(s ≤ N )I(r ≤ N )XZs XZr
s,r=0
Z Zs+1 /n Z Zr+1 /n o
×σn (Zs /n)σn (Zr /n) ϕj (t)dt ϕj (t)dt
Zs /n Zr /n
n
X
≤ 2c23 n−2 E{|γ X (Zs − Zr )|(Zs+1 − Zs )(Zr+1 − Zr )} ≤ c7 n−1 . (9.3.18)
s,r=0
In the last inequality we used the assumed (9.3.9), and recall that ci are positive constants
whose specific values are not of interest.
Combining the results we get that θ̂j , suggested for time series {At Yt } with missing
observations, has the same property as θ̃j for {Yt }, namely
We conclude that, depending on a given time series, we may use either Fourier estimator
θ̃j or Fourier estimator θ̂j to construct a corresponding regression E-estimator of function
q(t) := m(t) + S(t).
Recall that in this subsection our aim is to estimate the trend m(t), and to do this we
simply bound the largest frequency J in the E-estimator. How to choose a feasible frequency
bound is explained in the next subsection.
338 DEPENDENT OBSERVATIONS
9.3.2 Separation of Trend from Seasonal Component. These two components may
be separated using either frequency or time domains. By the latter it is understood that a
deterministic function with the period less than Tmax is referred to as a seasonal component,
and as a trend component otherwise. In some applications it may be easier to think about a
seasonal component in terms of periods and the choice of Tmax comes naturally. For instance,
for a long-term money investor Tmax is about several years, while for an active stock trader
it may be just several hours or even minutes.
If Tmax is specified, then in the above-proposed regression E-estimator, the largest fre-
quency used by the estimator should not exceed Jmax which is defined as the minimal integer
such that ϕJmax (x + Tmax ) ≈ ϕJmax (x) for all x. For instance, for the cosine basis on [0, n]
with the elements ϕ0 (t) := n−1/2 , ϕj (t) := (n/2)−1/2 cos(πjt/n), j = 1, 2, . . ., 0 ≤ t ≤ n,
we get
Jmax = b2n/Tmax c. (9.3.20)
Recall that bxc denotes the rounded down x.
Using (9.3.20) in the E-estimator, proposed for q(t) in subsection 9.3.1, yields the E-
estimator m̂(t) of the trend.
9.3.3 Estimation of a Seasonal Component. If T is the period of a seasonal component
S(t), then by definition of a seasonal component, we have S(t + T ) = S(t) for any t, and if a
time series is defined at integer points (and in this subsection this convention is convenient),
PT
then l=1 S(l) = 0 (a seasonal component should be zero-mean because the mean of a time
series is a part of the trend).
A classical time series theory assumes that the period T of an underlying seasonal
component S(x) is given. And indeed, in many practical examples, such as monthly housing
starts, hourly electricity demands, migration of birds, or monthly average temperatures,
periods of possible cyclical components are apparent. If the period is unknown, we will use
spectral density to estimate it.
For now, let us assume that the period T is known and m̃(t) is an estimate of the trend.
We begin with the case when observations of {Yt } are not missed. Set Ỹt := Yt − m̃(t) and
introduce the estimator
b(n−t)/T c
X
S̃(t) := (b(n − t)/T c + 1)−1 Ỹt+rT , t = 1, 2, . . . , T. (9.3.21)
r=0
Pk−1
where Wt0 := k −1/2 r=0 Wt+rT are again independent standard normal variables. Thus, if
k is large enough (that is, if n is large and T is relatively small), then the estimator should
perform well.
Estimation of the seasonal component for the case of missed observations (9.3.2) is
similar,
Pb(n−t)/T c
r=0 I(At+rT Yt+rT 6= 0)[At+rT Yt+rT − m̃(t + rT )]
Ŝ(t) := Pb(n−t)/T c , t = 1, 2, . . . , T. (9.3.23)
r=0 I(At+rT Yt+rT 6= 0)
NONSTATIONARY TIME SERIES WITH MISSING DATA 339
9.3.4 Estimation of Scale Function. Estimation of the scale σ(x) in models (9.3.1) and
(9.3.2) is not a part of a classical time series analysis. The primary concern of the classical
time series theory is that the stochastic term {Xt } should be second-order stationary, that
is, the scale function σ(x) should be constant. Since this is typically not the case, the usually
recommended approach is to transform a dataset at hand in order to produce a new data
set that can be successfully modeled as a stationary time series. In particular, to reduce
the variability (volatility) of data, Box–Cox transformations are recommended when the
original positive observations Y1 , . . . , Yn are converted to ψλ (Y1 ), . . . , ψλ (Yn ), where two
popular choices are ψλ (y) := (y λ − 1)/λ, λ 6= 0 and ψλ (y) := log(y), λ = 0. By a suitable
choice of λ, the variability may be significantly reduced.
Our aim is twofold. First, we would like to estimate the scale function because in a
number of applications, specifically in finance where the scale is called the volatility, it is
important to know this function. Second, when the scale is estimated, this gives us an access
to the hidden {Xt }.
Section 3.6 explains, for the case of complete observations, how to convert the problem
of estimation of the scale function into a classical regression one. Furthermore, here again
an E-estimator σ̂(t) may be used. The same conclusion holds for the case of missed data.
We will see shortly how the scale E-estimator performs.
9.3.5 Estimation of Spectral Density Function. This is our final step. As soon as
we have estimated trend, seasonal component and scale, we can use a plug-in estimator
for values of X1 , . . . , Xn or A1 X1 , . . . , An Xn depending on the data, and then utilize the
corresponding spectral density E-estimator of Chapter 8.
9.3.6 Example of the Nonparametric Analysis of a Time Series. This subsection
presents a simulated example of a nonstationary time series (9.3.2) and an explanation of
how the spectral density of an underlying zero-mean and second-order stationary time series
{Xt } can be estimated. The example is simulated by Figure 9.6, it contains 10 diagrams and
to improve their visualization the first four diagrams are shown in Figure 9.5 and the rest
is in Figure 9.6. Captions to these figures present explanation of corresponding diagrams.
The used model of nonstationary time series is (9.3.2). Namely, we observe the product
{At Yt } of two time series at times t = 1, 2, . . . , n. The stationary time series {At } is Markov-
Bernoulli, its definition may be found in Section 8.3, and it creates missing observations
whenever At = 0. The nonstationary time series {Yt } is defined as
Here {Xt } is a zero-mean second-order stationary Gaussian time series of interest, in the
simulation it is an ARMA(1,1) process similar to those in Figure 8.1. The main aim is to
estimate its spectral density. Deterministic function m(t) is a trend, in the simulation it
is one of our corner functions with domain [1, n]. Deterministic function S(t) is a periodic
seasonal component. In the simulation it is a trigonometric function S(t) := ss sin(2πt/T ) +
sc cos(2πt/T ) with the period T . Deterministic function σ(t) is the scale, and it is defined as
σsc (1 + f (l/n)), where f (x) is one of our corner functions. How to choose specific underlying
parameters and functions is explained in the caption.
Now let us look at a realization of the underlying time series {Xt } shown in Diagram
1 of Figure 9.5. Here, crosses show the time series. The diagram is congested due to the
sample size n = 200, but overall the time series looks like a reasonable stationary realization.
Diagram 2 shows us the available time series with missing data. The horizontal line y = 0
helps us to see times when missing occurs. Note that the diagram is dramatically less
congested because only N = 144 observations are available and more than a quarter of
observations are missed. Note that the time series in Diagram 2 is all that we have for
the statistical analysis. Can you visualize the underlying trend, seasonal component, scale
340 DEPENDENT OBSERVATIONS
4
2
0
-1.0
-2
0 50 100 150 200 0 50 100 150 200
n = 200 N = 144
4
2
0
-2
-4
-0.5
Figure 9.5 Analysis of a nonstationary time series. Figure 9.6 creates ten diagrams and here the first
four are shown. Diagram 1 shows the underlying (hidden) time series {Xt } which is an ARMA(1,1)
similar to those in Figure 8.1. The sample size n = 200 is shown in the subtitle. Diagram 2 shows the
observed time series {At Yt } defined in (9.3.2). The time series {At } is generated as in Figure 8.2.
The horizontal line helps to recognize missing observations, and the number of available observations
N := n
P
l=1 Al = 144 is shown in the subtitle. The underlying trend and its E-estimate are shown
in Diagram 3 by the solid and dashed lines, respectively. Diagram 4 shows the detrended data. Note
that detrended data may be calculated only when At = 1, and hence only N = 144 observations are
shown in this diagram.
function, and the spectral structure of the stochastic noise from the data? The answer is
probably “no,” so let us see how the nonparametric data-driven procedure, discussed earlier,
handles the data.
The first step is the nonparametric estimation of the trend. This is done by the regression
E-estimator whose largest frequency is bounded by 2n/Tmax where the possible largest
period of seasonal component Tmax must be chosen manually. For the considered data we
choose the default Tmax = 35, which implies Jmax = 7. The E-estimate of the trend (the
dashed line) is shown in the third diagram. Note that it is based on the time series with
missed data. The estimate is relatively good, it undervalues the modes but otherwise nicely
shows the overall shape of the Bimodal corner function. The right tail goes down too much,
but this is what the data indicate (look at the right tail of the observed time series). The
reader is advised to repeat Figure 9.6 and get used to possible regression estimates because
here we are dealing with a very complicated setting and a rather delicate procedure of
estimation of the trend.
As soon as the trend is estimated, we can detrend the data (subtract the estimated
trend from observations shown in Diagram 2), and the result is exhibited in Diagram 4.
NONSTATIONARY TIME SERIES WITH MISSING DATA 341
5. Spectral Density of Detrended Data 6. The Estimated Seasonal Component
1.0
0.0
-1.0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 2 4 6 8 10
1.5
2
1.0
0
-2
0.5
0 50 100 150 200 0 50 100 150 200
0.4
0.2
0.0
-3
0 50 100 150 200 0.0 0.5 1.0 1.5 2.0 2.5 3.0
Figure 9.6 Analysis of a nonstationary time series. Here the last 6 diagrams of Figure 9.6 are
presented while the first four are shown in Figure 9.5. Time series in Diagram 4 (see Figure 9.5) is
used to estimate its spectral density. Diagram 5 shows the estimated spectral density whose mode is
used to estimate a possible period of a seasonal component shown in the subtitle. Then the rounded
period, shown in the subtitle of Diagram 6, is used to estimate an underlying seasonal component.
The seasonal component (the triangles) and its estimate (the circles) are shown in Diagram 6. The
estimated seasonal component is subtracted from the detrended data of Diagram 4, and the result
is shown by circles in Diagram 7. Again, only available observations (when At = 1) are exhibited.
These statistics are used to estimate the scale function, and Diagram 8 shows the underlying scale
function (the solid line) and its E-estimate (the dashed line). Data, shown in Diagram 9, are the
observations in Diagram 7 divided by the scale E-estimate. Finally, the time series of Diagram 9
is used by the spectral density E-estimate. Diagram 10 shows the underlying spectral density (the
solid line) of the underlying time series {Xt } and its E-estimate (the dashed line). {Choosing the
trend and scale functions is controlled by arguments trendf and scalef. Parameter σ is controlled
by sigmasc. The estimate of the scale is bounded below by lbscale. The seasonal component is
S(t) = ss sin(2πt/T ) + sc sin(2πt/T ) whose parameters are controlled by arguments ss, sc and
Tseas. ARMA(1,1) time series {Xt } is generated as in Figure 8.1 and it is controlled by arguments
a and b. Time series {At } is generated as in Figure 8.2 and it is controlled by arguments alpha
and beta. Intervals for the search of an underlying period are controlled by arguments set.period (in
the time domain) and set.lambda (in the spectral domain). Argument TMAX separates trend from
seasonal component. Setting ManualPer=T allows the user to manually choose a period, and this
option is illustrated in Figure 9.8. A warning is issued if the estimated period is beyond a wished
range.} [n = 200, trendf = 3, scalef = 2, sigmasc = 0.5, ss = 1, sc = 1, a = -0.4, b = -0.5, alpha
= 0.4, beta = 0.8, TMAX = 35, Tseas = 10, ManualPer = F, set.period = c(8,12), set.lambda =
c(0,2), lbscale = 0.1, cJ0 = 4, cJ1 = 0.5, cTH = 4, cJ0sp = 2, cJ1sp = 0.5, cTHsp = 4]
Let us look at it. First, if a majority of observations in the second diagram are positive,
here it is fair to say that the sample mean is likely zero. Second, note that only available
observations are shown, and their number is N = 144. Third, based on this time series
with missed data, we need to recover an underlying seasonal component if one exists. Can
you recognize a seasonal component in the detrended data? Even if you know that this
342 DEPENDENT OBSERVATIONS
is a smooth function with period 10, it is difficult to see it in the data. Furthermore, the
pronounced scale function complicates visualization of a seasonal component.
Estimation of the seasonal component is explained in diagrams of Figure 9.6 (recall that
it presents continuation of the analysis of data shown in Figure 9.5). Diagram 5 shows us the
E-estimate of the spectral density of the detrended time series with missing data. (Recall
that as in the previous sections, arguments of the spectral density E-estimator have the
attached string sp, for instance, cJ0sp is the argument that controls the coefficient cJ0 of
the spectral density E-estimator. This allows us to use separate arguments for the regression
estimator, which recovers the trend and scale functions, and the spectral density estimator.)
Diagram 5 indicates that the detrended data have a spectral density with a pronounced mode
at the frequency about 0.6. The period 9.62 (the estimated period), calculated according to
the formula T = 2π/λ, is given in the subtitle. The corresponding rounded (to the nearest
integer) period is 10, and this is exactly the underlying period. What we see is one of the
important practical applications of the spectral density that allows us to find periods of
seasonal components.
While for this particular simulated time series the rounded estimated period has been
determined correctly, this is not always the case. The small sample sizes and large errors
may take their toll and lead to an incorrect estimate of the period. We will return to this
issue shortly.
The rounded estimated period is used to estimate the underlying seasonal component.
Here the estimator (9.3.23) is used and recall that it uses the fact that S(t + T ) = S(t).
Circles in the sixth diagram show the estimate while triangles show the underlying seasonal
component. The estimate is not perfect, but it is not chaotic or unreasonable. Note that
its magnitude is fair, and the phase is shown absolutely correctly. Keep in mind that each
point is the average of about 14 observations, so even for a parametric setting this would
be considered a small sample size.
It is fair to say that for this particular data the nonparametric technique produced a
remarkable outcome keeping in mind complexity of data in Diagram 2.
As soon as the seasonal component is estimated, we can subtract it from the detrended
time series, and the resulting time series with missing observations is shown in Diagram
7. Now we may almost feel the shape of the scale functions which still makes the series
nonstationary. The E-estimate of the scale function (the dashed line) is shown in Diagram
8, and we may compare it with the underlying scale (the solid line). Overall the E-estimate
is good, and note that it is based on a time series where more than a quarter of observations
are missed.
The data shown in Diagram 7, using the regression terminology, may be referred to
as the time series of residuals. As soon as the scale function is estimated, we divide the
residuals by the scale. To avoid a zero divisor, the estimate is truncated from below by the
argument lbscale; the default value is 0.1. The resulting time series, called rescaled residuals,
is shown in Diagram 9 for times t when At = 1. Note that the rescaled residuals are our
plug-in E-estimates X̂t of the underlying stationary time series Xt . Visual analysis shows
that there is no apparent trend, or a seasonal component, or a scale function.
Finally, we arrive at the last step of estimation of the spectral density of the rescaled
residuals. The E-estimate (the dashed line) and the underlying spectral density (the solid
line) are shown in Diagram 10. Let us explain what to look upon here. The spectral density
estimate is clearly different from the one in Diagram 5. The main issue is that we do not
have a pronounced mode near frequency 0.6. This tells us that a seasonal component was
successfully removed from the data. Otherwise we would again see a pronounced mode in
the spectral density. The spectral density E-estimate is relatively good keeping in mind the
complexity of the problem.
This finishes our analysis of this particular time series. Repeated simulations may show
different outcomes, and Figure 9.6 allows us to address some issues. In Diagram 5 the
DECOMPOSITION OF AMPLITUDE-MODULATED TIME SERIES 343
mode correctly indicates the period of seasonal component, but in general this may not be
the case. First, there may be several local modes created by both a seasonal component
and a stochastic component, and large errors also may produce a wrong global mode. As
a result, the period may be estimated incorrectly. One of the possibilities to avoid such
a complication is to use prior information about the domain of possible periods. To play
around with this possibility, two arguments are added to Figure 9.6, namely, set.period
and set.lambda. The first one, set.period = c(T1,T2), allows one to skip estimation of a
seasonal component whenever an estimated period is beyond the interval [T 1, T 2]. The
second argument, set.lambda = c(λ1 , λ2 ), allows one to restrict the search for the mode
to this particular frequency interval. While these two arguments do a similar job, they are
good tools for gaining the necessary experience in dealing with the time and frequency
domains. Note that Diagrams 6 and 7 are skipped if the estimated period is beyond the
interval [T1,T2] or the frequency is beyond its interval, and then a warning statement is
issued.
The second reason for the failure of the estimation of the period is that due to large noise
and small sample size, the mode of an estimated spectral density may be relatively flat. To
understand why, consider, as an example, frequencies λ∗1 = 0.6, λ∗2 = 0.59, and λ∗3 = 0.54.
Then, the corresponding periods (recall the formula T = 2π/λ) are T1∗ = 2π/0.6 = 10.47,
T2∗ = 2π/0.59 = 10.64, and T3∗ = 2π/0.54 = 11.63, which imply the rounded periods 10,
11, and 12, respectively. We conclude that a relatively small error in the location of a mode
may imply a significant error in the estimated period.
Two questions immediately arise: how to detect such a case and how to correct the
mistake. An incorrect period will not remove the mode in the spectral estimate shown in
Diagram 10. Further, an incorrect period will not show a reasonable seasonal component in
Diagram 6. These are the key points to check. The obvious method to deal with a wrong
period is to use a manual period, and Figure 9.6 allows us to do this. Set the argument
ManualPer = T (in R-language “T” stands for “True” and “F” for “False”). This stops the
calculations at Diagram 5. Then the program prompts for entering a wished period from
the keyboard. At the prompt 1: enter a period (here it should be 10, but any integer period
may be tried) from the keyboard and then press Return; then at the prompt 2: just press
Return. This completes the procedure, and the seasonal component will be calculated with
the period entered. The period will be shown in the subtitle of Diagram 6. This option will
be illustrated in Figure 9.8.
One more comment is due. In some cases a spectral density estimate of rescaled residuals
has a relatively large left tail, as in Diagram 10, while an underlying theoretical spectral
density does not. One of the typical reasons for such a mistake is a poorly estimated trend.
Unfortunately, for the cases of small sample sizes and relatively large errors there is no
cure for this “disease,” but knowledge of this phenomenon may shed light on a particular
outcome.
Here, as in Section 9.3, m(t) is the trend, S(t) is the seasonal component, and σ(t) is the
scale. As a result, the time series {Yt } is neither zero-mean, nor unit-variance, nor station-
ary. Second, {Yt } is not observed directly, and instead we observe its modification by an
344 DEPENDENT OBSERVATIONS
amplitude-modulating process {Ut } (recall Section 8.4). Namely, the available observation
is a time series with elements
Here {Ut } is a time series of independent and identically distributed Poisson random vari-
ables with an unknown mean λ. Note that {Ut } creates both the missing and the amplitude
modulation of the time series {Yt }. Let us additionally assume that P(Yt = 0) = 0, then it
is sufficient to observe only Vt to recognize if the underlying Yt is missed (U t = 0) or scaled
(Ut > 0). Indeed, using the assumption we get P I(Ut = 0) = I(Vt = 0) = 1, and hence
I(Vt 6= 0) may be used as the availability.
The main aim is to estimate the spectral density of {Xt } based on observations V1 , . . . , Vn
of the time series {Vt }, and we also would like to estimate the trend, the seasonal component,
the scale and the parameter λ.
The problem looks similar to the one considered in Section 9.3, but here we have an
additional complication that not only some observations of time series {Yt } are missed but
they are also amplitude-modulated. In other words, we are dealing with a sophisticated
modification of the hidden time series of interest {Xt }. Nonetheless, the only way to solve
the problem is to estimate all nuisance functions and get an access to the underlying time
series of interest.
To shed light on a possible solution, similarly to Section 9.3 we begin with estimation
of q(t) := m(t) + S(t) and only for estimation of this function assume that the times of
observations are tl := l/n, l = 1, 2, . . . , n. In a general case we may rescale all observations
on the unit interval, and we do this to use our traditional regression approach of estimation
over the unit interval.
To employ a regression E-estimator for estimation of q(t), we need to estimate a Fourier
coefficient Z 1
θj := q(t)ϕj (x)dx. (9.4.3)
0
Pn
To do this, set N := l=1 I(Vl 6= 0) for the number of available observations of the un-
derlying time series, assume that N > 1, and then similarly to Section 9.3 we denote by
Zs , s ∈ {1, 2, . . . , N } random variables such that VZs 6= 0. Recall our discussion in Sections
4.1 and 9.3 about cases with small N and that for a feasible estimation we need to have N
comparable with sizes n used for directly observed time series.
Set Z0 := 0, ZN +k := n for k ≥ 1, with some plain abuse of notation set V0 := V1 , and
introduce a statistic
N
X Z Zs+1 /n
θ̌j := VZs ϕj (t)dt. (9.4.4)
s=0 Zs /n
We can continue (9.4.4) and replace the random number of terms in the sum by deterministic
one,
n
X Z Zs+1 /n
θ̌j := I(s ≤ N )VZs ϕj (t)dt. (9.4.5)
s=0 Zs /n
n
X Z Zs+1 /n
= E{E{UZs |Zs }I(s ≤ N )q(Zs /n) ϕj (t)dt}. (9.4.6)
s=0 Zs /n
DECOMPOSITION OF AMPLITUDE-MODULATED TIME SERIES 345
Note that for a given Zs the Poisson variable UZs is positive, and hence
E{Ut I(Ut > 0)} λ
E{UZs |Zs } = = . (9.4.7)
P(Ut > 0) 1 − e−λ
We conclude that the statistic θ̌j cannot be used for estimation of θj , but there is a
simple remedy. Assume that λ is known and q(t) is differentiable and the absolute value of
the derivative is bounded. Introduce a Fourier estimator
N Z Zs+1 /n
1 − e−λ X
θ̃j := VZs ϕj (t)dt. (9.4.8)
λ s=0 Zs /n
20
-0.5 0.0 0.5 1.0
15
10
5
0
-10 -5
-1.5
0 50 100 150 200 250 300 0 50 100 150 200 250 300
n = 300 N = 231
10
0.0 0.5 1.0 1.5 2.0 2.5
5
0
-5
0 50 100 150 200 250 300 0 50 100 150 200 250 300
Figure 9.7 Analysis of amplitude-modulated nonstationary time series. Here the first four diagrams,
created by Figure 9.8, are shown. The underlying simulation of Yt := m(t) + S(t) + σ(t)Xt is
identical to Figure 9.6. The only difference is that here the observed time series is {Ut Yt } where Ut
are independent and identically distributed Poisson variables with mean λ = 1.5. The structure of
diagrams is similar to Figure 9.5.
Further, the magnitude of detrained observations is still relatively large with respect to the
hidden data in Diagram 1.
We continue our analysis of the data in Figure 9.8. Diagram 5 shows us E-estimate of the
spectral density of the detrended time series. Look how flat is the left mode which is in the
range of frequencies of interest. We know from our discussion in Section 9.3 that this may
lead to inconsistency in estimation of the period of a possible seasonal component. Because
Figure 9.8 uses the default argument ManualPer=T, the program stops after Diagram 5 and
allows us to enter a wished period (the caption explains how to do this). Here the period
10 was entered, and Diagram 6 shows the estimated seasonal component by circles and the
underlying seasonal component by triangles. The estimate is reasonable given complexity
of the problem. A wrongly chosen period would not produce a reasonable shape of seasonal
component, and also the left mode of the spectral density, observed in Diagram 5, will be
again seen in Diagram 10. These two facts may help to choose a correct period for Diagram
6.
Detrended and deseasonalized time series is shown in Diagram 7. It is clear that the
DECOMPOSITION OF AMPLITUDE-MODULATED TIME SERIES 347
1.0
0.2 0.3 0.4 0.5
0.0
-1.0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 2 4 6 8 10
1.8
5
1.4
0
1.0
-5
0.6
0 50 100 150 200 250 300 0 50 100 150 200 250 300
0.6
-4 -2 0 2 4
0.4
0.2
0.0
0 50 100 150 200 250 300 0.0 0.5 1.0 1.5 2.0 2.5 3.0
Figure 9.8 Analysis of amplitude-modulated nonstationary time series {Vt } := {Ut Yt } . Here {Ut }
is the Poisson time series of independent variables with the mean λ, time series Yt and {Ut } are
independent, and {Yt } is generated as in Figure 9.6. The case of a manually chosen period T = 10
of seasonal component is presented. This option is set by the argument M anualP er = T which
stops the program after exhibiting Diagram 5. Then the program prompts for entering a wished
period. At prompt 1: enter a period (here it is 10, but any integer period may be tried) from the
keyboard and press Return, then at the prompt 2: press Return. This completes the procedure, the
seasonal component will be calculated with the manual period which will be also shown in the subtitle
of Diagram 6. Apart of this, the structure of diagrams is identical to those in Figure 9.6. [n = 200,
lambda = 1.5, trendf = 3, scalef = 2, sigmasc = 0.5, ss = 1, sc = 1, a = -0.3,b = -0.5, TMAX =
35, Tseas = 10, ManualPer = T, set.period = c(8,12), set.lambda = c(0,2), lbscale = 0.1, cJ0 =
4, cJ1 = 0.5, cTH = 4, cJ0sp = 2, cJ1sp = 0.5, cTHsp = 4]
scale is not constant and it is larger for t around 200. This is what we see in the scale
estimate (the dashed line) shown in Diagram 8. The estimate is skewed to the right with
respect to the underlying scale (the solid line), and its tails are also wrong. At the same
time, let us note that the E-estimator knows only the time series shown in Diagram 7, and
the data support conclusion of the E-estimator. The main issue here is the large volatility
created by Poisson amplitude-modulation, recall the discussion in Section 8.4.
We are almost done with the decomposition. Diagram 9 shows us the rescaled residuals,
and Diagram 10 shows the spectral density E-estimate (the dashed line) and the true spectral
density (the solid line) of time series {Xt } shown in Diagram 1. First of all, note that the
left mode, observed in the spectral density of detrended data in Diagram 5, is completely
gone. This tells us that the underlying time series {Yt } has a seasonal component with
period 10, the estimated seasonal component (see Diagram 6) is very reasonable, and that
the procedure of removing the seasonal component was successful. The spectral density
348 DEPENDENT OBSERVATIONS
estimate by itself is far from being perfect, but at least it correctly shows that {Xt } has a
larger power at high frequencies.
The explored model of amplitude-modulated nonstationary time series is complicated,
and the reader is encouraged to repeat Figure 9.8 with different parameters, appreciate
its complexity, and get a good training experience. Let us stress one more time that here
we are dealing with compounded effects of the scale-location modification {Yt } = {m(t) +
S(t) + σ(t)Xt } of the process of interest {Xt }, missing observations of {Yt } created by zero
realizations of a Poisson process, and amplitude-modulation of nonstationary {Yt } by the
Poisson process. Figure 9.8 helps to shed light on all these modifications.
h n−j
X i
E{γ̂ Y (j)} = γ X (j) (n − j)−1 E{Ul/n }E{U(l+j)/n } . (9.5.3)
l=1
Set λ(t) := E{Ut }, t ∈ [0, 1] and assume that the function λ(t) is differentiable and
the absolute value of the derivative is bounded on [0, 1]. Then we can continue (9.5.3) and,
to simplify formulae, we are considering separately cases j = 0 and j > 0. Using relation
E{U 2 (l/n)} = λ(l/n) + [λ(l/n)]2 , we can continue (9.5.3) for the case j = 0,
n
X
E{γ̂ Y (0)} = γ X (0)[n−1 λ(l/n)(1 + λ(l/n))]
l=1
Z 1
= γ X (0) λ(t)(1 + λ(t))dt
0
n n
X Z 1 o
+ γ X (0)[n−1 λ(l/n)(1 + λ(l/n)) − λ(t)(1 + λ(t))dt] . (9.5.4)
l=1 0
NONSTATIONARY AMPLITUDE-MODULATION 349
The absolute value of the term in the square brackets, due to the assumed smoothness
of function λ(t), is not larger than c1 n−1 (note that we are evaluating the remainder for
a Riemann sum and we have done a similar calculation in Section 9.3). Here and in what
follows ci are some positive finite constants whose specific values are not of interest to us.
We conclude that E{γ̂ Y (0)}
≤ c2 γ X (0)n−1 .
X
γ (0) − R 1 (9.5.5)
0
λ(t)(1 + λ(t))dt
For j > 0 we can continue (9.5.3) and write,
n−j
X
E{γ̂ Y (j)} = γ X (j)[(n − j)−1 λ(l/n)λ((l + j)/n)]
l=1
Z 1 n n−j
X Z 1 o
= γ X (j) (λ(t))2 dt + γ X (j)[(n − j)−1 λ(l/n)λ((l + j)/n) − (λ(t))2 dt] . (9.5.6)
0 l=1 0
The absolute value of the term in the square brackets, due to the assumed smoothness of
function λ(t), is not larger than c3 jn−1 , and we get
E{γ̂ Y (j)}
≤ c4 |γ X (j)|jn−1 .
X
γ (j) − R 1 (9.5.7)
(λ(t))2 dt
0
We conclude that if function λ(t), defining the missing mechanism, is known then
R1
γ̂ Y (j)/ 0 (λ(t))2 dt may be used as a Fourier estimator of the autocovariance function of
interest γ X (j).
In general function λ(t) is unknown. Recall the assumption P(Xt = 0) = 0, and using it
we can write,
Set
m(t) := e−λ(t) . (9.5.9)
Then (9.5.8) implies that m(t) is the regression function in the fixed design Bernoulli re-
gression with the response I(Yl = 0) and the predictor l/n, l = 1, 2, . . . , n. Hence we can
use Bernoulli regression E-estimator m̂(t) and propose the plug-in estimator of λ(t),
1.0
1.5
0.8
0.5
0.6
0.4
-0.5
0.2
-1.5
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
n = 240
0.6
4
2
0.4
0
-6 -4 -2
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0
N = 175
Figure 9.9 illustrates the problem and the proposed solution. Diagram 1 shows realization
of an underlying stationary ARMA(1,1) time series, the sample size n = 240 is shown in
the subtitle. Diagram 2 shows us observed realizations of the nonstationary amplitude-
modulated time series {Ut Xt }. Here missed realizations, when Ut = 0, are skipped. This
presentation allows us to conclude that the availability likelihood increases as t increases
because as t increases the number of missing realizations decreases. We also clearly observe
the amplitude-modulated structure of the time series, which is in strike contrast to the time
series in Diagram 1. Diagram 3 shows us the scattergram used to estimate the regression
function m(t) defined in (9.5.9). The solid line shows the underlying regression and the
dashed line shows the regression E-estimate. Note that the function λ(t) is not differentiable
NONSTATIONARY AUTOCOVARIANCE AND SPECTRAL DENSITY 351
but still it is possible to show that all our conclusions hold for this function because it is
piecewise differentiable. As we see from Diagram 3, the E-estimate is not perfect but it does
follow the data. Further, Diagram 3 is a nice place to realize that the amplitude-modulating
time series is not stationary.
The final result of the estimation procedure is exhibited in Diagram 4. Here the solid
line shows the spectral density of the underlying (hidden) process {Xt }. We see that it
has a larger power on lower frequency, and returning to Diagram 1 we realize why. The
dotted line shows us the estimated spectral density of {Ut Xt }. Note that it has much
larger power, thanks to the Poisson amplitude-modulation. Clearly using the naı̈ve estimate
would be misleading. Nonetheless, please note that its overall decreasing shape is correct.
The interested reader may return to the above-presented formulae to test this conclusion
theoretically. The dot-dashed line is the oracle E-estimate based on the hidden (but known
to us from the simulation) time series {Xt } shown in Diagram 1. This estimate is good and
it indicates that the underlying time series is reasonable to begin with. Finally, the proposed
E-estimate is shown by the dashed line. This is a fair estimate keeping in mind complexity
of the nonstationary modification and that almost 27% of initial observations are missed,
and note that the E-estimate is dramatically better than the naı̈ve estimate.
Here γ X (j) is the autocovariance function which for any l = 1, 2, . . . , n − j can be defined
as
γ X (j) := γlX (j) := E{Xl/n X(l+j)/n }. (9.6.2)
Because the time series is second-order stationary, in (9.6.2) the autocovariance does not
depend on a particular l. In general this may not be the case.
Let us present an example where (9.6.2) does not hold. Consider a causal ARMA(1,1)
process Xt − aXt−1/n = σ(Wt + bWt−1/n ), |a| < 1. If a and b are constants, we know from
Section 8.2 that
σ 2 [(a + b)2 + 1 − a2 ] σ 2 (a + b)(1 + ab)
γ X (0) = , γ X (1) = ,
(1 − a2 ) (1 − a2 )
γ X (j) = aj−1 γ(1), j ≥ 2, (9.6.3)
σ 2 |1 + beiλ |2
g X (λ) = . (9.6.4)
2π|1 − aeiλ |2
Note that if a > 0 and b > 0, then γ(1) > 0, and a realization of the time series will
352 DEPENDENT OBSERVATIONS
“slowly” change over time. On the other hand, if a + b < 0 and 1 + ab > 0, then a realization
of the time series may change its sign almost every time. Thus, depending on parameters a
and b, we may see either slow or fast oscillations in a realization of an ARMA(1,1) process.
Figures 8.1 and 8.2 allow us to visualize ARMA(1,1) processes with (a = −0.3, b = −0.6)
and (a = 0.4, b = 0.5), respectively. What will be if the parameters of ARMA(1,1) process
change from those in Figure 8.1 to those in Figure 8.2 when time t increases from 0 to 1?
In other words, suppose that
Then, according to (9.6.3) and (9.6.4), we have changing in time (dynamic, nonstationary)
autocovariance function and spectral density.
This example motivates us to introduce the following characteristics of a zero-mean
time series {Xt , t = 1/n, 2/n, . . .} with bounded second moments E{Xt2 } ≤ c∗ < ∞. We
introduce a changing in time (dynamic, nonstationary) autocovariance function
In its turn, the changing in time autocovariance yields the dynamic (nonstationary) spectral
density
∞
X
gtX (λ) := (2π)−1 γtX (0) + π −1 γtX (j) cos(jλ), −π < λ ≤ π, 0 ≤ t ≤ 1. (9.6.7)
j=1
These characteristics explain a possible approach for the spectral analysis of nonsta-
tionary time series. Suppose that we are dealing with a realization X1/n , X2/n , . . . , Xn/n
of a zero-mean time series {Xt } with a bounded second moment. Then the problem is to
estimate the dynamic autocovariance γtX (j) and the dynamic spectral density gtX (λ).
Note that a dynamic autocovariance is a bivariate function in t and j. As a result, the
proposed solution is to fix j < n and consider (9.6.6) as a nonparametric regression function
with t being the predictor and Xt Xt+j/n being the response. Indeed, we may formally write
X
Xl/n X(l+j)/n = γl/n (j) + εl,j,n , l = 1, 2, . . . , n − j, (9.6.8)
and use the regression E-estimator of Section 2.3 to estimate γtX (j) as a function in t.
Then we can use these estimators to construct E-estimator of the dynamic spectral density.
Because the dynamic spectral density is a bivariate function, it is better to visualize the
corresponding dynamic autocovariance functions, and this is the recommended approach.
Now, when we know the model and the estimation procedure, let us analyze several
realizations of a nonstationary time series.
Figure 9.10 allows us to look at a particular realization of the above-presented example
of ARMA(1,1) process with changing in time coefficients. Its caption explains the simulation
and the diagrams. The top diagram shows us a particular realization of the nonstationary
Gaussian ARMA process. Note how, over the time of observation of the process, initial
high-frequency oscillations transfer into low-frequency oscillations. It may be instructive to
look one more time at the corresponding stationary oscillations shown in Figures 8.1 and
8.2 and then compare them with the nonstationary one.
Now let us see what the discussed in Section 8.2 traditional methods of spectral density
estimation, developed for stationary processes, produce. The periodogram is clearly confused
and exhibits several possible periodic components (look at the modes and try to evaluate
possible periods using the rule T = λ/(2π) discussed in Section 8.2). The bottom diagram
shows the initial underlying spectral density (the solid line), the final underlying spectral
density (the dotted line), and the E-estimate of Section 8.2 (the dashed line). Note that
NONSTATIONARY AUTOCOVARIANCE AND SPECTRAL DENSITY 353
0 1 2 3
-2
Periodogram
0.6
0.4
0.2
Spectral Density
0.8
0.4
0.0
Figure 9.10 Nonstationary ARMA(1,1) time series and its naı̈ve spectral analysis. The top
diagram shows a particular realization of a nonstationary Gaussian ARMA(1,1) time series
Xl/n − a(l/n)X(l−1)/n = σ(Wl/n + b(l/n)W(l−1)/n ), l = 1, 2, . . . , n, where a(l/n) = a0 + (l/n)a1 ,
b(l/n) = b0 + (l/n)b1 , and {Wt } is a standard Gaussian white noise. The middle diagram shows the
periodogram. The bottom diagram shows by the solid and dotted lines spectral densities of station-
ary Gaussian ARMA(1,1) processes with parameters (a0 , b0 , σ) and (a1 , b1 , σ), respectively. In other
words, these are initial (t = 0) and final (t = 1) spectral densities of the underlying nonstationary
ARMA time series. The dashed line shows the spectral density E-estimate of Section 8.2 which
is developed for a second-order stationary time series. {Parameters a0 , b0 , a1 , b1 are controlled by
arguments a0, b0, a1, and b1, respectively.} [n = 240, sigma = 1, a0 = -0.3, b0 = -0.6, a1 = 0.4,
b1 = 0.5, cJ0sp = 2, cJ1sp = 0.5, cTHsp = 4]
the E-estimate, similarly to the periodogram, is confused and indicates that the observed
process is a white noise.
What we see in Figure 9.10 is in no way the fault of the estimators because they are
developed for stationary processes and cannot be used for nonstationary ones.
Now let us check how the proposed methodology of dynamic autocovariances estimated
via the regression E-estimator works out. Figure 9.11 illustrates the proposed approach.
Here the top diagram shows us another simulation of the same Gaussian nonstationary
process used in Figure 9.10. We again see a familiar pattern of changing the dynamic of
the process over time from high to low frequency oscillations. The three other diagrams
show underlying autocovariances (the solid line) and the E-estimates (the dashed line). As
354 DEPENDENT OBSERVATIONS
3
0
-3
Autocovariance (0)
X
t
1.6
1.0
Autocovariance (1)
X
t
0.5
-1.0
Autocovariance (2)
X
t
0.4
-0.2
Figure 9.11 Estimation of the first three autocovariance functions for a nonstationary time series.
The time series is generated as in Figure 9.10. The solid and dashed lines show an underlying
autocovariance γtX (j) and its regression E-estimate, respectively. [n = 240, sigma = 1, a0 = -0.3,
b0 = -0.6, a1 = 0.4, b1= 0.5, cJ0 = 4, cJ1 = 0.5, cTH = 4]
we see, the E-estimates are not perfect but they correctly show the dynamic of underlying
autocovariances. Note that for a stationary process these three curves should be constant
in time. Hence we can use the developed methodology for testing second-order stationarity.
We do not show here the corresponding dynamic spectral density gtX (λ) because it is
a bivariate function in t and λ, and we know that it is not easy to visualize a bivariate
function. On the other hand, recall that at each moment t the dynamic autocovariances are
coefficients in Fourier expansion (9.6.7). Hence, at each moment in time we can reconstruct
the spectral density E-estimate using E-estimates of dynamic autocovariances.
Let us finish this section by a remark about more general formulation of the problem.
Consider a set of time series {Xt,τ } where t is a discrete time of moments when we can
observe these time series and τ is a continuous parameter that controls second-order charac-
teristics like the autocovariance function and the spectral density. Note that we are dealing
THE SIMPSON PARADOX 355
with infinitely many underlying time series of interest that are synchronized in time. We
do not observe all these time series simultaneously. Instead, we may observe the trajectory
(realization) of a process Xt,t , t = 1, 2, . . . , n. Note that every time we observe realization
of a different underlying time series. The latter is a very special modification of underly-
ing time series because each time we “jump” from observing one time series to observing
another time series. Similarly to the previous setting with a dynamic ARMA process, we
would like to infer about second-order characteristics of the underlying set of time series.
The main assumption is that if τ = t, then the second-order characteristics change slowly
in time t. Under this assumption, the above-presented solution is applicable.
Our conclusion from the survey is clear: the mean salary of B graduates is only 66% of
the mean salary of the A graduates.
Now let us look more closely at the available data and take into account the lurking vari-
able “field of concentration” which can be either science or engineering. Table 9.2 presents
the corresponding data.
Table 9.2 sheds an absolutely different light on the same data and dramatically changes
our opinion about salaries of A and B graduates. The data indicates that B graduates have
larger salaries in both fields. Note that Table 9.2 is a classical two-way table which takes
into account the lurking variable “field of concentration.”
How can B graduates do better than A graduates in every field according to Table 9.2
yet still fall far behind A graduates according to Table 9.1? The explanation is in the larger
number of B graduates concentrating in science where salaries are lower than in engineering.
When salaries from both concentrations are lumped together, the B graduates place lower
because the fields they favor pay less.
356 DEPENDENT OBSERVATIONS
A Graduates B Graduates
1.0
1.0
0.8
0.8
0.6
0.6
Salary
Salary
0.4
0.4
0.2
0.2
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
GPA GPA
Mean Salary = 0.47 Mean Salary = 0.31
Figure 9.12 Two scattergrams overlaid by linear regressions. The left diagram presents data for
graduates with a master’s degree from college A, and the right diagram for graduates with a master’s
degree from college B. The simulation and its parameters are explained at the end of Section 9.7.
{This figure is the first part of the triplet generated by Figure 9.12. To proceed to Figure 9.13, at the
prompt Browser[1] > enter letter c and then press Return. The same procedure is used to proceed
to Figure 9.14.} [n = 100, k = 10, sigma = 0.1, a = 0.8, b = 0.2, eta = 0.4]
The original one-way Table 9.1 is misleading because it does not take into account the
lurking categorical variable “field.” This misleading constitutes Simpson’s paradox when an
association or comparison that holds for all of several groups can reverse direction when the
data are combined to form a single group.
Of course, there are many other examples that are similar to the above-presented case of
two colleges. For instance, you may want to compare two hospitals with the lurking variable
being the proportion of elderly patients, or you may want to compare returns of two mutual
funds with different required allocations in bonds and stocks, etc.
The main teaching moment is that Simpson’s paradox helps us to recognize the impor-
tance of paying attention to lurking variables. It truly teaches us to look “inside” the data
and challenge “obvious” conclusions.
Can the paradox be useful in understanding other familiar statistical topics? Let us
continue the discussion of the salary example, and we will be able to recognize Simpson’s
paradox in a linear regression and learn how to investigate it using the conditional density.
Another interesting feature of the example is that we will be using a continuous stochastic
process in an underlying model.
Figure 9.12 presents data and linear regressions with the salary, 10 years after gradu-
ation, being the response and GPA being the predictor. The two diagrams correspond to
graduates from schools A and B, and there are n = 100 graduates from each school. The
solid lines, overlaying the scattergrams, exhibit the least squares linear regressions for A
and B graduates. The mean salaries are shown in the subtitles.
What we see in Figure 9.12 is a definite testament to the superiority of School A. Not
only the mean salary of A graduates is significantly larger, the better learning at school
A, reflected by the GPA, is dramatically more rewarding in terms of the salary as we can
see from the larger slope in the linear regression. Interestingly, even A graduates with the
worst grades do better than their peers from school B (compare the y-intercepts). Note that
Figure 9.12 is the extended regression analog of Table 9.1, and it allows us to make much
THE SIMPSON PARADOX 357
A Engineers A Scientists
1.0
1.0
0.8
0.8
0.6
0.6
Salary
Salary
0.4
0.4
0.2
0.2
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
GPA GPA
Mean Salary = 0.56 Mean Salary = 0.18
B Engineers B Scientists
1.0
1.0
0.8
0.8
0.6
0.6
Salary
Salary
0.4
0.4
0.2
0.2
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
GPA GPA
Mean Salary = 0.61 Mean Salary = 0.21
Figure 9.13 Scattergrams taking into account the field of specialization. Solid lines are least-squares
regressions. This is a continuation of Figure 9.12.
stronger conclusion about rather dramatic differences in the salary patterns of A and B
graduates. After analysis of the data, it is absolutely clear that school A does a superb job
in educating its students and preparing them for the future career.
Now, similar to Table 9.2, let us take into account the lurking variable field (science or
engineering). Figure 9.13 presents corresponding scatterplots overlaid by linear regressions.
Note that the four diagrams give us much better visualization and understanding of data
than a two-way table like Table 9.2. These diagrams also completely change our opinion
about the two schools. What we see is that in each field B graduates do better and the
success in learning, measured by the GPA, is more rewarding at school B for both fields.
This is a complete reversal of the previous conclusion based on the analysis of Figure 9.12.
Just think about a similar example when performances of two mutual funds are compared
and how wrong conclusions that do not take into account all pivotal information may be.
How can B graduates do better in all aspects of the salary in every field, yet fall far behind
when we look at all engineers and scientists? The answer is the same as in the previous
discussion of the multi-way tables. When salaries are lumped together, the B graduates
are placed lower because the fields they favor pay less. What makes the presented example
so attractive is that the lurking variable “field” dramatically changes our opinion not only
358 DEPENDENT OBSERVATIONS
A Graduate B Graduate
GPA
GPA
ry ry
Sala Sala
Sa Sa
PA
PA
lar lar
y y
G
G
Figure 9.14 Estimated conditional densities of salary given GPA. Each column of diagrams shows
the same E-estimate using two different “eye” locations. This is a final figure created by Figure
9.12.
about average salaries, but also about the effect of being a good-standing (higher GPA)
student.
As we see, Simpson’s paradox for a regression may be even more confusing and challeng-
ing than its classical multi-way-table counterpart where only mean values are compared.
As a result, the paradox motivates us to think about data and to not rush with conclusions
based on employing standard statistical tools.
Can a statistical science suggest a tool for recognizing a possible Simpson’s paradox
and/or the presence of an important lurking variable? The answer is “yes,” and the tool is
the conditional density. Remember that if f XY (x, y) is the joint density of the pair (X, Y )
of random variables, then the conditional density of Y given X is defined as f Y |X (y|x) :=
f XY (x, y)/f X (x) assuming that the marginal density f X (x) of X is positive. Of course, the
conditional density is a bivariate function and, as we know from Section 4.4, this presents
its own complications caused by the curse of dimensionality. Nonetheless, it may help us in
visualization of a potential issue with a lurking variable.
Figure 9.14 allows us to look at nonparametric E-estimates of the conditional density
of salary given GPA for the two schools based on data exhibited in Figure 9.12. The two
“eye” locations help us to visualize the surfaces. Note that intersection of a vertical slice
SEQUENTIAL DESIGN 359
(with constant GPA) with the surface shows us the estimated conditional density of the
salary given a specific GPA. The two pronounced ridges in conditional densities for both
schools clearly indicate the dependence of the salary on the GPA, and they also indicate
two possible strata in the data that may be explained by a lurking variable. After making
this observation, it may be a good learning experience to return to the available data shown
in Figure 9.12 and realize that the indicated by the conditional density strata may be also
seen in the scattergrams.
Now let us explain the underlying simulation used in Figure 9.12. For both schools the
rescaled GPA is a uniform random variable on [0, 1]. For an lth graduate from a school A
with a field of concentration being Fl , which is either engineering when Fl = 1 or science if
Fl = 0, the model for the salary Sl (A) as a function of the GPA G is defined as a rescaled
onto [0, 1] random variable Sl0 (A) which is defined as
Sl0 (A) = 2 + Gl + Fl (A)(1 + 2Gl ) + σW (A, Gl , k) + ηZl (A), l = 1, . . . , n. (9.7.1)
Here Fl (A) is a Bernoulli random variable with P(Fl (A) = 1) = a, W (A, t, k) =
Pk−1
j=0 Wj (A)ϕj (t) is a standard frequency-limited white Gaussian noise with Wj (A) be-
ing independent standard Gaussian variables, Zl (A) are also independent standard Gaus-
sian variables, and a, k, σ, η are parameters that may be changed while using Figure 9.12.
Similarly for the school B,
Sl0 (B) = 2 + 1.2Gl + Fl (B)(1 + 3Gl ) + σW (B, Gl , k) + ηZl (B), l = 1, . . . , n. (9.7.2)
Here Fl (B) is a Bernoulli random variable with P(Fl (B) = 1) = b, it is the indicator that
the lth graduate from school B is engineer, and b is the parameter that may be changed
while using Figure 9.12. The process W (B, t, k) is defined as explained above for school A.
The reader is advised to repeat Figure 9.12 with different parameters and get a better
understanding of the underlying idea of Simpson’s paradox for regression and how the
conditional density may shed light on the paradox. Another teachable moment is to visualize
data in Figure 9.12 and learn how to search after possible strata directly from scattergrams.
3
2
Y
1
0
1
-1 0
Figure 9.15 Testing the idea of the optimal design. Regression scattergrams, simulated according to
(9.8.1), are shown by circles overlaid by the underlying regression function (the solid line) and its
E-estimate (the dashed line). The Uniform design density is used in the top diagram, the optimal
design density f∗X (x), proportional to the scale, is used in the bottom diagram. The same errors εl ,
l = 1, . . . , n are used in the two regressions. The scale function is defined as σ(a + f2 (x)) where
f2 (x) is the Normal density. In a title, ISE is the empirical Rintegrated squared
R 1 error, AISE is the
1
sample mean of ISEs obtained for nsim simulations. R := 0 [σ(x)]2 dx/[ 0 σ(x)dx]2 is the ratio
between the functionals (9.8.2) for the Uniform and optimal designs, it is shown in the title of
the bottom diagram. {The argument corn controls an underlying regression function, n controls the
sample size, sigma controls σ, nsim controls the number of repeated simulations used to calculate
AISE.} [n = 50, corn = 2, sigma = 0.1, a = 3, nsim = 100, cJ0 = 4, cJ1 = 0.5, cTH = 4]
Figure 9.15 allows us to conduct the first test, and its caption explains the diagrams.
Here the scale function is σ(x) = σ(a + f2 (x)) where f2 (x) is the Normal density, σ = 0.1
and a = 3. The top diagram uses the Uniform density as the design density, and the bottom
diagram uses the design density f∗X (x) proportional to the underlying scale function. The
same regression errors εl are used in the both simulations, so the only difference is in
the design of predictors. What we see here is that indeed the optimal design placed more
observations in the middle of the unit interval. On the other hand, note that in the bottom
diagram we have less observations to estimate tails of the regression function. In particular,
under the optimal design there are no observations near the end of the right tail.
Let us stress that we are considering random designs, so each simulation may have its
own specific issues. Further, because the scale is larger near the center of the interval of
estimation, in the bottom diagram we observe more cases with larger regression errors.
Nonetheless, for this particular simulation the optimal design implied a smaller integrated
squared error (ISE) shown in the title.
Of course, we have analyzed a single simulation, and it is of interest to repeat the
simulation a number of times and then compare the sample means of corresponding ISEs.
Figure 9.15 allows us to do this. Namely, it repeats the outlined simulation nsim = 100
times, for each simulation calculates ISEs for the Uniform and optimal designs, then averages
362 DEPENDENT OBSERVATIONS
3
2
Y
1
0
Scale Function, k = 50
0.2 0.3 0.4 0.5
1
0
Figure 9.16 Testing a two-stage sequential design. The underlying regression is the same as in
Figure 9.15. Regression scattergrams are shown by circles overlaid by the underlying regression
function (the solid line) and its estimate (the dashed line). The top diagram shows a simulation
(9.8.1) with the Uniform design. The middle diagram shows estimation of the scale function based
on first k observations shown by crossed circles in the top diagram. Here k := bbnc where b is the
parameter. The bottom diagram shows results for a particular sequential simulation. Here the first
k pairs of observations are the same as in the top diagram (generated according to the Uniform
design density) and indicated by crossed circles, and the next n − k are generated according to the
density proportional to the estimated scale function shown in the middle diagram. Indicated ISE
and AISE are the integrated squared errors for the shown regression estimates and averaged ISEs
over nsim simulations. [n = 100, corn = 2, sigma = 0.1, a = 3, b = 0.5, nsim = 100, cJ0 = 4,
cJ1 = 0.5, cTH = 4]
them and shows AISEs in the titles. The above-described numerical study is a traditional
statistical tool to compare different statistical procedures.
The AISEs, shown in the titles of the corresponding diagrams, are relatively close to
each other but do benefit the optimal design. It is of interest to compare them with the
theoretical ratio R := d(1, σ)/d∗ (σ) = 1.06 shown in the title of the bottom diagram. As
we see, the potential for improvement is rather modest, but nonetheless it does exist. Of
course, the small ratio stresses complexity of the problem of a sequential controlled design
for small samples because this design involves more estimation procedures that may go
EXERCISES 363
wrong. Nonetheless, the outcome of Figure 9.15 is encouraging, and we may proceed to
exploring a sequential design.
A sequential design, explored in Figure 9.16, is a two-stage design. First, we conduct
k = bbnc simulations according to the Uniform design. These observations are used to
estimate the scale function by the regression E-estimator of Section 3.6. This procedure
constitutes the first stage of the design. The second stage is to generate the remaining n − k
pairs of observations using the design density calculated according to (9.8.3) with the plug-
in scale E-estimate. Note that in the available sample the last n − k pairs of observations
are mutually independent, but observations collected during the first and second stages are
dependent.
Figure 9.16 shows us a particular two-stage sequential simulation. Its top diagram is
simulated identically to the top diagram in Figure 9.15. The only difference here is that
the first k pairs are highlighted by crossed circles. These are the pairs used to estimate the
underlying scale function, and the result is shown in the middle diagram. The E-estimate
of the scale (the dashed line) is far from being perfect, but note that only its shape is of
interest (recall formula (9.8.3)), and the E-estimate does indicate that the scale function is
symmetric about 0.5 and has a pronounced mode. Sequential design is shown in the bottom
diagram. Here the first k pairs are the same as in the top diagram, and they are highlighted
by crossed circles. The remaining n − k pairs are generated using the estimated optimal
design. The particular outcomes and the AISEs favor the sequential design but they are
too close to each other. Because of this, another simulation may reverse the outcome. The
issue here is that the size n is too small for the considered challenging regression problem.
Nonetheless, this outcome may be considered as a success. Indeed, with the ratio R = 1.06
(recall Figure 9.15), the sequential controlled design has a tiny margin for error. Further,
the studied sequential regression estimation is a dramatically more complicated procedure
than the regression E-estimator for the Uniform design. And the fact that the AISEs are
close to each other for the relatively small sample size n = 100 is encouraging.
We may conclude that a sequential controlled design is an interesting and promising idea,
but in general it is a challenging one for small samples when initial observations are used for
estimation of nuisance functions and then obtained estimates define an optimal design for
next observations. This is when a numerical study becomes a pivotal tool. For the considered
regression problem, Figures 9.15 and 9.16, repeated with different parameters, may help to
learn more about this interesting problem.
9.9 Exercises
9.1.1 Explain the problem of regression estimation with dependent responses.
9.1.2 Suppose that in the regression model (9.1.1) the predictor Xl is uniformly distributed
and εl are realizations of a stationary time series. Under this assumption, are observations
(Xl , Yl ) stationary or nonstationary?
9.1.3 Consider the previous exercise only now Xl = l/n. Are observations (Xl , Yl ) stationary
or nonstationary?
9.1.4 Explain how regression E-estimator performs for the case of a model (9.1.1) with
independent errors.
9.1.5∗ Suppose that regression errors are a zero-mean and second-order stationary time
series. What is the mean and the variance of the sample mean estimator (9.1.4)?
9.1.6 Verify every step in establishing (9.1.6). Comment about used assumptions.
9.1.7∗ Explain why the second sum in (9.1.7) is sensitive to the design of predictors.
9.1.8 Verify (9.1.8).
9.1.9 Explain why (9.1.9) holds or does not hold for both designs of predictors.
9.1.10 Verify (9.1.12) for a random design and then explain why this result is a good news
for a random design regression.
364 DEPENDENT OBSERVATIONS
9.1.11 Is (9.1.12) valid for a fixed design regression?
9.1.12∗ What is the MISE of the regression E-estimator for the case of Gaussian ARMA
regression errors and a fixed design of predictors? Make any additional assumptions that
may be helpful.
9.1.13 Explain the underlying simulation used in Figure 9.1.
9.1.14 What is the definition of long-memory errors?
9.1.15 Do long-memory errors affect regression estimation for the case of a random design?
Explain your answer theoretically and then support by simulations using Figure 9.1.
9.1.16∗ Consider the case of regression with fixed design of predictors and long-memory er-
rors. Calculate the rate of the MISE convergence for the E-estimator, and then complement
your answer by simulations.
9.1.17∗ Explore the effect of σ on regression estimation with dependent errors. Use both
theoretical and empirical approaches.
9.1.18 How well may the shape of a regression function be estimated?
9.1.19∗ Figure 9.1 indicates that the E-estimator has difficulties with estimation of the
mean of regression function. Explain why, and then answer the following question. Is this
problem specific for the E-estimator?
9.1.20 Using Figure 9.1 for different sample sizes, write a report on how the sample size
affects estimation of regression function in the presence of long-memory errors.
9.1.21 Using Figure 9.1, explore the effect of parameter β on quality of estimation.
9.1.22 Propose better parameters for the E-estimator used in Figure 9.1.
9.2.1 Give several examples of continuous stochastic processes.
9.2.2 Present several examples of stationary and nonstationary stochastic processes.
9.2.3 Find the mean and the variance of the continuous stochastic process (9.2.1). Hint:
Calculate these characteristics for a particular time t.
9.2.4 Is the process (9.2.1) stationary or nonstationary?
9.2.5 What is the distribution of process (9.2.1) at time t?
9.2.6 Why is the limit, as k → ∞, of the process W (t, k) called a white process?
9.2.7 Is it possible to simulate a process W (t, ∞)?
9.2.8 Give the definition of a Brownian (Wiener) process. What is the underlying idea
behind its definition?
9.2.9 Verify (9.2.4).
9.2.10∗ Consider two independent Brownian motions. What can be said about their differ-
ence and sum? What will change if the Brownian motions are dependent?
9.2.11 Consider times t1 < t2 . What is the distribution of B(t2 ) − B(t1 )?
9.2.12∗ Consider times t1 < t2 . What is the distribution of B(t1 ) + B(t2 )?
9.2.13∗ Explain the meaning of infinite sum in (9.2.5).
9.2.14 Verify (9.2.6).
9.2.15 Verify each equality in (9.2.7).
9.2.16 Explain the simulation used by Figure 9.2.
9.2.17∗ Explain theoretically, and then test empirically using Figure 9.2, how parameter k
affects the shape of curves.
9.2.18 Explain the problem of filtering a continuous signal from white noise.
9.2.19 Are the models (9.2.8) and (9.2.9) equivalent?
9.2.20∗ Explain how the E-estimator performs filtering a signal from white noise.
9.2.21∗ Consider the case of a signal from a Sobolev class Sα,Q defined in (2.1.11). First,
find the MISE of the E-estimator for a given cutoff J. Second, find a cutoff that minimizes
the MISE. Finally, calculate the corresponding minimal MISE. Note that the obtained rate
of the MISE convergence is the fastest and no other estimator can produce a better rate.
9.2.22∗ Consider the case of a signal from an analytic class Ar,Q defined in (2.1.12). First,
find the MISE of the E-estimator for a given cutoff J. Second, find a cutoff that minimizes
EXERCISES 365
the MISE. Finally, calculate the corresponding minimal MISE. Note that the obtained rate
of the MISE convergence is the fastest and no other estimator can produce a better rate as
n → ∞. Further, even the constant of the MISE is minimal among all possible estimators.
9.2.23 Explain the relations (9.2.12).
9.2.24 Consider (9.2.12). Why are Wj independent and have a standard Gaussian distri-
bution?
9.2.25 Explain the underlying idea of estimator (9.2.13).
9.2.26∗ Find the mean and variance of estimator (9.2.13).
9.2.27 Explain the simulation used in Figure 9.3.
9.2.28 How is parameter σ chosen in Figure 9.3?
9.2.29 Repeat Figure 9.3 with different n and explain how this parameter affects perfor-
mance of the E-estimator.
9.2.30 Using Figure 9.3, find better parameters of the E-estimator.
9.2.31 Explain relation (9.2.14) and its implications.
9.2.32 Explain the principle of equivalence.
9.2.33 Explain the underlying experiment of Figure 9.4.
9.2.34 Consider the available realization of the process shown in Figure 9.4. Can you propose
a method for recovery the missed portion of the process?
9.2.35∗ Consider a filtering problem. Assume that, similarly to Figure 9.4, a portion of
the noisy signal is missed. Can you suggest a situation when an underlying signal may be
consistently estimated? Hint: Think about a parametric signal (like a linear regression) or
a seasonal component.
9.3.1 What is the definition of a stationary and second-order stationary time series?
9.3.2 Explain a classical decomposition model of a nonstationary time series. Present an
example.
9.3.3 What is the definition of a seasonal component?
9.3.4 What is the scale function?
9.3.5 Present and discuss several examples of time series with missed data.
9.3.6 Explain the idea of estimation of the trend.
9.3.7 Verify (9.3.4). Describe terms in the right side of (9.3.4).
9.3.8 What is the underlying idea of estimator (9.3.5)?
9.3.9 Explain (9.3.6). What are the assumptions needed for the validity of this inequality?
9.3.10 Verify (9.3.7).
9.3.11 Prove all equalities in (9.3.8).
9.3.12 Why do we need assumption (9.3.9) for evaluation of the variance of ηj ? What will
be if it does not hold?
9.3.13∗ What do relations (9.3.11) tell us about the estimator θ̃j ? Note that the estimator
is biased. Can an unbiased estimator decrease the rate of the mean squared error (MSE)
convergence?
9.3.14 Explain the idea of estimation of Fourier coefficients for the case of model (9.3.2)
with missing data.
9.3.15∗ Evaluate the mean and the variance of the estimator (9.3.12).
9.3.16∗ Explain how the sum, with the random number of addends on the left side of
(9.3.14), is replaced by the sum with fixed number of addends. Then prove (9.3.14).
9.3.17∗ Verify inequality (9.3.15). What is the assumption sufficient for its validity?
9.3.18∗ Prove (9.3.16).
9.3.19∗ Estimator θ̂j , defined in (9.3.12), is studied under the assumption that the function
q(t) has a bounded derivative. Is it possible to relax this assumption and still prove that
E{(θ̂j − θj )2 } ≤ cn−1 ?
9.3.20∗ Verify (9.3.17). Note that here N is a random variable.
366 DEPENDENT OBSERVATIONS
9.3.21∗ Verify every relation in (9.3.18). Pay attention to the fact that N is a random
variable.
9.3.22 Based on (9.3.19), what can be said about the mean squared error of the estimator
θ̂j ?
9.3.23 Explain the underlying idea of estimation of a seasonal component.
9.3.24 Explain formula (9.3.22).
9.3.25∗ How can a trend be separated from a seasonal component?
9.3.26 Explain formula (9.3.23).
9.3.27 Explain the simulation used in Figure 9.6.
9.3.28∗ Explain all steps in the analysis of nonstationary time series. Then repeat Figure
9.6 and present analysis of all ten diagrams.
9.4.1 Is the time series {Yt }, defined in (9.4.1), stationary or nonstationary?
9.4.2 Define and explain all components of the classical decomposition model (9.4.1).
9.4.3 Is there a difference, if any, between the trend and the seasonal component?
9.4.4 Present several examples of a time series (9.4.1) with a pronounced trend, seasonal
component and the scale.
9.4.5 Explain the model (9.4.2) of a nonstationary amplitude-modulated time series.
9.4.6 What is the underlying idea of estimation of the trend?
9.4.7 Explain the motivation behind estimator (9.4.4).
9.4.8∗ Evaluate the mean and the variance of estimator (9.4.4).
9.4.9 Does the statistic (9.4.5) use unavailable VZs for s > N ?
9.4.10 Suppose that P(Yt = 0) = 0. Show that in this case from an observation of Vt we
can conclude if Yt is missed, that is, if Ut = 0.
9.4.11∗ Formula (9.4.4) is a naı̈ve numerical integration. Propose a more accurate formula
and then evaluate its mean and variance.
9.4.12 Verify (9.4.6).
9.4.13 Prove (9.4.7).
9.4.14 Explain the underlying idea of estimator (9.4.8).
9.4.15∗ Calculate the mean and variance of estimator (9.4.8).
9.4.16∗ Prove (9.4.9).
9.4.17∗ Evaluate the mean and variance of estimator (9.4.10).
9.4.18 Explain the underlying idea of estimator (9.4.10).
9.4.19∗ Propose another feasible estimator of λ and compare it with estimator (9.4.10).
9.4.20 Explain the simulation used in Figure 9.8.
9.4.21 Explain the difference between Diagrams 1 and 2 in Figure 9.7.
9.4.22∗ Explain how the E-estimator calculates the E-estimate of the trend in Diagram 3.
9.4.23 Explain how the detrended data in Diagram 4 are obtained. Hint: Pay attention to
missing data in Diagram 2.
9.4.24 How was the estimated period in Diagram 5 calculated?
9.4.25 Explain how the estimate of seasonal component is constructed. Repeat Figure 9.8
several times, get a wrong estimate of the seasonal component, and then explain why this
happened.
9.4.26 The E-estimate in Diagram 8 is poor, it does not resemble the underlying scale
function. Why did this happen?
9.4.27 Does the time series in Diagram 9 look stationary? Do you see any seasonal compo-
nent or trend?
9.4.28 Diagram 10 shows the estimated spectral density of the rescaled residuals. Does it
indicate a seasonal component?
9.4.29 Repeat Figure 9.8 several times using different periods for the seasonal component.
Then write a report on how the period affects estimation of the spectral density.
EXERCISES 367
9.4.30 Propose better parameters for the E-estimator of the trend and scale function. Use
Figure 9.8 to check your suggestions.
9.4.31 Propose better parameters for the E-estimator of the spectral density. Use Figure
9.8 to check your suggestions.
9.4.32∗ Explain, both theoretically and using simulations, how parameter λ affects estima-
tion of the trend, scale and spectral density.
9.4.33∗ Consider a Poisson variable U . Calculate E{U 2 |(U > 0)}. Then explain how this
result can be used in estimation of the scale function in the decomposition of {Ut Yt }.
9.5.1 Explain a model of time series with nonstationary missing. Present several examples.
9.5.2 Why do we consider a model of time series on the unit time interval?
9.5.3 Explain formula (9.5.1) for the autocovariance function.
9.5.4∗ Evaluate the mean and variance of the sample autocovariance function (9.5.2).
9.5.5∗ Prove every equality in (9.5.3) and (9.5.4).
9.5.6∗ Is it possible to relax the assumption about bounded derivative of λ(t) and still have
(9.5.5)?
9.5.7 Verify (9.5.6).
9.5.8 Prove (9.5.7)
9.5.9∗ Is it possible to relax the assumption of bounded derivative of λ(t) for validity of
(9.5.7)?
9.5.10 Explain how function λ(t) may be estimated based on observations of the time series
{Ut Xt }.
9.5.11 Explain the estimator (9.5.10).
9.5.12∗ Suppose that the MISE of the regression E-estimator m̂(t) is known. Evaluate the
MISE of estimator λ̂(t) defined in (9.5.10).
9.5.13∗ Evaluate the mean and variance of estimator (9.5.11).
9.5.14∗ Evaluate the mean and variance of estimator (9.5.12).
9.5.15 Repeat Figure 9.9 with different sample sizes n. Based on the experiment, what is
the minimal sample size that may be recommended for a reliable estimation?
9.5.16 Using different parameters of an underlying ARMA(1,1) process in Figure 9.9, ex-
plore the problem of how these parameters affect estimation of the spectral density.
9.5.17∗ In Section 8.3 a stationary batch-Bernoulli missing mechanism was studied. Con-
sider a nonstationary batch-Bernoulli missing mechanism and propose a spectral density
estimator for an underlying time series {Xt }.
9.5.18∗ Explore the case of nonstationary Markov–Bernoulli missing mechanism. Hint:
Recall Section 8.3. Think about how many parameters are needed to define a stationary
Markov chain, then make them changing in time.
9.5.19∗ For the setting of the previous exercise, suggest E-estimator for the spectral density.
9.6.1 Give definition of the spectral density of a stationary time series.
9.6.2 Give definition of the autocovariance function of a stationary time series.
9.6.3∗ Prove (9.6.3) for a causal ARMA(1,1) process.
9.6.4∗ Verify (9.6.4).
9.6.5 Suppose that (9.6.5) holds. Explain how a typical realization of the corresponding
nonstationary ARMA(1,1) process will look like.
9.6.6 Present several practical situations when (9.6.5) occurs.
9.6.7 Why is (9.6.6) called a dynamic autocovariance?
9.6.8 Explain the underlying idea of the dynamic spectral density.
9.6.9 Is the dynamic spectral density a univariate or bivariate function?
9.6.10 What is the simulation used in Figure 9.10?
9.6.11 Explain the time series of observations in the top diagram in Figure 9.10. Does it
look like a stationary time series? Explain.
368 DEPENDENT OBSERVATIONS
9.6.12 Explain how the periodogram in Figure 9.10 is calculated. What do its modes tell
us?
9.6.13 Explain the three curves in the bottom diagram in Figure 9.10.
9.6.14 Can changing parameters of the E-estimator, used in Figure 9.10, help us to realize
that the underlying time series is nonstationary?
9.6.15 Explain the diagrams in Figure 9.11.
9.6.16 What are γtX (j) shown in Figure 9.11?
9.6.17 Does Figure 9.11 alert us about nonstationarity of the time series?
9.6.18 Suppose that you would like to simulate a stationary time series using Figure 9.11.
How can this be done? Then what type of curves could be expected for γtX (j)?
9.6.19 Explain why (9.6.6) can be considered as a nonparametric regression. Hint: Use
(9.6.8).
9.6.20∗ Explain how E-estimator γ̂tX (j) may be constructed.
9.6.21∗ Propose an estimator of γtX (j), and then evaluate its mean and variance.
9.6.22∗ What may define the quality of estimation of a dynamic autocovariance function?
9.6.23∗ Propose a spectral density E-estimator for a nonstationary time series. Hint: Use
E-estimators of dynamic autocovariances.
9.6.24∗ At the end of Section 9.6, a general setting of a set of time series {Xt,τ } is presented.
Propose a feasible approach for a corresponding dynamic spectral density and how it may
be estimated.
9.7.1 What is the definition of a one-way table? Give several examples.
9.7.2 What is the definition of a two-way table? Give several examples.
9.7.3 What is a plausible definition of a three-way table? Give several examples.
9.7.4 It looks like Tables 9.1 and 9.2 contradict each other. Explain why it is possible that
they are based on the same data.
9.7.5 Suggest an example of Simpson’s paradox for performance of two mutual funds with
lurking variable being the allocation between bonds and stocks.
9.7.6 Explain the underlying simulation used in Figure 9.12.
9.7.7 Figures 9.12 and 9.13 are based on the same data and nonetheless they imply different
conclusions about the two schools. How is this possible?
9.7.8 Based on the simulation, what is the mean number of graduates with a major in
science?
9.7.9 Based on the simulation, what is the variance of the number of graduates with a
major in science?
9.7.10∗ In Figure 9.12, parameters a and b control the probability of graduates from schools
A and B being an engineer. Using both the theory and simulations, for what values of these
parameters may the Simpson’s paradox no longer be observed?
9.7.11 Explain how E-estimator of the conditional density is constructed.
9.7.12 In model (9.7.1) the GPA is considered as a continuous “time” variable in a stochastic
process. Does this make sense for a salary model?
9.7.13 Suggest a stochastic model for weakly returns of a mutual fund for the last ten years
with lurking variable being the asset allocation.
9.7.14∗ Explain how dependence between observations in model (9.7.1) affect estimation
of the linear regression and the conditional density.
9.8.1 Explain the idea of a sequential design in a controlled regression experiment.
9.8.2 Suppose that in a controlled regression the next predictor is generated according
to a density whose choice is based on previous observations. In this case, are available
observations dependent or independent? R1
9.8.3∗ Propose a Fourier estimator of θj := 0 m(x)ϕj (x)dx whose mean squared error is
n−1 d(f X , σ)[1 + oj (1) + on (1)]. Hint: Recall our discussion in Section 2.3.
NOTES 369
9.8.4 Show that the design density that minimizes d(f X , σ) is proportional to the scale
function σ(x).
9.8.5 Verify (9.8.4).
9.8.6 Proof inequality (9.8.5) using the functional form (1.3.33) of the Cauchy-Schwarz
inequality.
9.8.7∗ Prove the Cauchy-Schwarz inequality used in (9.8.5). Hint: Use the Cauchy inequality
2|ab| ≤ a2 + b2 and then think about choosing appropriate a and b.
9.8.8 Explain the underlying simulation used in the top diagram of Figure 9.15.
9.8.9 Explain the underlying simulation used in the bottom diagram of Figure 9.15.
9.8.10∗ Repeat Figure 9.15 a number of times and make your own conclusion about the
opportunity of using an optimal design. Then explain all possible complications in using
the idea of optimal design by a sequential estimator.
9.8.11 Repeat Figure 9.15 and notice that AISEs vary rather significantly from one exper-
iment to another. Explain the variation.
9.8.12 Find better parameters of the used E-estimator.
9.8.13 Using Figure 9.15 for other regression functions and sample sizes, make your own
conclusion about feasibility of using an optimal design.
9.8.14 Repeat Figure 9.16 a number of times and write a report about sensitivity of the
sequential design to the scale’s estimate.
9.8.15∗ Consider the problem of choosing the size k of a sample on the first stage. Asymp-
totically, should k be of the same order as n or it may be smaller in order than n?
9.8.16 Use Figure 9.16 and explore the effect of k on estimation. Hint: The size k of the
first stage is controlled by argument b.
9.10 Notes
There are a number of excellent books devoted to time series analysis like Anderson (1971),
Diggle (1990), Brockwell and Davis (1991), Fan and Yao (2003), Bloomfield (2004). Among
more recent ones that discuss nonstationary time series, let us mention Box et al. (2016),
De Gooijer (2017), and Tanaka (2017).
9.1 The book by Beran (1994) covers a number of topics devoted to dependent variables
including long-memory processes, see also Ibragimov and Linnik (1971) and Samorodnitsky
(2016). The books Dedecker et al. (2007) and Rio (2017) cover a wide spectrum of topics on
weak dependence. Hall and Hart (1990) established that dependent regression errors may
significantly slow down the MISE convergence for a fixed-design regression. The theory of
regression with dependent errors is discussed in Efromovich (1997c, 1999a, 1999c) and Yang
(2001). The books by Dryden and Mardia (1998) and Efromovich (1999a) discuss estimation
of shapes.
9.2 White noise, Brownian motion and filtering a signal are classical statistical topics
with a rich literature. See, for instance, books by Ibragimov and Khasminskii (1981), Mallat
(1998), Efromovich (2009a), Tsay (2005), Tsybakov (2009), Fan and Yao (2003, 2015), Del
Moral and Penev (2014), Pavliotis (2014), Durrett (2016) and Samorodnitsky (2016). The
book Dobrow (2016) uses R to present introduction to stochastic processes.
Asymptotic theory of efficient filtering a signal from white Gaussian noise was pioneered
by Pinsker (1980) for the case of a known class of signals, efficient adaptation was proposed
in Efromovich and Pinsker (1984, 1989), and multidimensional settings were considered in
Efromovich (1994b, 2000b). It is worthwhile to note that typically a sine-cosine basis is used
in a series expansion.
The principle of equivalence between filtering model and other classical statistical models
has been initiated by Brown and Low (1996) for regression and Nussbaum (1996) for the
probability density. It is important to stress that the equivalence is based on some specific
370 DEPENDENT OBSERVATIONS
assumptions, and this implies limits on its applications; see more in Efromovich and Samarov
(1996) and Efromovich (1999a, 2003a).
Missing data is discussed in Tsay (2005) and Box et al. (2016).
9.3 The decomposition model is discussed in a number of books, see Brockwell and Davis
(1991), Fan and Gijbels (1996), Efromovich (1999a), Fan and Yao (2003, 2015), Tsay (2005)
and Box et al. (2016). The typically recommended method of estimation of the trend is the
liner regression. A thorough discussion of Fourier estimators using the idea of approximation
of an integral by a Riemann sum can be found in Efromovich (1999a) and Efromovich and
Samarov (2000).
9.4 Amplitude-modulated time series are discussed in Efromovich (2014d) where further
references may be found. Wavelet and multiwavelet analysis may be of a special interest
for the considered problem. See some relevant results in Efromovich (1999a, 2001c, 2004e,
2007b, 2009b, 2017), Efromovich et al. (2004), Efromovich et al. (2008), Efromovich and
Valdez-Jasso (2010), Efromovich and Smirnova (2014a,b), as well as the asymptotic theory
in the monograph Johnstone (2017).
9.5 Matsuda and Yajima (2009) consider a time series {At Xt } with missing observations
where the availability variables At are independent Bernoulli variables and they are also
independent from the time series of interest {Xt }, but At are not identically distributed
Bernoulli variables with P(At = 1) = w(t) being unknown. This model was suggested
for time series with irregularly spaced data. The case of a stationary Poisson amplitude-
modulation was studied in Vorotniskaya (2008), and Efromovich (2014d) proved efficiency
of the E-estimation approach.
9.6 Time-varying nonstationary processes is a popular topic due to numerous practical
applications, see a discussion in books by Tsay (2005), Stoica and Moses (2005), Sandsten
(2016) and Tanaka (2017). One of the main ideas to deal with nonstationarity, complemen-
tary to the presented one, is the segmentation of the period of observation, assuming that
the process is stationary on each segment, and then using a smoothing procedure. This, and
other interesting approaches may be found in Priestly (1965), Kitagawa and Akaike (1978),
Zurbenko (1991), Dahlhaus (1997), Adak (1998), and Rosen, Wood and Stoffer (2012). As
an example of application, see Chen et al. (2016) and Efromovich and Wu (2017).
9.7 Simpson’s paradox is an extreme example showing that observed relationships and
associations can be misleading when there are lurking variables, see a discussion in Moore,
McGabe and Craig (2009). Due to its confusing nature, it is an excellent pedagogical tool to
attract our attention to statistical data analysis, see Gou and Zhang (2017). It is a common
practice to use the paradox in conjunction with explanation of multi-way tables. At the same
time, as we have seen from Figures 9.12–9.14, it is also useful in understanding of regressions,
continuous processes and conditional densities. It is also important to stress that as these
figures indicate, the conditional density, and not a regression, is the ultimate description of
the relationship between the predictor and the response. As such, the conditional density
may uncover the “mystery” of Simpson’s paradox and may teach us a valuable lesson about
the practical value of the conditional density. A nice introduction to multi-way tables can
be found in Moore, McGabe and Craig (2009), to linear models in Kutner et al. (2005), and
the theory of estimation of the conditional density in Efromovich (2007g).
9.8 Stein (1945) and Wald (1947,1950) pioneered principles of sequential estimation,
and more good reading can be found in Prakasa Rao (1983), Wilks (1962), Mukhopadhyay
and Solanky (1994). There are a number of good books that discuss controlled sampling
procedures, see Pukelsheim (1993), Thompson and Seber (1996), Arnab (2017), and Dean,
Voss and Draguljic (2017).
High-dimensional two-stage procedures are considered in Aoshima and Yata (2011).
Asymptotic issues of nonparametric sequential estimation are considered in Efromovich
(1989; 2004b,d; 2007d; 2008c; 2012b; 2015; 2017). Quickest detection problem is another
interesting topic, see Efromovich and Baron (2010).
Chapter 10
Ill-Posed Modifications
371
372 ILL-POSED MODIFICATIONS
regression with missing responses and measurement errors in predictors when a complete
case approach may no longer be consistent.
While measurement errors, deconvolution and estimation of derivatives are probably
the better known example of ill-posed settings, survival analysis has its own number of
interesting examples. In particular, we will consider a practically important current status
censoring (CSC) where the censoring affects the rate of the MISE convergence.
The context of the chapter is as follows. Density estimation with measurement errors,
which is a particular example of a deconvolution problem, is studied in the first three
sections. Section 10.1 considers the case of a random variable whose observations are con-
taminated by measurement errors. Section 10.2 makes the setting more complicated by
considering missing data with measurement errors. Section 10.3 adds another layer of mod-
ification via censoring. Let us note that the proposed deconvolution is based on estimation
of the characteristic function which is an important statistical problem on its own, and
this is the first time when this function is discussed. The characteristic function, similar
to the cumulative distribution function, completely defines a random variable, and for a
random variable X it is defined as φX (t) := E{eitX } = E{cos(tX)} + iE{sin(tX)}, where
t ∈ (−∞, ∞) and i is the imaginary unit, that is i2 = −1. Current status censoring, also
referred to as case 1 interval censoring, is discussed in Section 10.4. Regression with mea-
surement errors in predictors is discussed in Sections 10.5 and 10.6. Estimation of derivatives
is discussed in Section 10.7.
and then, using (10.1.6), estimate the characteristic function of an underlying X by the
sample mean estimator Pn
X φ̂Y (t) n−1 l=1 eitYl
φ̃ (t) := ε = . (10.1.8)
φ (t) φε (t)
The only (and extremely series) complication here is that in (10.1.8) hε (t) is used in
the denominator. The following examples shed light on the complexity. The most common
measurement error is normal. If ε has a normal distribution with zero mean and variance
2 2
σ 2 , then its characteristic function is φε (t) = e−t σ /2 . Another often used distribution for
ε is Laplace (double exponential) with zero mean and variance 2b2 , and its characteristic
function is φε (t) = [1 + b2 t2 ]−1 . The fast decrease in the characteristic function of the
measurement error makes estimation of φX (t) for large t problematic and causes an acute
ill-posedness. Indeed, the good news is that the estimator (10.1.8) is unbiased, the bad news
is that its variance is proportional to |φε (t)|−2 , namely
E{|φ̂Y (t) − φY (t)|2 } Y
−1 1 − |φ (t)|
2
V(φ̃X (t)) = = n . (10.1.9)
|φε (t)|2 |φε (t)|2
374 ILL-POSED MODIFICATIONS
1.5
3
1.0
Density
Density
2
0.5
1
0.0
0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
X X
Convolution Convolution
0.0 0.2 0.4 0.6 0.8 1.0
Y, x Y, x
Figure 10.1 Simulated hidden data and the same data contaminated by normal measurement errors
(the convolution). Hidden samples of size n = 100 are simulated according to the Uniform and
the Normal densities and are shown in the two columns. The same measurement errors are used
for both samples and the errors are generated according to a normal distribution with zero mean
and standard deviation σ. Samples are shown by histograms, underlying densities f X are shown by
solid lines. Note that each of the bottom diagrams shows the histogram of a sample from Y and the
underlying density f X (x), x ∈ [0, 1]. The latter is stressed by the horizontal axis label “Y, x”. {The
sample size is controlled by the argument n, the set of underlying densities by set.c, and σ by the
argument sigma.} [n = 100, set.c = c(1,2), sigma = 0.3]
We conclude that the variance may dramatically (exponentially for the case of a normally
distributed measurement error) increase in t.
These results shed light on the ill-posedness caused by measurement errors. According
to (10.1.6), for a large t a change in an underlying φX (t) may cause a relatively small change
in the observed φY (t), and this is what causes ill-posedness. Further, (10.1.9) shows that
the variance of φ̃X (t) may be dramatically larger than the variance of φ̂Y (t).
Figure 10.1 allows us to visualize the effect of measurement errors. The left-top diagram
shows us a simulated sample from the Uniform density (the caption explains the simulation).
This is a direct sample, and an appropriate smoothing of the histogram should produce a
good estimate. The diagram below shows the same sample contaminated by independent
normal measurement errors with zero mean and standard deviation σ = 0.3. No longer the
histogram even remotely resembles the Uniform density shown by the solid line. Further,
over the interval of interest [0, 1], the histogram clearly indicates a decreasing density. Fur-
thermore, please look at the larger support of Y and the asymmetric about 0.5 data, and
note that the distribution of Y is symmetric about 0.5. Can you visualize the underlying
Uniform density in the histogram? The answer is “no.” Here only a rigorous statistical ap-
proach may help. The right column of diagrams shows a similar experiment only with the
underlying Normal density and the same measurement errors. Here the direct histogram (in
the right-top diagram) is good and we may visualize the underlying Normal density shown
by the solid line. On the other hand, the histogram of the same sample with measurement
MEASUREMENT ERRORS IN DENSITY ESTIMATION 375
errors does not resemble the Normal and it is clearly skewed and asymmetric about 0.5. We
may conclude that no longer visualization is a helpful step in the analysis of data modified
by measurement errors. The reader is advised to repeat this figure, use different arguments,
and get a feeling of the problem and its complexity.
Now we are in a position to explain how our E-estimator, described in Section 2.2 for the
case of direct observations of X, can be used for estimating an underlying density f X (x)
when data is modified by measurement errors. Recall that X is supported on [0, 1], and
then the density of interest can be written as
∞
X
f X (x) = 1 + 21/2 Re{φX (πj)}ϕj (x), x ∈ [0, 1], (10.1.10)
j=1
where ϕj (x) := 21/2 cos(πjx) are elements of the cosine basis on [0, 1] and Re{z} is the
real part of a complex number z. Then, assuming that the distribution of the measurement
error is known, the characteristic function may be estimated by the sample mean estimator
(10.1.8). This characteristic function estimator yields the deconvolution E-estimator.
Furthermore, according to (10.1.4), the E-estimator simplifies and looks more familiar
if the measurement error is symmetric about zero. Indeed, in this case it is convenient to
rewrite (10.1.9) as
∞
X
f X (x) = 1 + θj ϕj (x), (10.1.11)
j=1
R1
where θj := 0
f X (x)ϕj (x)dx, and then use the sample mean Fourier estimator
n
X
θ̂j := n−1 ϕj (Yl )/φε (πj). (10.1.12)
l=1
Estimator (10.1.12) resembles the traditional Fourier estimator of Section 2.2, with one
pronounced difference. Here we divide by the quantity φε (πj) which may be small and close
to zero. The latter also complicates plugging in an estimate of the characteristic function.
The asymptotic theory shows that estimation on frequencies where the characteristic func-
tion is too small must be skipped. In particular, we may restrict our attention to frequencies
where |φε (πj)|2 > CH n−1 log(n), and if m is the sample size of an extra sample of measure-
ment errors used to estimate φε , then n is replaced by m. Here CH is a positive constant
which is used by the E-estimator.
Figure 10.2 illustrates performance of the E-estimator for the case of a known dis-
tribution of measurement errors. Here the distribution of measurement errors is Laplace
with parameter b = 0.2. The distribution is symmetric about zero, the density is f ε (u) =
(2b)−1 exp(−|u|/b) and the characteristic function is φε (t) = (1+b2 t2 )−1 . Note that Laplace
distribution may be also referred to as double exponential or Gumble. Laplace distribution
has heavier tails than a Normal distribution and a slower decreasing characteristic function.
Now let us look at particular simulations. The E-estimate for the case of direct observation
and the Normal density is almost perfect despite a rough histogram reflecting the underlying
sample from the Normal. The corresponding histogram for the modified by measurement
errors data (the left-bottom diagram) indicates a skewed and asymmetric sample. This
yields a far from perfect E-estimate. On the other hand, keeping in mind the underlying
data, the E-estimate correctly indicates the symmetric about 0.5 and bell-shaped density
as well as the almost perfect support. For the case of the Bimodal underlying distribution
(see the right column of diagrams) the E-estimates are worse, and this reflects the more
challenging shape of the density. We see in the right-top diagram that the histogram (and
hence the sample) does not indicate the shape of the Bimodal, and this is also reflected by
376 ILL-POSED MODIFICATIONS
Density
2.0
1.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
X X
5
4
Density
Density
3
2
1
0
-0.5 0.0 0.5 1.0 0.0 0.5 1.0 1.5
Y, x Y, x
Figure 10.2 Performance of the density E-estimator for direct data and the deconvolution E-
estimator for data modified by Laplace measurement errors. Parameter of the Laplace distribution
is b = 0.2. Data are shown by histograms, underlying densities by solid lines and E-estimates by
dashed lines. {Argument CH controls parameter CH .} [n = 100, b = 0.2, set.c = c(2,3), cJ0 = 3,
cJ1 = 0.8, cTH = 4, CH = 0.1]
the E-estimate. At the same time, the E-estimate does indicate two modes with the only
caveat that the magnitude of the right one is too small. Measurement errors caused an
interesting modification of the sample from X that may be observed in the right-bottom
diagram. Surprisingly, measurement errors “corrected” the underlying sample in terms of
the relative magnitudes of the modes. On the other hand, the E-estimate shows sharper
modes and incorrectly indicates two strata. This is a teachable outcome because it shows
how the deconvolution performs. Finally, note how heavier (with respect to normal) tails of
Laplace distribution affect the modified data. The reader is advised to repeat Figures 10.1
and 10.2 and compare data modified by these two typical measurement errors.
If the distribution of measurement errors is unknown, then in general the deconvolution
is impossible. In what follows we are considering two possible scenarios that allow us to
overcome this complication. The first one is when an extra sample of measurement errors of
size m is available. Then the characteristic function φε may be estimated and used in place
of an unknown φε .
Figure 10.3 illustrates this situation. Simulations are similar to those in Figure 10.2.
We begin analysis of estimates with the left column corresponding to the Normal density
f X (x). The particular sample of direct observations, shown by the histogram in the left-
top diagram, is clearly skewed and asymmetric. The E-estimate improves this drawback of
the histogram but still it is far from being good. The latter is also reflected by the large
ISE (compare with Figure 10.2). The extra sample of measurement errors, shown by the
histogram in the left-middle diagram, is also far from being perfect. It is used to estimate
the characteristic function φε (πj). The left-bottom diagram shows the modified data and
the plug-in density E-estimate which is relatively good. We observe a rare outcome when
MEASUREMENT ERRORS IN DENSITY ESTIMATION 377
2.0
2.0
Density
Density
1.0
1.0
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
X X
2.0
Density
Density
1.0
1.0
0.0
0.0
-1.0 -0.5 0.0 0.5 -0.5 0.0 0.5
2.0
2.0
Density
Density
1.0
1.0
0.0
0.0
Y, x Y, x
Figure 10.3 Performance of the deconvolution E-estimator for the case of an unknown distribution
of measurement errors when an extra sample of size m of the errors is available. The underlying
distribution of errors is Laplace with parameter b = 0.2. Solid and dashed lines show underlying
densities and E-estimates, respectively. [n = 100, m = 50, b = 0.2, set.c = c(2,3), cJ0 = 3, cJ1 =
0.8, cTH = 4, CH = 0.1]
measurement errors improved the E-estimate. No such pleasant surprise for the Bimodal
density shown in the right column of diagrams. The unimodal shape of the plug-in E-
estimate for the modified data is typical because the Bimodal shape is too complicated
for the deconvolution and small sample sizes. At the same time, we may conclude that a
relatively small size of extra samples of measurement errors is feasible for our purposes. It
is highly advisable to repeat Figures 10.3 with different parameters and learn more about
this interesting problem and the case of small samples.
Another possibility to solve the deconvolution problem without knowing the distribution
of measurement errors is as follows. Consider the case of repeated observations when available
observations are
Ylk = Xl + εlk , l = 1, 2, . . . , n, k = 1, 2. (10.1.13)
Here εlk are iid observations (the sample of size 2n) from ε, and it is known that the
distribution of ε is symmetric about zero and its characteristic function is positive (note
that this is the case for the Normal, Laplace, Cauchy, their mixtures and a number of other
popular distributions).
To understand how the characteristic function φε (t) of the measurement error may be
estimated, let us note that for any independent and identically distributed εl1 and εl2 we
can write,
E{eit(Yl1 −Yl2 ) } = E{eit(εl1 −εl2 ) } = φε (t)φε (−t) = |φε (t)|2 . (10.1.14)
378 ILL-POSED MODIFICATIONS
−1
Pn it(Yl1 −Yl2 )
As a result, n l=1 e is the unbiased estimator of |φε (t)|2 . Furthermore, if the
distribution of ε is symmetric about zero and its characteristic function is positive, then
n
X
−1
ε
φ̂ (t) := [n eit(Yl1 −Yl2 ) ]1/2 (10.1.15)
l=1
is a consistent (as well as asymptotically rate optimal) estimator of φε (t). This characteristic
function estimator yields the plug-in deconvolution E-estimator.
So far we have considered the case of a known finite support of X. In a general case of
an unknown support, similarly to the case of directly observed X, the density is estimated
over the range of observed sample from Y = X + ε.
We are finishing this section by presenting an interesting and practically important
case of directional (angular, circular) data where data are measured in the form of angles.
Such data may be found almost everywhere throughout science. Typical examples include
wind and ocean current directions, times of accident occurrence, and energy demand over
a period of 24 hours. It is customary to measure directions in radians with the range
[0, 2π) radians. In this case the mathematical procedure of translation of any value onto
this interval by modulo 2π (the shorthand notation is [mod 2π]) is useful. As an example,
5π[mod 2π] = 5π − 2(2π) = π, and −3.1π[mod 2π] = −3.1π + 4π = 0.9π. In words, you add
or subtract j2π (where j is an integer) to get a result in the range [0, 2π).
The corresponding statistical setting is as follows. The data are n independent and
identically distributed realizations (so-called directions) Yl , l = 1, 2, . . . , n, of a circular
random variable Y that is defined by Y := (X + ε)[mod 2π] (or Y := (X[mod 2π] +
ε[mod 2π])[mod 2π]), where the random variable ε is independent of X. The variable ε is
referred to as the measurement error. The problem is to estimate the probability density
f X (x), 0 ≤ x < 2π, of the random variable X[mod 2π].
Before explaining the solution, several comments about circular random variables should
be made. Many examples of circular probability densities are obtained by wrapping a prob-
ability density defined on the line around the circumference of a circle of unit radius (or
similarly one may say that a continuous random variable on the line is wrapped around the
circumference). In this case, if Z is a continuous random variable on the line and X is the
corresponding wrapped random variable, then
∞
X
X
X = Z[mod 2π] , f (x) = f Z (x + 2πk). (10.1.16)
k=−∞
While the notion of a wrapped density is intuitively clear, the formulae are not simple.
For instance, a wrapped normal N (µ, σ 2 ) random variable has the circular density (obtained
after some nontrivial simplifications)
∞
X 2 2
f X (x) = (2π)−1 1 + 2 e−k σ /2 cos(k(x − µ)) . (10.1.17)
k=1
Fortunately, for the problem at hand these complications with wrapped densities are not
crucial because for the case of a wrapped distribution we get the following simple formulae
for the characteristic function
Z 2π X ∞
X
φ (j) = f Z (x + 2πk)eijx dx = φZ[mod 2π] (j) = φZ (j). (10.1.18)
0 k=−∞
Hence for independent X and ε formula (10.1.6) holds and, in its turn, this implies that
the proposed deconvolution E-estimator may be used for directional data modified by mea-
surement errors.
DENSITY DECONVOLUTION WITH MISSING DATA 379
10.2 Density Deconvolution with Missing Data
Here we are considering the same model as in the previous section only when some obser-
vations may be missed. In a sense, this setting is a combination of models considered in
Sections 10.1, 4.1 and 5.1 only now a missing mechanism may be more complicated due to
a larger number of underlying random variables and the absence of direct observations of a
random variable of interest.
Let us formally describe the setting. There are two underlying (and hidden) samples of
size n from the continuous random variable of interest X and the continuous measurement
error ε. It is assumed that X is supported on [0, 1] and independent of ε. The element-wise
sum of these two samples Y1 := X1 + ε1 , . . . , Yn := Xn + εn , which may be considered as
a sample from Y := X + ε, is also hidden due to a missing. The missing mechanism is
described by a Bernoulli random variable A, called the availability, which also generates a
sample A1 , . . . , An , and the availability likelihood is
The missing resembles the classical MNAR (missing not at random) of Section 5.1 where
Y is the variable of interest. For instance we may think that the missing occurs after an
observations of Y is generated, and then given Y the missing does not depend on hidden
variables X and ε. The obvious complication, with respect to the classical MNAR of Section
5.1, is that the variable of interest X is modified by the measurement error and hence a
special deconvolution is needed to recover the density f X .
If the availability likelihood w(y) is known, then for any t, such that |φε (t)| > 0, the
sample mean estimator of the characteristic function φX (t) of the random variable of interest
X is
n
X I(Al Yl 6= 0)eitAl Yl
φ̂X (t) := n−1 . (10.2.3)
φε (t)w(Al Yl )
l=1
380 ILL-POSED MODIFICATIONS
3.0
2.0
Density
Density
2.0
1.0
1.0
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
X X
3.0
2.0
Density
Density
2.0
1.0
1.0
0.0
0.0
-0.5 0.0 0.5 1.0 -0.5 0.0 0.5 1.0 1.5
Y, x Y, x
Density
2.0
1.0
1.0
0.0
0.0
Y[A==1], x Y[A==1], x
Figure 10.4 Performance of the E-estimator for direct data, the data contaminated by Laplace
measurement errors, and M-sample where some observations with measurement errors are missed
according to the availability likelihood w(y). Samples are shown by histograms, underlying densities
and E-estimates are shown by solid and dashed lines, respectively. {The used availability likelihood
is w∗ (y) := max(dwL , min(dwU , w(y))} where w(y) = 0.3 + 0.5exp{(1 + 4y)/(1 + exp(1 + 4y))},
measurement errors are Laplace with parameter b. Parameters dwL and dwU are controlled by
arguments dwL and dwU , function w(y) is defined by the string w.} [n = 100, set.c = c(2,4), b =
0.2, w = 00 0.3+0.5*exp((1+4*y)/(1+exp(1+4*y)) 00 , dwL = 0.3, dwU = 0.9, cJ0 = 3, cJ1 = 0.8,
cTH = 4, cH = 0.1]
This is a new case for us because the missing is MNAR and it is defined by a hidden vari-
able whose observations are contaminated by measurement errors. This makes the missing
mechanism more complicated.
382 ILL-POSED MODIFICATIONS
Our aim is to show that if the nuisance functions φε (πj) and w(x) are known, then
consistent estimation of the density of interest f X (x) is possible.
Let us explain the proposed solution. We begin with a preliminary calculation of the
expectation of the product AeitAY . Using the technique of conditional expectation we can
write,
E{AeitAY } = E{eitX eitε E{A|X, ε}} = E{w(X)eitX }φε (t). (10.2.7)
Here we used (10.2.6) and the independence of X and ε. In its turn (10.2.7) implies that
Z 1
E{AeitAY } = φε (t) [f X (x)w(x)]eitx dx. (10.2.8)
0
Equation (10.2.8) is the pivot that explains how the density of interest f X (x) may
be estimated. Recall the cosine basis ϕ0 (x) := 1, ϕj (x) = 21/2 cos(πjx), j = 1, 2, . . . on
[0, 1], set g(x) := f X (x)w(x) for the product of the density of interest and the availability
likelihood, and introduce Fourier coefficients of the function g(x),
Z 1
κj := g(x)ϕj (x)dx. (10.2.9)
0
Then, according to (10.2.8) and the assumed symmetry about zero of the distribution of ε,
the sample mean Fourier estimator of κj is
n
X
κ̂j := n−1 I(Al Yl 6= 0)ϕj (AYl )/φε (πj). (10.2.10)
l=1
3.0
3
Density
Density
2.0
2
1.0
1
0.0
0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
X X
3.0
2.0
Density
Density
2.0
1.0
1.0
0.0
0.0
-0.5 0.0 0.5 1.0 -0.5 0.0 0.5 1.0 1.5
Y, x Y, x
3
Density
Density
2
1.0
1
0.0
0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.0 0.5 1.0 1.5
Y[A==1], x Y[A==1], x
Figure 10.5 Deconvolution with missing data when the availability likelihood depends on X. Samples
are shown by histograms, underlying densities and E-estimates are shown by solid and dashed lines,
respectively. {The availability likelihood is w∗ (x) := max(dwL , min(dwU , w(x))) where w(x) = 0.3 +
0.5x, measurements errors are Laplace with parameter b. Parameters dwL and dwU are controlled
by arguments dwL and dwU , function w(x) is defined by the string w.} [n = 100, set.c = c(2,4),
b = 0.2, w = 00 0.3+0.5*x 00 , dwL = 0.3, dwU = 0.9, cJ0 = 3, cJ1 = 0.8, cTH = 4, cH = 0.1]
just by chance, the interested reader may repeat these figures and check this. At the same
time, it is a nice exercise in probability to calculate the likelihood of this event.
Case 4. Availability likelihood depends on ε. This is the case when missing is defined by
value of the measurement error, and the availability likelihood is
This setting is similar to the one considered in Section 5.3, and given |φε (t)| > 0 we may
introduce the following sample mean estimator of the characteristic function of X,
Pn
X n−1 l=1 [I(Al Yl 6= 0)eitAl Yl /w(Zl )]
φ̂ (t) = . (10.2.17)
φε (t)
Let us check that this estimator is unbiased. Using independence of X and ε, together with
(10.2.16) and the rule of calculation of the expectation via conditional expectation, we can
write,
E{φ̂X (t)} = E{E{AeitAY [w(Z)φε (t)]−1 |X, ε, Z}}
= E{eitX eitε [w(Z)φε (t)]−1 E{A|X, ε, Z}}
= E{eitX eitε [w(Z)φε (t)]−1 w(Z)}
= E{eitX }E{eitε }/φε (t) = φX (t). (10.2.18)
We conclude that the proposed estimator of the characteristic function is unbiased.
In its turn, the characteristic function estimator (10.2.17) yields the deconvolution esti-
mator fˆX (x), x ∈ [0, 1] of Section 10.1.
If the function w(z) is unknown, then we can estimate it from the sample (Z1 , I(A1 Y1 6=
0)), . . . , (Zn , I(An Yn 6= 0)). Indeed,
and hence we can use the Bernoulli regression estimator ŵ(z) of Section 2.4.
Figure 10.6 allows us to understand the setting and test performance of the proposed
deconvolution E-estimator fˆX (x). The experiment is explained in the caption. We begin
with the left column of diagrams where the case of the Normal density f X is considered. For
the hidden sample from X, shown in the top diagram, the E-estimator does a very good job.
The middle diagram shows by circles the scattergram for the Bernoulli regression of A on
Z, triangles show us values of w(Zl ) for Zl corresponding to Al = 1. Note that according to
(10.2.17) only for these values we need to know the availability likelihood. The E-estimate
is not perfect, and we know from Section 2.4 that Bernoulli regression is a complicated
problem for small sample sizes. Indeed, the right tail of the estimate is wrong, but note
that it does correspond to the scattergram. The bottom diagram shows us an E-estimate
DENSITY DECONVOLUTION WITH MISSING DATA 385
3.0
3.0
2.0
Density
Density
2.0
1.0
1.0
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
X X
0.8
A
A
0.4
0.4
0.0
0.0
-2 -1 0 1 2 -1 0 1 2 3 4
Z Z
Density
2.0
1.0
1.0
0.0
0.0
Y[A==1], x Y[A==1], x
Figure 10.6 Deconvolution with missing data when the availability likelihood depends on auxiliary
variable Z. In the experiment Z = β0 + β1 X + ση with β0 = 0.2, β1 = 0.4, σ = 1, and η being
standard normal random variable and independent of X and ε. Measurement errors are Laplace
with parameter b = 0.2. Solid and dashed lines show underlying densities of interest and their E-
estimates, respectively. In the middle diagrams observations of the pair (Z, A) are shown by circles,
and for Z corresponding to A = 1 the likelihood function and its E-estimate are shown by triangles
and crosses, respectively. {The availability likelihood is w∗ (z) := max(dwL , min(dwU , w(z))) where
w(z) = 0.3+0.5exp((1+4z)/(1+exp(1+4z))). Parameters dwL and dwU are controlled by arguments
dwL and dwU , function w(z) is controlled by the string w, set.beta = (β0 , β1 ).} [n = 100, set.c =
c(2,4), b = 0.2, set.beta = c(0.2,0.4), sigma = 1, w = 00 0.3+0.5*exp((1+4*z)/(1+exp(1+4*z))) 00 ,
dwL = 0.3, dwU = 0.3, cJ0 = 3, cJ1 = 0.8, cTH = 4, cH=0.1]
of f X (x), x ∈ [0, 1]. The estimate is good keeping in mind complexity of the deconvolution
problem and when the estimate of the availability likelihood is far from perfect. As before,
the E-estimate shows a smaller support but otherwise its shape is very good. Note how
the underlying data, shown in the top diagram, is skewed to the right by the missing, and
nonetheless the E-estimate correctly indicates a symmetric about 0.5 density.
The case of the underlying density Strata is considered in the right column of diagrams.
The E-estimate for the hidden sample from X, shown in the top diagram, is reasonable for
this sample size. The E-estimate of the availability likelihood, shown in the middle diagram
by crosses, is far from being perfect but it does reflect the data at hand. The deconvolution
E-estimate of the density of interest, shown in the bottom diagram, correctly shows two
386 ILL-POSED MODIFICATIONS
strata but shifts the modes. The estimate does not look attractive, but if we carefully look
at the histogram, which presents available (not missed) observations of Y , and note its size
N = 61, it becomes clear that the deconvolution E-estimator does a good job in recovering
the Strata. After all, we are dealing with a complicated ill-posed deconvolution problem
with missing data, and a relatively poor estimation must be expected.
It is a prudent advice to repeat Figures 10.4-10.6 with different parameters and get used
to the deconvolution problem with missing data.
where ϕj (x) = 21/2 cos(πjx) are elements of the cosine basis on [0, 1]. Because the charac-
teristic function φε (t) of the measurement error is a known real function and φε (πj) 6= 0,
using (10.1.6) we can continue (10.3.1) and get,
E{ϕj (Y )}
θj = . (10.3.2)
φε (πj)
Now we need to understand how the expectation in (10.3.2) can be estimated based on
a sample from (V, ∆). We note that
where GC (x) is the survival function of the censoring variable C. Suppose that βC ≥ βY
(recall that βZ denotes the upper bound of the support of Z). This allows us to continue
(10.3.2) and write
E{∆ϕj (V )/GC (V )}
θj = . (10.3.4)
φε (πj)
It was explained in Section 6.5 that GC (v) may be estimated by
n Pn (1 − ∆ )I(V ≤ v) o
l l
ĜC (v) := exp − l=1 Pn . (10.3.5)
l=1 I(V l ≥ v)
DENSITY DECONVOLUTION FOR CENSORED DATA 387
4
Density
2
0
1.5
0.0
Y, x
1
-1
Figure 10.7 Density estimation for the case when observations are first contaminated by measure-
ment errors and then right censored. In the simulation the variable of interest X is the Normal,
measurement error ε is Laplace with parameter b = 0.2, and C is Uniform(0, UC ) with UC = 1.5.
Diagram 1 shows the histogram of a simulated hidden sample from the random variable of interest
X, the solid and dashed lines exhibit the underlying Normal density and its E-estimate. Diagram 2
shows the histogram of the hidden sample contaminated by Laplace measurement errors. The solid
and dashed lines exhibit the underlying Normal density and the deconvolution E-estimate. Diagram
3 shows the observed censored data which is a sample from (V, ∆). Here N = n
P
l=1 ∆l is the num-
ber of uncensored observations of Y := X + ε. Diagram 4 exhibits the underlying density (the solid
line), the E-estimate (the dashed line), (1 − α) pointwise (the dotted lines) and simultaneous (the
dot-dashed lines) confidence bands. {Use cens = 00 Expon 00 for exponential censoring variable whose
mean is controlled by argument lambdaC. Argument alpha controls α.} [n = 100, corn = 2, b =
0.2, cens = 00 Unif 00 , uC = 1.5, lambdaC = 1.5, alpha = 0.05, c = 1, cJ0 = 3, cJ1 = 0.8, cTH =
4]
Combining the results, we may propose the following plug-in sample mean estimator of
θj ,
Pn
n−1 l=1 ∆l ϕj (Vl )/ max(ĜC (Vl ), c/ ln(n))
θ̂j = . (10.3.6)
φε (πj)
388 ILL-POSED MODIFICATIONS
The Fourier estimator (10.3.6) yields the density E-estimator fˆX (x). Further, let us
recall that, according to Section 2.6, E-estimators allow us to visualize their pointwise and
simultaneous confidence bands.
Figure 10.7 illustrates the setting and the estimation procedure (see comments in the
caption). Diagram 1 shows the underlying sample from X. It is atypical for the Normal, and
this is reflected in the E-estimate and relatively large ISE. Then this sample is modified by
Laplace measurement errors. Diagram 2 shows the modified sample Y1 = X1 + ε1 , . . . , Yn =
Xn + εn . Note that the sample is heavily skewed and also check the larger support. The E-
estimate does a good job in deconvolution of f X . We see the symmetric about 0.5 E-estimate
(the dashed line) and there is only a relatively small increase, for the ill-posed problem, in
the ISE with respect to the case of direct observations. Then the sample, contaminated
by measurement errors, is right censored by a random variable C with uniform on [0, 1.5]
distribution. The available observations of the pair (V, ∆) are shown in Diagram 3. As we
see, only N = 66 observations of Y are uncensored.
Because the support [0, 1.5] of the censoring variable C is smaller than the support
(−∞, ∞) of Y , we known from Section 6.5 that consistent estimation of the survival function
GC is possible. On the other hand, consistent estimation of the density f Y is impossible
because there are no observations of Y larger than 1.5. Nonetheless, let us continue our
analysis of the censored data shown in Diagram 3. As we see, only N = 66 from n = 100
of the underlying observations of Y are available in the censored sample. This is the bad
news. The good news is that, as we know from Diagram 2, a majority of observations of Y
are smaller than βC = 1.5, and this tells us that the E-estimate of f X may be relatively
good (under the circumstances). And indeed, Diagram 4 exhibits symmetric about 0.5 and
bell-shaped E-estimate whose ISE is comparable in order with the ones for the case of the
convoluted and directly observed data. Furthermore, the confidence bands perform well.
Figure 10.7 also allows us to simulate experiments with exponential censoring variable
C, and it is highly advisable to analyze this case, as well as to repeat Figure 10.7 with
different parameters. The simulations and their analysis will allow the reader to gain first-
hand experience in dealing with this complicated statistical problem when underlying data
are subject to the two-fold modification by measurement errors and censoring.
and Z ∞
∆Z,∆
f (0, 0) = P(∆ = 0) = P(X > Z) = 1 − f Z (z)F X (z)dz. (10.4.6)
−∞
It immediately follows from (10.4.5) and (10.4.6) that no consistent estimation of the distri-
bution of X is possible unless f Z (z) is known. Indeed, note that the right sides of (10.4.5)
and (10.4.6) depend solely on the product f Z (z)F X (z). As a result, based on missed CSC
data we may estimate only the product f Z (z)F X (z). This implies that to estimate F X , one
needs to know f Z . Of course, if no missing occurs then f Z may be estimated based on n
direct observations of Z. On the learning side of the story, we now have an example when
MAR implies destructive missing.
Now, after all these preliminary remarks, we are ready to propose an E-estimator for
the density f X (x), x ∈ [0, 1] based on the above-described sample and known density
f Z (z) of the monitoring variable. Further, let us assume that the monitoring variable is
also supported on [0, 1] and f Z (z) ≥ c∗ > 0, z ∈ [0, 1] and leave the case of a larger
support as an exercise. Following the E-estimation methodology, we are using the cosine
basis {1, ϕj (x) := 21/2 cos(πjx), j = 0, 1, . . .} on [0, 1] and write,
∞
X
f X (x) = 1 + θj ϕj (x), x ∈ [0, 1]. (10.4.7)
j=1
The Fourier estimator is unbiased, and it yields the density E-estimator f˜X (x), x ∈ [0, 1].
The asymptotic theory asserts that no other estimator can improve the rate of the MISE
convergence achieved by the E-estimator. But the asymptotic theory also tells us that the
problem is ill-posed and the MISE convergence slows down with respect to the case of direct
observations of X.
Let us present two explanations of why the CSC modification is ill-posed. The first one is
to directly evaluate the variance of the sample mean estimator (10.4.9) and get the following
asymptotic expression,
Z 1 X
F (z)
V(θ̃j ) = n−1 (πj)2 Z
dz[1 + oj (1)]. (10.4.10)
0 f (z)
392 ILL-POSED MODIFICATIONS
Recall that os (1) denotes a generic sequence in s that tends to zero as s increases. Formula
(10.4.10) shows that the variance increases as j increases, and this is what slows down the
MISE convergence (compare with the discussion in Section 10.1). At the same time, for small
sample sizes, when the E-estimator uses only a few first Fourier coefficients, that increase
in the variances may not be too large and we still may get relatively fair E-estimates (and
recall that this is always the hope with ill-posed problems).
Another way to realize the ill-posedness is via direct analysis of our probability formulas.
First of all, according to (10.4.5), the density of available observations is proportional to
the CDF F X (x). In other words, the likelihood of a particular observation is defined by the
CDF. On the other hand, the estimand of interest is not the CDF but the density f X (x).
Let us check how a change in the density of interest
R1 affects the CFD and correspondingly
the density of available observations. Set κj := 0 [21/2 sin(πjx)]F X (x)dx for the jth Fourier
coefficient of the CDF F X (x) for the sine
P∞ basis on [0, 1]. Then, following (10.4.7), introduce
a new probability density f 0 (x) := 1 + s=1 θs0 ϕs (x) with θs0 := θs , s 6= j and θj0 := θj + γ.
0
Note that the difference between densities f X (x), defined R x in0 (10.4.7), and f (x) is only in
0
their jth Fourier coefficients. Then the CDF F (x) := 0 f (u)du has Fourier coefficients
κ0s = κs , s 6= j and κ0j = κj + γ/(πj). As a result, the effect of changing the estimand (here
f X (x)) on the density (10.4.5) of observations vanishes as j increases. The latter is exactly
what ill-posedness is about when observations (data) become less sensitive to changes in
the estimand. Note that no such phenomenon occurs for the case of direct observations of
X when the density of observations f X (x) and the estimand f X (x) coincide.
Now let us consider another model of missing CSC data when Z is observed only when
∆ = 0, that is when X > Z. In this case we observe a sample ((1 − ∆1 )Z1 , ∆1 ), . . . , ((1 −
∆n )Zn , ∆n ) from pair ((1 − ∆)Z, ∆), and
Let us explain how to construct a sample mean estimator of Fourier coefficients θj of f X (x)
in the expansion (10.4.7). Note that the two top lines in (10.4.8) are still valid, and we can
write using (10.4.11) for j ≥ 1,
1
f (1−∆)Z,∆ (x, 0)
Z
1/2 1/2
θj = 2 cos(πj) + (2 πj) sin(πjx)[1 − ]dx
0 f Z (x)
Z 1
1/2 1/2
=2 cos(πj) + (2 πj) sin(πjx)dx
0
n sin(πj(1 − ∆)Z) o
−(21/2 πj)E (1 − ∆) Z
f ((1 − ∆)Z)
n sin(πj(1 − ∆)Z) o
= 21/2 − (21/2 πj)E (1 − ∆) Z . (10.4.12)
f ((1 − ∆)Z)
R1
In the last line we used 0 sin(πjx)dx = [− cos(πj) + 1]/(πj).
Relation (10.4.12) implies the following sample mean estimator of Fourier coefficients,
n
X sin(πj(1 − ∆l )Zl )
θ̌j := 21/2 − n−1 (21/2 πj) (1 − ∆l ) Z . (10.4.13)
f ((1 − ∆l )Zl )
l=1
This Fourier estimator is unbiased and it yields the corresponding density E-estimator
fˇX (x), x ∈ [0, 1].
The third considered model is the classical CSC. We observe a sample (Z1 , ∆1 ), . . . ,
(Zn , ∆n ) from (Z, ∆) (recall that ∆ := I(X ≤ Z)) with the joint density (10.4.1). Here we
CURRENT STATUS CENSORING 393
no longer need to know density f Z (z) of the monitoring variable Z because we can use the
E-estimator fˆZ (z) of Section 2.2 based on the direct observations Z1 , . . . , Zn . Then we may
plug this estimator in (10.4.9) and (10.4.13) in place of f Z (z) and get two estimators of
f X (x). And this leads us to a new statistical problem of aggregation of the two estimators.
Aggregation of several estimators into a better one is a classical topic in statistics, and
for us it is a right time and place to explore it. We begin with a parametric estimation
problem. Suppose that there are two unbiased and independent estimators γ̃1 and γ̃2 of
parameter γ whose variances are σ12 and σ22 , respectively. We are interested in finding an
aggregation
γ̃(λ) := λγ̃1 + (1 − λ)γ̃2 , λ ∈ [0, 1], (10.4.14)
of the two estimators that minimizes the variance V(γ̃(λ)). To solve the problem, we first
find an explicit expression for the variance,
σ22
λ∗ := . (10.4.16)
σ12 + σ22
Using this λ∗ in (10.4.14) and (10.4.15) we find that the optimal aggregated estimator is
σ22 σ12
γ̂ := γ̃ 1 + γ̃2 , (10.4.17)
σ12 + σ22 σ12 + σ22
and
σ12 σ22
V(γ̂) = < min(σ12 , σ22 ). (10.4.18)
σ12 + σ22
The inequality in (10.4.18) sheds light on performance of the aggregation.
Importance of this unbiased aggregation procedure for parametric statistics is explained
by the fact that it is optimal (no other unbiased estimator can have a smaller variance)
whenever γ is the mean of a normal random variable and we need to combine sample mean
estimators based on two independent samples.
Formula (10.4.17) gives us a tool for aggregation of Fourier estimators θ̃j and θ̌j for the
considered CSC model. The aggregated Fourier estimator θ̂j yields the density E-estimator
fˆX (x), x ∈ [0, 1].
Now let us look at simulations and test performance of the proposed E-estimators. We
begin with the case of CSC data illustrated in Figure 10.8. Its top diagram shows us a sample
of size n = 100 from the Normal variable X. This is the classical case of direct observations
and it will serve us as a reference. The lines are explained in the caption. As we see, the
E-estimate nicely smooths the histogram. At the same time, note that the relatively small
sample size n = 100 precludes us from having observations near the boundary points. The
empirical integrated squared error (ISE) of the E-estimate is equal to 0.0017, and it will be
our benchmark. Also, the diagram shows the underlying cumulative distribution function
F X (x) and the empirical cumulative distribution function (ECDF)
n
X
F̂ X (x) := n−1 I(Xl ≤ x). (10.4.19)
l=1
Note that ECDF is a classical sample mean estimator because F X (x) := P(X ≤ x) =
394 ILL-POSED MODIFICATIONS
2.0
Density
1.0
0.0
1.0
0.0
Z, x
Figure 10.8 Density estimation based on a direct sample from X and its current status censored
(CSC) modification. Distributions of independent X and Z are the Normal and the Uniform. The
top diagram presents direct observations via the histogram, the solid and dashed lines show the
underlying density and its E-estimate, and the dotted and dot-dashed lines show the underlying
cumulative distribution function and the empirical cumulative distribution function (ECDF). ISE
and ISEF are the integrated squared errors for the E-estimate and the ECDF. The bottom diagram
shows CSC data. Here the histogram shows the sample from Z, and the circles show pairs (Zl , ∆l ),
l = 1, 2, . . . , n. The solid and dashed lines are the underlying density of X and its E-estimate, the
ISE is shown in the title. The dotted line shows E-estimate of the density f Z (z). [n = 100, corn =
2, cJ0 = 3, cJ1 = 0.8, cTH = 4]
E{I(X ≤ x)}. This immediately implies that the estimator is unbiased and V(F̂ X (x)) =
n−1 F X (x)(1 − F X (x)). For the particular simulation the integrated squared error of the
ECDF is ISEF = 0.00036. Note that the ISEF is in order smaller than the ISE, and this
reflects the fact that estimation of the cumulative distribution function is possible with the
parametric rate n−1 while estimation of the density is a nonparametric problem with a
slower rate. The latter also explains why historically density estimation was considered as
an ill-posed problem.
The bottom diagram in Figure 10.8 is devoted to the current status censoring (CSC).
The data are created by the above-presented sample from X and an independent sam-
ple from the Uniform Z shown by the histogram. The circles show the observed sample
CURRENT STATUS CENSORING 395
(Z1 , ∆1 ), . . . , (Zn , ∆n ) where ∆l = I(Xl ≤ Zl ). The E-estimate of the density f Z (the dot-
ted line) is perfect, and this may lead to a good E-estimate of the density of interest f X .
And indeed, its E-estimate (the dashed line) is very respectful, it correctly indicates the
symmetric and unimodal shape of the Normal. Of course, the tails are shown incorrectly
and indicate a smaller support, but this is also what we have seen in the histogram for
direct observations. Note that the E-estimator knows only the scattergram of circles, and
from this data it reconstructs the density. Also, look at the ISE which is in order larger than
the benchmark. This is what defines the ill-posedness of the current status censoring. At
the same time, the interested reader can repeat Figure 10.8 and find simulations where the
E-estimate based on CSC data is better than the one based on direct observations. These
are rare but realistic outcomes for small samples, and the likelihood of such an outcome
diminishes for larger sample sizes.
Figure 10.9 is devoted to the case of missing CSC data. Here, to have as a reference
the hidden underlying data, we use the sample from (X, Z) shown in Figure 10.8. The top
diagram is devoted to the case when only observations Zl satisfying Xl ≤ Zl are available,
that is when ∆l = 1. If we return for a moment to the bottom diagram in Figure 10.8, then
the circles corresponding to ∆l = 1 are observed and others are missed. In Figure 10.9 the
histogram of available observations of Z clearly indicates that only larger observations of
Z are present. Nonetheless, the E-estimate of f X (x) is perfectly symmetric, unimodal and
its ISE is good. It is truly impressive how, based on the right-skewed observations of the
monitoring variable Z, the E-estimator has recovered the correct shape of the underlying
Normal. The diagram also shows us the underlying CDF F X (x) of the Normal and its
estimate by the dotted and dot-dashed lines, respectively.
Let us explain how the CDF estimator is constructed because it is of interest on its own.
According to (10.4.5),
f ∆Z,∆ (z, 1)
F X (z) = whenever f Z (z) > 0. (10.4.20)
f Z (z)
Note that the mixed density f ∆Z,∆ (z, 1) = f ∆Z|∆ (z|1)P(∆ = 1) can be estimated by the
E-estimator and f Z (z) is assumed to be known for the case of missing data. This yields a
naı̈ve ratio CDF E-estimator. Note that accuracy of estimation of the CDF is defined by
the nonparametric accuracy of estimation of the density f ∆Z|∆ (z|1), and hence the MSE
converges with a slower rate than the classical n−1 . The asymptotic theory asserts that no
other estimator can improve the MSE convergence and that the proposed estimator is rate-
optimal. The same conclusion holds for the case of CSC data. This yields that estimation of
the CDF, based on CSC data, is ill-posed. This conclusion is both important and teachable
because now we have an example of the modification by censoring which makes estimation
of the CDF ill-posed. Further, comparing Figures 10.8 and 10.9, it is possible to explore the
effect of missing CSC data on estimation of the CDF.
The bottom diagram in Figure 10.9 is constructed identically to the top one, only here
we observe Zl corresponding to ∆l = 0; recall that this is another missing scenario when
Fourier estimator (10.4.13) is used. Note that the number N = 53 of available observations
complements the number N = 47 in the top diagram, and the small sample sizes stress
how difficult the estimation problems are. The reader is advised to repeat Figure 10.9 with
different parameters and realize that the problem of missing CSC data is indeed extremely
complicated due to the combined effects of missing and ill-posedness.
Now let us finish our discussion of the CSC by considering an even more complicated
case when (10.4.3), and respectively (10.4.2), does not hold, meaning that the support of
X is no longer a subset of the support of Z. In this case no consistent estimation of the
density of interest f X is possible because no inference about tail of f X can be made. At
the same time, is it possible to gain any insight into the distribution of X?
396 ILL-POSED MODIFICATIONS
3.0
Density
2.0
1.0
0.0
Z[ ==1], x
2.0
1.0
0.0
Z[ ==0], x
Figure 10.9 Density estimation for missing current status censored (CSC) data. The density of the
monitoring time Z is assumed to be known. The underlying hidden sample from (X, Z) is the same
as in Figure 10.8 (any other simulation by this figure will produce its own independent sample). In
both diagrams the observed sample is from (AZ, ∆), only in the top diagram A := ∆ and in the
bottom A := 1 − ∆. As a result, in the top diagram only observations of Zl satisfying Zl ≥ Xl are
available and their number N is shown in the title. The available observations of the monitoring
time Z are shown via the histogram, the solid and dashed lines show the underlying density f X (x)
and its E-estimate, respectively. The dotted and dot-dashed lines show the underlying F X (x) and the
naı̈ve estimate F̃ X (z) := fˆZ,∆ (z, 1)/f Z (z) where fˆZ,∆ (z, 1) is the E-estimate based on N available
observations of Z. ISE and ISEF are the (empirical) integrated squared errors for the density and
the cumulative distribution function estimates, respectively. The bottom diagram is similar to the
top one, only here observations Zl satisfying Zl < Xl are available. [n = 100, corn = 2, cJ0 = 3,
cJ1 = 0.8, cTH = 4]
We are exploring this issue for the case of missing CSC when only observations Zl
corresponding to ∆l = 1 are available. We are also relaxing our assumption that the support
of f X (x) is known. Denote by â and b̂ the smallest and largest available monitoring time,
by ĉ := b̂ − â the range, and by ψ0 (x) := [1/ĉ]1/2 , ψj (x) := [2/ĉ]1/2 cos(πj(x − â)/ĉ) the
cosine basis on [â, b̂].
Let us check the possibility to estimate f X (x) over the interval [â, b̂], which is the
CURRENT STATUS CENSORING 397
empirical support for the monitoring time Z given X ≤ Z. Set
Z b̂
κj := ψ(x)f X (x)dx, j = 0, 1, . . . (10.4.21)
â
for Fourier coefficients of f X (x) on [â, b̂]. Because the density of observations is directly
related to F X and not to the density f X (recall (10.4.1)), it is prudent to rewrite (10.4.21)
via F X . We do this separately for j = 0 and j ≥ 1. For j = 0 we can write that
It is a teachable moment to compare (10.4.23) with the second line in (10.4.8) that
holds for the simpler setting (10.4.3) and known support [0, 1] of X. We can clearly see
the difference and the challenges of the considered setting. First of all, and this is very
important, we no longer know values of F X (b̂) and F X (â), and they affect every Fourier
coefficient. Recall that estimation of F X is an ill-posed problem, and this is a very serious
complication. A possible approach of estimating F X is based on using (10.4.5) which still
holds in our case. It follows from (10.4.5) that
f ∆Z,∆ (x, 1)
F X (x) = , x ∈ [a, b], (10.4.24)
f Z (x)
where [a, b] is the support of Z given ∆ = 1, and let us assume that f Z|∆ (z|1) ≥ c0 > 0
on [a, b]. Recall that f Z (z) is supposed to be known for the case of missing CSC, and we
already have discussed how to construct the mixed density E-estimator fˆ∆Z,∆ (x, 1). This
immediately yields a plug-in estimator
fˆ∆Z,∆ (x, 1)
F̃ X (x) := , x ∈ [â, b̂]. (10.4.25)
f Z (x)
We can also use the bona fide properties of the CDF to improve the estimator by considering
its projection F̂ X (z) on a class of monotone and not exceeding 1 functions, see the Notes.
Further, similarly to the motivation of the sample mean estimator (10.4.9) by (10.4.8), we
may use (10.4.23), as well as (10.4.22), and propose the following sample mean estimator
of Fourier coefficients κj , j = 0, 1, . . .
This Fourier estimator implies the density E-estimator fˆX (x), x ∈ [â, b̂].
Now let us look at how the proposed E-estimator performs. Figure 10.10 sheds light on
the setting and the density estimator. Here the random variable of interest X has the Strata
398 ILL-POSED MODIFICATIONS
3.0
2.0
1.0
0.0
Z, x
Missing CSC, N = 90
5
4
Density
3
2
1
0
Z, x
Figure 10.10 Density estimation for missing CSC data when the support of monitoring time Z is the
subset of an unknown support of X. The case of missing with the availability A := ∆ := I(X ≤ Z)
is considered. The distributions of X and Z are the Strata and Uniform(a1, a2), respectively. The
top diagram shows by circles a hidden CSC sample from (Z, ∆) which is overlaid by the density
of interest f X (x) on its support [0, 1]. The support of the monitoring variable Z is [a1, a2], it is
shown in the title and highlighted by the two vertical dashed lines. The bottom diagram shows the
available observations of Z by the histogram, as well as the underlying density f X (x) (the solid
line) and its E-estimate (the dashed line) over the range of available monitoring times. Note that
the histogram corresponds to z-coordinates of the top circles in the upper diagram. The number of
available observations is N := n
P
l=1 ∆l . [n = 200, corn = 4, a1 = 0.1, a2 = 0.6, cJ0 = 3, cJ1 =
0, cTH = 4]
distribution while the monitoring variable (time) Z has uniform distribution on [0.1, 0.6].
The top diagram allows us to understand the setting. Density f X (x) is shown by the solid
line. The monitoring variable Z is uniform on the interval indicated by two vertical dashed
lines. Hidden CSC observations are shown by the scatterplot of pairs (Z, ∆). Note that in
the missing CSC we are observing only values of Z corresponding to ∆ = 1, that is we know
only the top circles, and their total number is N = 90. In addition to the small number of
available monitoring times, note that the empirical support is also significantly smaller than
the support [0, 1] of the variable of interest X. The bottom diagram shows the histogram
of available observations of Z, and this is another visualization of the data at hand. The
histogram is overlaid by the solid line which indicates the underlying density f X (x) over
REGRESSION WITH MEASUREMENT ERRORS IN PREDICTORS 399
the empirical support [â, b̂]. Note that in no way the histogram resembles or hints upon the
underlying density f X (x). The density E-estimate fˆX (x), x ∈ [â, b̂] is shown by the dashed
line. The density estimate is relatively good, but it is highly advisable to repeat Figure
10.10 a number of times with different parameters and to realize that here we are dealing
with an extremely complicated ill-posed problem where a chance to get a fair estimate is
slim.
Our final remark is about the possibility to estimate f X (x) by taking the derivative
of an estimate of F X (x). This approach is natural keeping in mind that the density of
observations is expressed via the cumulative distribution function F X (x) and not via the
density. Assuming that f Z (z) is known, this approach, as it follows from (10.4.5) or (10.4.24),
is equivalent to estimating the derivative of f ∆Z,∆ (z, 1). Estimation of derivatives is another
classical ill-posed problem and it will be considered in Section 10.7.
Yl = m(Xl ) + εl , Ul = Xl + ξl , l = 1, 2, . . . , n. (10.5.3)
400 ILL-POSED MODIFICATIONS
The problem is again to estimate the regression function m(x), x ∈ [0, 1], and now this
problem is referred to as a regression problem with measurement errors in predictors or
simply a MEP regression. Note that a MEP regression is “symmetric” in terms of affecting
both the predictor and the response by additive errors. As we shall see shortly, errors in
predictors slow down the MISE convergence with respect to a standard regression and make
estimation of the regression function extremely complicated. For instance, if ξ in (10.5.3)
is normal, then the MISE decreases in n only logarithmically, and if the distribution of the
measurement error is unknown, then consistent regression estimation is impossible (note
that there is no such outcome for the classical regression model). This is an important
information to keep in mind because errors in the response and the predictor yield dra-
matically different effects on regression estimation. We conclude that a MEP regression
is ill-posed with respect to a standard regression, or we may say that the modification of
predictors by measurement errors is ill-posed.
After this bad news, let us mention several good news. First, E-estimation is still optimal
and dominates any other methodology. Second, for small samples and simple regression
functions, we may get reasonable results.
Let us explain when and how an E-estimator can be constructed for the MEP model
(10.5.3). The best way to explain the proposed solution is first to look at statistic
n
X
κ̌j := n−1 Yl ϕj (Ul ), j = 0, 1, . . . (10.5.4)
l=1
Pn
This statistic is a naı̈ve mimicking of the estimator n−1 l=1 Yl ϕj (xl ) of Fourier coefficients
R1
θj := 0 ϕj (x)m(x)dx for the regression model (10.5.2), and recall our notation ϕ0 (1) := 1,
ϕj (x) := 21/2 cos(πjx), j = 1, 2, . . . for the cosine basis on [0, 1].
Now we step-by-step calculate the expectation of statistic (10.5.4). For j = 0 we have
Z 1
E{κ̌0 } = E{Y } = [f X (x)m(x)]dx. (10.5.5)
0
We continue for j ≥ 1,
n n
X o
E{κ̌j } = E n−1 Yl ϕj (Ul ) = E{Y ϕj (U )}
l=1
Using it and additionally assuming that the predictor X and the measurement error ξ are
independent, we continue (10.5.7),
Relation (10.5.10) is the pivot for our understanding of how to construct a sample
R1
mean estimator of Fourier coefficients θj := 0 ϕj (x)m(x)dx and hence the corresponding
regression E-estimator for the MEP regression. But first let us combine together the above-
made assumptions. It is assumed that X, ε and ξ are mutually independent, ε has zero
mean and finite variance, and the distribution of ξ is symmetric about zero. Under these
assumptions (10.5.10) holds, and the characteristic function φξ (t) of the measurement error
ξ can be written as
ĝ(x)
m̂(x) := , x ∈ [0, 1], (10.5.16)
ˆX
max(f (x), c/ ln(n))
402 ILL-POSED MODIFICATIONS
1.2
4
3
0.8
2
Density
Y
0.4
0
-1
0.0
-2
0.0 0.2 0.4 0.6 0.8 1.0 -0.5 0.0 0.5 1.0 1.5
X U, x
n = 100
4
3
3
2
2
Y
Y
1
1
0
0
-1
-1
-2
-2
-0.5 0.0 0.5 1.0 1.5 -0.5 0.0 0.5 1.0 1.5
U, x U, x
Figure 10.11 Regression with measurement errors in predictors (MEP regression). The underlying
model is (10.5.3) with the regression function being the Bimodal, regression error ε being normal
with zero mean and variance σ 2 , measurement error ξ being Laplace with parameter b (the same
measurement error is used in Figure 10.2), and the predictor X being the Uniform. These variables
are mutually independent. The left-top diagram shows a hidden sample of size n from (X, Y ) via
the scattergram. The underlying regression (the solid line) and the regression E-estimate of Section
2.3 (the dashed line) are also shown. The left-bottom diagram illustrates estimation of function
g(x) defined in (10.5.13). The circles show the available sample from (U, Y ) where U := X + ξ,
and this scattergram is overlaid by the underlying function g(x) (the solid line) and its E-estimate
(the dashed line) over the interval of interest [0, 1]. The right-top diagram shows the histogram of
the sample from U , and by solid and dashed lines the underlying density f X (x), x ∈ [0, 1] and its
deconvolution E-estimate, respectively. (In this simulation the deconvolution E-estimate is perfect
and is hidden by the solid line.) Finally, the right-bottom diagram again shows us the scattergram of
available observations overlaid by the underlying regression (the solid line) and the MEP regression
estimate (the dashed line). {Argument c controls the parameter c in (10.5.16).} [n = 100, sigma =
1, b = 0.2, corn = 3, c = 1, cJ0 = 3, cJ1 = 0, cTH = 4, c = 1]
This is the MAR case because the missing is defined by the always observed variable U .
According to Chapter 4, for a standard regression (10.5.2) (regression with P(ξ = 0) = 1)
the MAR is not destructive and further a complete-case approach is optimal.
In what follows all assumptions of the previous section hold, namely X is supported on
[0, 1], the design density f X (x) is positive on the support, variables X, ε and ξ are mutually
independent, ε is zero mean and has a finite variance, the distribution of ξ is symmetric
about zero and values φξ (jπ), j = 1, 2, . . . of its characteristic function are known and not
equal to zero. The problem is to estimate the regression function m(x) based on MAR data.
We begin with the case of a known availability likelihood w(u). Introduce a statistic
(compare with (10.5.4))
Xn
κ̃j := n−1 Al Yl ϕj (Ul )/w(Ul ), (10.6.3)
l=1
This is the same expression as in (10.5.12) and hence we can use the proposed in Section
10.5 plug-in regression E-estimator which performs as follows. First, the E-estimator g̃(x)
of g(x) := m(x)f X (x) is constructed using the Fourier estimator
n
X
g̃j := (nφξ (πj))−1 Al Yl ϕj (Ul )/w(Ul ) (10.6.6)
l=1
of Fourier coefficients Z 1
gj := ϕj (x)[f X (x)m(x)]dx. (10.6.7)
0
Second, the deconvolution density E-estimator fˆX (x) of Section 10.1 is calculated, and it is
MEP REGRESSION WITH MISSING RESPONSES 405
based on the sample U1 , . . . , Un and the characteristic function φξ (πj). Finally, the plug-in
regression estimator is defined as (compare with (10.5.16))
g̃(x)
m̃(x) := , x ∈ [0, 1]. (10.6.8)
ˆX
max(f (x), c/ ln(n))
Let us stress that if the characteristic function φξ (πj) is unknown then, similarly to
the previous section, consistent estimation of regression function m(x) is impossible. If the
latter is the case, then an extra sample from ξ may be used to estimate the characteristic
function.
So far it was assumed that the availability likelihood w(u) is known and it was used in
(10.6.3). If this function is unknown, and this is a typical case, then it can be estimated
from the sample (U1 , A1 ), . . . , (Un , An ). Indeed, we may write that w(u) = E{A|U = u},
and hence w(u) is a Bernoulli regression function and the regression E-estimator ŵ(u) of
Section 3.7 may be used. The only issue here is that no longer the design density f U (u) is
necessarily separated from zero, and this may affect the estimation.
Figure 10.12 illustrates the model and the proposed estimator. The underlying regression
model is identical to the one in Figure 10.11, only here the Normal regression function is
used. The left-top diagram shows us the underlying regression of Y on X, and we can
visualize it thanks to the simulation (in a practical example it would not be available).
The underlying regression function is the Normal, and due to relatively large noise and
small sample size the estimate is not perfect but correctly indicates the unimodal shape
of the regression. Further, note that the estimate does reflect the scattergram. It may be
instructive to compare this diagram with the left-top diagram in Figure 10.11. The new
diagram in Figure 10.12 is the left-bottom one. It shows us a classical Bernoulli regression
of A on U . According to (10.6.3), we need to know values of the availability likelihood
function w(u) only at points Ul . This is why the triangles show values of w(Ul ) and the
crosses show values of its estimate ŵ(Ul ). The estimate is clearly not good, but it does
follow the scattergram and at least correctly shows that the availability likelihood increases
in u. The right-top diagram shows that the design density estimate is perfect. Finally, the
right-bottom diagram shows the underlying regression and its estimate. For this particular
simulation only N = 75 responses from n = 100 are available. The outcome is curious
because the stochasticity, together with the small sample size, created a simulation where
the estimate based on the missing MEP regression is more accurate than the one based
on the hidden regression. It is advisable to repeat Figure 10.12 and realize that, despite
this particular optimistic outcome, the problem is ill-posed, it is complicated by missing
responses, and hence another simulation may create a worse outcome. Further, repeated
simulations indicate that the effect of inaccurate estimation of w(u) is less severe than of
the design density f X (x).
Now let us consider a more complicated, and at the same time very interesting, case
when the missing is defined by the hidden predictor X, and the availability likelihood is
In particular, this is the situation when missing of the response occurs before the predic-
tor X is contaminated by the measurement error ξ. In other words, we have a standard
MAR missing in a classical regression, and only then the predictor is contaminated by the
measurement error.
Let us see what can be done in this case. Introduce a statistic
n
X
−1
κ̂j := n Al Yl ϕj (Ul ). (10.6.10)
l=1
406 ILL-POSED MODIFICATIONS
1.0
5
4
0.8
3
0.6
Density
2
Y
0.4
1
0.2
0
0.0
-1
0.0 0.2 0.4 0.6 0.8 1.0 -1.0 -0.5 0.0 0.5 1.0 1.5
X U, x
n = 100
5
4
0.8
3
0.6
AY
2
A
0.4
1
0.2
0
0.0
-1
-1.0 -0.5 0.0 0.5 1.0 1.5 -1.0 -0.5 0.0 0.5 1.0 1.5
U U, x
Figure 10.12 MEP regression (10.6.1) with missing responses, the case when the missing is de-
fined by always observable U . The underlying model is the same as in Figure 10.11, only here
the particular regression function is the Normal. The missing mechanism is (10.6.2) with the
availability likelihood w∗ (u) := max(dwL , min(dwU , w(u))), and for this particular simulation
w(u) = 0.3 + 0.5e1+4u /(1 + e1+4u ), dwL = 0.3 and dwU = 0.9. With respect to Figure 10.11,
the new diagram is the left-bottom one that shows estimation of the availability likelihood. Here
circles show the Bernoulli scattergram, the triangles and crosses show values w(Ul ) and ŵ(Ul ),
respectively. In the right-bottom diagram the circles show complete pairs and triangles show incom-
plete ones with missed responses. {Function w(u) is defined by a string w.} [n = 100, sigma = 1,
b = 0.2, corn = 2, w = 00 0.3 + 0.5 *exp(1+4*u)/(1+exp(1+4*u)) 00 , dwL = 0.3, dwU = 0.9, cJ0
= 3, cJ1 = 0.8, cTH = 4, c = 1]
We can estimate this Fourier coefficient using the sample mean estimator
n
X
ξ −1
q̂j = (nφ (πj)) Al Yl ϕj (Ul ). (10.6.15)
l=1
In its turn, this Fourier estimator yields the E-estimator q̂(x) of q(x).
Returning to (10.6.13), we now have a clear path for estimating the regression function
m(x) if we are able to estimate the product f X (x)w(x). Let us explain a possible approach.
There exists a useful formula,
The numerator on the right side of (10.6.16) is the product f X (x)w(x), which we would like
to know, and the Pnprobability P(A = 1) is straightforwardly estimated by the sample mean
estimator n−1 l=1 Al .
As a result, we need to understand how to estimate the conditional density f X|A (x|1)
based on a sample (U1 , A1 Y1 , A1 ), . . . , (Un , An Yn , An ). Is it possible to use here the decon-
volution density estimator of Section 10.1? To answer this question, consider a subsample of
(Ul , Al ) with Al = 1, that is a subsample of MEP predictors in complete cases. Because X
and ξ are independent, this, together with (10.6.9), implies the equality f ξ|A (z|1) = f ξ (z).
In its turn, this equality allows us to deduce the following convolution formula,
Z 1
U |A
f (u|1) = f X|A (x|1)f ξ (u − x)dx. (10.6.17)
0
As a result, we indeed can use the subsample of MEP predictors in complete cases and
construct the deconvolution density E-estimator fˆX|A (x|1) of Section 10.1.
Now returning to (10.6.16), we can estimate the product f X (x)w(x) by the estimator
Pn
fˆX|A (x|1)n−1 l=1 Al . This, together with (10.6.13) and the above-constructed estimator
q̂(x), yields the following regression estimator for the case when the availability likelihood
is defined by the value of X,
q̂(x)
m̂(x) := Pn , x ∈ [0, 1]. (10.6.18)
[n−1 ˆX|A (x|1), c/ ln(n))
l=1 Al ] max(f
408 ILL-POSED MODIFICATIONS
1.5
3
2
1.0
Density
Y
0.5
0
-1
0.0
-2
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5
X U[A==1], x
n = 100
4
3
1.0
2
AY
q
1
0.5
0
-1
0.0
-2
0.0 0.2 0.4 0.6 0.8 1.0 -0.5 0.0 0.5 1.0 1.5
x U, x
Figure 10.13 MEP regression with missing responses when the missing is defined by the hidden
predictor X. The underlying model is the same as in Figure 10.12 only here the missing mechanism
is (10.6.9). With respect to Figure 10.12, the new diagrams are the left-bottom and right-top ones.
The left-bottom diagram shows estimation of the function q(x) defined in (10.6.13). Here the solid
and dashed lines are q(x) and its E-estimate, respectively. Also, the dotted line shows the underlying
availability likelihood function w(x). The right-top diagram shows the histogram of Ul corresponding
to Al = 1, as well as the conditional density f X|A (x|1) (the solid line) and its deconvolution
E-estimate fˆX|A (x|1) (the dashed line) for x ∈ [0, 1]. {In the default simulation the availability
likelihood is w∗ (x) := max(dwL , min(dwU , w(x))), w(x) = 0.1+0.9x, dwL = 0.2 and dwU = 0.9. The
function w(x) is defined by the string w, and parameters dwL and dwU are controlled by arguments
dwL and dwU.} [n = 100, sigma = 1, b = 0.2, corn = 2, w = 00 0.1 + 0.9 *x 00 , dwL = 0.2, dwU =
0.9, cJ0 = 3, cJ1 = 0.8, cTH = 4, c = 1]
Of course, instead of developing the ratio estimator (10.6.18) we could use our traditional
approach of estimating Fourier coefficients of the regression function; developing of this
approach is left as an exercise.
Figure 10.13 illustrates the setting and performance of the regression estimator. The
underlying MEP regression model is the same as in Figure 10.12, and the left-top diagram
shows us the scattergram of the hidden regression of Y on X. The solid and dashed lines
ESTIMATION OF DERIVATIVE 409
are the underlying Normal regression function and its E-estimate. It is a nice exercise to
use your imagination and try to propose a better fit for this particular scattergram (and, of
course, repeat the figure and visualize more simulations). The missing MEP data are shown
in the right-bottom diagram. Here the circles show complete pairs (Ul , Al Yl ) corresponding
to Al = 1 and the triangles show incomplete pairs (Ul , 0) corresponding to Al = 0. You
may see how the MEP spreads observations along corresponding horizontal lines and how
the missing, defined by an underlying value of X, causes incomplete cases. The underlying
availability likelihood function
Pn w(x) is shown by the dotted line in the left-bottom diagram.
Note that only N := l=1 Al = 59 pairs are complete, and hence we are dealing with a
simulation which is difficult even for the case of a regular regression. There are three steps in
construction of the regression estimator for the MEP regression with missing responses. The
first one is to estimate the function q(x) := f X (x)w(x)m(x). We do this by the E-estimator
based on Fourier estimator (10.6.15). The function and the E-estimate are shown in the
left-bottom diagram. The outcome is good. At the same time, let us stress that estimation
of the function q(x) is an ill-posed problem with missing data, this is a difficult problem,
and it is advisable to repeat Figure 10.13 to realize the complexity.
The second step is to estimate the conditional density f X|A (x|1). Recall that the es-
timation is based on observations of Ul from complete pairs. According to (10.6.16), this
density is proportional to f X (x)w(x), and note that in the simulation the design density
f X (x) is the Uniform and the availability likelihood w(x) is shown by the dotted line in
the left-bottom diagram. Function w(x) is only piecewise differentiable, and this creates
a complication in estimation of the conditional density f X|A (x|1). The right-top diagram
illustrates the estimation. The histogram shows us N = 59 observations of Ul in complete
cases. Note that the histogram resembles neither f X (x), nor w(x), nor their product. The
underlying function f X|A (x|1) is shown by the solid line. Note that it is about twice in
magnitude as the function w(x) shown in the left-bottom diagram, and this is due to the
denominator P(A = 1) in (10.6.16). The deconvolution E-estimate fˆX|A (x|1) is shown by
the dashed line, and it is very good keeping in mind the small sample size and ill-posed
nature of the deconvolution problem.
The final step is (10.6.18) where the estimate of q(x) (the dashed line in the left-bottom
diagram) is divided by the estimate of the conditional density (the dashed line in the
right-top diagram) and then divided by the sample mean estimate of P(A = 1). Note that
(10.6.18) involves the ratio of two nonparametric estimates, and each of them is for an
ill-posed problem. This is what makes the problem of MEP regression with missing data
so complicated. The particular estimate, shown in the right-bottom diagram, is good, but
keeping in mind complexity of the problem, another simulation may imply a worse outcome.
It is highly recommended to repeat Figures 10.12 and 10.13 with different parameters
and compare simulations and performance of the estimators. Also, pay attention to the
nuisance functions that may be of interest in practical applications.
where I(·) is the indicator. Note that (10.7.1) is the sample mean estimator because
This immediately yields that the estimator (10.7.1) is unbiased and its variance decreases
with the parametric rate n−1 , namely
We may conclude that, despite being a nonparametric estimation problem, the cumu-
lative distribution function can be estimated with the parametric rate n−1 . This is an
important conclusion on its own because now we have an example of nonparametric esti-
mation with the parametric rate. Further, as we will see shortly, it allows us to understand
that the notion of ill-posedness depends on a chosen benchmark.
Now let us consider the problem of estimation of the derivative of F X (x) which is called
the probability density of X,
f X (x) := dF X (x)/dx. (10.7.5)
As we know from Section 2.2, a probability density cannot be estimated with the para-
metric rate n−1 even if it is a smooth function. If the density is α-fold differentiable and
belongs to a Sobolev class Sα,Q defined in (2.1.11) (the latter is assumed in this section),
then the best result that can be guaranteed is
and the same rate of convergence holds for the MISE. Here and in what follows c1 , c2 , . . .
denote generic positive constants whose exact values are not of interest.
We may conclude that density estimation, with respect to estimation of the cumulative
distribution function, is an ill-posed problem. And indeed, this is how this problem was
treated when the research on nonparametric probability density estimation was initiated.
With time passing by, more statisticians have been involved in the nonparametric research,
the literature has grown rapidly, and then slowly but surely the nonparametric density esti-
mation, characterized by slower rates, has become a classical statistical problem on its own.
As a result, for now, papers and books that consider nonparametric density estimation as
ill-posed problem are next to none. At the same time, of course the problem of nonparamet-
ric density estimation is ill-posed with respect to estimation of the cumulative distribution
function.
ESTIMATION OF DERIVATIVE 411
Let us comment on why we have slower rates of the MISE convergence for the density
estimation. Consider a cosine series estimator
J
X
fˇX (x) := 1 + θ̂j ϕj (x), x ∈ [0, 1]. (10.7.7)
j=1
Here
n
X
θ̂j := n−1 ϕj (Xl ) (10.7.8)
l=1
and parameter J in (10.7.7) is called the cutoff. Assuming that the density is square inte-
grable, we can write
∞
X
f X (x) = 1 + θj ϕj (x), x ∈ [0, 1]. (10.7.10)
j=1
Using the Parseval identity (1.3.40) we get the following expression for the mean inte-
grated squared error (MISE) of the estimator fˇX (x) (compare with (2.2.10))
Z 1
MISE(fˇX , f X ) := E{ (fˇX (x) − f X (x))2 dx}
0
J
X X
= E{(θ̂j − θj )2 } + θj2 . (10.7.11)
j=1 j>J
Estimator θ̂j is the sample mean estimator, and hence it is unbiased, its variance is equal
to the mean squared error, and the variance is easily evaluated (the calculation is left as an
exercise and recall notation os (1) for vanishing in s → ∞ sequences),
Using (10.7.12) and (10.7.13) in (10.7.11) we conclude that the MISE of the estimator
(10.7.7) can be bounded from above,
and it implies the above-mentioned rate n−2α/(2α+1) of the MISE convergence. Further, the
asymptotic theory asserts that this rate of the MISE convergence is optimal in the sense
that no other estimator can achieve a faster rate uniformly over the Sobolev class (2.1.11)
of α-fold differentiable densities.
412 ILL-POSED MODIFICATIONS
3.0
3.0
2.0
2.0
1.0
1.0
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
X X
3.0
2.0
2.0
Density
Density
1.0
1.0
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
X X
10
5
0
q(x)
q(x)
0
-10
-10 -5
-20
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x
Figure 10.14 Estimation of the cumulative distribution function (CDF), its first derivative (the
probability density), and its second derivative (derivative of the probability density). Two simulations
with different underlying distributions and different sample sizes are shown in the two columns.
The solid and dashed lines show underlying functions and their estimates, respectively. Samples
are shown by histograms. {Argument set.n allows to choose sample sizes for the two experiments.}
[set.n = c(100,200), set.corn = c(2,3), cJ0 = 3, cJ1 = 0.8, cTH = 4]
We may conclude that nonparametric estimation of the derivative of the cumulative dis-
tribution function, or equivalently nonparametric estimation of the density, is an ill-posed
problem which implies slower rates of a risk convergence with respect to the benchmark
problem of estimation of the cumulative distribution function. There is also another inter-
esting conclusion to point upon. The smoothness of F X (x), say it is piece-wise continuous
or twice differentiable, has no effect on the rate n−1 of the variance or the MISE conver-
gence, see (10.7.3). At the same time, (10.7.6) indicates that smoothness of the cumulative
distribution function dramatically affects accuracy of estimation of the derivative of the
cumulative distribution function.
Now we are in a position to continue our investigation of estimation of derivatives.
Consider the problem of estimation of the second derivative of the cumulative distribution
ESTIMATION OF DERIVATIVE 413
function, or equivalently the problem of estimation of the derivative of the density. In what
follows, we are using the latter formulation to simplify notation.
Several questions immediately arise. Is the problem of estimation of the density’s deriva-
tive ill-posed with respect to the density estimation? Will the smoothness of density affect
estimation of its derivative?
To answer these questions, set
for the density’s derivative. Note that because we are dealing with only one random variable
X, we may skip the superscript X in the density’s derivative q X (x) and write q(x) instead.
We are going to consider a very simple estimator of q X (x), which is the derivative of
the series estimator (10.7.7),
J
X
q̌ (x) = dfˇX (x)/dx = −
X
θ̂j (πj)ψj (x), ψj (x) := 21/2 sin(πjx). (10.7.17)
j=1
Note that {ψj (x), j = 1, 2, . . .} is the sine orthonormal basis on [0, 1].
Let us calculate the MISE of estimator (10.7.17). Using the Parseval identity we may
write, Z 1
MISE(q̌ X , q X ) := E{ (q̌ X (x) − q X (x))2 dx}
0
J
X X
= (πj)2 E{(θ̂j − θj )2 } + (πj)2 θj2 . (10.7.18)
j=1 j>J
Recall that we are considering f X (x) from the Sobolev class (2.1.11) of α-fold differ-
entiable densities. Assume that α > 1 and evaluate the second sum on the right side of
(10.7.18), X
(πj)2 θj2 < c4 J −2α+2 . (10.7.19)
j>J
Using this and (10.7.12) in (10.7.18), we get the following upper bound for the MISE of the
density’s derivative estimator,
It is of interest to compare this cutoff for the density’s derivative estimator with the
optimal cutoff Jn∗ for the density estimator defined in (10.7.15). As we see, they are both
proportional to n1/(2α+1) , that is estimation of the derivative does not affect how the optimal
cutoff increases in n.
Using (10.7.21) in (10.7.20) we conclude that
The asymptotic theory asserts that no other estimator can improve the above-presented
rate of the MISE convergence uniformly over densities from the Sobolev class of α-fold
differentiable densities. The latter means that our estimator of the derivative is rate optimal.
Further, using the cutoff Jn∗ in place of Jn0 in the estimator of derivative also yields the
414 ILL-POSED MODIFICATIONS
optimal rate of the MISE convergence. We conclude that the derivative of the density E-
estimator of Section 2.2 may be used as the estimator of the density’s derivative.
The optimal rate n−2(α−1)/(2α+1) for estimation of the derivative is slower than the
optimal rate n−2α/(2α+1) for estimation of the density, and this implies that the problem of
estimation of the derivative is ill-posed with respect to the problem of density estimation.
In short, estimation of derivative is ill-posed.
The same conclusion holds for other classical nonparametric problems like regression,
filtering a signal or spectral analysis. This does not come as a surprise to us because we
know that all these nonparametric problems have a lot in common, and moreover there
exists a theory of equivalence between these problems.
Figure 10.14 illustrates the problem of estimation of the density and its derivative. The
top diagrams show us two simulations via histograms and estimation of the cumulative
distribution function. Note that the sample sizes of simulations are different to take into
account the more complicated shape of the Bimodal. Pay attention to the small integrated
squared errors of the empirical cumulative distribution function.
The middle diagrams illustrate the familiar density estimation problem. Of course, we
may equivalently say that this is the problem of estimation of the derivative of the cu-
mulative distribution function. In both cases E-estimates are good and this is reflected by
the corresponding empirical ISEs. At the same time, compare these ISEs with those in the
corresponding top diagrams and get the feeling of ill-posedness.
The bottom diagrams show derivatives of the underlying densities and their E-estimates.
Overall, the estimated derivatives are good but look at the large empirical ISEs. The latter,
at least in part, is explained by the fact that the derivatives are larger functions (look at
the scales), but it also reflects the ill-posed nature of the problem. Further, repeated simu-
lations indicate a relatively large number of poor estimates, and it is highly recommended
to repeat Figure 10.14 with different parameters and get used to the problem of estimation
of derivatives.
Let us also note that, using (10.7.17), it is possible to develop a sine series E-estimator
of estimation of the derivative. Further, it is possible to consider more complicated settings
when data are missed and/or modified by truncation, censoring, measurement errors, etc.
These extensions are done straightforwardly using the E-methodology and they are left as
exercises.
10.8 Exercises
10.1.1 Give several examples of a problem with measurement errors. In those examples,
what can be said about the distribution of errors?
10.1.2∗ Consider a pair of random variables (X, ε). If the joint cumulative distribution
function is given, what is the cumulative distribution function of their sum? Hint: Note
that no information about the independence is provided.
10.1.3 Consider a pair of continuous random variables (X, ε). What is the density of their
sum? Hint: Begin with the case of independent variables and then look at a general case.
10.1.4 Give definition of a characteristic function. Does it always exist?
10.1.5 Calculate characteristic functions of uniform, normal and Laplace random variables.
10.1.6 What is the value of a characteristic function of X at point zero, that is φX (0)?
Hint: Note that the distribution of X is not specified.
10.1.7 Prove (10.1.4).
10.1.8 Verify assertion (10.1.5).
10.1.9 Find the mean and the variance of an empirical characteristic function. Do the mean
and the variance always exist?
10.1.10 Explain how the deconvolution problem can be solved.
10.1.11 Verify formula (10.1.9) for the variance of estimator (10.1.8).
EXERCISES 415
10.1.12 Using formula (10.1.9), present examples of measurement errors when the variance
increases exponentially and as a power in t.
10.1.13 Explain the simulation used by Figure 10.1.
10.1.14 Repeat Figure 10.1 using different arguments and comment on obtained result.
10.1.15 For which density, among our four corner ones, the effect of additive noise is more
visually devastating in terms of “hiding” an underlying distribution? Hint: Begin with the
theory and then confirm your conclusion using Figure 10.1.
10.1.16 Explain the estimate (10.1.12).
10.1.17 Find the mean and variance of the estimate (10.1.12).
10.1.18 Explain the simulation used in Figure 10.2.
10.1.19 Using Figure 10.2, find better values for parameters of the E-estimator. Hint: Use
empirical ISEs.
10.1.20 In Figure 10.2, what is the role of parameter CH? Repeat Figure 10.2 for different
sample sizes and suggest a better value for the parameter.
10.1.21 Explain the purpose of using an extra sample in Figure 10.3.
10.1.22 Explain the E-estimator used in Figure 10.3.
10.1.23 Suggest better parameters for E-estimator used in Figure 10.3. Hint: Use empirical
ISEs.
10.1.24 Consider sample sizes n = 100 and n = 200. Then, using Figure 10.3, explore
minimal sample sizes m of extra samples that imply a reliable estimation.
10.1.25 Explain how the case of repeated observations can be used for solving the decon-
volution problem with unknown distribution of the measurement error.
10.1.26∗ Find the mean and variance of the estimator (10.1.15).
10.1.27 What is the definition of a circular random variable? Give examples.
10.1.28 Explain formula (10.1.16)
10.1.29∗ Verify (10.1.17).
10.1.30∗ Explain why the proposed deconvolution E-estimator may be also used for circular
data.
10.1.31∗ Evaluate the MISE of the proposed deconvolution E-estimator. Hint: Begin with
the case of a known nuisance function.
10.1.32∗ Consider the case of unknown distribution of ε. Suppose that an extra sample
from ε may be obtained. Suggest an E-estimator for this case and analyze its MISE.
10.1.33∗ Suppose that the measurement error ε and the variable of interest X are depen-
dent. Suggest a deconvolution estimator.
10.2.1 Explain the problem of density deconvolution with missing data.
10.2.2 What type of observations is available for density deconvolution with missing data?
10.2.3 Explain a possible solution for the case of MCAR.
10.2.4 Explain a setting when the availability likelihood depends on Y . Present several
examples.
10.2.5 What is an underlying idea of estimator (10.2.3)?
10.2.6∗ Find the mean and variance of estimator (10.2.3). Is it unbiased?
10.2.7 Explain the underlying simulation used by Figure 10.4.
10.2.8 Repeat Figure 10.4 with different sample sizes, and comment on outcomes.
10.2.9 Present several examples when the availability likelihood depends on the underlying
variable of interest X.
10.2.10 Explain how E-estimator, proposed for the case when the availability likelihood
depends on X, is constructed.
10.2.11∗ Find the mean and variance of estimator (10.2.10). Then evaluate the MISE of
E-estimator (10.2.11).
10.2.12∗ Use Figure 10.5 and analyze the effect of the availability likelihood w(x) on estima-
tion. Which functions are less and more favorable for the E-estimator? Hint: Theoretically
416 ILL-POSED MODIFICATIONS
analyze the MISE and then test your conclusion using empirical ISEs. Pay attention to the
number of available observations.
10.2.13∗ Explain the case when the availability likelihood depends on the measurement
error ε. Present several examples. Then propose a deconvolution E-estimator and evaluate
its MISE.
10.2.14 What is the motivation of the estimator (10.2.17)?
10.2.15 Find the mean and variance of estimator (10.2.17).
10.2.16 How can the availability likelihood w(z) be estimated?
10.2.17 Explain the underlying simulation in Figure 10.6.
10.2.18∗ Using Figure 10.6, analyze the effect of the estimator of the availability likelihood
on performance of the E-estimator. Hint: Develop a numerical study and use empirical ISEs.
10.2.19∗ Does the shape of the availability likelihood function affect the deconvolution? If
the answer is “yes,” then explain how. Hint: Use both the theory and Figure 10.6.
10.2.20 Consider sample sizes n = 100 and n = 400 for Figure 10.6. Do you see a significant
difference in quality of estimation? Hint: Support your conclusion by analysis of empirical
ISEs.
10.2.21 Suggest better values for parameters of the E-estimator used in Figure 10.6.
10.2.22 Verify each equality in (10.2.18).
10.2.23∗ Explain the idea of estimation of the availability function w(z). Propose an E-
estimator. Evaluate the MISE of that E-estimator.
10.2.24∗ Repeat Figures 10.4-10.6, compare quality of estimation, analyze the results, and
write a report.
10.2.25∗ Consider the case of unknown distribution of ε. Suppose that an extra sample
from ε may be obtained. Suggest a deconvolution E-estimator for this case and analyze its
MISE.
10.3.1 Explain the model of density deconvolution for censored data, and also present
several examples. Further, is it necessary to assume that the variable Y is nonnegative
(lifetime)?
10.3.2 Verify (10.3.2).
10.3.3 Explain why (10.3.3) is valid. Hint: Think about a condition needed for the validity
of (10.3.3).
10.3.4 Why can (10.3.1) be written as (10.3.4)?
10.3.5 Explain the underlying idea of estimator (10.3.5).
10.3.6∗ Find the mean and variance of estimator (10.3.5) of the survival function.
10.3.7 Explain the estimator (10.3.6).
10.3.8 Find the mean of statistic (10.3.6). Hint: Begin with the case of known GC .
10.3.9∗ Find the mean squared error of estimator (10.3.6) when the estimand is θj .
10.3.10∗ Describe the proposed deconvolution E-estimator of density f X . Then calculate
its MISE.
10.3.11 Explain the underlying simulation of Figure 10.7.
10.3.12 Using Figure 10.7, explain data and how the E-estimator performs.
10.3.13∗ Using Figure 10.7, explain how measurement errors and censoring affect the esti-
mation. Hint: Propose a numerical study and then make a statistical analysis of empirical
ISEs.
10.3.14 Using Figure 10.7, explain how parameters of this figure affect the quality of esti-
mation.
10.3.15 Find better arguments of the E-estimator used in Figure 10.7. Do your recommen-
dations depend on the underlying censoring distribution and sample size?
10.3.16∗ Explain, using both the theory and simulations of Figure 10.7, how the support
of censoring variable C affects the studied estimation.
EXERCISES 417
10.3.17∗ Consider the case of an unknown distribution of ε. Suppose that an extra sample
from ε may be obtained. Suggest an E-estimator for this case and analyze its MISE.
10.3.18∗ Explain how the confidence bands are constructed.
10.3.19∗ Evaluate the MISE of proposed E-estimator. Hint: Begin with the case of known
nuisance functions.
10.3.20∗ Consider the case of truncated data and propose a consistent estimator.
10.4.1 Describe several examples of the current status censoring (CSC).
10.4.2 What is the statistical model of CSC?
10.4.3 Suppose that you would like to simulate CSC data. Explain how this may be done.
10.4.4 Explain formula (10.4.1) for the joint (mixed) density of (Z, ∆). What are the
assumptions?
10.4.5∗ Consider the case when support of the monitoring variable Z is the subset of the
support of the random variable of interest X. In other words, suppose that (10.4.2) does
not hold. Is it possible in this case to consistently estimate the density of X? Explain your
answer. Then explore an estimand that may be consistently estimated.
10.4.6 The studied CSC modification is ill-posed. What does this mean?
10.4.7 There are two considered models of missing CSC data. Explain them. Further, what
is the difference between truncated CSC and missing CSC?
10.4.8 Explain and prove expression (10.4.5) for the joint density of (∆Z, ∆).
10.4.9 Verify (10.4.6).
10.4.10 Explain and prove each equality in (10.4.8).
10.4.11 Explain the motivation behind the estimator (10.4.9).
10.4.12∗ Find the mean and variance of the estimator (10.4.9). Is it unbiased? What is the
effect of the nuisance function?
10.4.13 Use formula (10.4.10) to explain ill-posedness.
PJ
10.4.14∗ Consider a series density estimator f˜X (x, J) := 1 + j=0 θ̃j ϕj (x). Find its MISE.
Then propose an estimator of the cutoff J and evaluate the MISE of the plug-in density
estimator.
10.4.15 Explain a missing mechanism which implies (10.4.11).
10.4.16 Verify each equality in (10.4.12).
10.4.17∗ Explain the underlying idea of estimator (10.4.13). Is it unbiased? Then calculate
its mean squared error.
10.4.18 Explain the idea of an aggregated estimator (10.4.14). Is it unbiased? Why may
we refer to it as linear? Hint: Recall why a regression is called linear.
10.4.19∗ Verify (10.4.15). What is the used assumption? Then consider the case of depen-
dent estimators γ̃1 and γ̃2 , and repeat calculation of the variance.
10.4.20 What is the value of parameter λ which minimizes the variance of estimator
(10.4.14)? Prove your assertion.
10.4.21 Explain why the weights in (10.4.17) for the two aggregated estimators have sense.
Hint: Think about σ12 significantly smaller than σ22 and vice versa.
10.4.22∗ Prove (10.4.18). Does the inequality justify the aggregation? Then consider a
sample of size 3n from X. Suppose that the estimand is the expectation of X. Propose
an estimator based on first n observations and another estimator based on remaining 2n
observations. Aggregate the two estimators and analyze the outcome.
10.4.23 Suppose that you have a sample of size n. What is the empirical cumulative dis-
tribution function?
10.4.24 Describe statistical properties of the empirical cumulative distribution function.
10.4.25 Explain the simulation used by Figure 10.8.
10.4.26 Explain all curves in Figure 10.8.
10.4.27 How are ISE and ISEF calculated in Figure 10.8?
418 ILL-POSED MODIFICATIONS
10.4.28 Using Figure 10.8, consider different sample sizes and comment about performance
of the E-estimator.
10.4.29 Using Figure 10.8, consider different corner functions. Write down a report about
your observations.
10.4.30 Explain the simulation used by Figure 10.9.
10.4.31 What is the difference between top and bottom diagrams in Figure 10.9?
10.4.32 How are ISE and ISEF calculated in Figure 10.9?
10.4.33 Consider Figure 10.9. Explain why the histogram in the top diagram is skewed to
the right and in the bottom diagram to the left.
10.4.34 Repeat Figure 10.9 with different sample sizes. Present a report about your con-
clusions.
10.4.35 Repeat Figure 10.9 with different corner function. Present a report about your
conclusions.
10.4.36 Using Figures 10.8 and 10.9, propose optimal parameters for the E-estimator.
Explain your choice.
10.4.37∗ Evaluate the MISE of proposed estimator.
10.4.38∗ Consider the case of missing CSC and propose a consistent estimator.
10.4.39 Explain the approach used for the case when (10.4.3) does not hold.
10.4.40 Are Fourier coefficients (10.4.21) deterministic or stochastic?
10.4.41 Verify each equality in (10.4.23).
10.4.42 Explain formula (10.4.25).
10.4.43∗ Find the mean and variance of the estimator (10.4.26).
10.4.44 Explain the simulation used in Figure 10.10.
10.4.45 Repeat Figure 10.10 with different parameters and write a report about your
findings.
10.4.46∗ Consider the case when the support of Z is larger than the support of X. Suggest
an E-estimator.
10.5.1 Explain the model of regression with measurement errors in predictors.
10.5.2 Present several examples of a regression with measurement errors in predictors. In
those examples, is it possible to avoid the errors?
10.5.3 What is the interpolation problem? Give an example and explain how accurately a
differentiable function may be interpolated.
10.5.4 What is the accuracy of estimation of a differentiable regression function? Why is
it worse than for the interpolation? Pn
10.5.5∗ Modify statistic (10.5.4) by using n−1 l=1 Yl ϕj (Ul )/f X (Ul ). The underlying idea
of this modification is to mimic a known estimator for the case of a standard regression.
Calculate its expectation and make a conclusionPn about usefulness of this modification.
10.5.6 Modify statistic (10.5.4) by using n−1 l=1 Yl ϕj (Ul )/f U (Ul ). Calculate its expecta-
tion and make a conclusion about usefulness of this modification.
10.5.7 Verify (10.5.6).
10.5.8∗ To get (10.5.7) it is assumed that ε is independent of U . First, is this a reasonable
assumption? Second, what will be if it does not hold?
10.5.9∗ To get (10.5.9), it is assumed that X and ξ are independent. Consider the case
when this assumption does not hold and propose a solution.
10.5.10 What is the assumption needed for validity of (10.5.10)?
10.5.11 Does the characteristic function completely define a random variable? If the answer
is “yes,” show how it can be used to find the second moment.
10.5.12 Verify (10.5.12) and formulate all used assumptions.
10.5.13 Explain how E-estimator can be used to estimate function g(x) := f X (x)m(x). Is
this a regular or ill-posed problem?
10.5.14∗ Explain the estimator (10.5.15). Find its mean and variance.
EXERCISES 419
10.5.15 Describe the underlying simulation of Figure 10.11.
10.5.16 Explain the procedure of regression E-estimation using diagrams in Figure 10.11.
10.5.17 Suggest better parameters of the E-estimator used in Figure 10.11.
10.5.18∗ Figure 10.11 uses the same parameters of the E-estimator for estimation of all
involved functions. Is this a good idea? Is it better to use different parameters for each
function? Support your conclusion via a numerical study based on repeated simulations.
10.5.19 To understand the effect of measurement errors, it is possible to compare ISEs for
the case of regular and MEP regressions. Use Figure 10.11 to conduct such a numerical
study.
10.5.20 Repeat Figure 10.11 a number of times and answer the following question. Based
on the simulations, is estimation of g(x) or f X (x) more critical for E-estimation?
10.5.21 Consider the same zero-mean normal distribution for ε and ξ. Assume that the
standard deviation is σ. Explain how σ affects the regression estimation.
10.5.22∗ Evaluate the MISE of proposed estimator.
10.5.23∗ Consider the case of a linear regression with errors in predictors. Propose a con-
sistent estimator.
10.5.24∗ Consider the case of unknown distribution of measurement errors ξ. Suggest a
possible solution of the regression problem.
10.5.25∗ Propose a regression E-estimator following the methodology of Section 2.3.
10.6.1 Explain the MEP regression with responses missing according to (10.6.2).
10.6.2 Present several examples of the MEP regression given (10.6.2).
10.6.3 Explain motivation of the statistic (10.6.3).
10.6.4 Verify (10.6.4). What are the used assumptions?
10.6.5 What are the conditions used to get (10.6.5)?
10.6.6 How can formula (10.6.5) be used to suggest a regression E-estimator?
10.6.7 How can the function g(x) := m(x)f X (x) be estimated? What are required assump-
tions?
10.6.8∗ Explain the E-estimator (10.6.8). Evaluate the MISE.
10.6.9 How is the estimator fˆX (x), used in (10.6.8), constructed? What are the assump-
tions?
10.6.10∗ Estimator (10.6.8) is based on the assumption that the characteristic function
φξ (πj) is known. What can be done if it is unknown?
10.6.11 How can the availability likelihood function, used in (10.6.8), be estimated? Is this
an ill-posed problem?
10.6.12 Explain the simulation used in Figure 10.12.
10.6.13 Explain diagrams in Figure 10.12.
10.6.14 Using Figure 10.12, make a suggestion about better parameters for the E-estimator.
10.6.15 Conduct a numerical study with the purpose of understanding the effect of missing
(10.6.2) in the MEP regression on quality of estimation. Hint: Use Figure 10.12.
10.6.16 Explain missing mechanism (10.6.9) and present several examples.
10.6.17 What is the underlying idea of the statistic (10.6.10)?
10.6.18 Verify every equality in (10.6.11).
10.6.19 Explain each step in establishing (10.6.12). What are the used assumptions?
10.6.20 Why do we introduce a new function q(x) in (10.6.13)?
10.6.21 Explain why (10.6.15) is a sample mean estimator.
10.6.22 What is the variance of estimator (10.6.15)?
10.6.23 Verify (10.6.16).
10.6.24 How can conditional density (10.6.16) be estimated?
10.6.25 Verify (10.6.17).
10.6.26 Describe the simulation used in Figure 10.13.
10.6.27 Using Figure 10.13, explain how the E-estimator performs.
420 ILL-POSED MODIFICATIONS
10.6.28 Using Figure 10.13, develop a numerical study to explore the effect of the avail-
ability likelihood function on estimation. Write a report.
10.6.29 Suggest a numerical study, using Figure 10.13, to explore the effect of parameter
σ on quality of estimation. Write a report.
10.6.30 Conduct a numerical study, using Figure 10.13, to explore the effect of sample size
on quality of estimation. Write a report.
10.6.31∗ Explain the regression estimator (10.6.18). Evaluate its MISE.
10.6.32∗ Consider the case of unknown distribution of ξ. Propose a consistent regression
estimator. Hint: Use an extra sample.
10.6.33∗ It is known that for a standard regression with MAR responses a complete-case
approach is consistent. Is it also the case for the MEP regression with missing response?
Hint: Consider the two missing mechanisms discussed in Section 10.6.
10.6.34∗ The equality f ξ|A (z|1) = f ξ (z) is used to establish (10.6.17). Prove this equality
and then explain its meaning.
10.7.1 What is the empirical cumulative distribution function? Is it a bona fide cumulative
distribution function?
10.7.2 Describe properties of the empirical cumulative distribution function.
10.7.3 Draw V(F̂ X (x)) as a function in x. Hint: First draw a cumulative distribution
function and then draw the corresponding variance. Pay attention to values at boundary
points.
10.7.4 Explain (10.7.6).
10.7.5 Find the mean and variance of Fourier estimator (10.7.8).
10.7.6 Verify (10.7.11).
10.7.7 Do we need to assume that the density is periodic for (10.7.13) to be valid?
10.7.8 Verify (10.7.14).
10.7.9 Verify (10.7.15).
10.7.10∗ Explain how to construct an estimator of the derivative of a density. Then calculate
the MISE.
10.7.11 Verify (10.7.18). Do you need any assumptions?
10.7.12 Check validity of (10.7.20).
10.7.13 What are the assumptions of (10.7.20)?
10.7.14 Prove (10.7.22).
10.7.15 Density estimation is ill-posed with respect to the cumulative distribution function
estimation, and density’s derivative estimation is ill-posed with respect to density estima-
tion. In your opinion, which ill-posedness is more severe?
10.7.16∗ Consider the problem of estimation of the derivative of a regression function.
Propose a consistent E-estimator.
10.7.17∗ Evaluate the MISE of an E-estimator for the derivative of regression function.
Hint: Make an assumption that simplifies the problem.
10.7.18∗ Consider the problem of estimation of derivative of the density when observations
are available with measurement errors (the setting of Section 10.1). Propose a consistent
E-estimator.
10.7.19∗ Consider the setting of Section 10.2 and propose a consistent E-estimator of the
density’s derivative.
10.7.20∗ Suppose that observations are truncated. Suggest a consistent estimator of the
density’s derivative.
10.7.21∗ Consider estimation of a kth derivative of the density. Formulate assumptions,
propose E-estimator and evaluate its MISE.
NOTES 421
10.9 Notes
There is a number of good mathematical books devoted to deterministic (not stochastic)
ill-posed problems, for instance see the classical Tikhonov (1998) or more recent Kabanikhin
(2011). Statistical theory of nonparametric deconvolution is considered in Efromovich and
Ganzburg (1999), Efromovich and Koltchinskii (2001), and Meister (2009).
10.1 The problem of measurement errors in density estimation is considered in Section
3.5 of Efromovich (1999a) as well as in Efromovich (1994c, 1997a). Statistical analysis of
directional data is as old as the analysis of linear data. As an example, Gauss developed
the theory of measurement errors to analyze directional measurements in astronomy, see
books by Mardia (1972) and Fisher (1993). Different methods for density deconvolution are
discussed in Wand and Jones (1995, Section 6.2.4). For more recent results and settings,
see Lepskii and Willer (2017), Pensky (2017), and Yi (2017).
10.2 Sequential estimation is a natural setting for the considered problem, see a discus-
sion in Efromovich (2004d, 2015, 2017).
10.3 There is a number of natural extensions of the considered problem. For instance,
hazard rate estimation for censored and measured with error data is discussed in Comte,
Mabon, and Samson (2017).
10.4 Interval censoring is discussed in the books by Chen, Sun and Peace (2012), Sun
and Zhao (2013), Groeneboom and Jongbloed (2014), and Klein et al. (2014).
10.5 Section 4.11 in Efromovich (1999a) considers the problem of MEP regression for
the case of the Uniform design density, and it is assumed that the design density is known.
The classical book by Carroll et al. (2006) is devoted to measurement errors in nonlinear
models, and also see the more recent book by Grace (2017). For a nice and concise discussion
of the interpolation problem see Demidovich and Maron (1981). Quantile regression with
measurement errors is discussed in Chester (2017) where further references may be found.
10.6 Let us briefly explain another possible scenario when the missing mechanism is
defined by the measurement error ξ. This is a more complicated case and here our intent is
to comment on a possible solution. For the considered missing mechanism we have
To understand how we may estimate the regression m(x) under this missing scenario,
let us look again at statistic (10.6.10). Write for its expectation,
This result is pivotal for the proposed solution. First of all, let us show how we can
estimate E{w(ξ)ϕj (ξ)}. Write following (10.9.2)-(10.9.3),
by estimator Pn Pn
n−1 l=1 Al ϕj (Ul ) φξ (πj) l=1 Al ϕj (Ul )
γ̂j := = Pn . (10.9.9)
θ̂j l=1 ϕj (Ul )
Aalen, O, Borgan, O., and Gjessing, H. (2008). Survival and Event History Analysis: A Process
Point of View (Statistics for Biology and Health). New York: Springer.
Adak, S. (1998). Time-dependent spectral analysis of nonstationary time series. Journal of the
American Statistical Association 93 1488–1501.
Addison, P. (2017). The Illustrated Wavelet Transform Handbook. 2nd ed. Boca Raton: CRC Press.
Allen, J. (2017). A Bayesian hierarchical selection model for academic growth with missing data.
Applied Measurement in Education 30 147–162.
Allison, P. (2002). Missing Data. Thousand Oaks: Sage Publications.
Allison, P. (2010). Survival Analysis Using SAS. Cary: SAS Institute.
Allison, P. (2014). Event History and Survival Analysis. 2nd ed. Thousand Oaks: SAGE Publica-
tions.
Anderson, T. (1971). The Statistical Analysis of Time Series. New York: Wiley.
Andersen P.K., Borgan, O., Gill, R.D., and Keiding, N. (1993). Statistical Models Based on Counting
Processes. New York: Springer.
Antoniadis, A., Gregoire, G., and Nason, G. (1999). Density and hazard rate estimation for right-
censored data by using wavelets methods. Journal of the Royal Statistical Society: Series B
(Statistical Methodology) 61 63–84.
Aoshima, M. and Yata, K. (2011). Two-stage procedures for high-dimensional data. Sequential
Analysis 30 356–399.
Arnab, R. (2017). Survey Sampling Theory and Applications. London: Academic Press.
Austin, P. (2017). A Tutorial on multilevel survival analysis: methods, models and applications.
International Statistical Review 85 185–203.
Baby, P. and Stoica, P. ( 2010). Spectral analysis of non uniformly sampled data - a review. Digital
Signal Processing 20 359–378.
Bagkavos, D. and Patil, P. (2012). Variable bandwidths for nonparametric hazard rate estimation.
Communications in Statistics - Theory and Methods 38 1055–1078.
Baisch, S. and Bokelmann, G. (1999). Spectral analysis with incomplete time series: an example
from seismology. Computers and Geoscience 25 739–750.
Baraldi, A. and Enders, C. (2010). An introduction to modern missing data analyses. Journal of
School Psychology 48 5–37.
Bary, N.K. (1964). A Treatise on Trigonometric Series. Oxford: Pergamon Press.
Bellman, R.E. (1961). Adaptive Control Processes. Princeton: Princeton University Press.
Beran, J. (1994). Statistics for Long-Memory Processes. New York: Chapman & Hall.
Berglund, P. and Heeringa S. (2014). Multiple Imputation of Missing Data Using SAS. Cary: SAS
Institute.
Berk, R. (2016). Statistical Learning from a Regression Perspective. New York: Springer.
Bickel, P.J. and Ritov, Y. (1991). Large sample theory of estimation in biased sampling regression
models. The Annals of Statistics 19 797–816.
Bickel, P.J. and Doksum, K.A. (2007). Mathematical Statistics. 2nd ed. London: Prentice Hall.
423
424 REFERENCES
Birgé, L. and Massart, P. (1997). From model selection to adaptive estimation. Festschrift for
Lucien Le Cam. (Pollard, D., ed.) New York: Springer, 55–87.
Bloomfield, P. (1970). Spectral analysis with randomly missing observations. Journal of Royal
Statistical Society, Ser. B 32 369–380.
Bloomfield, P. (2004). Fourier Analysis of Time Series. New York: Wiley.
Bodner T. (2006). Missing data: Prevalence and reporting practices. Psychological Reports 99 675–
680.
Borrajo, M., González-Manteiga, W., and Martı́nez-Miranda, M. (2017). Bandwidth selection for
kernel density estimation with length-biased data. Journal of Nonparametric Statistics 29
636–668.
Bott, A. and Kohler, M. (2017). Nonparametric estimation of a conditional density. Annals of the
Institute of Statistical Mathematics 69 189–214.
Bouza-Herrera, C. (2013). Handling Missing Data in Ranked Set Sampling. New York: Springer.
Box, G., Jenkins, G., Reinsel, G., and Ljung, G. (2016). Time Series Analysis: Forecasting and
Control. 5th ed. Hoboken: Wiley.
Bremhorsta, V. and Lamberta, F. (2016). Flexible estimation in cure survival models using Bayesian
p-splines. Computational Statistics and Data Analysis 93 270–284.
Brockwell, P.J. and Davis, R.A. (1991). Time Series: Theory and Methods. 2nd ed. New York:
Springer.
Brown, L.D. and Low, M.L. (1996). Asymptotic equivalence of nonparametric regression and white
noise. The Annals of Statistics 24 2384–2398.
Brunel, E. and Comte, F. (2008). Adaptive estimation of hazard rate with censored data. Commu-
nications in Statistics - Theory and Methods 37 1284–1305.
Brunel, E., Comte, F., and Guilloux, A. (2009). Nonparametric density estimation in presence of
bias and censoring. Test 18 166–194.
Butcher, H. and Gillard, J. (2016). Simple nuclear norm based algorithms for imputing missing
data and forecasting in time series. Statistics and Its Interface 10 (1) 19–25.
Butzer, P.L. and Nessel, R.J. (1971). Fourier Analysis and Approximations. New York: Academic
Press.
Cai, T. and Low, M. (2006). Adaptive confidence balls. The Annals of Statistics 34 202–228.
Cai, T. and Guo, Z. (2017). Confidence intervals for high-dimensional linear regression: Minimax
rates and adaptivity. The Annals of Statistics 45 615–646.
Cao, R., Janssen, P., and Veraverbeke, N. (2005). Relative hazard rate estimation for right censored
and left truncated data. Test 14 257–280.
Cao, W., Tsiatis, A., and Davidian, M. (2009). Improving efficiency and robustness of the doubly
robust estimator for a population mean with incomplete data. Biometrika 96 723–734.
Carpenter, J. and Kenward, M. (2013). Multiple imputation and its application. Chichester: Wiley.
Carroll, R. and Ruppert, D. (1988). Transformation and Weighting in Regression. Boca Raton:
Chapman & Hall.
Carroll, R., Ruppert, D., Stefanski, L., and Crainceanu, C. (2006). Measurement Error in Nonlinear
Models: A Modern Perspective. 2nd ed. Boca Raton: Champan & Hall.
Casella, G. and Berger, R. (2002). Statistical Inference. 2nd ed. New York: Duxbury.
Chan, K. (2013). Survival analysis without survival data: connecting length-biased and case-control
data. Biometrika 100 764–770.
Chaubey, Y., Chesneau, C., and Navarro, F. (2017). Linear wavelet estimation of the derivatives
of a regression function based on biased data. Communications in Statistics - Theory and
Methods 46 9541–9556.
Cheema, J. (2014). A Review of missing data handling methods in education research. Review of
Educational Research 84 487–508.
REFERENCES 425
Chen, D., Sun, J., and Peace, K. (2012). Interval-Censored Time-to-Event Data: Methods and
Applications. Boca Raton: Chapman & Hall.
Chen, X. and Cai, J. (2017). Reweighted estimators for additive hazard model with censoring
indicators missing at random. Lifetime Data Analysis, https://doi.org/10.1007/s10985-017-
9398-z.
Chen, Y., Genovese, C., Tibshirani, R., and Wasserman, L. (2016). Nonparametric modal regression.
The Annals of Statistics 44 489–514.
Chen, Z.,Wang, Q., Wu, D., and Fan, P. (2016). Two-dimensional evolutionary spectrum approach
to nonstationary fading channel modeling. IEEE Transactions on Vehicular Technology 65
1083-1097.
Chentsov, N.N. (1962). Estimation of unknown distribution density from observations. Soviet Math.
Dokl. 3 1559–1562.
Chentsov, N.N. (1980). Statistical Decision Rules and Optimum Inference. New York: Springer-
Verlag.
Chester, A. (2017). Understanding the effect of measurement error on quantile regressions. Journal
of Econometrics 200 223–237.
Choi, S. and Portnoy, S. (2016). Quantile autoregression for censored data. Journal of Time Series
Analysis 37 603–623.
Cohen, A.C. (1991). Truncated and Censored Samples: Theory and Applications. New York: Marcel
Dekker.
Collett, D. (2014). Modeling Survival Data in Medical Research. 3rd ed. Boca Raton: Chapman &
Hall.
Comte, F., and Rebafka, T. (2016). Nonparametric weighted estimators for biased data. Journal
of Statistical Planning and Inference 174 104–128.
Comte, F., Mabon, G., and Samson, A. (2017). Spline regression for hazard rate estimation when
data are censored and measured with error. Statistica Neerlandica 71 115–140.
Cortese, G., Holmboe, S., and Scheike, T. (2017). Regression models for the restricted residual
mean life for right-censored and left-truncated data. Statistics in Medicine 36 1803–1822.
Cox, D.R. and Oakes, D. (1984). Analysis of Survival Data. London: Chapman & Hall.
Crowder, M. (2012). Multivariate Survival Analysis and Competing Risks. Boca Raton: Chapman
& Hall.
Dabrowska, D. (1989). Uniform consistency of the kernel conditional Kaplan-Meier estimate. The
Annals of Statistics 17 1157–1167.
Daepp, M., Hamilton, M., West, G., and Bettencourt, L. (2015). The mortality of companies.
Journal of the Royal Society Interface 12 DOI: 10.1098/rsif.2015.0120.
Dahlhaus, R. (1997). Fitting time series Models to nonstationary processes. Annals of Statistics
25, 1–37.
Dai, H., Restaino, M., and Wang, H. (2016). A class of nonparametric bivariate survival function
estimators for randomly censored and truncated data. Journal of Nonparametric Statistics 28
736–751.
Daniels, M. and Hogan, J. (2008). Missing Data in Longitudinal Studies: Strategies for Bayesian
Modeling and Sensitivity Analysis. New York: Chapman & Hall.
Davey, A. and Salva, J. (2009). Statistical Power Analysis with Missing Data. London: Psychology
Press.
Davidian, M., Tsiatis, A., and Leon, S. (2005). Semiparametric estimation of treatment effect in a
pretest-posttest study with missing data. Statistical Science: A Review Journal of the Institute
of Mathematical Statistics 20 261–301.
Dean, A., Voss, D., and Draguljic, D. (2017). Design and Analysis of Experiments. 2nd ed. New
York: Springer.
426 REFERENCES
De Gooijer, J. (2017). Elements of Nonlinear Time Series Analysis and Forecasting. New York:
Springer.
Dedecker, J., Doukhan P., Lang, G., Leon, J., Louhichi, S., and Prieur, C. (2007). Weak Dependence:
With Examples and Applications. Springer: New York.
Del Moral, P. and Penev, S. (2014). Stochastic Processes: From Applications to Theory. Boca Raton:
CRC Press.
Delecroix, M., Lopez, O., and Patilea, V. (2008). Nonlinear Censored Regression Using Synthetic
Data. Scandinavian Journal of Statistics 35 248–265.
Demidovich, B. and Maron, I. (1981). Computational Mathematics. Moscow: Mir.
De Una-Álvarez, J. (2004). Nonparametric estimation under length-biased and type I censoring: a
moment based approach. Annals of the Institute of Statistical Mathematics 56 667–681.
De Una-Álvarez, J. and Veraverbeke, N. (2017) Copula-graphic estimation with left-truncated and
right-censored data. Statistics 51 387–403.
DeVore, R.A. and Lorentz, G.G. (1993). Constructive Approximation. New York: Springer-Verlag.
Devroye, L. and Györfi, L. (1985). Nonparametric Density Estimation: The L1 View. New York:
Wiley.
Devroye, L. (1987). A Course in Density Estimation. Boston: Birkhäuser.
Diggle, P.J. (1990). Time Series: A Biostatistical Introduction. Oxford: Oxford University Press.
Dobrow, R. (2016). Introduction to Stochastic Processes with R. Hoboken: Wiley.
Donoho, D. and Johnstone, I. (1995). Adapting to unknown smoothness via wavelet shrinkage.
Journal of the American Statistical Association 90 1200–1224.
Doukhan, P. (1994). Mixing: Properties and Examples. New York: Springer-Verlag.
Dryden, I.L. and Mardia, K.V. (1998). Statistical Shape Analysis. New York: Wiley.
Dunsmuir, W. and Robinson, P. (1981a). Estimation of time series models in the presence of missing
data. Journal of the American Statistical Association 76 560–568.
Dunsmuir, W. and Robinson, P. (1981b). Asymptotic theory for time series containing missing
and amplitude modulated observations, Sankhya: The Indian Journal of Statistics, ser. A 43
260–281.
Durrett, R. (2016). Essentials of Stochastic Processes. New York: Springer.
Dym, H. and McKean, H.P. (1972). Fourier Series and Integrals. London: Academic Press.
Dzhaparidze, K. (1985). Estimation of Parameters and Verification of Hypotheses in Spectral Anal-
ysis. New York: Springer.
Efromovich, S. (1980a). Information contained in a sequence of observations. Problems of Informa-
tion Transmission 15 178–189.
Efromovich, S. (1980b). On sequential estimation under conditions of local asymptotic normality.
Theory of Probability and its Applications 25 27–40.
Efromovich, S. (1984). Estimation of a spectral density of a Gaussian time series in the presence
of additive noise. Problems of Information Transmission 20 183–195.
Efromovich, S. (1985). Nonparametric estimation of a density with unknown smoothness. Theory
of Probability and its Applications 30 557–568.
Efromovich, S. (1986). Adaptive algorithm of nonparametric regression. Proc. of Second IFAC
symposium on Stochastic Control. Vilnuis: Science, 112–114.
Efromovich, S. (1989). On sequential nonparametric estimation of a density. Theory of Probability
and its Applications 34 228–239.
Efromovich, S. (1992). On orthogonal series estimators for random design nonparametric regression.
Computing Science and Statistics 24 375–379.
Efromovich, S. (1994a). On adaptive estimation of nonlinear functionals. Statistics and Probability
Letters 19 57–63.
REFERENCES 427
Efromovich, S. (1994b). On nonparametric curve estimation: multivariate case, sharp-optimality,
adaptation, efficiency. CORE Discussion Papers 9418 1–35.
Efromovich, S. (1994c). Nonparametric curve estimation from indirect observations. Computing
Science and Statistics 26 196–200.
Efromovich, S. (1995a). Thresholding as an adaptive method (discussion). Journal of Royal Statis-
tical Society ser. B 57 343.
Efromovich, S. (1995b). On sequential nonparametric estimation with guaranteed precision. The
Annals of Statistics 23 1376–1392.
Efromovich, S. (1996a). On nonparametric regression for iid observations in general setting. The
Annals of Statistics 24 1126–1144.
Efromovich, S. (1996b). Adaptive orthogonal series density estimation for small samples. Compu-
tational Statistics and Data Analysis 22 599–617.
Efromovich, S. (1997a). Density estimation for the case of supersmooth measurement error. Journal
of the American Statistical Association 92 526–535.
Efromovich, S. (1997b). Robust and efficient recovery of a signal passed through a filter and then
contaminated by non-Gaussian noise. IEEE Transactions on Information Theory 43 1184–
1191.
Efromovich, S. (1997c). Quasi-linear wavelet estimation involving time series. Computing Science
and Statistics 29 127–131.
Efromovich, S. (1998a). On global and pointwise adaptive estimation. Bernoulli 4 273–278.
Efromovich, S. (1998b). Data-driven efficient estimation of the spectral density. Journal of the
American Statistical Association 93 762–770.
Efromovich, S. (1998c). Simultaneous sharp estimation of functions and their derivatives. The
Annals of Statistics 26 273–278.
Efromovich, S. (1999a). Nonparametric Curve Estimation: Methods, Theory, and Applications. New
York: Springer.
Efromovich, S. (1999b). Quasi-linear wavelet estimation. The Journal of the American Statistical
Association 94 189–204.
Efromovich, S. (1999c). How to overcome the curse of long-memory errors. IEEE Transactions on
Information Theory 45 1735–1741.
Efromovich, S. (1999d). On rate and sharp optimal estimation. Probability Theory and Related
Fields 113 415–419.
Efromovich, S. (2000a). Can adaptive estimators for Fourier series be of interest to wavelets?
Bernoulli 6 699–708.
Efromovich, S. (2000b). On sharp adaptive estimation of multivariate curves. Mathematical Methods
of Statistics 9 117–139.
Efromovich, S. (2000c). Sharp linear and block shrinkage wavelet estimation. Statistics and Prob-
ability Letters 49 323–329.
Efromovich, S. (2001a). Density estimation under random censorship and order restrictions: from
asymptotic to small samples. The Journal of the American Statistical Association 96 667–685.
Efromovich, S. (2001b). Second order efficient estimating a smooth distribution function and its
applications. Methodology and Computing in Applied Probability 3 179–198.
Efromovich, S. (2001c). Multiwavelets and signal denoising. Sankhya ser. A 63 367–393.
Efromovich, S. (2002). Discussion on random rates in anisotropic regression. The Annals of Statis-
tics 30 370–374.
Efromovich, S. (2003a). On the limit in the equivalence between heteroscedastic regression and
filtering model. Statistics and Probability Letters 63 239–242.
Efromovich, S. (2004a). Density estimation for biased data. The Annals of Statistics 32 1137–1161.
Efromovich, S. (2004b). Financial applications of sequential nonparametric curve estimation. In
428 REFERENCES
Applied Sequential Methodologies, eds. N.Mukhopadhyay, S.Datta, and S.Chattopadhyay. 171–
192.
Efromovich, S. (2004c). Distribution estimation for biased data. Journal of Statistical Planning and
Inference 124 1–43.
Efromovich, S. (2004d). On sequential data-driven density estimation. Sequential Analysis Journal
23 603–624.
Efromovich, S. (2004e). Analysis of blockwise shrinkage wavelet estimates via lower bounds for
no-signal setting. Annals of the Institute of Statistical Mathematics 56 205–223.
Efromovich, S. (2004f). Oracle inequalities for Efromovich–Pinsker blockwise estimates. Methodol-
ogy and Computing in Applied Probability 6 303–322.
Efromovich, S. (2004g). Discussion on “Likelihood ratio identities and their applications to sequen-
tial analysis” by Tze L. Lai. Sequential Analysis Journal 23 517–520.
Efromovich, S. (2004h). Adaptive estimation of error density in heteroscedastic nonparametric
regression. In: Proceedings of the 2nd International workshop in Applied Probability IWAP
2004, Univ. of Piraeus, Greece, 132–135.
Efromovich, S. (2005a). Univariate nonparametric regression in the presence of auxiliary covariates.
Journal of the American Statistical Association 100 1185–1201.
Efromovich, S. (2005b). Estimation of the density of regression errors. The Annals of Statistics 33
2194–2227.
Efromovich, S. (2007a). A lower-bound oracle inequality for a blockwise-shrinkage estimate. Journal
of Statistical Planning and Inference 137 176–183.
Efromovich, S. (2007b). Universal lower bounds for blockwise-shrinkage wavelet estimation of a
spike. Journal of Applied Functional Analysis 2 317–338.
Efromovich, S. (2007c). Adaptive estimation of error density in nonparametric regression with small
sample size. Journal of Statistical Inference and Planning 137 363–378.
Efromovich, S. (2007d). Sequential design and estimation in heteroscedastic nonparametric regres-
sion. Invited paper with discussion. Sequential Analysis 26 3–25.
Efromovich, S. (2007e). Response on sequential design and estimation in heteroscedastic nonpara-
metric regression. Sequential Analysis 26 57–62.
Efromovich, S. (2007f). Optimal nonparametric estimation of the density of regression errors with
finite support. Annals of the Institute of Statistical Mathematics 59 617–654.
Efromovich, S. (2007g). Conditional density estimation. The Annals of Statistics 35 2504–2535.
Efromovich, S. (2007h). Comments on nonparametric inference with generalized likelihood ratio
tests. Test 16 465–467.
Efromovich, S. (2007i). Applications in finance, engineering and health sciences: Plenary Lecture.
Abstracts of IWSM-2007, Auburn University, 20–21.
Efromovich, S. (2008a). Optimal sequential design in a controlled nonparametric regression. Scan-
dinavian Journal of Statistics 35 266–285.
Efromovich, S. (2008b). Adaptive estimation of and oracle inequalities for probability densities and
characteristic functions. The Annals of Statistics 36 1127–1155.
Efromovich, S. (2008c). Nonparametric regression estimation with assigned risk. Statistics and
Probability Letters 78 1748–1756.
Efromovich, S. (2009a). Lower bound for estimation of Sobolev densities of order less 1/2. Journal
of Statistical Planning and Inference 139 2261–2268.
Efromovich, S. (2009b). Multiwavelets: theory and bioinformatic applications. Communications in
Statistics – Theory and Methods 38 2829–2842
Efromovich, S. (2009c). Optimal sequential surveillance for finance, public health, and other areas:
discussion. Sequential Analysis 28 342–346.
Efromovich, S. (2010a). Sharp minimax lower bound for nonparametric estimation of Sobolev
densities of order 1/2. Statistics and Probability Letters 80 77–81.
REFERENCES 429
Efromovich, S. (2010b). Oracle inequality for conditional density estimation and an actuarial ex-
ample. Annals of the Institute of Mathematical Statistics 62 249–275.
Efromovich, S. (2010c). Orthogonal series density estimation. WIREs Computational Statistics 2
467–476.
Efromovich, S. (2010d). Dimension reduction and oracle optimality in conditional density estima-
tion. Journal of the American Statistical Association 105 761–774.
Efromovich, S. (2011a). Nonparametric regression with predictors missing at random. Journal of
the American Statistical Association 106 306–319.
Efromovich, S. (2011b). Nonparametric regression with responses missing at random. Journal of
Statistical Planning and Inference 141 3744–3752.
Efromovich, S. (2011c). Nonparametric estimation of the anisotropic probability density of mixed
variables. Journal of Multivariate Analysis 102 468–481.
Efromovich, S. (2012a). Nonparametric regression with missing data: theory and applications. Ac-
tuarial Research Clearing House 1 1–15.
Efromovich, S. (2012b). Sequential analysis of nonparametric heteroscedastic regression with miss-
ing responses. Sequential Analysis 31 351–367.
Efromovich, S. (2013a). Nonparametric regression with the scale depending on auxiliary variable.
The Annals of Statistics 41 1542–1568.
Efromovich, S. (2013b). Notes and proofs for nonparametric regression with the scale depending
on auxiliary variable. The Annals of Statistics 41, 1–29.
Efromovich, S. (2013c). Adaptive nonparametric density estimation with missing observations.
Journal of Statistical Planning and Inference 143 637–650.
Efromovich, S. (2014a). On shrinking minimax convergence in nonparametric statistics. Journal of
Nonparametric Statistics 26 555–573.
Efromovich, S. (2014b). Efficient nonparametric estimation of the spectral density in the presence
of missing observations. Journal of Time Series Analysis 35 407–427.
Efromovich, S. (2014c). Nonparametric regression with missing data. Computational Statistics 6
265–275.
Efromovich, S. (2014d). Nonparametric estimation of the spectral density of amplitude-modulated
time series with missing observations, Statistics and Probability Letters 93 7–13.
Efromovich, S. (2014e). Nonparametric curve estimation with incomplete data, Actuarial Research
Clearing House 15 31–47.
Efromovich, S. (2015). Two-stage nonparametric sequential estimation of the directional density
with complete and missing observations. Sequential Analysis 34 425–440.
Efromovich, S. (2016a). Minimax theory of nonparametric hazard rate estimation: efficiency and
adaptation. Annals of the Institute of Statistical Mathematics 68 25–75.
Efromovich, S. (2016b). Estimation of the spectral density with assigned risk. Scandinavian Journal
of Statistics 43 70–82.
Efromovich, S. (2016c). What an actuary should know about nonparametric regression with missing
data. Variance 10 145–165.
Efromovich, S. (2017). Missing, modified and large-p-small-n data in nonparametric curve estima-
tion. Calcutta Statistical Association Bulletin 69 1–34.
Efromovich, S. and Baron, M. (2010). Discussion on “quickest detection problems: fifty years later”
by Albert N. Shiryaev. Sequential Analysis 29 398–403.
Efromovich, S. and Chu, J. (2018a). Efficient nonparametric hazard rate estimation with left trun-
cated and right censored data. Annals of the Institute of Statistical Mathematics, in press.
https://doi.org/10.1007/s10463-017-0617-x
Efromovich, S. and Chu, J. (2018b). Small LTRC samples and lower bounds in haz-
ard rate estimation. Annals of the Institute of Statistical Mathematics, in press.
https://doi.org/10.1007/s10463-017-0617-x
430 REFERENCES
Efromovich, S. and Ganzburg, M. (1999). Best Fourier approximation and application in efficient
blurred signal reconstruction. Computational Analysis and Applications 1 43–62.
Efromovich, S., Grainger, D., Bodenmiller, D., and Spiro, S. (2008). Genome-wide identification of
binding sites for the nitric oxide sensitive transcriptional regulator NsrR. Methods in Enzy-
mology 437 211–233.
Efromovich, S. and Koltchinskii, V. (2001). On inverse problems with unknown operators. IEEE
Transactions on Information Theory 47 2876–2894.
Efromovich, S., Lakey, J., Pereyra, M.C., and Tymes, N. (2004). Data-driven and optimal denoting
of a signal and recovery of its derivative using multiwavelets. IEEE Transactions on Signal
Processing 52 628–635.
Efromovich, S. and Low, M. (1994). Adaptive estimates of linear functionals. Probability Theory
and Related Fields 98 261–275.
Efromovich, S. and Low, M. (1996a). On Bickel and Ritov’s conjecture about adaptive estimation
of some quadratic functionals. The Annals of Statistics 24 682–686.
Efromovich, S. and Low, M. (1996b). Adaptive estimation of a quadratic functional. The Annals
of Statistics 24 1106–1125.
Efromovich, S. and Pinsker M.S. (1981). Estimation of a square integrable spectral density for a
time series. Problems of Information Transmission 17 50–68.
Efromovich, S. and Pinsker M.S. (1982). Estimation of a square integrable probability density of a
random variable. Problems of Information Transmission 18 19–38.
Efromovich, S. and Pinsker M.S. (1984). An adaptive algorithm of nonparametric filtering. Au-
tomation and Remote Control 11 58–65.
Efromovich, S. and Pinsker M.S. (1986). Adaptive algorithm of minimax nonparametric estimating
spectral density. Problems of Information Transmission 22 62–76.
Efromovich, S. and Pinsker, M.S. (1989). Detecting a signal with an assigned risk. Automation and
Remote Control 10 1383–1390.
Efromovich, S. and Pinsker, M. (1996). Sharp-optimal and adaptive estimation for heteroscedastic
nonparametric regression. Statistica Sinica 6 925–945.
Efromovich, S. and Salter-Kubatko, L. (2008). Coalescent time distributions in trees of arbitrary
size. Statistical Applications in Genetics and Molecular Biology 7 1–21.
Efromovich, S. and Samarov, A. (1996). Asymptotic equivalence of nonparametric regression and
white noise model has its limits. Statistics and Probability Letters 28 143–145.
Efromovich, S. and Samarov, A. (2000). Adaptive estimation of the integral of squared regression
derivatives. Scandinavian Journal of Statistics 27 335–352.
Efromovich, S. and Smirnova, E. (2014a). Wavelet estimation: minimax theory and applications.
Sri Lankan Journal of Applied Statistics 15 17–31.
Efromovich, S. and Smirnova, E. (2014b). Statistical analysis of large cross-covariance and cross-
correlation matrices produced by fMRI Images. Journal of Biometrics and Biostatistics 5 1–8.
Efromovich, S. and Thomas, E. (1996). Application of nonparametric binary regression to evaluate
the sensitivity of explosives. Technometrics 38 50–58.
Efromovich, S. and Valdez-Jasso, Z.A. (2010). Aggregated wavelet estimation and its application
to ultra-fast fMRI. Journal of Nonparametric Statistics 22 841–857.
Efromovich, S. and Wu, J. (2017). Dynamic nonparametric analysis of nonstationary portfolio
returns and its application to VaR and forecasting. Actuarial Research Clearing House 1–25.
Efron, B. and Tibshirani, R. (1996). Using specially designed exponential families for density esti-
mators. The Annals of Statistics 24 2431–2461.
El Ghouch, A. and Van Keilegom, I. (2008). Nonparametric regression with dependent censored
data. Scandinavian Journal of Statistics 35 228–247.
El Ghouch, A. and Van Keilegom, I. (2009). Local linear quantile regression with dependent cen-
sored data. Statistica Sinica 19 1621–1640.
REFERENCES 431
Enders, C. (2006). A primer on the use of modern missing-data methods in psychomatic medicine
research. Psyhosomatic Medicine 68 427–436.
Enders, C. (2010). Applied Missing Data Analysis. New York: The Guilford Press.
Eubank, R.L. (1988). Spline Smoothing and Nonparametric Regression. New York: Marcel and
Dekker.
Everitt, B. and Hothorn, T. (2011). An Introduction to Applied Multivariate Analysis with R. New
York: Springer.
Fan, J. and Gijbels, I. (1996). Local Polynomial Modeling and its Applications - Theory and Method-
ologies. New York: Chapman & Hall.
Fan, J. and Yao, Q. (2003). Nonlinear Time Series: Nonparametric and Parametric Methods. New
York: Springer.
Fan, J. and Yao, Q. (2015). The Elements of Financial Economics. Beijing: Science Press.
Faraway, J. (2016). Extending the Linear Model with R: Generalized Linear, Mixed Effects and
Nonparametric Regression Models. 2nd ed. New York: Chapman & Hall.
Fisher, N.I. (1993). Statistical Analysis of Circular Data. Cambridge: Cambridge University Press.
Fleming, T.R. and Harrington, D.P. (2011). Counting Processes and Survival Analysis. New York:
Wiley.
Frumento, P. and Bottai, M. (2017). An estimating equation for censored and truncated quantile
regression. Computational Statistics and Data Analysis 113 53–63.
Genovese, C. and Wasserman, L. (2008). Adaptive confidence bands. The Annals of Statistics 36
875–905.
Ghosal, S. and van der Vaart, A. (2017). Fundamentals of Nonparametric Bayesian Inference.
Cambridge: Cambridge University Press.
Gill, R., Vardi, Y., and Wellner, J. (1988). Large sample theory of empirical distributions in biased
sampling models. The Annals of Statistics 16 1069–1112.
Gill, R. (2006). Lectures on Survival Analysis. New York: Springer.
Giné, E. and Nickl, R. (2010). Confidence bands in density estimation. The Annals of Statistics 38
1122–1170.
Glad, I., Hjort, N., and Ushakov, N. (2003). Correction of density estimators that are not densities.
The Scandinavian Journal of Statistics 30 415–427.
Goldstein, H., Carpenter, J., and Browne, W. J. (2014). Fitting multilevel multivariate models with
missing data in responses and covariates that may include interactions and non-linear terms.
Journal of the Royal Statistical Society, Ser. A 177 553–564.
Gou, J. and Zhang, F. (2017). Experience Simpson’s paradox in the classroom. American Statisti-
cian 71 61–66.
Grace, Y. (2017). Statistical Analysis with Measurement Error or Misclassification: Strategy,
Method and Application. New York: Springer.
Graham, J. (2012). Missing Data: Analysis and Design. New York: Springer.
Green, P. and Silverman, B. (1994). Nonparametric Regression and Generalized Linear Models: a
Roughness Penalty Approach. London: Chapman & Hall.
Greiner, A., Semmler, W., and Gong, G. (2005). The Forces of Economic Growth: A Time Series
Perspective. Princeton: Princeton University Press.
Groeneboom, P. and Jongbloed, G. (2014). Nonparametric Estimation under Shape Constraints:
Estimators, Algorithms and Asymptotics. Cambridge: Cambridge University Press.
Groves R., Dillman D., Eltinge J., and Little R. (2002). Survey Nonresponse. New York: Wiley.
Guo, S. (2010). Survival Analysis. Oxford: Oxford University Press.
Györfi, L., Kohler, M., Krzyzak, A., and Walk, H. (2002). A Distribution-Free Theory of Nonpara-
metric Regression. New York: Springer.
432 REFERENCES
Györfi, L., Härdle, W., Sarda, P., and Vieu, P. (2013). Nonparametric Curve Estimation from Time
Series. New York: Springer.
Hagar,Y. and Dukic, V. (2015). Comparison of hazard rate estimation in R. arXiv: 1509.03253v1
Hall, P. and Hart, J.D. (1990). Nonparametric regression with long-range dependence. Stochastic
Processes and their Applications 36 339–351.
Han, P. and Wang, L. (2013). Estimation with missing data: beyond double robustness. Biometrika
100 417–430.
Härdle, W. (1990). Applied Nonparametric Regression. Cambridge: Cambridge University Press.
Härdle, W., Kerkyacharian, G., Picard, D., and Tsybakov, A. (1998). Wavelets, Approximation and
Statistical Applications. New York: Springer.
Harrell, F. (2015). Regression Modeling Strategies: With Applications to Linear Models, Logistic
and Ordinal Regression, and Survival Analysis. 2nd ed. London: Springer.
Hart, J.D. (1997). Nonparametric Smoothing and Lack-Of-Fit Tests. New York: Springer.
Hastie, T.J. and Tibshirani, R. (1990). Generalized Additive Models. London: Chapman & Hall.
Helsel, D. (2011). Statistics for Censored Environmental Data Using Minitab and R. 2nd. ed. New
York: Wiley.
Hoffmann, M. and Nickl, R. (2011). On adaptive inference and confidence bands. The Annals of
Statistics 39 2383–2409.
Hollander, M., Wolfe, D., and Chicken, E. (2013). Nonparametric Statistical Methods. New York:
Wiley.
Honaker, J. and King, G. (2010). What to do about missing values in time-series cross-section data.
American Journal of Political Science 54 561–581.
Horowitz, J. and Lee, S. (2017). Nonparametric estimation and inference under shape restrictions.
Journal of Econometrics 201 108–126.
Hosmer D., Lemeshow, S., and May, S. (2008). Applied Survival Analysis: Regression Modeling of
Time-to-Event Data. 2nd ed. New York: Wiley.
Ibragimov, I.A. and Khasminskii, R.Z. (1981). Statistical Estimation: Asymptotic Theory. New
York: Springer.
Ibragimov, I.A. and Linnik, Yu. V. (1971). Independent and Stationary Sequences of Random Vari-
ables. Groningen: Walters-Noordhoff.
Ingster, Yu. and Suslina, I. (2003). Nonparametric Goodness-of-Fit Testing Under Gaussian Models.
New York: Springer
Ivanoff, S., Picard, F., and Rivoirard, V. (2016). Adaptive Lasso and group-Lasso for functional
Poisson regression. Journal of Machine Learning Research 17 1–46.
Izbicki, R. and Lee, A. (2016). Nonparametric conditional density estimation in a high-dimensional
regression setting. Journal of Computational and Graphical Statistics 25 1297–1316.
Izenman, A. (2008). Modern Multivariate Statistical Techniques: Regression, Classification, and
Manifold Learning. New York: Springer.
Jankowski, H. and Wellner, J. (2009). Nonparametric estimation of a convex bathtub-shaped hazard
function. Bernoulli 15 1010–1035.
Jiang, J. and Hui, Y. (2004). Spectral density estimation with amplitude modulation and outlier
detection. Annals of the Institute of Statistical Mathematics 56 611–630.
Jiang, P., Liu, F., Wang, J., and Song, Y. (2016). Cuckoo search-designated fractal interpolation
functions with winner combination for estimating missing values in time series. Applied Math-
ematical Modeling 40 9692–9718.
Johnstone, I. (2017). Gaussian Estimation: Sequence and Wavelet Models. Manuscript, Stanford:
University of Stanford.
Kabanikhin, S. (2011). Inverse and Ill-Posed Problems: Theory and Applications. Berlin: De
Gruyter.
REFERENCES 433
Kalbfleisch, J. and Prentice, R. (2002). The Statistical Analysis of Failure Time Data. 2nd ed. New
York: Springer.
Kitagawa, G. and Akaike, H. (1978). A procedure for the modeling of nonstationary time series.
Annals of the Institute of Statistical Mathematics 30 351–363.
Klein, J.P. and Moeschberger, M.L. (2003). Survival Analysis: Techniques for Censored and Trun-
cated Data. New York: Springer.
Klein, J.P., van Houwelingen, H., Ibrahim, J., and Scheike, T. (2014). Handbook of Survival Analysis.
Boca Raton: Chapman & Hall.
Kleinbaum, D. and Klein, M. (2012). Survival Analysis. 3rd ed. New York: Springer.
Klugman, S., Panjer, H., and Willmot, G. (2012). Loss Models: From Data to Decisions. 4th ed.
New York: Wiley.
Kokoszka, P. and Reimherr, M. (2017). Introduction to Functional Data Analysis. New York: Chap-
man & Hall.
Kolmogorov, A.N. and Fomin, S.V. (1957). Elements of the Theory of Functions and Functional
Analysis. Rochester: Graylock Press.
Kosorok, M. (2008). Introduction to Empirical Processes and Semiparametric Inference. New York:
Springer.
Kou, J. and Liu, Y. (2017). Nonparametric regression estimations over Lp risk based on biased
data. Communications in Statistics - Theory and Methods 46 2375-2395.
Krylov, A.N. (1955). Lectures in Approximate Computation. Moscow: Science (in Russian).
Kutner, M., Nachtsheim, C., Neter, J., and Li, W. (2005). Applied Linear Statistical Models. 5th
ed. Boston: McGraw-Hill.
Lang, K. and Little, T. (2016). Principled missing data treatments. Prevention Science 1–11,
https://doi.org/10.1007/s11121-016-0644-5.
Lawless, J., Kalbfleisch, J., and Wild, C. (1999). Semiparametric methods for response selective
and missing data problems in regression. Journal of the Royal Statistical Society B 61 413–438.
Lee, E. and Wang, J. (2013). Statistical Methods for Survival Analysis. 4th ed. New York: Wiley.
Lee, M. (2004). Strong consistency for AR model with missing data, J. Korean Mathematical Society
41 1071–1086.
Lehmann, E.L. and Casella, G. (1998). Theory of Point Estimation. New York: Springer.
Lepskii, O. and Willer, T. (2017). Lower bounds in the convolution structure density model.
Bernoulli 23 884–926.
Levit, B. and Samarov, A. (1978). Estimation of spectral functions. Problems of Information Trans-
mission 14, 61–66.
Li, J. and Ma, S. (2013). Survival Analysis in Medicine and Genetics. Boca Raton: Chapman &
Hall.
Li, Q. and Racine, J. (2007). Nonparametric Econometrics: Theory and Practice. Princeton: Prince-
ton University Press.
Liang H. and de Una-Álvarez J. (2011). Wavelet estimation of conditional density with truncated,
censored and dependent data. Journal of Multivariate Analysis 102 448–467
Little, R. and Rubin, D. (2002). Statistical Analysis with Missing Data. New York: Wiley.
Little, R. et al. (2016). The treatment of missing data in a large cardiovascular clinical outcomes
study. Clinical Trials 13 344–351.
Little, T.D., Lang, K.M., Wu, W. and Rhemtulla, M. (2016). Missing data. Developmental Psy-
chopathology: Theory and Method. (D. Cicchetti, ed.) New York: Wiley, 760–796.
Liu, X. (2012). Survival Analysis: Models and Applications. New York: Wiley
Longford, N. (2005). Missing Data and Small-Area Estimation. New York: Springer.
Lorentz, G., Golitschek, M., and Makovoz, Y. (1996). Constructive Approximation. Advanced Prob-
lems. New York: Springer-Verlag.
434 REFERENCES
Lu, X. and Min, L. (2014). Hazard rate function in dynamic environment. Reliability Engineering
and System Safety 130 50–60.
Luo, X. and Tsai, W. (2009). Nonparametric estimation for right-censored length-biased data: a
pseudo-partial likelihood approach. Biometrika 96 873–886.
Mallat, S. (1998). A Wavelet Tour of Signal Precessing. Boston: Academic Press.
Mardia, K.V. (1972). Statistics of Directional Data. London: Academic Press.
Martinussen, T. and Scheike, T. (2006). Dynamic Regression Models for Survival Data. New York:
Springer.
Matloff, N. (2017). Statistical Regression and Classification: From Linear Models to Machine Learn-
ing. New York: Chapman & Hall.
Matsuda, Y. and Yajima, Y. (2009). Fourier analysis of irregularly spaced data on Rd . Journal of
Royal Statistical Society, Ser. B 71 191–217.
McKnight, P., McKnight, K., Sidani, S., and Figueredo, A. (2007). Missing Data: A Gentle Intro-
duction. London: The Guilford Press.
Meister, A. (2009). Deconvolution Problems in Nonparametric Statistics. New York: Springer.
Miller, R.G. (1981). Survival Analysis. New York: Wiley.
Mills, M. (2011). Introducing Survival and Event History Analysis. Thousand Oaks: Sage.
Molenberghs, G., Fitzmaurice G., Kenward, M., Tsiatis A., and Verbeke G. (Eds.) (2014). Handbook
of Missing Data Methodology. Boca Raton: Chapman & Hall.
Molenberghs, G. and Kenward, M. (2007). Missing Data in Clinical Trials. Hoboken: Wiley.
Montgomery, D., Jennings, C., and Kulahci, M. (2016). Introduction to Time Series Analysis and
Forecasting. 2nd ed. Hoboken: Wiley.
Moore, D. (2016). Applied Survival Analysis Using R. New York: Springer.
Moore, D., McGabe, G., and Craig, B. (2009). Introduction to the Practice of Statistics. 6th ed.
New York: W.H.Freeman and Co.
Mukherjee, R. and Sen S. (2018). Optimal adaptive inference in random design binary regression.
Bernoulli 24 699–739.
Mukhopadhyay, N. and Solanky, T. (1994). Multistage Selection and Ranking Procedures. New
York: Marcel Dekker.
Müller, H. (1988). Nonparametric Regression Analysis of Longitudinal Data. Berlin: Springer-
Verlag.
Müller, H. and Wang, J. (2007). Density and hazard rate estimation. Encyclopedia of Statistics
in Quality and Reliability. (Ruggeri, F., Kenett, R. and Faltin, F., eds.) Chichester: Wiley,
517–522.
Müller, U. (2009). Estimating linear functionals in nonlinear regression with responses missing at
random. The Annals of Statistics 37 2245–2277.
Müller, U. and Schick, A. (2017). Efficiency transfer for regression models with responses missing
at random. Bernoulli 23 2693–2719.
Müller, U. and Van Keilegom, I. (2012). Efficient parameter estimation in regression with missing
responses. Electronic Journal of Statistics 6 1200–1219.
Nakagawa, S. (2015). Missing data: mechanisms, methods, and messages. Ecological Statistics:
Contemporary Theory and Application. (Fox, G., Negrete-Yankelevich, S. and Sosa, V., eds.)
Oxford: Oxford University Press, 81–105.
Nason, G. (2008). Wavelet Methods in Statistics with R. London: Springer.
Nemirovskii, A.S. (1999). Topics in Non-Parametric Statistics. New York: Springer.
Newgard, C. and Lewis, R. (2015). Missing data: how to best account for what is not known. The
Journal of American Medical Association 2015 314 940–941.
Nickolas, P. (2017). Wavelets: A Student Guide. Cambridge: Cambridge University Press.
REFERENCES 435
Nikolskii, S.M. (1975). Approximation of Functions of Several Variables and Embedding Theorems.
New York: Springer-Verlag.
Ning, J. Qin, J., and Shen, Y. (2010). Nonparametric tests for right-censored data with biased
sampling. Journal of Royal Statistical Society, B 72 609–630.
Nussbaum, M. (1996). Asymptotic equivalence of density estimation and Gaussian white noise. The
Annals of Statistics 24 2399–2430.
O’Kelly, M. and Ratitch, B. (2014). Clinical Trials with Missing Data: A Guide to Practitioners.
New York: Wiley.
Parzen, E. (1963). On spectral analysis with missing observations and amplitude modulation.
Sankhya, ser. A 25 383–392.
Patil, P. (1997). Nonparametric hazard rate estimation by orthogonal wavelet method. Journal of
Statistical Planning and Inference 60 153–168.
Patil, P. and Bagkavos, D. (2012). Histogram for hazard rate estimation. Sankhya B 74 286–301.
Pavliotis, G. (2014). Stochastic Processes and Applications. New York: Springer.
Pensky, M. (2017). Minimax theory of estimation of linear functionals of the deconvolution density
with or without sparsity. The Annals of Statistics 45 1516–1541.
Petrov, V. (1975). Sums of Independent Random Variables. Springer, New York.
Pinsker, M.S. (1980). Optimal filtering a square integrable signal in Gaussian white noise. Problems
of Information Transmission 16 52–68.
Prakasa Rao, B.L.S. (1983). Nonparametric Functional Estimation. New York: Academic Press.
Priestly, M. (1965). Evolutionary spectra and non-stationary processes. Journal of the Royal Sta-
tistical Society 27 204–237.
Pukelsheim, F. (1993). Optimal Design of Experiments. New York: Wiley.
Qian, J. and Betensky, R. (2014). Assumptions regarding right censoring in the presence of left
truncation. Statistics and Probability Letters 87 12–17.
Qin, J. (2017). Biased Sampling, Over-Identified Parameter Problems and Beyond. New York:
Springer.
Rabhi, Y. and Asgharian, M. (2017). Inference under biased sampling and right censoring for a
change point in the hazard function. Bernoulli 23 2720–2745.
Raghunathan, T. (2016). Missing Data Analysis in Practice. Boca Raton: Chapman & Hall
Rio, E. (2017). Asymptotic Theory of Weakly Dependent Random Processes. New York: Springer.
Robinson, P. (2008). Correlation testing in time series, spatial and cross-sectional data. Journal of
Econometrics 147 5–16.
Rosen, O., Wood, S., and Stoffer, D. (2012). AdaptSPEC: adaptive spectral estimation for nonsta-
tionary time series. The Journal of American Statistical Association 107 1575–1589.
Ross, S. (2014). Introduction to Probability Models. 11th ed. New York: Elsevier.
Ross, S. (2015). A First Course in Probability. 9th ed. Upper Saddle River: Prentice Hall.
Royston, P. and Lambert, P. (2011). Flexible Parametric Survival Analysis Using Stata: Beyond
the Cox Model. College Station: Stata Press.
Rubin, D.B. (1976). Inference and missing data. Biometrika 63 581–590.
Rubin, D.B. (1987). Multiple Imputation for Nonresponse in Surveys. New York: Wiley.
Sakhanenko, L. (2015). Asymptotics of suprema of weighted Gaussian fields with applications to
kernel density estimators. Theory of Probabability and its Applications 59 415–451.
Sakhanenko, L. (2017) In search of an optimal kernel for a bias correction method for density
estimators. Statistics and Probability Letters 122 42–50.
Samorodnitsky, G. (2016). Stochastic Processes and Long Range Dependence. Cham: Springer.
Sandsten, M. (2016). Time-Frequency Analysis of Time-Varying Signals in Non-Stationary Pro-
cesses. Lund: Lund University Press.
436 REFERENCES
Scheinok, P. (1965). Spectral analysis with randomly missed observations: binomial case. Annals
of Mathematical Statistics 36 971–977.
Scott, D.W. (2015). Multivariate Density Estimation: Theory, Practice, and Visualization. 2nd ed.
New York: Wiley.
Shaw, B. (2017). Uncertainty Analysis of Experimental Data with R. NewYork: Chapman & Hall.
Shen, Y., Ning, J., and Qin, J. (2017). Nonparametric and semiparametric regression estimation
for length-biased survival data. Lifetime Data Analysis 23 3–24.
Shi, J., Chen, X., and Zhou, Y. (2015). The strong representation for the nonparametric estimation
of length-biased and right-censored data. Statistics and Probability Letters 104 49–57.
Shumway, R. and Stoffer, D. (2017). Time Series Analysis and Its Applications with R Examples.
4th ed. New York: Springer.
Silverman, B.W. (1986). Density Estimation for Statistics and Data Analysis. London: Chapman
& Hall.
Simonoff, J.S. (1996). Smoothing Methods in Statistics. New York: Springer.
Srivastava, A. and Klassen, E. (2016). Functional and Shape Data Analysis. New York: Springer.
Stein, C. (1945). Two-sample test for a linear hypothesis whose power is independent of the variance.
Annals of Mathematical Statistics 16 243–258.
Stoica, P. and Moses, R. (2005). Spectral Analysis of Signals. New York: Prentice Hall.
Su, Y. and Wang, J. (2012). Modeling left-truncated and right censored survival data with longi-
tudinal covariates. The Annals of Statistics 40 1465–1488.
Sullivan, T., Lee, K., Ryan, P., and Salter, A. (2017). Treatment of missing data in follow-up
studies of randomized controlled trials: a systematic review of the literature. Clinical Trials
14 387–395.
Sun, J. and Zhao, X. (2013). The Statistical Analysis of Interval-Censored Failure Time Data. New
York: Springer.
Takezawa, K. (2005). Introduction to Nonparametric Regression. Hoboken: Wiley.
Talamakrouni, M., Van Keilegom, I., and El Ghouch, A. (2016). Parametrically guided nonpara-
metric density and hazard estimation with censored data. Computational Statistics and Data
Analysis 93 308–323.
Tan, M., Tian, G., and Ng, K. (2009). Bayesian Missing Data Problems: EM, Data Augmentation
and Noniterative Computation. Boca Raton: Chapman & Hall.
Tanaka, K. (2017). Time Series Analysis: Nonstationary and Noninvertible Distribution Theory.
New York: Wiley.
Tarczynski, A. and Allay, B. (2004). Spectral analysis of randomly sampled signals: Suppression of
aliasing and sampler jitter. IEEE Transactions on Signal Processing 52 3324-3334.
Tarter, M.E. and Lock, M.D. (1993). Model-Free Curve Estimation. London: Chapman & Hall.
Temlyakov, V.N. (1993). Approximation of Periodic Functions. New York: Nova Science Publishers.
Thompson, J.R. and Tapia, R.A. (1990). Nonparametric Function Estimation, Modeling, and Sim-
ulation. Philadelphia: SIAM.
Thompson, S. and Seber, A. (1996). Adaptive Sampling. New York: Wiley.
Tikhonov, A.N. (1998). Nonlinear Ill-Posed Problems. New York: Springer.
Tsai, W. (2009). Pseudo-partial likelihood for proportional hazards models with biased-sampling
data. Biometrika 96 601-615.
Tsay, R.S. (2005). Analysis of Financial Time Series. 2nd ed. New York: Wiley.
Tsiatis, A. (2006). Semiparametric Theory and Missing Data. New York: Springer.
Tsybakov, A. (2009). Introduction to Nonparametric Estimation. New York: Springer.
Tutz, G. and Schmid, M. (2016). Modeling Discrete Time-to-Event Data. Cham: Springer.
REFERENCES 437
Tymes, N., Pereyra, M.C., and Efromovich, S. (2000). The Application of Multiwavelets to Recovery
of Signals. Computing Science and Statistics 33 234–241.
Uzunogullari, U. and Wang, J. (1992). A comparison of hazard rate estimators for left truncated
and right censored data. Biometrika 79 297–310.
van Buuren, S. (2012) Flexible Imputation of Missing Data. Boca Raton: Chapman & Hall.
van Houwelingen, H. and Putter, H. (2011). Dynamic Prediction in Clinical Survival Analysis. Boca
Raton: Chapman & Hall.
Vidakovic, B. (1999). Statistical Modeling by Wavelets. New York: Wiley.
Vorotniskaya, T. (2008). Estimates of covariance function and spectral density of stationary stochas-
tic process with Poisson gaps in observations. Vestnik 31 3–11.
Wahba, G. (1990). Spline Models for Observational Data. Philadelphia: SIAM.
Wald, A. (1947). Statistical Analysis. New York: Wiley.
Wald, A. (1950). Statistical Decision Functions. New York: Wiley.
Walter, G.G. (1994). Wavelets and Other Orthogonal Systems with Applications. London: CRC
Press.
Wand, M.P. and Jones, M.C. (1995). Kernel Smoothing. London: Chapman & Hall.
Wang, C. and Chan, K. (2018). Quasi-likelihood estimation of a censored autoregressive
model with exogenous variables. Journal of the American Statistical Association, in press.
dx.doi.org/10.1080/01621459.2017.1307115
Wang, J.-L. (2005). Smoothing hazard rate. Encyclopedia of Biostatistics, 2nd ed. (Armitage, P.
and Colton, T., eds.) Chichester: Wiley, 7 4486-4497.
Wang, M. (1996). Hazards regression analysis for length-biased data. Biometrika 83 343–354.
Wang, Y. (1995). Jump and sharp cusp detection by wavelets. Biometrica 82 385–397.
Wang, Y., Zhou, Z., Zhou, X., and Zhou, Y. (2017). Nonparametric and semiparametric estimation
of quantile residual lifetime for length-biased and right-censored data. The Canadian Journal
of Statistics 45 220–250.
Wasserman, L. (2006). All of Nonparametric Statistics. New York: Springer.
Watson, G. and Leadbetter, M. (1964). Hazard rate analysis. Biometrika 51 175–184.
Wienke, A. (2011). Frailty Models in Survival Analysis. Boca Raton: Chapman & Hall.
Wilks, S. (1962). Mathematical Statistics. New York: John Wiley.
Wood, S. (2017). Generalized Additive Models: An Introduction with R. Boca Raton: Chapman &
Hall.
Woodroofe, M. (1985). Estimating a distribution function with truncated data. The Annals of
Statistics 13 163–177.
Wu, S. and Wells, M. (2003). Nonparametric estimation of hazard functions by wavelet methods.
Nonparametric Statistics 15 187–203.
Wu, W. and Zaffaroni, P. (2018). Asymptotic theory for spectral density estimates of general multi-
variate time series. Econometric Theory, in press. https://doi.org/10.1017/S0266466617000068
Yang, D. (2017). Handbook of Regression Methods. New York: Chapman & Hall.
Yang, Y. (2001). Nonparametric regression with dependent errors. Bernoulli 7 633–655.
Yi, G. (2017). Statistical Analysis with Measurement Error or Misclassification: Strategy, Method
and Application. New York: Springer.
Yoo, W. and Ghosal, S. (2016). Supremum norm posterior contraction and credible sets for non-
parametric multivariate regression. The Annals of Statistics 44 1069–1102.
Young W., Weckman G., and Holland, W. (2011). A survey of methodologies for the treatment
of missing values within datasets: Limitations and benefits. Theoretical Issues in Ergonomics
Science 12 15–43.
438 REFERENCES
Zhang, F. and Zhou, Y. (2013). Analyzing left-truncated and right-censored data under Cox model
with long-term survivors. Acta Mathematicae Applicatae Sinica 29 241–252.
Zhou, M. (2015). Empirical Likelihood Method in Survival Analysis. Boca Raton: Chapman & Hall.
Zhou, X., Zhou, C., and Ding, X. (2014). Applied Missing Data Analysis in the Health Sciences.
New York: Wiley.
Zhu, T. and Politis, D. (2017). Kernel estimates of nonparametric functional autoregression models
and their bootstrap approximation. Electronic Journal of Statistics 11 2876–2906.
Zou, Y. and Liang, H. (2017). Wavelet estimation of density for censored data with censoring
indicator missing at random. A Journal of Theoretical and Applied Statistics 51 1214–1237.
Zucchini, W., MacDonald, I., and Langrock, R. (2016). Hidden Markov Models for Time Series:
An Introduction Using R. Boca Raton: CRC Press.
Zurbenko, I. (1991). Spectral analysis of non-stationary time series. International Statistical Review
59 163–173.
Author Index
439
440 AUTHOR INDEX
Del Moral, P., 369 González-Manteiga, W., 97
Delecroix, M., 242 Gou, J., 370
Demidovich, B., 421 Grace, Y., 421
DeVore, R.A., 65 Graham, J., 28, 142, 143
Devroye, L., 65 Grainger, D., 65
Diggle, P.J., 320, 369 Green, P., 66
Dillman, D., 142 Gregoire, G., 241
Ding, X., 28, 142 Greiner, A., 320
Dobrow, R., 369 Groeneboom, P., 29, 66, 241, 421
Doksum, K.A., 65, 240 Groves, R., 142
Donoho, D., 65 Guilloux, A., 97
Doukhan, P., 320 Guo, S., 29, 240
Draguljic, D., 370 Guo, Z., 66
Dryden, I.L., 369 Györfi, L., 65, 66, 320
Dukic, V., 241
Dunsmuir, W., 320 Härdle, W., 65
Durrett, R., 369 Hagar, Y., 241
Dym, H., 65 Hall, P., 369
Dzhaparidze, K., 320 Hamilton, M., 240
Harrell, F., 144, 240, 280
Efromovich, S., 29, 65, 66, 97, 143, 179, Harrington, D.P., 240
240, 241, 280, 320, 321, 369, 370, Hart, J.D., 65, 369
421, 422 Hastie, T.J., 66, 144
Efron, B., 65 Heeringa, S., 28, 143
El Ghouch, A., 241 Helsel, D., 321
Eltinge, J., 142 Hjort, N., 65
Enders, C., 28, 142, 179 Hoffmann, M., 66
Eubank, R.L., 66 Hogan, J., 142
Everitt, B., 66 Holland, W., 142
Hollander, W., 65
Fan, J., 144, 320, 321, 369 Holmboe, S., 242
Fan, P., 370 Honaker, J., 142
Faraway, J., 66 Horowitz, J., 66
Figueredo, A., 143 Hosmer, D., 29, 240
Fisher, N.I., 421 Hothorn, T., 66
Fitzmaurice, G., 142, 179 Hui, Y., 320
Fleming, T.R., 240
Flumento, P. , 242 Ibragimov, I.A., 65, 369
Fomin, S.V., 65 Ibrahim, J., 240, 280
Ingster, Yu., 66
Ganzburg, M., 179, 421 Ivanoff, S., 144
Genovese, C., 66, 98 Izbicki, R., 144
Ghosal, S., 66, 144, 240 Izenman, A., 66, 144
Gijbels, I., 144, 321
Gill, R., 29, 97, 240, 241 Jankowski, H., 240
Gillard, J., 320 Janssen, P., 241
Giné, E., 66 Jenkins, G., 320, 369
Gjessing, H., 29, 240 Jennings, C., 320
Glad, I., 65 Jiang, J., 320
Goldstein, H., 144 Jiang, P., 320
Golitschek, M., 65, 66 Johnstone, I., 28, 65, 370
Gong, G., 320 Jones, M.C., 66, 97, 421
AUTHOR INDEX 441
Jongbloed, G., 29, 66, 241, 421 Little, T.D., 280
Liu, X., 29, 240
Kabanikhin, S., 29, 421 Liu, Y., 97
Kalbfleisch, J., 29, 97, 240 Ljung, G., 320, 369
Keiding, N., 241 Lock, M.D., 65
Kenward, M., 28, 142, 179 Longford, N., 28
Kerkyacharian, G., 65 Lopez, O., 242
Khasminskii, R.Z., 65, 369 Lorentz, G.G., 65, 66
King, G., 142 Louhichi, S., 320
Kitagawa, G., 370 Low, M., 66, 179, 370
Klassen, E., 241 Lu, X., 240
Klein, J., 29, 240, 280, 421 Luo, X., 97
Klein, M., 29, 240
Kleinbaum, D., 29, 240 Müller, H., 66, 240
Klugman, S., 240 Müller, U., 143
Kohler, M., 66, 144 Ma, S., 29, 240
Kokoszka, P., 280 Mabon, G., 421
Kolmogorov, A.N., 65 MacDonald, I., 320
Koltchinskii, V., 421 Makovoz, Y., 65, 66
Kosorok, M., 240 Mallat, S., 65, 242, 369
Kou, J., 97 Mardia, K.V., 369, 421
Krylov, A.N., 65 Maron, I., 421
Krzyzak, A., 66 Martı́nez-Miranda, M., 97
Kulahci, M., 320 Martinussen, T., 29, 240
Kutner, M., 370 Massart, P., 65
Matloff, N., 66
Lambert, P., 29, 240 Matsuda, Y., 370
Lamberta, F., 241 May, S., 29, 240
Lang, G., 320 McGabe, G., 370
Lang, K., 142 McKean, H., 65
Langrock, R., 320 McKnight, K., 143
Lawless, J., 97 McKnight, P., 143
Leatbetter, M., 240 Meister, A., 29, 421
Lee, A., 144 Miller, R.G., 29
Lee, E., 29, 240 Mills, M., 29, 240
Lee, M., 320 Min, L., 240
Lee, P., 142 Moeschberger, M.L., 29, 240
Lee, S., 66 Molenberghs, G., 28, 142, 179
Lehmann, E.L., 98 Montgomery, D., 320
Lemeshow, S., 29, 240 Moore, D., 29, 240, 370
Leon, J., 320 Moses, R., 370
Leon, S., 143 Mukherjee, R., 66
Lepskii, O., 421 Mukhopadhyay, N., 370
Levit, B., 320
Lewis, R., 142 Nachtsheim, C., 370
Li, J., 29, 240 Nakagawa, S., 142
Li, Q,, 66 Nason, G., 65, 241, 242
Li, Q., 321 Navarro, F., 97
Li, W., 370 Nemirovskii, A., 66
Liang, H., 241, 280 Nessel, R.J., 65
Linnik, Yu.V., 369 Neter, J., 370
Little, R., 28, 142, 179, 280 Newgard, C., 142
442 AUTHOR INDEX
Ng, K., 28 Ryan, P., 142
Nickl, R., 66
Nickolas, P., 242 Sakhanenko, L., 65
Nikolskii, S.M., 65, 66 Salter, A., 142
Ning, J., 97 Salter-Kubatko, L., 65
Nussbaum, M., 369 Salva, J., 142
Samarov, A., 179, 320, 370
O’Kelly, M., 28 Samorodnitsky, G., 369
Oakes, D., 29, 240, 241 Samson, A., 421
Sansten, M., 370
Pamjer, H., 240 Scheike, T., 29, 240, 242, 280
Parzen, E., 320 Scheinok, P., 320
Patil, P., 240 Schick, A., 144
Patilea, V., 242 Schmid, M., 240
Pavliotis, G., 369 Scott, D.W., 65, 66
Peace, K., 29, 240, 421 Seber, A., 370
Penev, S., 369 Semmler, W., 320
Pensky, M. , 421 Sen, S., 66
Pereyra, M.C, 242 Shaw, B., 180
Petrov, V., 29 Shen, Y., 97
Picard, D., 65, 144 Shi, J., 241
Pinsker, M.S., 65, 66, 179, 320, 369 Shumway, R. , 320
Politis, D., 321 Sidani, S., 143
Portnoy, S., 321 Silverman, B., 65, 66, 240
Prakasa Rao, B.L.S., 98, 240, 370 Simonoff, J., 66, 98
Prentice, R., 29, 240 Smirnova, E., 65, 242, 370
Priestly, M. , 370 Solanky, T., 370
Prieur, C., 320 Spiro, S., 65
Pukelsheim, F., 370 Srivastava, A., 241
Putter, H., 29, 240 Stefanski, L., 421
Stein, C., 370
Qian, J., 241 Stoffer, D., 320, 370
Qin, J., 97 Stoica, P., 320, 370
Su, Y., 242
Rabhi, Y., 241
Sullivan, T., 142
Racine, J., 66, 321
Sun, J., 29, 240, 280, 321, 421
Raghunathan, T., 28, 142, 144, 179
Suslina, I., 66
Ratitch, B., 28
Rebafka, T., 97 Takezawa, K., 66
Reimherr, M., 280 Talamakrouni, M., 241
Reinsel, G., 320, 369 Tan, M., 28
Restaino, M., 241 Tanaka, K., 369
Rhemtulla, M., 142 Tapia, R.A., 65
Rio, E., 369 Tarczynski, A., 320
Ritov, Y., 97 Tarter, M.E., 65
Rivoirard, V., 144 Temlyakov, V.N., 65, 66
Robinson, P., 320, 321 Thomas, E., 66
Rosen, O., 370 Thompson, J.R., 65
Ross, S., 14, 320 Thompson, S., 370
Royston, P., 29, 240 Tian, G., 28
Rubin, D.B., 28, 142, 179 Tibshirani, R., 65, 66, 98, 144
Ruppert, D., 66, 421 Tikhonov, A.N., 29, 421
AUTHOR INDEX 443
Tsai, W., 97 Yang, D., 66
Tsay, R.S., 321, 369 Yang, Y., 369
Tsiatis, A., 28, 142, 143, 179 Yao, Q., 320, 321, 369
Tsybakov, A., 28, 65, 369 Yata, K., 370
Tutz, G., 240 Yi, G., 421
Tymes, N., 242 Yoo, W., 66
Young, W., 142
Ushakov, N., 65
Uzunogullari, U., 241 Zaffaroni, P., 320
Zhang, F., 242, 370
Valdez-Jasso, Z.A., 65, 370 Zhao, X., 280, 321, 421
van Buuren, S., 28, 143, 280 Zhou, C., 28, 142
van der Vaart, A., 144, 240 Zhou, M., 29, 240
van Houwelingen, H., 29, 240 Zhou, X., 28, 142, 241
Van Keilogom, I., 143, 241 Zhou, Y., 241, 242
Vardi, Y., 97 Zhou, Z., 241
Veraverbeke, N., 241 Zhu, T., 321
Verbeke, G., 142, 179 Zou, Y., 280
Vidakovic, B., 65, 242 Zucchini, W., 320
Vorotniskaya, T., 320, 370 Zurbenko, I., 370
Voss, D., 370
Wahba, G., 66
Wald, A., 370
Walk, H., 66
Walter, G.G., 65
Wand, M.P., 66, 97, 421
Wang, C., 242, 321
Wang, H., 241
Wang, J., 29, 240–242
Wang, M., 242
Wang, Q., 370
Wang, Y., 97, 241
Wasserman, L., 28, 65, 66, 98
Watson, G., 240
Weckman, G., 142
Wellner, J., 97, 240
Wells, M., 240
West, G., 240
Wienke, A., 240
Wild, C., 97
Wilks, S., 370
Willer, T., 421
Willmot, G., 240
Wolfe, D., 65
Wood, S., 144, 370
Woodroofe, M., 241
Wu, D., 370
Wu, J., 370
Wu, S., 240
Wu, W., 142, 320
445
446 SUBJECT INDEX
Complete-case approach, 109 Laplace, 15
Conditional density, 16, 117, 217, 224 Normal, 15
Estimation for LT data, 217 Poisson, 10, 15
Estimation of, 117, 358 Uniform, 15
Conditional expectation, 16 Weibull, 184
Conditional survival function, 216, 224
Estimation for LT data, 216 E-estimation, 39
Confidence band Multivariate, 53
Pointwise, 58 E-estimator, 39
Simultaneous, 59 E-sample, 145
Confidence interval, 56 Ellipsoid
Consistent estimation, 47 Analytic, 61
Convolution, 372 Periodic function, 61
Corner (test) function, 31 Sobolev, 37
Bimodal, 32 Empirical autocovariance function, 286
Custom-made, 34 Empirical cumulative distribution func-
Normal, 32 tion, 410
Strata, 32 Empirical cutoff, 39
Uniform, 31 Estimand, 17
CSC problem, 390 Estimation of distribution
Cumulative distribution function, 14 LT, 208
Joint, 16 LTRC, 219
Current status censoring, 388 MAR indicator of censoring, 243
Cutoff, 33 MNAR indicator of censoring, 249
RC, 203
Deconvolution, 372, 375 Estimator, 17
Censored data, 386 Consistent, 18
Missing data, 379 Unbiased, 18
Deductible, 5 Example
Density estimation, 2, 38, 246 Actuarial, 5, 9, 10, 188, 193
Auxiliary variable, 156 Age and wage, 10
Complete data, 38 Biased data, 2
Dependent observations, 304 Clinical study, 193
Extra sample, 151 Epidemiological, 389
Known availability, 146 Housing starts, 8
LTRC data, 221 Intoxicated drivers, 2
MCAR data, 101 Lifetime of bulb, 188
Measurement error, 372 Mortgage loan, 188
MNAR indicator of censoring, 251 Startup, 193
RC data, 204 Expectation, 14
Derivative
Estimation of, 409 Failure rate, 183
Design Filtering signal, 330
Fixed, 48 Force of mortality, 183
Random, 48 Fourier coefficient, 19
Sequential, 359
Design density, 48 H-sample, 20, 100, 145
Directional variable, 378 Hazard rate, 183
Distribution Estimation of, 186, 188, 192, 197, 246
Bernoulli, 11, 14 Properies, 183
Binomial, 15 Ratio-estimator, 186
Exponential, 184 Histogram, 2
SUBJECT INDEX 447
Ill-posed problem, 371, 399, 403 At random (MAR), 20
Indicator, 14 Batch-Bernoulli, 295, 333, 334
Indicator of censoring Completely at random (MCAR), 20,
MAR, 243 101
MNAR, 249 Destructive, 145
Inequality Markov–Bernoulli, 291, 334
Bernstein, 19 Nondestructive, 99
Cauchy, 21 Not at random (MNAR), 20
Cauchy–Schwarz, 19, 21 Poisson amplitude-modulation, 299
Chebyshev, 18 Mixing coefficient, 285
Generalized Chebyshev, 18 Mixing theory, 284
Hoeffding, 18 Mixture, 83
Jensen, 19 Mixtures regression, 83
Markov, 18 MNAR, 20
Minkowski, 21
Integrated square error (ISE), 36 Nelson–Aalen estimator, 241
Integrated squared bias (ISB), 54 Nonnegative projection, 41
Integration by parts, 36 Nonparametric autoregression, 310
Interpolation, 399 Nonparametric estimation, 38
Inverse problem, 372 Nonparametric regression, 47
Additive, 133
Kaplan–Meier estimator, 203 Bernoulli, 51
Unavailable failures, 87
Left truncation, 193 Biased predictors and responses, 74
Limit on payment, 5 Biased responses, 72
Linear regression, 7 Bivariate, 129
Long memory, 306 Censored predictors, 262
LT, 208 Censored responses, 258
LTRC, 197 Dependent responses, 323
Formulas, 198 Direct data, 7, 47
Generator, 197 Fixed-design, 48
LTRC predictors, 269 Heteroscedastic, 48
Homoscedastic, 48
M-sample, 20, 100, 145 LTRC predictors, 269
MA, 283 MAR censored responses, 253
MAR, 20 MAR predictors, 112, 258
MAR censored responses, 253 MAR responses, 107, 262, 264, 269
MAR responses, 262, 264, 269 MEP, 399
Markov chain, 292 MEP with missing responses, 404
Markov–Bernoulli missing, 292 Missing cases, 172
MCAR, 20, 101 MNAR predictors, 169
Mean integrated squared error (MISE), MNAR responses, 160
19, 38 Random-design, 48
Mean squared error (MSE), 17 RC predictors, 229
Measurement errors, 372, 399 RC responses, 225
Memory Truncated predictors, 264
Long, 283 Nonstationary amplitude-modulation, 348
Short, 283 Nonstationary autocovariance, 351
MEP regression, 400 Nonstationary spectral density, 351
Missing responses, 404 Nonstationary time series, 333
Missing Missing data, 333
Amplitude-modulated, 297 Nuisance function, 13, 85
448 SUBJECT INDEX
Estimation of, 86 Standard deviation, 15
Stochastic process, 327
Optimal design density, 112 Support, 15
Optimal rate of convergence, 54 Unknown, 91
Ordered statistics, 17 Survival analysis, 7, 181
Survival function, 14
Parseval’s identity, 20 Estimation of, 185, 260
Periodogram, 286
Poisson amplitude-modulation, 299 Thresholding, 41
Poisson regression, 124 Time domain approach, 284
Principle of equivalence, 332 Time series, 281
Probability density, 15 m-dependent, 285
Mixed, 17 Amplitude-modulation, 343, 404
Projection estimator, 40 ARMA, 283
Causal, 284
R package, 21 Censored, 301
Random variable Decomposition, 333
Continuous, 15 Long-memory, 283
Discrete, 14 Nonstationary, 333
Rate of convergence Nonstationary amplitude-modulation,
MISE, 46 348
Optimal, 46, 47 Scale function, 333
RC, 188 Seasonal component, 333
Regression function, 48 Second-order stationary, 282
Repeated observations, 377 Short-memory, 283
Strictly stationary, 282
Sample
Trend, 333
Direct, 2, 17
Zero-mean, 282
Extra, 151
Trend, 333
Hidden (H-sample), 108
Estimation of, 334
Sample mean estimator, 18
Truncated predictors, 264
Plug-in, 18, 39
Truncated series, 33
Sample variance estimator
Truncation
Plug-in, 39
Left, 5, 192
Scale function, 127, 333
Estimation of, 127, 339 Unavailable failures, 87
Scattergram, 49
Seasonal component, 333 Variance, 14
Estimation of, 338
Period, 286 Weak dependence, 306
Sequential design, 359 White noise, 328
Shape, 297 Frequency limited, 328
Short memory, 306 Wiener process, 328
Simpson’s paradox, 355 Wrapped density, 378
Sobolev class, 37
Software, 21
Spatial data, 281
Spectral density, 286
ARMA, 287
Estimation of, 286
Nonstationary, 351
Shape, 297