0% found this document useful (0 votes)
139 views12 pages

Rubin 1976

The document discusses the conditions under which the process causing missing data can be ignored when making statistical inferences. It establishes that if data are 'missing at random' and 'observed at random', inferences about the data parameter can be made without considering the missing data process. The paper also differentiates between sampling distribution inferences and Bayesian inferences, concluding that Bayesian methods are often less sensitive to missing data processes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
139 views12 pages

Rubin 1976

The document discusses the conditions under which the process causing missing data can be ignored when making statistical inferences. It establishes that if data are 'missing at random' and 'observed at random', inferences about the data parameter can be made without considering the missing data process. The paper also differentiates between sampling distribution inferences and Bayesian inferences, concluding that Bayesian methods are often less sensitive to missing data processes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Biometrika (1976), 63, 3, pp.

581-92 581
Printed in Great Britain

Inference and missing data


BY DONALD B. RUBIN
Educational Testing Service, Princeton, New Jersey

SUMMABY
When making sampling distribution inferences about the parameter of the data, 6, it is
appropriate to ignore the process that causes missing data if the missing data are 'missing

Downloaded from http://biomet.oxfordjournals.org/ at University of California, San Francisco on December 17, 2014
at random' and the observed data are 'observed at random', but these inferences are
generally conditional on the observed pattern of missing data. When making direct-
likelihood or Bayesian inferences about 6, it is appropriate to ignore the process that causes
missing data if the minting data are missing at random and the parameter of the missing data
process is ' distinct' from 6. These conditions are the weakest general conditions under which
ignoring the process that causes missing data always leads to correct inferences.

Some hoy words: Bayesian inference; Incomplete data; Likelihood inference; Missing at random;
Missing data; Missing values; Observed at random; Sampling distribution inference.

1. INTRODUCTION : THE GENERAUTY OF


THE PEOBLHM OF MISSING DATA
The problem of missing data arises frequently in practice. For example, consider a large
survey of families conducted in 1967 with many socioeconomic variables recorded, and a
follow-up survey of the same families in 1970. Not only is it likely that there will be a few
missing values scattered throughout the data set, but also it is likely that there will be a large
block of missing values in the 1970 data because many families studied in 1967 could not be
located in 1970. Often, the analysis of data like these proceeds with an assumption, either
implicit or explicit, that the process that caused the missing data can be ignored. The
question to be answered here is: when is this the proper procedure?
The statistical literature on Tr»i««ing data does not answer this question in general. In most
articles on unintended missing data, the process that causes missing data is ignored after
being assumed accidental in one sense or another. In some articles such as those concerned
with the multivariate normal (Ann & Elashoff, 1966; Anderson, 1957; Hartley & Hocking,
1971; Hocking & Smith, 1968; Wilks, 1932), the assumption about the process that causes
miHHing data seems to be that each value in the data set is equally likely to be missing. In
other articles such as those dealing with the analysis of variance (Hartley, 1956; Healy &
Westmacott, 1956; Rubin, 1972, 1976; Wilkinson, 1958), the assumption seems to be that
values of the dependent variables are missing without regard to values that would have
been observed.
The statistical literature also discusses miming data that arise intentionally. In these
cases, the process that causes missing data is generally considered explicitly. Some examples
of methods that intentionally create missing data are: a preplanned multivariate experi-
mental design (Hocking & Smith, 1972; Trawinski & Bargmann, 1964); random sampling
from a finite population, i.e. the values of variables for unsampled units^being missing
(Cochran, 1963, p. 18); randomization in an experiment, where, for each unit, the values
» that would have been observed had the unit received a different treatment are missing
582 DONAUD B. RUBIN
(Kempthorne, 1952, p. 137; Rubin, 1976); sequential stopping rules, where the values after
the last one observed are missing (Lehmann, 1959, p. 97), and even some 'robust analyses',
where observed values are considered outliers and so discarded or made missing.

2. OBJECTIVES AND BROAD REVIEW


Our objective is to find the weakest simple conditions on the process that causes missing
data such that it is always appropriate to ignore this process when making inferences about
the distribution of the data. The conditions turn out to be rather intuitive as well as non-
parametric in the sense that they are not tied to any particular distributional form. Thus
they should prove helpful for deciding in practical problems if the process that causes
missing data can be ignored.
Section 3 gives the notation for the random variables: 6 is the parameter of the data, and
(f> is the parameter of the missing-data process, i.e. the parameter of the conditional distribu-
tion of the missing-data indicator given the data. Section 4 presents examples of processes
that cause missing data.
Section 5 shows that when the process that causes miming data is ignored, the missing-
data indicator random variable is simply fixed at its observed value. Whether this corre-
sponds to proper conditioning depends on the method of inference and three conditions, on
the process that causes mining data. These conditions place no restrictions on the miH«ing-
data process for patterns of missing data other than the observed pattern. Their formal
definitions correspond to the following statements.
The miming data are missing at random if for each, possible value of the parameter $, the conditional
probability of the observed pattern of miming data, given the miming data and the value of the
observed data, is the same for all possible values of the miming data.
The observed data are observed at random if for each possible value of the miaHing data and the
parameter <f>, the conditional probability of the observed pattern of miming data, given the miRsing
data and the observed data, is the same for all possible values of the observed data.
The parameter <f> is distinct from 6 if there are no a priori ties, via parameter space restrictions or
prior distributions, between <f> and 6.
Sections 6, 7 and 8 use these definitions to prove that ignoring the process that causes
TnJRfring data when making sampling distribution inferences about 6 is appropriate if the
missing data are missing at random and the observed data are observed at random, but the
resulting inferences are generally conditional on the observed pattern of mining data.
Further, ignoring the process that causes missing data when making direct-likelihood or
Bayesian inferences about 6 is appropriate if the missing data are missing at random and
<j> is distinct from 6.
Other results show that these conditions are the weakest simple and general conditions
under which it is always appropriate to ignore the process that causes missing data. The
reader not interested in the formal details should be able to skim §§ 3-8 and proceed to § 9.
Section 9 uses these resulte to highlight the distinctions between the sampling distribution
and the likelihood-Bayesian approaches to the problem of missing data. Section 10 con-
cludes the paper with the suggestion that in many practical problems, Bayesian and
likelihood inferences are less sensitive than sampling distribution inferences to the process
that causes missing data.
Throughout, measure-theoretic considerations about sets of probability zero are ignored.
Inference and missing data 583

3. NOTATION FOB THE RANDOM VARIABLES


Let U = (Ult..., Un) be a vector random variable with probability density function y^.
The objective is to make inferences about 6, the vector parameter of this density. Often in
practice, the random variable V "will be arranged in a 'unite' by 'variables' matrix. Let
M <=i (Mlt..., Mn) be the associated 'missing-data indicator' vector random variable, where
each Mf takes the value 0 or 1. The probability that M takes the value m = (m^ mB) given
that U takes the value u = (t^, ...,«„) is g^(m\u), where <p is the nuisance vector parameter
of the distribution.
The conditional distribution g$ corresponds to 'the process that causes missing data':
if m{ = 1, the value of the random variable Ut will be observed while i£mi = 0, the value
of Ut will not be observed. More precisely, define the extended vector random variable
F = iyx, ...,Vn) with range extended to include the special value * for miming data:
vt = u{ (mt = 1 ) , and v{ = * (m{ = 0). The values of the random variable V are observed,
not the random variable V, although it is desired to make inferences about the
distribution of U.

4. EXAMPLES OF PBOOESSES THAT CAUSE MESSING DATA


In order to clarify the notation in § 3 we give four examples.
Example 1. Suppose there are n samples of an alloy and on each we attempt to record some
characteristic by an instrument that has a constant probability, <j>, of failing to record the
result for all possible samples. Then

Example 2. Let u4 be the value of blood pressure for the tth subject (t = 1, ...,n) in a
hospital survey. Suppose vt = * if ^ is less than <j>, which equals the mean blood pressure in
the population; i.e. we only record blood pressure for subjects whose blood pressures are
greater than average. Then
n
fyH") = n *{r(«i-^)-m4},
where y(o) = 1 if a > 0 and 0 otherwise; i{a) = 1 if a =• 0 and 0 otherwise.
Example 3. Observations are taken in sequence until a particular function of the observa-
tions is in a specified critical region G. Here n is essentially infinite and, for some r^ which
is a function of the observations, vt 4= * (t < n j , and v{ = * (t > n^). Thus

= 5 *(!-«»*) ft
where 74 is the minimum k such that the function <&(«!, ..
Example 4. Let n = 2. If v^ > 0: with probability 0, t^ =t= * and vt = *; and with proba-
bility 1 — <j>, vx + • and «7j # *. If ttj < 0: with probability <f>, vx + * and vt = *; and with
probability 1 — <j>, v^ = * and v2 + •. Thus
if m = (l,0),
if TO = (1,1),
g+(m\u) =

if m = (0,0).
584 DONALD B. RUBIN

5. IGNORING THE PBOOESS THAT CAUSES MISSING DATA


Let 0 = (t51( ...,v n ) be a particular sample realization of V, i.e. each vt is either a known
number or a missing value, *. These observed values imply an observed value for the random
variable M, rh = (n^, ...,thn), and imply observed values for some of the scalar random
variables in V. That is, if v{ is a number, then the observed value of Mf, rht, is one, and the
observed value of U{, Ht, equals fy; if €t = *, then rht = 0 and the value of U{ is not known;
in special cases, knowing values in v may imply observed values for some Ut with vt= *,
for example fe specifies ti^ = u^ + Ug and we observe vx = *, #2 = 3-1 and v3 = 5-2.

Table 1. Classifying the examples in § 4


Missing data, Observed data,
Example miming at random observed at random <j> distinct from 6
1 Always MAB Always OAB Always distinct
2 MAB only if all mt = 1 OAB only if all mt = 0 Distinct only if mean blood
pressure in the population is
known a priori
3 AlwayB M*» Never OAB Always distinct
4 MAB unless m = (0, 1) OAB unless m = (1, 1) Distinct if a priori <p is not
' restricted by 6

Hence, the observed value of M, namely rh, effects a partition of each of the vectors of
random variables and the vectors of observed values into two vectors corresponding to
rh{ = 0 for missing data and rhi= 1 for observed data. For convenience write
v
F = (%», Vu), u = (««,), uo), =
where by definition %) = (*,..., *.) and u^ = v^. I t is important to remember that these
partitions are those corresponding to m = rh, the observed pattern of missing data. For
further notational convenience, we let u = (M^), U^) ; u consists of a vector of arguments, ?%,
corresponding to unobserved random variables, and a vector of known numberSj H^ = t ^ ,
corresponding to values of observed random variables.
The objective is to use $, or equivalently rh and u^, to make inferences about 6. It is com-
mon practice to ignore the process that causes missing data when making these inferences.
Ignoring the process that causes missing data means proceeding by: (a) filing the random
variable M at the observed pattern of missing data, rh, and (6) assuming that the values of
the observed data, £k), arose from the marginal density of the random variable U^:

//„(«) duo,,. (5-1)


The central question here concerns the weakest simple conditions on g^ such that ignoring
the process that causes miiming data will always yield proper inferences about 6.
Three conditions are relevant to answering this question. These conditions place no
restrictions on (fy(wi|u.) for values of M other than rh.

Definition 1. The mianing data are missing at random if for each value of <j>, g^ihty) takes
the same value for all 1%.
Definition 2. The observed data are observed at random if for each value of <f> and t%,
) takes the same value for all u^.
Inference and missing data • 585
Definition 3. The parameter <j> is distinct from 6 if their joint parameter space factorizes
into a 0-space and a #-space, and when prior distributions are specified for $ and 6, if these
are independent.
Table 1 classifies the four examples of § 4 in terms of these definitions.

6. MISSING DATA AND SAMPLING DISTBIBUTION INTEBENOE


A sampling distribution inference is an inference that results solely from comparing the
observed value of a statistic, e.g. an estimator, test criterion or confidence interval, with the
sampling distribution of that statistic under various hypothesized underlying distributions.
Within the context of sampling distribution inference, the parameters 6 and <f> have fixed
hypothesized values.
Ignoring the process that causes missing data when making a sampling distribution infer-
ence about the true value of d means comparing the observed value of some vector statistio
8(v), equivalently 8(m, u\$), to the distribution of 8{v) found from/j. More precisely, the
sampling distribution of 8(0) ignoring the process that causes missing data is found by
fixing M at the observed rh and assuming that the sampling distribution of the observed data
follows from density (6-1). The problem with this approach is that for the fixed m, the
sampling distribution of the observed data, u^, does not follow from (5-1) which is the
marginal density of U^ but from the conditional density of U^ given that the random
variable M took the value m:

where h$^(m) = j fg(u) g^(m\u) du, which is the marginal probability that M takes the
value rh. Hence, the correct sampling distribution of 8{v) depends in general not only on the
fixed hypothesized fg but also on the fixed hypothesized g$.
THEOBBM 6-1. Suppose that (a) the missing data are missing at random and (6) the observed
data are observed at random. Then the sampling distribution of8(&) under fg ignoring the process
that causes missing data, i.e. calculatedfrom density (5-1), equals the correct conditional sampling
distribution of #(#) given rh under fgg^, that is calculated from density (6-1) assuming
0.
Proof. Under conditions (a) and (6), for each value of $, g^{m\u) takes the same value for
all u; notice that this does not imply V and M are independently distributed unless it holds
for all possible rh. Hence kg^{rh) = gr^(wi|tt), and thus the distribution of every statistic under
density (5-1) is the same as under density (6-1).
THEOREM 6-2. The sampling distribution of8(d) under fe calculated by ignoring the process
that causes missing data equals the correct conditional sampling distribution of S(v) given rh
under fgg^ for every 8(v), if and only if

Proof. The sampling distribution of every 8{S) found from density (5-1) will be identical
to that found from density (6-1) if and only if these two densities are equal. This equality
may be written as equation (6-2) by dividing by (5-1), and multiplying by ke,^(m).
The phrase 'ignoring the process that causes missing data when making sampling distri-
bution inferences' may suggest not only calculating sampling distributions with respect to
density (6-1) but also interpreting the resulting sampling distributions as unconditional
rather than conditional on m.
586 DONALD B. R U B I N

THEOREM 6-3. The sampling distribution of S(v) under fg calculated ignoring the process
that causes missing data equals the correct unconditional sampling distribution of S(v) under
(v) if and only »/fy(m|u) = 1.
Proof. The sufficiency is immediate. To establish the necessity consider the statistic
S(v) = 1 ifTO= rh and 0 otherwise.

7. MISSING DATA AND DIREOT-LIKEIJHOOD


A direct-likelihood inference is an inference that results solely from ratios of the likelihood
function for various values of the parameter (Edwards, 1972). Within the context of direct-
likelihood inference, 8 and <fi take values in a joint parameter space Clg^.
Ignoring the process that causes mirering data when making a direct-likelihood inference
for 6 means defining a parameter space for 8, Q^, and taking ratios, for various 6 e Cle, of the
'marginal' likelihood function based on density (5-1):

J2?(0|0) = 8(6, Oj/Wfijdtow), (7-1)


where 8(a, Q) is the indicator function of £1 Likelihood (7-1) is regarded as a function of 8
given the observed m and u\$.
The problem with this approach is that M is a random variable whose value is also
observed, so that the actual likelihood is the joint likelihood of the observed data «to and rh:

(7-2)

regarded as a function of 8, <f> given the observed u\$ and rh.


THEOREM 7-1. Suppose (a) that the missing data are missing at random, and (b) that <f> is
distinct from 8. Then the likelihood ratio ignoring the process that causes missing data, that is
8%\v), equals the correct likelihood ratio, that is SC(dlt <j>\V)l&(6%, <f>\V),foraH<f>e fy
such that fi^(m|tf) > 0.

Proof. Conditions (a) and (6) imply from equations (7-1) and (7-2) that

THEOREM 7-2. Suppose J2?(0|0) > Ofor aM 6eQg. AU likelihood ratios for defy ignoring
the process thai causes missing data are correct for aU<j>e Q^, if and only if (a) fy^ = Cle x Q^,
and (b)for each <f> e £1^, E^ty^mty) \m, u\$, 6, <j>) takes the same positive value for aUde£le.
Proof. First we show that
Se{d,<j>\v) = E^g^u) \m,uti),d,t}S{(0,t), Qe,^(d,fS). (7-3)
This is immediate if £?(6\C) > 0 for all 8eQe, and is true otherwise because

for all 8, (f> and v. If conditions (a) and (6) hold, (7-2) factorizes into a 0-factor and a ^-factor;
thus these conditions are sufficient even if j£?(0|0) = 0 for some defy.
Now consider the necessity of conditions (a) and (6). Since :S?(0|t5) > 0 for all 8 e Qe, if the
likelihood ratios for 8 ignoring the process that causes miaajng data are correct for all <f> e Q^,
Inference and missing data 587
for each (0,<fi)e£lgx £}J, we have SC{0,<f>\v) > 0. Hence condition (a) in the theorem is
necessary. Now using condition (a) and (7-3) write for all 0X, 0a e Q$ and <f> e fl^

-'> 0.
v) E^g^m^m, v^, 0%,
If (7*4) equals Sf(01\v)/£f(02\v) for all 6lt 02 e Clg and all <j> e £L, we have condition (b) in the
theorem.

8. MISSING DATA AND BAYESIAN INFERENCE


A Bayesian inference is an inference that results solely from posterior distributions corre-
sponding to specified prior distributions, e.g. the posterior mean and variance of a parameter
having a specified prior distribution. Within the context of Bayesian inference, 0 and (f> are
random variables whose marginal distribution is specified by the product of the prior
densities, p{0)p{<j>\0).
Bayeeian inference for 0 ignoring the process that causes missing data means choosing
p(0) and assuming that the observed data, u\$, arose from density (5-1). Hence the posterior
distribution of 0 ignoring the process that causes miming data is proportional to
p(0)jfs(u)dui(i). (8-1)
The problem with this approach is that the random variable M is being fixed at rh and thus
is being implicitly conditioned upon without being explicitly conditioned upon. That is,
correct conditioning on both the observed data, u\$, and on the observed pattern of missing
data, rh, leads to the joint posterior distribution of 0 and <j> which is proportional to

THEOREM 8-1. Suppose (a) thai the missing data are missing at random, and (6) that <f> is
distinct from 0. Then the posterior distribution of 6 ignoring the process that causes missing
data, i.e. calculated from equation (8-1), equals the correct posterior distribution of 6, that is cal-
culated from (8-2), and the posterior distributions for 6 and <j> are independent.
Proof. By conditions (a) and (6), equation (8-2) equals {p{6)^fe(u)du<^{p(<p)gQ(m\u)}.
THEOREM 8-2. The posterior distribution of 0 ignoring the process that causes missing data
equals the correct posterior distribution of 0 if and only if

takes a constant positive value.


Proof. The posterior distribution of 0 is proportional to (8-2) integrated over (j>. This can
be written as
l l (8-4)
Expressions (8-4) and (8-1) yield the same distribution for 0 if and only if they are equal.
Hence, the second factor in (8-4), which is expression (8-3), must take a constant positive
value.

9. COMPABOTG rNTKRENOES IN A SIMPLE •BTTAMTT/R:


Suppose that we want to estimate the weight of an object, say 0, using a scale that has a
digital display, including a sign bit! The weighing mechanism has a known normal error
distribution with mean zero and variance one. We propose to weigh the object ten times and
so obtain ten independent, identically distributed observations from N(0,1). A colleague
588 o DONAI/D B. R U B I N

tells us that in his experience sometimes no value will be displayed. Nevertheless in our ten
weighings we obtain ten values whose average is 5-0.
Let us first ignore the process that causes missing data. This might seem especially reason-
able since there are in fact no missing data. Under/g, the sampling distribution of the sample
average, 5-0, is N(6,0-1), and with a flat prior on 8 > 0 the posterior distribution of 6 is
approximately N(5-0, 0-1). Also, 5-0 is the maximum likelihood estimate of 8, and for
example the likelihood ratio of 6 = 5-0 to 8 = 4-0 is e6.
Now let U8 consider the process that causes missing data. Since there are no missing
observations, the miRsing data are missing at random. We discuss two processes that cause
missing data. First suppose that the manufacturer informs us that the display mechanism
has the flaw that for each weighing the value is displayed with probability <fi = 6/(1 + 0).
This fact means that the observed data are observed at random, and that <fi is not distinct
from 6. With a flat prior on 8 > 0 the posterior distribution for 6 is proportional to the
posterior distribution ignoring the process that causes missing data times {0/(1 + 6)}10. Thus,
because 8 and <f> are not distinct, the posterior distribution for 6 may be affected by the
process that causes minting data; i.e. all ten weighings yielding values suggeste that 6j( 1 + 6)
is close to unity and hence suggests that 6 is large compared to unity. The maximum likeli-
hood estimate of 8 is now about 5-04 and the likelihood ratio of 6 = 5-0 to 8 = 4-0 is about
1-5,/e.
However, since in this case the missing data are missing at random and the observed data
are observed at random, the sampling distribution of the sample average ignoring the
process that causes missing data equals the conditional sampling distribution of the sample
average given that all values are observed. The unconditional sampling distribution of the
sample average is the mixture of eleven distributions, the ith being N(6,1/i) with mixing
weight 0*101/(1 + 6)10{i 1(10 — *)!}, and the eleventh being the distribution of the 'sample
average' if no data are observed, e.g. zero with probability 1, with mixing weight (1 + 6)~10.
Now suppose that the manufacturer instead informs us that the display mechanism has
the flaw that it fails to display a value if the value that is going to be displayed is less than <j>.
Then the missing data are still missing at random, but the observed data are not observed
at random since the values are observed because they are greater than <f>. Also 8 and <f> are
now distinct since ^ is a property of the machine and 6 is a property of the object. It follows
that sampling distribution inferences may be affected by the process that causes mirering
data. Thus, the sampling distribution of the sample average given that all ten values are
observed is now the convolution of ten values from the distribution N(8,0g01) truncated
below <f>, and the unconditional sampling distribution of the sample average is the mixture
of eleven distributions, the jth(J = 1,..., 10) beingthe convolution oijN(8,llJ)'B with mixing
weight equal to [10!/{j! (10 - j)!}] £(<f>, 6)>{1- £(0, 0)} 1(W , where £(0,6) equals the area from
<f> to oo under the N(8,1) density, and the eleventh being the distribution of the 'sample
average' if no data are observed with mixing weight {1 — £(</>, 6)Y0.
However, since the missing data are missing at random and <j> is distinct from 8, the
posterior distribution for 8 with each fixed prior is unaffected by the process that causes
missing data. Hence, with a flat prior on 6 > 0, the posterior distribution for 8 remains
approximately N(5-0, 0-1). Also, 6-0 remains the maximum likelihood estimate of 6, and
^/e remains the likelihood ratio of 8 = 5-0 to 8 = 4-0.
Inference and missing data 589

10. PRACTICAL IMPLICATIONS


In order to have a practical problem in mind, consider the example in § 1 of the survey
of families in 1967 and the follow-up survey in 1970, where a number of families in the 1967
survey could not be located in 1970. Notice that it may be plausible that the miamng data
are missing at random; that is, families were not located in 1970 basically because of their
values on background variables that were recorded in 1967, e.g. low scores on socioeconomic
status measures. Also it may be plausible that the parameter of the distribution of the data
and the parameter relating 1967 family characteristics to locatability in 1970 are not tied
to each other. However, it is more difficult to believe that the missing data are missing at
random and that the observed data are observed at random, because these would imply that
families were not located in 1970 independently of both the values that were recorded in
1967 and those that would have been recorded in 1970.
This example seems to suggest that if the process that causes missing data is ignored,
Bayesian and direct-likelihood inferences will be proper Bayesian, or likelihood, inferences
more often than sampling distribution inferences will be proper sampling distribution
inferences. Since explicitly considering the process that causes mirering data requires a model
for the process, it seems simpler to make proper Bayesian and likelihood inferences in
many cases.
One might argue, however, that this apparent simplicity of likelihood and Bayesian
inference really buries the important issues. Many Bayesians feel that data analysis should
proceed with the use of 'objective' or 'noninformative' priors (Box & Tiao, 1973; Jeffreys,
1961), and these objective priors are determined from sampling distributions of statistics,
e.g. Fisher information. In addition, likelihood inferences are at times surrounded with
references to the sampling distributions of likelihood statistics. Thus practically, when
there is the possibility of missing data, some interpretations of Bayesian and likelihood
inference face the same restrictions as sampling distribution inference.
The inescapable conclusion seems to be that when dealing with real data, the practising
statistician should explicitly consider the process that causes missing data far more often
than he does. However, to do so, he needs models for this process and these have not received
much attention in the statistical literature.

I would like to thank A. P. Dempster, P. W. Holland, T. W. F. Stroud and a referee for


helpful comments on earner versions of this paper.

REFEBENCES
1
A n n , A. A. & ELASHOM , R. M. (1966). Mismng observations in multivariate statistics. I. Review of
the literature. J. Am. Statist. Asaoc. 61, 696-604.
ANDBBSON, T. W. (1967). Maximum likelihood estimates for a multivariate normal distribution when
some observations are miming. J. Am. Statist. Aasoc. 52, 200-3.
Box, Q. E. P. & TIAO, G. C. (1973). Bayesian Inference in Statistical Analysis. Reading, Mass: Addison-
Wesley.
COOHBAN, W. G. (1963). Sampling Techniques. New York: Wiley.
EDWABDS, A. W. F. (1972). Likelihood. Cambridge University Press.
HABTLET, H. O. (1956). Programming analysis of variance for general purpose computers. Biometrics
12, 110-22.
HABTT.KY, H. O. & HooKmo, R. R. (1971). Incomplete data analysis. Biometrics 27, 783-823.
TTWAT.V, M. J. R. & WESTMACOTT, M. (1956). Missing values in experiments analyzed on automatic!
computers. Appl. Statist. 5, 203-6.
HOOKING, R. R. & SMITH, W. B. (1968). Estimation of parameters in the multivariate normal distribu-
tion with Tniftfring observations. J. Am. Statist. Assoc. 63, 159-73.
590 DONALD B. RUBIN
Hooxmo, R. R. & SMITH, W. B. (1972). Optimum incomplete multi-normal samples. Technometrics 14,
299-307.
JEFFREYS, H. (1961). Theory of Probability, 3rd edition. Oxford: Clarendon.
KBMPTHOBNE, O. (1952). The Design and Analysis of Experiments. New York: Wiley.
LEHMANN, E. L. (1959). Testing Statistical Hypotheses. New York: Wiley.
RUBIN, D. B. (1972). A noniterative algorithm for least squares estimation of missing values in any
analysis of variance design. Appl. Statist. 21, 136-41.
RUBIN, D. B. (1975). Bayesian inference for causality: The importance of randomization. Proo. Social
Statistics Section, Am. Statist. Assoc. pp. 233-9.
RUBIN, D. B. (1976). Noniterative least squares estimates, standard errors, and .F-testa for analyses of
variance with Tniamng data. J. S. Statist. Soc. B 38. To appear.
TBAWINSKI, I. M. & BABQMANN, R. E. (1964). Maximum likelihood estimation witji incomplete multi-
variate data. Ann. Math. Statist. 35,647-57.
WrLKiNBON, G. N. (1958). Estimation of missing values for the analysis of incomplete data. Biometrics
14, 257-86.
W U K S , S. S. (1932). Moments and distributions of estimates of population parameters from frag-
mentary samples. Ann. Math. Statist. 3, 163-95.

[Received April 1974. Revised November 1975]

Comments on paper by D. B. Rubin

BY R. J. A. LITTLE
Department of Statistics, University of Chicago

In the following comments, a notation close to that of Dr Rubin's paper is used. Thus U = (%,..., u j
denotes the full data, with density/(u; 6) (6e fle) and M = (m^,...,mn) indicates the observed pattern,
with conditional density p(m|u; <f>) (<j> e Q^) given U = u. The distribution of obs (U,M), the observed
data, can be described as follows. It has M = m with probability
g(m; 8,<f>) = Jg(m\u;<j>)f(u;8)du = ED{g(m\U; <fi); 8). (1)
Given M = m, the conditional density of obs(U,M) is
m; 8, <j>) = / ( u ( 0 ; 8) Jp(»n|u; <5)/(u(1}|u,0); 8) du^ (2)

where J7(1) is the observed part of U and Z7(0) is the missing part of U.
For sampling based inferences, a first crucial question concerns when it is justified to condition on the
observed pattern, that is on the event M = m, and to use the distribution (2) and (3). A natural condition
is that M should be ancillary, that is that g{m; 8, <f>) should be independent of 0 for all m, <f>. Otherwise
the pattern on its own carries at least some information about 0, which should in principle be used.
Suppose now that this anoillarity condition is satisfied. As Dr Rubin stresses, ignoring the deletion
mechanism involves not only conditioning on M = m, but also assuming that DJy has a distribution with
marginal density/(%); 6), that is that for the observed pattern M =rh,
e), (4)
or thatflrfmlt^D; 0,$) = EUi>lg{m\lJi0),ulii;ff) is independent of u ^ which is Dr Rubin's condition (6-2).
A sufficient condition for (4) is a combination of Dr Rubin's conditions, missing at random and
observed at random, namely that
f(m\u; <f>) is independent oft*, (5)
This implies ancillarity if and only if it holds for all observable patterns m, and not just for the observed
pattern m, and also the parameter space for (0, <f>) is Clg x fl^; then the deletion pattern can be ignored.
For example, consider Dr Rubin's weighing problem in § 9, when a weighing value is displayed with
probability 81(1 — 6), and all values are displayed. Then (6) is satisfied for all patterns m, but 8 = <j>, so
that 8 and ^ are dependent, and anoillarity fails to hold. Thus in principle the rather complicated distri-
bution of obs (17, M) described by Dr Rubin should be used. However this deletion mechanism seems
highly unlikely in practice.
Comments on paper by D. B. Rubin 591
Necessary conditions for ignoring the deletion mechanism are unfortunately not obvious, and it is
worth considering some further examples.
Example 1. Suppose that for the observed value m, U^ and U^ are independently distributed, and that
the probability that M =m depends on U(i)b\it not U^th&t is g{rh\u; <j>) = g(th\tt^; <f>). Then clearly (4) ia
satisfied but not (5), so (6) ia not necessary for (4).
Example?,. Let U4 be independent N(6,1) (» = 1, ...,n) and suppose mf = l i f andonlyif \U{— U\ <<j>,
for some constant $. A simple computation of (1) establishes that m is ancillary for 6. However we cannot
ignore the deletion mechanism, since the correct distribution for sampling inference has density

(3)
f(u;6)du
where R{m) = {u: |u4 —u| *A: asm, = 0 or 1} is a region of R"; this is clearly not the normal density

The case of pure likelihood inferences is much simpler, since we can fix U^ and M at their observed
values flj.fn, and the rather complex sample space of obs (17, M) is not relevant. Dr Rubin's sufficient
conditions in Theorem 7 • 1 are perhaps more remarkable than his examples would suggest. TTia Example 3
for instance.is already well known: see Examples 2-34 and 2-40 of Cox & Hinkley (1974). We give a
multivariate example of some practical importance.
Example 3. Consider an incomplete bivariate normal sample size n of random variables X and T,
which have respective means fa, fa, variances a\, o\, and correlation p. Suppose X is always observed.
Two possible deletion mechanisms for T are: (a) observe Y if and only if Y > c; (b) observe Y if and
only if X > c. It is easily seen that Dr Rubin's' mianing at random' condition is satisfied in (6) but not
in (a), and so for nuTimnm likelihood estimation we can ignore the deletion mechanism in (6) but not
in (a). To illustrate this, the estimates of Table 1 were found from generated data with 50 observations,
c = 0 and fa = fa, = 0, so that about half the Y values were deleted in (o) and (6). Note that estimates
of fa, a\ and p in situation (ii o) are biased, confirming previous theory. However the estimates in
situation (ii 6) are maximum likelihood, and are close to their true values. Thus here we can ignore the
deletion pattern, although the observed values of Y do not follow the marginal ^(0,2) distribution, and
in particular their sample mean will overestimate zero.
In a real set of data for which (ii 6) is appropriate, X might be blood pressure, and Y a medical test
which for safety reasons is not carried out when X is below a certain level o.

Table 1. Maximum likelihood estimates, ignoring the deletion mechanism, for


fa = 0, fa = 0, a\ = 1, a\ = 2, p = 0-71

(i) Complete data 0013 0 •085 0 •917 1-827 0-780


(no) Data censored by (a) 0013 0 •930 0 •917 0-456 0- 510
(h&) Data censored by (b) 0013 - 0 •140 0 •917 1-991 0- 645

In nummary, Dr Rubin's paper should stimulate thought about the many mechanisms whioh produce
data with miamng values.
REFERENCE
Cox, D. R. <fc HnncLHY, D. V. (1974). Theoretical Statistics. London: Chapman and Hall.

Reply to c o m m e n t s

B Y D. B. RUBIN

First, I want to thank Dr Little for his Example 3, which numerically illustrates the point being made
in the beginning of § 10. Secondly, I must reject bis restriction that M should be ancillary when malring
sampling distribution inferences for d which are conditional on M. As Theorem 6-1 states, if (a) the
missing data are missing at random and (b) the observed data are observed at random, then a sampling
592 DONALD B. RUBIN
distribution probability statement that ignores the process that causes missing data is correct if
interpreted as being conditional on M. Given (a) and (6), Theorem 7• 1 on likelihood inference implies that
suoh a probability statement cannot generally be fully efficient for inference about 6 unless (e) 6 is distinct
from <f>. Nevertheless, sampling distribution inferences that are less than fully efficient are often quite
useful. Furthermore, given (a), (6) and (c), sampling distribution inference for 6 should be conditional
on M whether or not M is ancillary. For a simple case, consider my Example 4 with m = (1,0), 0 = 0-1,
andfu^wj ~N{(d,6),I). The conditional probability of the event & = (u- 1-96 < d <u+ 1-96), where
t? = 1/mi uJJjmf, is 0-95 for all 0, while the unconditional probability of 8 is nearly 0-99 for 0 quite positive.
This example suggests that the usual definition of ancillary (Cox & Hinkley, 1974, p. 35) is incorrect for
inference about 6 and should be modified to be conditional on the observed value of the ancillary statistic.

You might also like