0% found this document useful (0 votes)

139 views12 pages

Rubin 1976

The document discusses the conditions under which the process causing missing data can be ignored when making statistical inferences. It establishes that if data are 'missing at random' and 'observed at random', inferences about the data parameter can be made without considering the missing data process. The paper also differentiates between sampling distribution inferences and Bayesian inferences, concluding that Bayesian methods are often less sensitive to missing data processes.

Uploaded by

Christine Celicious

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

139 views12 pages

Rubin 1976

Uploaded by

Christine Celicious

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Biometrika (1976), 63, 3, pp.

581-92 581
Printed in Great Britain

Inference and missing data

BY DONALD B. RUBIN
Educational Testing Service, Princeton, New Jersey

SUMMABY
When making sampling distribution inferences about the parameter of the data, 6, it is
appropriate to ignore the process that causes missing data if the missing data are 'missing

Downloaded from http://biomet.oxfordjournals.org/ at University of California, San Francisco on December 17, 2014
at random' and the observed data are 'observed at random', but these inferences are
generally conditional on the observed pattern of missing data. When making direct-
likelihood or Bayesian inferences about 6, it is appropriate to ignore the process that causes
missing data if the minting data are missing at random and the parameter of the missing data
process is ' distinct' from 6. These conditions are the weakest general conditions under which
ignoring the process that causes missing data always leads to correct inferences.

Some hoy words: Bayesian inference; Incomplete data; Likelihood inference; Missing at random;
Missing data; Missing values; Observed at random; Sampling distribution inference.

1. INTRODUCTION : THE GENERAUTY OF

THE PEOBLHM OF MISSING DATA
The problem of missing data arises frequently in practice. For example, consider a large
survey of families conducted in 1967 with many socioeconomic variables recorded, and a
follow-up survey of the same families in 1970. Not only is it likely that there will be a few
missing values scattered throughout the data set, but also it is likely that there will be a large
block of missing values in the 1970 data because many families studied in 1967 could not be
located in 1970. Often, the analysis of data like these proceeds with an assumption, either
implicit or explicit, that the process that caused the missing data can be ignored. The
question to be answered here is: when is this the proper procedure?
The statistical literature on Tr»i««ing data does not answer this question in general. In most
articles on unintended missing data, the process that causes missing data is ignored after
being assumed accidental in one sense or another. In some articles such as those concerned
with the multivariate normal (Ann & Elashoff, 1966; Anderson, 1957; Hartley & Hocking,
1971; Hocking & Smith, 1968; Wilks, 1932), the assumption about the process that causes
miHHing data seems to be that each value in the data set is equally likely to be missing. In
other articles such as those dealing with the analysis of variance (Hartley, 1956; Healy &
Westmacott, 1956; Rubin, 1972, 1976; Wilkinson, 1958), the assumption seems to be that
values of the dependent variables are missing without regard to values that would have
been observed.
The statistical literature also discusses miming data that arise intentionally. In these
cases, the process that causes missing data is generally considered explicitly. Some examples
of methods that intentionally create missing data are: a preplanned multivariate experi-
mental design (Hocking & Smith, 1972; Trawinski & Bargmann, 1964); random sampling
from a finite population, i.e. the values of variables for unsampled units^being missing
(Cochran, 1963, p. 18); randomization in an experiment, where, for each unit, the values
» that would have been observed had the unit received a different treatment are missing
582 DONAUD B. RUBIN
(Kempthorne, 1952, p. 137; Rubin, 1976); sequential stopping rules, where the values after
the last one observed are missing (Lehmann, 1959, p. 97), and even some 'robust analyses',
where observed values are considered outliers and so discarded or made missing.

2. OBJECTIVES AND BROAD REVIEW

Our objective is to find the weakest simple conditions on the process that causes missing
data such that it is always appropriate to ignore this process when making inferences about
the distribution of the data. The conditions turn out to be rather intuitive as well as non-
parametric in the sense that they are not tied to any particular distributional form. Thus
they should prove helpful for deciding in practical problems if the process that causes
missing data can be ignored.
Section 3 gives the notation for the random variables: 6 is the parameter of the data, and
(f> is the parameter of the missing-data process, i.e. the parameter of the conditional distribu-
tion of the missing-data indicator given the data. Section 4 presents examples of processes
that cause missing data.
Section 5 shows that when the process that causes miming data is ignored, the missing-
data indicator random variable is simply fixed at its observed value. Whether this corre-
sponds to proper conditioning depends on the method of inference and three conditions, on
the process that causes mining data. These conditions place no restrictions on the miH«ing-
data process for patterns of missing data other than the observed pattern. Their formal
definitions correspond to the following statements.
The miming data are missing at random if for each, possible value of the parameter $, the conditional
probability of the observed pattern of miming data, given the miming data and the value of the
observed data, is the same for all possible values of the miming data.
The observed data are observed at random if for each possible value of the miaHing data and the
parameter <f>, the conditional probability of the observed pattern of miming data, given the miRsing
data and the observed data, is the same for all possible values of the observed data.
The parameter <f> is distinct from 6 if there are no a priori ties, via parameter space restrictions or
prior distributions, between <f> and 6.
Sections 6, 7 and 8 use these definitions to prove that ignoring the process that causes
TnJRfring data when making sampling distribution inferences about 6 is appropriate if the
missing data are missing at random and the observed data are observed at random, but the
resulting inferences are generally conditional on the observed pattern of mining data.
Further, ignoring the process that causes missing data when making direct-likelihood or
Bayesian inferences about 6 is appropriate if the missing data are missing at random and
<j> is distinct from 6.
Other results show that these conditions are the weakest simple and general conditions
under which it is always appropriate to ignore the process that causes missing data. The
reader not interested in the formal details should be able to skim §§ 3-8 and proceed to § 9.
Section 9 uses these resulte to highlight the distinctions between the sampling distribution
and the likelihood-Bayesian approaches to the problem of missing data. Section 10 con-
cludes the paper with the suggestion that in many practical problems, Bayesian and
likelihood inferences are less sensitive than sampling distribution inferences to the process
that causes missing data.
Throughout, measure-theoretic considerations about sets of probability zero are ignored.
Inference and missing data 583

3. NOTATION FOB THE RANDOM VARIABLES

Let U = (Ult..., Un) be a vector random variable with probability density function y^.
The objective is to make inferences about 6, the vector parameter of this density. Often in
practice, the random variable V "will be arranged in a 'unite' by 'variables' matrix. Let
M <=i (Mlt..., Mn) be the associated 'missing-data indicator' vector random variable, where
each Mf takes the value 0 or 1. The probability that M takes the value m = (m^ mB) given
that U takes the value u = (t^, ...,«„) is g^(m\u), where <p is the nuisance vector parameter
of the distribution.
The conditional distribution g$ corresponds to 'the process that causes missing data':
if m{ = 1, the value of the random variable Ut will be observed while i£mi = 0, the value
of Ut will not be observed. More precisely, define the extended vector random variable
F = iyx, ...,Vn) with range extended to include the special value * for miming data:
vt = u{ (mt = 1 ) , and v{ = * (m{ = 0). The values of the random variable V are observed,
not the random variable V, although it is desired to make inferences about the
distribution of U.

4. EXAMPLES OF PBOOESSES THAT CAUSE MESSING DATA

In order to clarify the notation in § 3 we give four examples.
Example 1. Suppose there are n samples of an alloy and on each we attempt to record some
characteristic by an instrument that has a constant probability, <j>, of failing to record the
result for all possible samples. Then

Example 2. Let u4 be the value of blood pressure for the tth subject (t = 1, ...,n) in a
hospital survey. Suppose vt = * if ^ is less than <j>, which equals the mean blood pressure in
the population; i.e. we only record blood pressure for subjects whose blood pressures are
greater than average. Then
n
fyH") = n *{r(«i-^)-m4},
where y(o) = 1 if a > 0 and 0 otherwise; i{a) = 1 if a =• 0 and 0 otherwise.
Example 3. Observations are taken in sequence until a particular function of the observa-
tions is in a specified critical region G. Here n is essentially infinite and, for some r^ which
is a function of the observations, vt 4= * (t < n j , and v{ = * (t > n^). Thus

= 5 *(!-«»*) ft
where 74 is the minimum k such that the function <&(«!, ..
Example 4. Let n = 2. If v^ > 0: with probability 0, t^ =t= * and vt = *; and with proba-
bility 1 — <j>, vx + • and «7j # *. If ttj < 0: with probability <f>, vx + * and vt = *; and with
probability 1 — <j>, v^ = * and v2 + •. Thus
if m = (l,0),
if TO = (1,1),
g+(m\u) =

if m = (0,0).
584 DONALD B. RUBIN

5. IGNORING THE PBOOESS THAT CAUSES MISSING DATA

Let 0 = (t51( ...,v n ) be a particular sample realization of V, i.e. each vt is either a known
number or a missing value, *. These observed values imply an observed value for the random
variable M, rh = (n^, ...,thn), and imply observed values for some of the scalar random
variables in V. That is, if v{ is a number, then the observed value of Mf, rht, is one, and the
observed value of U{, Ht, equals fy; if €t = *, then rht = 0 and the value of U{ is not known;
in special cases, knowing values in v may imply observed values for some Ut with vt= *,
for example fe specifies ti^ = u^ + Ug and we observe vx = *, #2 = 3-1 and v3 = 5-2.

Table 1. Classifying the examples in § 4

Missing data, Observed data,
Example miming at random observed at random <j> distinct from 6
1 Always MAB Always OAB Always distinct
2 MAB only if all mt = 1 OAB only if all mt = 0 Distinct only if mean blood
pressure in the population is
known a priori
3 AlwayB M*» Never OAB Always distinct
4 MAB unless m = (0, 1) OAB unless m = (1, 1) Distinct if a priori <p is not
' restricted by 6

Hence, the observed value of M, namely rh, effects a partition of each of the vectors of
random variables and the vectors of observed values into two vectors corresponding to
rh{ = 0 for missing data and rhi= 1 for observed data. For convenience write
v
F = (%», Vu), u = (««,), uo), =
where by definition %) = (*,..., *.) and u^ = v^. I t is important to remember that these
partitions are those corresponding to m = rh, the observed pattern of missing data. For
further notational convenience, we let u = (M^), U^) ; u consists of a vector of arguments, ?%,
corresponding to unobserved random variables, and a vector of known numberSj H^ = t ^ ,
corresponding to values of observed random variables.
The objective is to use $, or equivalently rh and u^, to make inferences about 6. It is com-
mon practice to ignore the process that causes missing data when making these inferences.
Ignoring the process that causes missing data means proceeding by: (a) filing the random
variable M at the observed pattern of missing data, rh, and (6) assuming that the values of
the observed data, £k), arose from the marginal density of the random variable U^:

//„(«) duo,,. (5-1)

The central question here concerns the weakest simple conditions on g^ such that ignoring
the process that causes miiming data will always yield proper inferences about 6.
Three conditions are relevant to answering this question. These conditions place no
restrictions on (fy(wi|u.) for values of M other than rh.

Definition 1. The mianing data are missing at random if for each value of <j>, g^ihty) takes
the same value for all 1%.
Definition 2. The observed data are observed at random if for each value of <f> and t%,
) takes the same value for all u^.
Inference and missing data • 585
Definition 3. The parameter <j> is distinct from 6 if their joint parameter space factorizes
into a 0-space and a #-space, and when prior distributions are specified for $ and 6, if these
are independent.
Table 1 classifies the four examples of § 4 in terms of these definitions.

6. MISSING DATA AND SAMPLING DISTBIBUTION INTEBENOE

A sampling distribution inference is an inference that results solely from comparing the
observed value of a statistic, e.g. an estimator, test criterion or confidence interval, with the
sampling distribution of that statistic under various hypothesized underlying distributions.
Within the context of sampling distribution inference, the parameters 6 and <f> have fixed
hypothesized values.
Ignoring the process that causes missing data when making a sampling distribution infer-
ence about the true value of d means comparing the observed value of some vector statistio
8(v), equivalently 8(m, u\$), to the distribution of 8{v) found from/j. More precisely, the
sampling distribution of 8(0) ignoring the process that causes missing data is found by
fixing M at the observed rh and assuming that the sampling distribution of the observed data
follows from density (6-1). The problem with this approach is that for the fixed m, the
sampling distribution of the observed data, u^, does not follow from (5-1) which is the
marginal density of U^ but from the conditional density of U^ given that the random
variable M took the value m:

where h$^(m) = j fg(u) g^(m\u) du, which is the marginal probability that M takes the
value rh. Hence, the correct sampling distribution of 8{v) depends in general not only on the
fixed hypothesized fg but also on the fixed hypothesized g$.
THEOBBM 6-1. Suppose that (a) the missing data are missing at random and (6) the observed
data are observed at random. Then the sampling distribution of8(&) under fg ignoring the process
that causes missing data, i.e. calculatedfrom density (5-1), equals the correct conditional sampling
distribution of #(#) given rh under fgg^, that is calculated from density (6-1) assuming
0.
Proof. Under conditions (a) and (6), for each value of $, g^{m\u) takes the same value for
all u; notice that this does not imply V and M are independently distributed unless it holds
for all possible rh. Hence kg^{rh) = gr^(wi|tt), and thus the distribution of every statistic under
density (5-1) is the same as under density (6-1).
THEOREM 6-2. The sampling distribution of8(d) under fe calculated by ignoring the process
that causes missing data equals the correct conditional sampling distribution of S(v) given rh
under fgg^ for every 8(v), if and only if

Proof. The sampling distribution of every 8{S) found from density (5-1) will be identical
to that found from density (6-1) if and only if these two densities are equal. This equality
may be written as equation (6-2) by dividing by (5-1), and multiplying by ke,^(m).
The phrase 'ignoring the process that causes missing data when making sampling distri-
bution inferences' may suggest not only calculating sampling distributions with respect to
density (6-1) but also interpreting the resulting sampling distributions as unconditional
rather than conditional on m.
586 DONALD B. R U B I N

THEOREM 6-3. The sampling distribution of S(v) under fg calculated ignoring the process
that causes missing data equals the correct unconditional sampling distribution of S(v) under
(v) if and only »/fy(m|u) = 1.
Proof. The sufficiency is immediate. To establish the necessity consider the statistic
S(v) = 1 ifTO= rh and 0 otherwise.

7. MISSING DATA AND DIREOT-LIKEIJHOOD

A direct-likelihood inference is an inference that results solely from ratios of the likelihood
function for various values of the parameter (Edwards, 1972). Within the context of direct-
likelihood inference, 8 and <fi take values in a joint parameter space Clg^.
Ignoring the process that causes mirering data when making a direct-likelihood inference
for 6 means defining a parameter space for 8, Q^, and taking ratios, for various 6 e Cle, of the
'marginal' likelihood function based on density (5-1):

J2?(0|0) = 8(6, Oj/Wfijdtow), (7-1)

where 8(a, Q) is the indicator function of £1 Likelihood (7-1) is regarded as a function of 8
given the observed m and u\$.
The problem with this approach is that M is a random variable whose value is also
observed, so that the actual likelihood is the joint likelihood of the observed data «to and rh:

(7-2)

regarded as a function of 8, <f> given the observed u\$ and rh.

THEOREM 7-1. Suppose (a) that the missing data are missing at random, and (b) that <f> is
distinct from 8. Then the likelihood ratio ignoring the process that causes missing data, that is
8%\v), equals the correct likelihood ratio, that is SC(dlt <j>\V)l&(6%, <f>\V),foraH<f>e fy
such that fi^(m|tf) > 0.

Proof. Conditions (a) and (6) imply from equations (7-1) and (7-2) that

THEOREM 7-2. Suppose J2?(0|0) > Ofor aM 6eQg. AU likelihood ratios for defy ignoring
the process thai causes missing data are correct for aU<j>e Q^, if and only if (a) fy^ = Cle x Q^,
and (b)for each <f> e £1^, E^ty^mty) \m, u\$, 6, <j>) takes the same positive value for aUde£le.
Proof. First we show that
Se{d,<j>\v) = E^g^u) \m,uti),d,t}S{(0,t), Qe,^(d,fS). (7-3)
This is immediate if £?(6\C) > 0 for all 8eQe, and is true otherwise because

for all 8, (f> and v. If conditions (a) and (6) hold, (7-2) factorizes into a 0-factor and a ^-factor;
thus these conditions are sufficient even if j£?(0|0) = 0 for some defy.
Now consider the necessity of conditions (a) and (6). Since :S?(0|t5) > 0 for all 8 e Qe, if the
likelihood ratios for 8 ignoring the process that causes miaajng data are correct for all <f> e Q^,
Inference and missing data 587
for each (0,<fi)e£lgx £}J, we have SC{0,<f>\v) > 0. Hence condition (a) in the theorem is
necessary. Now using condition (a) and (7-3) write for all 0X, 0a e Q$ and <f> e fl^

-'> 0.
v) E^g^m^m, v^, 0%,
If (7*4) equals Sf(01\v)/£f(02\v) for all 6lt 02 e Clg and all <j> e £L, we have condition (b) in the
theorem.

8. MISSING DATA AND BAYESIAN INFERENCE

A Bayesian inference is an inference that results solely from posterior distributions corre-
sponding to specified prior distributions, e.g. the posterior mean and variance of a parameter
having a specified prior distribution. Within the context of Bayesian inference, 0 and (f> are
random variables whose marginal distribution is specified by the product of the prior
densities, p{0)p{<j>\0).
Bayeeian inference for 0 ignoring the process that causes missing data means choosing
p(0) and assuming that the observed data, u\$, arose from density (5-1). Hence the posterior
distribution of 0 ignoring the process that causes miming data is proportional to
p(0)jfs(u)dui(i). (8-1)
The problem with this approach is that the random variable M is being fixed at rh and thus
is being implicitly conditioned upon without being explicitly conditioned upon. That is,
correct conditioning on both the observed data, u\$, and on the observed pattern of missing
data, rh, leads to the joint posterior distribution of 0 and <j> which is proportional to

THEOREM 8-1. Suppose (a) thai the missing data are missing at random, and (6) that <f> is
distinct from 0. Then the posterior distribution of 6 ignoring the process that causes missing
data, i.e. calculated from equation (8-1), equals the correct posterior distribution of 6, that is cal-
culated from (8-2), and the posterior distributions for 6 and <j> are independent.
Proof. By conditions (a) and (6), equation (8-2) equals {p{6)^fe(u)du<^{p(<p)gQ(m\u)}.
THEOREM 8-2. The posterior distribution of 0 ignoring the process that causes missing data
equals the correct posterior distribution of 0 if and only if

takes a constant positive value.

Proof. The posterior distribution of 0 is proportional to (8-2) integrated over (j>. This can
be written as
l l (8-4)
Expressions (8-4) and (8-1) yield the same distribution for 0 if and only if they are equal.
Hence, the second factor in (8-4), which is expression (8-3), must take a constant positive
value.

9. COMPABOTG rNTKRENOES IN A SIMPLE •BTTAMTT/R:

Suppose that we want to estimate the weight of an object, say 0, using a scale that has a
digital display, including a sign bit! The weighing mechanism has a known normal error
distribution with mean zero and variance one. We propose to weigh the object ten times and
so obtain ten independent, identically distributed observations from N(0,1). A colleague
588 o DONAI/D B. R U B I N

tells us that in his experience sometimes no value will be displayed. Nevertheless in our ten
weighings we obtain ten values whose average is 5-0.
Let us first ignore the process that causes missing data. This might seem especially reason-
able since there are in fact no missing data. Under/g, the sampling distribution of the sample
average, 5-0, is N(6,0-1), and with a flat prior on 8 > 0 the posterior distribution of 6 is
approximately N(5-0, 0-1). Also, 5-0 is the maximum likelihood estimate of 8, and for
example the likelihood ratio of 6 = 5-0 to 8 = 4-0 is e6.
Now let U8 consider the process that causes missing data. Since there are no missing
observations, the miRsing data are missing at random. We discuss two processes that cause
missing data. First suppose that the manufacturer informs us that the display mechanism
has the flaw that for each weighing the value is displayed with probability <fi = 6/(1 + 0).
This fact means that the observed data are observed at random, and that <fi is not distinct
from 6. With a flat prior on 8 > 0 the posterior distribution for 6 is proportional to the
posterior distribution ignoring the process that causes missing data times {0/(1 + 6)}10. Thus,
because 8 and <f> are not distinct, the posterior distribution for 6 may be affected by the
process that causes minting data; i.e. all ten weighings yielding values suggeste that 6j( 1 + 6)
is close to unity and hence suggests that 6 is large compared to unity. The maximum likeli-
hood estimate of 8 is now about 5-04 and the likelihood ratio of 6 = 5-0 to 8 = 4-0 is about
1-5,/e.
However, since in this case the missing data are missing at random and the observed data
are observed at random, the sampling distribution of the sample average ignoring the
process that causes missing data equals the conditional sampling distribution of the sample
average given that all values are observed. The unconditional sampling distribution of the
sample average is the mixture of eleven distributions, the ith being N(6,1/i) with mixing
weight 0*101/(1 + 6)10{i 1(10 — *)!}, and the eleventh being the distribution of the 'sample
average' if no data are observed, e.g. zero with probability 1, with mixing weight (1 + 6)~10.
Now suppose that the manufacturer instead informs us that the display mechanism has
the flaw that it fails to display a value if the value that is going to be displayed is less than <j>.
Then the missing data are still missing at random, but the observed data are not observed
at random since the values are observed because they are greater than <f>. Also 8 and <f> are
now distinct since ^ is a property of the machine and 6 is a property of the object. It follows
that sampling distribution inferences may be affected by the process that causes mirering
data. Thus, the sampling distribution of the sample average given that all ten values are
observed is now the convolution of ten values from the distribution N(8,0g01) truncated
below <f>, and the unconditional sampling distribution of the sample average is the mixture
of eleven distributions, the jth(J = 1,..., 10) beingthe convolution oijN(8,llJ)'B with mixing
weight equal to [10!/{j! (10 - j)!}] £(<f>, 6)>{1- £(0, 0)} 1(W , where £(0,6) equals the area from
<f> to oo under the N(8,1) density, and the eleventh being the distribution of the 'sample
average' if no data are observed with mixing weight {1 — £(</>, 6)Y0.
However, since the missing data are missing at random and <j> is distinct from 8, the
posterior distribution for 8 with each fixed prior is unaffected by the process that causes
missing data. Hence, with a flat prior on 6 > 0, the posterior distribution for 8 remains
approximately N(5-0, 0-1). Also, 6-0 remains the maximum likelihood estimate of 6, and
^/e remains the likelihood ratio of 8 = 5-0 to 8 = 4-0.
Inference and missing data 589

10. PRACTICAL IMPLICATIONS

In order to have a practical problem in mind, consider the example in § 1 of the survey
of families in 1967 and the follow-up survey in 1970, where a number of families in the 1967
survey could not be located in 1970. Notice that it may be plausible that the miamng data
are missing at random; that is, families were not located in 1970 basically because of their
values on background variables that were recorded in 1967, e.g. low scores on socioeconomic
status measures. Also it may be plausible that the parameter of the distribution of the data
and the parameter relating 1967 family characteristics to locatability in 1970 are not tied
to each other. However, it is more difficult to believe that the missing data are missing at
random and that the observed data are observed at random, because these would imply that
families were not located in 1970 independently of both the values that were recorded in
1967 and those that would have been recorded in 1970.
This example seems to suggest that if the process that causes missing data is ignored,
Bayesian and direct-likelihood inferences will be proper Bayesian, or likelihood, inferences
more often than sampling distribution inferences will be proper sampling distribution
inferences. Since explicitly considering the process that causes mirering data requires a model
for the process, it seems simpler to make proper Bayesian and likelihood inferences in
many cases.
One might argue, however, that this apparent simplicity of likelihood and Bayesian
inference really buries the important issues. Many Bayesians feel that data analysis should
proceed with the use of 'objective' or 'noninformative' priors (Box & Tiao, 1973; Jeffreys,
1961), and these objective priors are determined from sampling distributions of statistics,
e.g. Fisher information. In addition, likelihood inferences are at times surrounded with
references to the sampling distributions of likelihood statistics. Thus practically, when
there is the possibility of missing data, some interpretations of Bayesian and likelihood
inference face the same restrictions as sampling distribution inference.
The inescapable conclusion seems to be that when dealing with real data, the practising
statistician should explicitly consider the process that causes missing data far more often
than he does. However, to do so, he needs models for this process and these have not received
much attention in the statistical literature.

I would like to thank A. P. Dempster, P. W. Holland, T. W. F. Stroud and a referee for

helpful comments on earner versions of this paper.

REFEBENCES
1
A n n , A. A. & ELASHOM , R. M. (1966). Mismng observations in multivariate statistics. I. Review of
the literature. J. Am. Statist. Asaoc. 61, 696-604.
ANDBBSON, T. W. (1967). Maximum likelihood estimates for a multivariate normal distribution when
some observations are miming. J. Am. Statist. Aasoc. 52, 200-3.
Box, Q. E. P. & TIAO, G. C. (1973). Bayesian Inference in Statistical Analysis. Reading, Mass: Addison-
Wesley.
COOHBAN, W. G. (1963). Sampling Techniques. New York: Wiley.
EDWABDS, A. W. F. (1972). Likelihood. Cambridge University Press.
HABTLET, H. O. (1956). Programming analysis of variance for general purpose computers. Biometrics
12, 110-22.
HABTT.KY, H. O. & HooKmo, R. R. (1971). Incomplete data analysis. Biometrics 27, 783-823.
TTWAT.V, M. J. R. & WESTMACOTT, M. (1956). Missing values in experiments analyzed on automatic!
computers. Appl. Statist. 5, 203-6.
HOOKING, R. R. & SMITH, W. B. (1968). Estimation of parameters in the multivariate normal distribu-
tion with Tniftfring observations. J. Am. Statist. Assoc. 63, 159-73.
590 DONALD B. RUBIN
Hooxmo, R. R. & SMITH, W. B. (1972). Optimum incomplete multi-normal samples. Technometrics 14,
299-307.
JEFFREYS, H. (1961). Theory of Probability, 3rd edition. Oxford: Clarendon.
KBMPTHOBNE, O. (1952). The Design and Analysis of Experiments. New York: Wiley.
LEHMANN, E. L. (1959). Testing Statistical Hypotheses. New York: Wiley.
RUBIN, D. B. (1972). A noniterative algorithm for least squares estimation of missing values in any
analysis of variance design. Appl. Statist. 21, 136-41.
RUBIN, D. B. (1975). Bayesian inference for causality: The importance of randomization. Proo. Social
Statistics Section, Am. Statist. Assoc. pp. 233-9.
RUBIN, D. B. (1976). Noniterative least squares estimates, standard errors, and .F-testa for analyses of
variance with Tniamng data. J. S. Statist. Soc. B 38. To appear.
TBAWINSKI, I. M. & BABQMANN, R. E. (1964). Maximum likelihood estimation witji incomplete multi-
variate data. Ann. Math. Statist. 35,647-57.
WrLKiNBON, G. N. (1958). Estimation of missing values for the analysis of incomplete data. Biometrics
14, 257-86.
W U K S , S. S. (1932). Moments and distributions of estimates of population parameters from frag-
mentary samples. Ann. Math. Statist. 3, 163-95.

[Received April 1974. Revised November 1975]

Comments on paper by D. B. Rubin

BY R. J. A. LITTLE
Department of Statistics, University of Chicago

In the following comments, a notation close to that of Dr Rubin's paper is used. Thus U = (%,..., u j
denotes the full data, with density/(u; 6) (6e fle) and M = (m^,...,mn) indicates the observed pattern,
with conditional density p(m|u; <f>) (<j> e Q^) given U = u. The distribution of obs (U,M), the observed
data, can be described as follows. It has M = m with probability
g(m; 8,<f>) = Jg(m\u;<j>)f(u;8)du = ED{g(m\U; <fi); 8). (1)
Given M = m, the conditional density of obs(U,M) is
m; 8, <j>) = / ( u ( 0 ; 8) Jp(»n|u; <5)/(u(1}|u,0); 8) du^ (2)

where J7(1) is the observed part of U and Z7(0) is the missing part of U.
For sampling based inferences, a first crucial question concerns when it is justified to condition on the
observed pattern, that is on the event M = m, and to use the distribution (2) and (3). A natural condition
is that M should be ancillary, that is that g{m; 8, <f>) should be independent of 0 for all m, <f>. Otherwise
the pattern on its own carries at least some information about 0, which should in principle be used.
Suppose now that this anoillarity condition is satisfied. As Dr Rubin stresses, ignoring the deletion
mechanism involves not only conditioning on M = m, but also assuming that DJy has a distribution with
marginal density/(%); 6), that is that for the observed pattern M =rh,
e), (4)
or thatflrfmlt^D; 0,$) = EUi>lg{m\lJi0),ulii;ff) is independent of u ^ which is Dr Rubin's condition (6-2).
A sufficient condition for (4) is a combination of Dr Rubin's conditions, missing at random and
observed at random, namely that
f(m\u; <f>) is independent oft*, (5)
This implies ancillarity if and only if it holds for all observable patterns m, and not just for the observed
pattern m, and also the parameter space for (0, <f>) is Clg x fl^; then the deletion pattern can be ignored.
For example, consider Dr Rubin's weighing problem in § 9, when a weighing value is displayed with
probability 81(1 — 6), and all values are displayed. Then (6) is satisfied for all patterns m, but 8 = <j>, so
that 8 and ^ are dependent, and anoillarity fails to hold. Thus in principle the rather complicated distri-
bution of obs (17, M) described by Dr Rubin should be used. However this deletion mechanism seems
highly unlikely in practice.
Comments on paper by D. B. Rubin 591
Necessary conditions for ignoring the deletion mechanism are unfortunately not obvious, and it is
worth considering some further examples.
Example 1. Suppose that for the observed value m, U^ and U^ are independently distributed, and that
the probability that M =m depends on U(i)b\it not U^th&t is g{rh\u; <j>) = g(th\tt^; <f>). Then clearly (4) ia
satisfied but not (5), so (6) ia not necessary for (4).
Example?,. Let U4 be independent N(6,1) (» = 1, ...,n) and suppose mf = l i f andonlyif \U{— U\ <<j>,
for some constant $. A simple computation of (1) establishes that m is ancillary for 6. However we cannot
ignore the deletion mechanism, since the correct distribution for sampling inference has density

(3)
f(u;6)du
where R{m) = {u: |u4 —u| *A: asm, = 0 or 1} is a region of R"; this is clearly not the normal density

The case of pure likelihood inferences is much simpler, since we can fix U^ and M at their observed
values flj.fn, and the rather complex sample space of obs (17, M) is not relevant. Dr Rubin's sufficient
conditions in Theorem 7 • 1 are perhaps more remarkable than his examples would suggest. TTia Example 3
for instance.is already well known: see Examples 2-34 and 2-40 of Cox & Hinkley (1974). We give a
multivariate example of some practical importance.
Example 3. Consider an incomplete bivariate normal sample size n of random variables X and T,
which have respective means fa, fa, variances a\, o\, and correlation p. Suppose X is always observed.
Two possible deletion mechanisms for T are: (a) observe Y if and only if Y > c; (b) observe Y if and
only if X > c. It is easily seen that Dr Rubin's' mianing at random' condition is satisfied in (6) but not
in (a), and so for nuTimnm likelihood estimation we can ignore the deletion mechanism in (6) but not
in (a). To illustrate this, the estimates of Table 1 were found from generated data with 50 observations,
c = 0 and fa = fa, = 0, so that about half the Y values were deleted in (o) and (6). Note that estimates
of fa, a\ and p in situation (ii o) are biased, confirming previous theory. However the estimates in
situation (ii 6) are maximum likelihood, and are close to their true values. Thus here we can ignore the
deletion pattern, although the observed values of Y do not follow the marginal ^(0,2) distribution, and
in particular their sample mean will overestimate zero.
In a real set of data for which (ii 6) is appropriate, X might be blood pressure, and Y a medical test
which for safety reasons is not carried out when X is below a certain level o.

Table 1. Maximum likelihood estimates, ignoring the deletion mechanism, for

fa = 0, fa = 0, a\ = 1, a\ = 2, p = 0-71

(i) Complete data 0013 0 •085 0 •917 1-827 0-780

(no) Data censored by (a) 0013 0 •930 0 •917 0-456 0- 510
(h&) Data censored by (b) 0013 - 0 •140 0 •917 1-991 0- 645

In nummary, Dr Rubin's paper should stimulate thought about the many mechanisms whioh produce
data with miamng values.
REFERENCE
Cox, D. R. <fc HnncLHY, D. V. (1974). Theoretical Statistics. London: Chapman and Hall.

Reply to c o m m e n t s

B Y D. B. RUBIN

First, I want to thank Dr Little for his Example 3, which numerically illustrates the point being made
in the beginning of § 10. Secondly, I must reject bis restriction that M should be ancillary when malring
sampling distribution inferences for d which are conditional on M. As Theorem 6-1 states, if (a) the
missing data are missing at random and (b) the observed data are observed at random, then a sampling
592 DONALD B. RUBIN
distribution probability statement that ignores the process that causes missing data is correct if
interpreted as being conditional on M. Given (a) and (6), Theorem 7• 1 on likelihood inference implies that
suoh a probability statement cannot generally be fully efficient for inference about 6 unless (e) 6 is distinct
from <f>. Nevertheless, sampling distribution inferences that are less than fully efficient are often quite
useful. Furthermore, given (a), (6) and (c), sampling distribution inference for 6 should be conditional
on M whether or not M is ancillary. For a simple case, consider my Example 4 with m = (1,0), 0 = 0-1,
andfu^wj ~N{(d,6),I). The conditional probability of the event & = (u- 1-96 < d <u+ 1-96), where
t? = 1/mi uJJjmf, is 0-95 for all 0, while the unconditional probability of 8 is nearly 0-99 for 0 quite positive.
This example suggests that the usual definition of ancillary (Cox & Hinkley, 1974, p. 35) is incorrect for
inference about 6 and should be modified to be conditional on the observed value of the ancillary statistic.

Semiparametric Theory and Missing Data - Anastasios Tsiatis - Springer Series in Statistics, 1, 2006 - Springer - 9780387324487 - Anna's Archive
No ratings yet
Semiparametric Theory and Missing Data - Anastasios Tsiatis - Springer Series in Statistics, 1, 2006 - Springer - 9780387324487 - Anna's Archive
391 pages
Missing Data
No ratings yet
Missing Data
71 pages
Francis R Pitard - Theory of Sampling and Sampling Practice, Third Edition-CRC Press (2019)
No ratings yet
Francis R Pitard - Theory of Sampling and Sampling Practice, Third Edition-CRC Press (2019)
26 pages
Fox Family - MX
100% (8)
Fox Family - MX
62 pages
Data Screening: Wei-Jiun, Shen Ph. D
No ratings yet
Data Screening: Wei-Jiun, Shen Ph. D
31 pages
Missing Data Values and How To Handle It
No ratings yet
Missing Data Values and How To Handle It
5 pages
Natalie Loxton Data Screening
No ratings yet
Natalie Loxton Data Screening
36 pages
Meth 2024 Part1 Censored
No ratings yet
Meth 2024 Part1 Censored
80 pages
Statistical Science: Volume 33, Number 2 May 2018
No ratings yet
Statistical Science: Volume 33, Number 2 May 2018
35 pages
Efron 1994
100% (1)
Efron 1994
14 pages
Meth 2024 Part2 Missing
No ratings yet
Meth 2024 Part2 Missing
40 pages
Efficient Estimation From Right-Censored Data When Failure Indicators Are Missing at Random
No ratings yet
Efficient Estimation From Right-Censored Data When Failure Indicators Are Missing at Random
17 pages
Lec4 Missing
No ratings yet
Lec4 Missing
12 pages
EM Algorithm
No ratings yet
EM Algorithm
30 pages
FDS U4
No ratings yet
FDS U4
93 pages
Extreme Learning Machine For Missing Data Using Multiple Imputations
No ratings yet
Extreme Learning Machine For Missing Data Using Multiple Imputations
18 pages
CH 02 Data Handling Technique
No ratings yet
CH 02 Data Handling Technique
105 pages
Lined Interjection Approach
No ratings yet
Lined Interjection Approach
7 pages
Missing Data DAGS R448-Reprint
No ratings yet
Missing Data DAGS R448-Reprint
12 pages
Performance of A General Location Model With An Ignorable Missing-Data Assumption in A Multivariate Mental Health Services Study
No ratings yet
Performance of A General Location Model With An Ignorable Missing-Data Assumption in A Multivariate Mental Health Services Study
13 pages
Missing Data: I. Types of Missing Data. There Are Several Useful Distinctions We Can Make
No ratings yet
Missing Data: I. Types of Missing Data. There Are Several Useful Distinctions We Can Make
19 pages
Unit 2 - 2
No ratings yet
Unit 2 - 2
22 pages
Missing Data
100% (2)
Missing Data
35 pages
Graham2009 Missing Values Analysis
No ratings yet
Graham2009 Missing Values Analysis
31 pages
Act2 Apren GVZA
No ratings yet
Act2 Apren GVZA
4 pages
Solutions For Missing Data in Structural Equation Modeling
No ratings yet
Solutions For Missing Data in Structural Equation Modeling
6 pages
Moreno Betancur Chavance 2013 Sensitivity Analysis of Incomplete Longitudinal Data Departing From The Missing at Random
No ratings yet
Moreno Betancur Chavance 2013 Sensitivity Analysis of Incomplete Longitudinal Data Departing From The Missing at Random
19 pages
Missing Data Stata
No ratings yet
Missing Data Stata
18 pages
Missingdata
No ratings yet
Missingdata
10 pages
Some General Guidelines For Choosing Missing Data Handling Method
No ratings yet
Some General Guidelines For Choosing Missing Data Handling Method
24 pages
Missing Data Analysis: University College London, 2015
No ratings yet
Missing Data Analysis: University College London, 2015
37 pages
DADM S5 Imputation of Missing Data
No ratings yet
DADM S5 Imputation of Missing Data
15 pages
1.data Cleaning Screening
No ratings yet
1.data Cleaning Screening
21 pages
Handling Missing Values in Data Mining
No ratings yet
Handling Missing Values in Data Mining
12 pages
Week 10 Non Response and Missing Data
No ratings yet
Week 10 Non Response and Missing Data
73 pages
Dyad 008
No ratings yet
Dyad 008
8 pages
Act 2 AGJ
No ratings yet
Act 2 AGJ
6 pages
Lecture 2.3.10
No ratings yet
Lecture 2.3.10
30 pages
Missing Data Mechanisms and Imputation Methods
No ratings yet
Missing Data Mechanisms and Imputation Methods
16 pages
BA UNIT-3 - Part 1
No ratings yet
BA UNIT-3 - Part 1
4 pages
Handling Missing Data
No ratings yet
Handling Missing Data
23 pages
Milsap Allison
No ratings yet
Milsap Allison
18 pages
Missing Data and Multi Imputation
No ratings yet
Missing Data and Multi Imputation
5 pages
Data Cleaning Workshop:: Club Data Science and Cloud Computing
No ratings yet
Data Cleaning Workshop:: Club Data Science and Cloud Computing
6 pages
Modern Method Web in Ar May 2012
No ratings yet
Modern Method Web in Ar May 2012
45 pages
Aaps - Schafer Missing Data and Longitudinal Analysis
No ratings yet
Aaps - Schafer Missing Data and Longitudinal Analysis
59 pages
Missing Data
No ratings yet
Missing Data
7 pages
Missing Data & How To Handle It
No ratings yet
Missing Data & How To Handle It
32 pages
Missng Data
No ratings yet
Missng Data
8 pages
Physical Chemistry II
No ratings yet
Physical Chemistry II
11 pages
1preparing Data
No ratings yet
1preparing Data
6 pages
Making Use of Incomplete Observations in The Analysis of Structural Equation Models The CALIS Procedure's Full Information Maximum Likelihood Method in SAS STAT 9.3
No ratings yet
Making Use of Incomplete Observations in The Analysis of Structural Equation Models The CALIS Procedure's Full Information Maximum Likelihood Method in SAS STAT 9.3
20 pages
Missing Data, Part 1. Why Missing Data Are A Probl
No ratings yet
Missing Data, Part 1. Why Missing Data Are A Probl
4 pages
Kerala PSC Maths Questions
No ratings yet
Kerala PSC Maths Questions
6 pages
LCGC Eur Burke 2001 - Missing Values, Outliers, Robust Stat and NonParametric PDF
No ratings yet
LCGC Eur Burke 2001 - Missing Values, Outliers, Robust Stat and NonParametric PDF
6 pages
Assignment 1
No ratings yet
Assignment 1
4 pages
00 Linacre Estimation Methods
No ratings yet
00 Linacre Estimation Methods
19 pages
ISAT 600 Progress Report 2
No ratings yet
ISAT 600 Progress Report 2
6 pages
Da Theory 03
No ratings yet
Da Theory 03
2 pages
Multiple Choice Questions
No ratings yet
Multiple Choice Questions
9 pages
Inflation in Pakistan - Causes, Impacts, and Policy Solutions
No ratings yet
Inflation in Pakistan - Causes, Impacts, and Policy Solutions
4 pages
Understanding Textiles
No ratings yet
Understanding Textiles
103 pages
Assessment of Socio-Economic Status of Pochampally Ikat Handloom Weavers
No ratings yet
Assessment of Socio-Economic Status of Pochampally Ikat Handloom Weavers
8 pages
Change of SZSE Securities Lists
No ratings yet
Change of SZSE Securities Lists
1,242 pages
B Tech Canopy
No ratings yet
B Tech Canopy
9 pages
Strat e Book 1
No ratings yet
Strat e Book 1
12 pages
Lecture Mean Variance
No ratings yet
Lecture Mean Variance
13 pages
Placement Session 2023-2024
No ratings yet
Placement Session 2023-2024
50 pages
Rev151 Topic 2 (B) - Cost Approach
No ratings yet
Rev151 Topic 2 (B) - Cost Approach
30 pages
Ceramics Final Test
100% (1)
Ceramics Final Test
6 pages
John Old University of Warwick: MBA DL: Economics of The Business Environment: Vulnerability WBS Live June 2010
No ratings yet
John Old University of Warwick: MBA DL: Economics of The Business Environment: Vulnerability WBS Live June 2010
43 pages
Class XII - Study Material - Economics
No ratings yet
Class XII - Study Material - Economics
69 pages
An Institutional Analysis of Galician Turbot Aquaculture Property Rights System, Legal Framework and Resistance To Institutional Change 2020
No ratings yet
An Institutional Analysis of Galician Turbot Aquaculture Property Rights System, Legal Framework and Resistance To Institutional Change 2020
9 pages
Houses & Apartments For Rent in Kigali Nyarugenge-Manual
No ratings yet
Houses & Apartments For Rent in Kigali Nyarugenge-Manual
135 pages
Xpress Bees: COD: Collect Amount Rs 358
No ratings yet
Xpress Bees: COD: Collect Amount Rs 358
24 pages
Packing List TSS
No ratings yet
Packing List TSS
1 page
An Analysis of The Cost Structure of Water Supply
No ratings yet
An Analysis of The Cost Structure of Water Supply
33 pages
Pol Price Wef 01.02.2024
No ratings yet
Pol Price Wef 01.02.2024
1 page
Handling Ripening Mangoes Export Trainers Notes 2910
No ratings yet
Handling Ripening Mangoes Export Trainers Notes 2910
13 pages
Scott Bader Gelcoat Handling Guide English
No ratings yet
Scott Bader Gelcoat Handling Guide English
6 pages
Annexure A
No ratings yet
Annexure A
1 page
10
No ratings yet
10
3 pages
Mr. Rangaswamy - 2
No ratings yet
Mr. Rangaswamy - 2
2 pages
CQF January 2022 M1L4 Exercises
No ratings yet
CQF January 2022 M1L4 Exercises
2 pages
The Straight Line and Business Applications
No ratings yet
The Straight Line and Business Applications
20 pages
Cable Gland
No ratings yet
Cable Gland
1 page
Philip Crosby: Quality-Management/major-Contributors-To-Tqm
No ratings yet
Philip Crosby: Quality-Management/major-Contributors-To-Tqm
2 pages
Data Science through R. Unsupervised Learning. Dimension Reduction Techniques: Principal Components, Factor Analysis and Correspondence Analysis
From Everand
Data Science through R. Unsupervised Learning. Dimension Reduction Techniques: Principal Components, Factor Analysis and Correspondence Analysis
César Pérez López
No ratings yet
Multivariate Analysis for the Biobehavioral and Social Sciences: A Graphical Approach
From Everand
Multivariate Analysis for the Biobehavioral and Social Sciences: A Graphical Approach
Bruce L. Brown
No ratings yet
Chi Squared for Beginners
From Everand
Chi Squared for Beginners
Stephanie Glen
No ratings yet

Rubin 1976

Uploaded by

Rubin 1976

Uploaded by

Biometrika (1976), 63, 3, pp.

Inference and missing data

1. INTRODUCTION : THE GENERAUTY OF

2. OBJECTIVES AND BROAD REVIEW

3. NOTATION FOB THE RANDOM VARIABLES

4. EXAMPLES OF PBOOESSES THAT CAUSE MESSING DATA

5. IGNORING THE PBOOESS THAT CAUSES MISSING DATA

Table 1. Classifying the examples in § 4

//„(«) duo,,. (5-1)

6. MISSING DATA AND SAMPLING DISTBIBUTION INTEBENOE

7. MISSING DATA AND DIREOT-LIKEIJHOOD

J2?(0|0) = 8(6, Oj/Wfijdtow), (7-1)

regarded as a function of 8, <f> given the observed u\$ and rh.

8. MISSING DATA AND BAYESIAN INFERENCE

takes a constant positive value.

9. COMPABOTG rNTKRENOES IN A SIMPLE •BTTAMTT/R:

10. PRACTICAL IMPLICATIONS

I would like to thank A. P. Dempster, P. W. Holland, T. W. F. Stroud and a referee for

[Received April 1974. Revised November 1975]

Comments on paper by D. B. Rubin

Table 1. Maximum likelihood estimates, ignoring the deletion mechanism, for

(i) Complete data 0013 0 •085 0 •917 1-827 0-780

You might also like