Rubin 1976
Rubin 1976
581-92 581
Printed in Great Britain
SUMMABY
When making sampling distribution inferences about the parameter of the data, 6, it is
appropriate to ignore the process that causes missing data if the missing data are 'missing
Downloaded from http://biomet.oxfordjournals.org/ at University of California, San Francisco on December 17, 2014
at random' and the observed data are 'observed at random', but these inferences are
generally conditional on the observed pattern of missing data. When making direct-
likelihood or Bayesian inferences about 6, it is appropriate to ignore the process that causes
missing data if the minting data are missing at random and the parameter of the missing data
process is ' distinct' from 6. These conditions are the weakest general conditions under which
ignoring the process that causes missing data always leads to correct inferences.
Some hoy words: Bayesian inference; Incomplete data; Likelihood inference; Missing at random;
Missing data; Missing values; Observed at random; Sampling distribution inference.
Example 2. Let u4 be the value of blood pressure for the tth subject (t = 1, ...,n) in a
hospital survey. Suppose vt = * if ^ is less than <j>, which equals the mean blood pressure in
the population; i.e. we only record blood pressure for subjects whose blood pressures are
greater than average. Then
n
fyH") = n *{r(«i-^)-m4},
where y(o) = 1 if a > 0 and 0 otherwise; i{a) = 1 if a =• 0 and 0 otherwise.
Example 3. Observations are taken in sequence until a particular function of the observa-
tions is in a specified critical region G. Here n is essentially infinite and, for some r^ which
is a function of the observations, vt 4= * (t < n j , and v{ = * (t > n^). Thus
= 5 *(!-«»*) ft
where 74 is the minimum k such that the function <&(«!, ..
Example 4. Let n = 2. If v^ > 0: with probability 0, t^ =t= * and vt = *; and with proba-
bility 1 — <j>, vx + • and «7j # *. If ttj < 0: with probability <f>, vx + * and vt = *; and with
probability 1 — <j>, v^ = * and v2 + •. Thus
if m = (l,0),
if TO = (1,1),
g+(m\u) =
if m = (0,0).
584 DONALD B. RUBIN
Hence, the observed value of M, namely rh, effects a partition of each of the vectors of
random variables and the vectors of observed values into two vectors corresponding to
rh{ = 0 for missing data and rhi= 1 for observed data. For convenience write
v
F = (%», Vu), u = (««,), uo), =
where by definition %) = (*,..., *.) and u^ = v^. I t is important to remember that these
partitions are those corresponding to m = rh, the observed pattern of missing data. For
further notational convenience, we let u = (M^), U^) ; u consists of a vector of arguments, ?%,
corresponding to unobserved random variables, and a vector of known numberSj H^ = t ^ ,
corresponding to values of observed random variables.
The objective is to use $, or equivalently rh and u^, to make inferences about 6. It is com-
mon practice to ignore the process that causes missing data when making these inferences.
Ignoring the process that causes missing data means proceeding by: (a) filing the random
variable M at the observed pattern of missing data, rh, and (6) assuming that the values of
the observed data, £k), arose from the marginal density of the random variable U^:
Definition 1. The mianing data are missing at random if for each value of <j>, g^ihty) takes
the same value for all 1%.
Definition 2. The observed data are observed at random if for each value of <f> and t%,
) takes the same value for all u^.
Inference and missing data • 585
Definition 3. The parameter <j> is distinct from 6 if their joint parameter space factorizes
into a 0-space and a #-space, and when prior distributions are specified for $ and 6, if these
are independent.
Table 1 classifies the four examples of § 4 in terms of these definitions.
where h$^(m) = j fg(u) g^(m\u) du, which is the marginal probability that M takes the
value rh. Hence, the correct sampling distribution of 8{v) depends in general not only on the
fixed hypothesized fg but also on the fixed hypothesized g$.
THEOBBM 6-1. Suppose that (a) the missing data are missing at random and (6) the observed
data are observed at random. Then the sampling distribution of8(&) under fg ignoring the process
that causes missing data, i.e. calculatedfrom density (5-1), equals the correct conditional sampling
distribution of #(#) given rh under fgg^, that is calculated from density (6-1) assuming
0.
Proof. Under conditions (a) and (6), for each value of $, g^{m\u) takes the same value for
all u; notice that this does not imply V and M are independently distributed unless it holds
for all possible rh. Hence kg^{rh) = gr^(wi|tt), and thus the distribution of every statistic under
density (5-1) is the same as under density (6-1).
THEOREM 6-2. The sampling distribution of8(d) under fe calculated by ignoring the process
that causes missing data equals the correct conditional sampling distribution of S(v) given rh
under fgg^ for every 8(v), if and only if
Proof. The sampling distribution of every 8{S) found from density (5-1) will be identical
to that found from density (6-1) if and only if these two densities are equal. This equality
may be written as equation (6-2) by dividing by (5-1), and multiplying by ke,^(m).
The phrase 'ignoring the process that causes missing data when making sampling distri-
bution inferences' may suggest not only calculating sampling distributions with respect to
density (6-1) but also interpreting the resulting sampling distributions as unconditional
rather than conditional on m.
586 DONALD B. R U B I N
THEOREM 6-3. The sampling distribution of S(v) under fg calculated ignoring the process
that causes missing data equals the correct unconditional sampling distribution of S(v) under
(v) if and only »/fy(m|u) = 1.
Proof. The sufficiency is immediate. To establish the necessity consider the statistic
S(v) = 1 ifTO= rh and 0 otherwise.
(7-2)
Proof. Conditions (a) and (6) imply from equations (7-1) and (7-2) that
THEOREM 7-2. Suppose J2?(0|0) > Ofor aM 6eQg. AU likelihood ratios for defy ignoring
the process thai causes missing data are correct for aU<j>e Q^, if and only if (a) fy^ = Cle x Q^,
and (b)for each <f> e £1^, E^ty^mty) \m, u\$, 6, <j>) takes the same positive value for aUde£le.
Proof. First we show that
Se{d,<j>\v) = E^g^u) \m,uti),d,t}S{(0,t), Qe,^(d,fS). (7-3)
This is immediate if £?(6\C) > 0 for all 8eQe, and is true otherwise because
for all 8, (f> and v. If conditions (a) and (6) hold, (7-2) factorizes into a 0-factor and a ^-factor;
thus these conditions are sufficient even if j£?(0|0) = 0 for some defy.
Now consider the necessity of conditions (a) and (6). Since :S?(0|t5) > 0 for all 8 e Qe, if the
likelihood ratios for 8 ignoring the process that causes miaajng data are correct for all <f> e Q^,
Inference and missing data 587
for each (0,<fi)e£lgx £}J, we have SC{0,<f>\v) > 0. Hence condition (a) in the theorem is
necessary. Now using condition (a) and (7-3) write for all 0X, 0a e Q$ and <f> e fl^
-'> 0.
v) E^g^m^m, v^, 0%,
If (7*4) equals Sf(01\v)/£f(02\v) for all 6lt 02 e Clg and all <j> e £L, we have condition (b) in the
theorem.
THEOREM 8-1. Suppose (a) thai the missing data are missing at random, and (6) that <f> is
distinct from 0. Then the posterior distribution of 6 ignoring the process that causes missing
data, i.e. calculated from equation (8-1), equals the correct posterior distribution of 6, that is cal-
culated from (8-2), and the posterior distributions for 6 and <j> are independent.
Proof. By conditions (a) and (6), equation (8-2) equals {p{6)^fe(u)du<^{p(<p)gQ(m\u)}.
THEOREM 8-2. The posterior distribution of 0 ignoring the process that causes missing data
equals the correct posterior distribution of 0 if and only if
tells us that in his experience sometimes no value will be displayed. Nevertheless in our ten
weighings we obtain ten values whose average is 5-0.
Let us first ignore the process that causes missing data. This might seem especially reason-
able since there are in fact no missing data. Under/g, the sampling distribution of the sample
average, 5-0, is N(6,0-1), and with a flat prior on 8 > 0 the posterior distribution of 6 is
approximately N(5-0, 0-1). Also, 5-0 is the maximum likelihood estimate of 8, and for
example the likelihood ratio of 6 = 5-0 to 8 = 4-0 is e6.
Now let U8 consider the process that causes missing data. Since there are no missing
observations, the miRsing data are missing at random. We discuss two processes that cause
missing data. First suppose that the manufacturer informs us that the display mechanism
has the flaw that for each weighing the value is displayed with probability <fi = 6/(1 + 0).
This fact means that the observed data are observed at random, and that <fi is not distinct
from 6. With a flat prior on 8 > 0 the posterior distribution for 6 is proportional to the
posterior distribution ignoring the process that causes missing data times {0/(1 + 6)}10. Thus,
because 8 and <f> are not distinct, the posterior distribution for 6 may be affected by the
process that causes minting data; i.e. all ten weighings yielding values suggeste that 6j( 1 + 6)
is close to unity and hence suggests that 6 is large compared to unity. The maximum likeli-
hood estimate of 8 is now about 5-04 and the likelihood ratio of 6 = 5-0 to 8 = 4-0 is about
1-5,/e.
However, since in this case the missing data are missing at random and the observed data
are observed at random, the sampling distribution of the sample average ignoring the
process that causes missing data equals the conditional sampling distribution of the sample
average given that all values are observed. The unconditional sampling distribution of the
sample average is the mixture of eleven distributions, the ith being N(6,1/i) with mixing
weight 0*101/(1 + 6)10{i 1(10 — *)!}, and the eleventh being the distribution of the 'sample
average' if no data are observed, e.g. zero with probability 1, with mixing weight (1 + 6)~10.
Now suppose that the manufacturer instead informs us that the display mechanism has
the flaw that it fails to display a value if the value that is going to be displayed is less than <j>.
Then the missing data are still missing at random, but the observed data are not observed
at random since the values are observed because they are greater than <f>. Also 8 and <f> are
now distinct since ^ is a property of the machine and 6 is a property of the object. It follows
that sampling distribution inferences may be affected by the process that causes mirering
data. Thus, the sampling distribution of the sample average given that all ten values are
observed is now the convolution of ten values from the distribution N(8,0g01) truncated
below <f>, and the unconditional sampling distribution of the sample average is the mixture
of eleven distributions, the jth(J = 1,..., 10) beingthe convolution oijN(8,llJ)'B with mixing
weight equal to [10!/{j! (10 - j)!}] £(<f>, 6)>{1- £(0, 0)} 1(W , where £(0,6) equals the area from
<f> to oo under the N(8,1) density, and the eleventh being the distribution of the 'sample
average' if no data are observed with mixing weight {1 — £(</>, 6)Y0.
However, since the missing data are missing at random and <j> is distinct from 8, the
posterior distribution for 8 with each fixed prior is unaffected by the process that causes
missing data. Hence, with a flat prior on 6 > 0, the posterior distribution for 8 remains
approximately N(5-0, 0-1). Also, 6-0 remains the maximum likelihood estimate of 6, and
^/e remains the likelihood ratio of 8 = 5-0 to 8 = 4-0.
Inference and missing data 589
REFEBENCES
1
A n n , A. A. & ELASHOM , R. M. (1966). Mismng observations in multivariate statistics. I. Review of
the literature. J. Am. Statist. Asaoc. 61, 696-604.
ANDBBSON, T. W. (1967). Maximum likelihood estimates for a multivariate normal distribution when
some observations are miming. J. Am. Statist. Aasoc. 52, 200-3.
Box, Q. E. P. & TIAO, G. C. (1973). Bayesian Inference in Statistical Analysis. Reading, Mass: Addison-
Wesley.
COOHBAN, W. G. (1963). Sampling Techniques. New York: Wiley.
EDWABDS, A. W. F. (1972). Likelihood. Cambridge University Press.
HABTLET, H. O. (1956). Programming analysis of variance for general purpose computers. Biometrics
12, 110-22.
HABTT.KY, H. O. & HooKmo, R. R. (1971). Incomplete data analysis. Biometrics 27, 783-823.
TTWAT.V, M. J. R. & WESTMACOTT, M. (1956). Missing values in experiments analyzed on automatic!
computers. Appl. Statist. 5, 203-6.
HOOKING, R. R. & SMITH, W. B. (1968). Estimation of parameters in the multivariate normal distribu-
tion with Tniftfring observations. J. Am. Statist. Assoc. 63, 159-73.
590 DONALD B. RUBIN
Hooxmo, R. R. & SMITH, W. B. (1972). Optimum incomplete multi-normal samples. Technometrics 14,
299-307.
JEFFREYS, H. (1961). Theory of Probability, 3rd edition. Oxford: Clarendon.
KBMPTHOBNE, O. (1952). The Design and Analysis of Experiments. New York: Wiley.
LEHMANN, E. L. (1959). Testing Statistical Hypotheses. New York: Wiley.
RUBIN, D. B. (1972). A noniterative algorithm for least squares estimation of missing values in any
analysis of variance design. Appl. Statist. 21, 136-41.
RUBIN, D. B. (1975). Bayesian inference for causality: The importance of randomization. Proo. Social
Statistics Section, Am. Statist. Assoc. pp. 233-9.
RUBIN, D. B. (1976). Noniterative least squares estimates, standard errors, and .F-testa for analyses of
variance with Tniamng data. J. S. Statist. Soc. B 38. To appear.
TBAWINSKI, I. M. & BABQMANN, R. E. (1964). Maximum likelihood estimation witji incomplete multi-
variate data. Ann. Math. Statist. 35,647-57.
WrLKiNBON, G. N. (1958). Estimation of missing values for the analysis of incomplete data. Biometrics
14, 257-86.
W U K S , S. S. (1932). Moments and distributions of estimates of population parameters from frag-
mentary samples. Ann. Math. Statist. 3, 163-95.
BY R. J. A. LITTLE
Department of Statistics, University of Chicago
In the following comments, a notation close to that of Dr Rubin's paper is used. Thus U = (%,..., u j
denotes the full data, with density/(u; 6) (6e fle) and M = (m^,...,mn) indicates the observed pattern,
with conditional density p(m|u; <f>) (<j> e Q^) given U = u. The distribution of obs (U,M), the observed
data, can be described as follows. It has M = m with probability
g(m; 8,<f>) = Jg(m\u;<j>)f(u;8)du = ED{g(m\U; <fi); 8). (1)
Given M = m, the conditional density of obs(U,M) is
m; 8, <j>) = / ( u ( 0 ; 8) Jp(»n|u; <5)/(u(1}|u,0); 8) du^ (2)
where J7(1) is the observed part of U and Z7(0) is the missing part of U.
For sampling based inferences, a first crucial question concerns when it is justified to condition on the
observed pattern, that is on the event M = m, and to use the distribution (2) and (3). A natural condition
is that M should be ancillary, that is that g{m; 8, <f>) should be independent of 0 for all m, <f>. Otherwise
the pattern on its own carries at least some information about 0, which should in principle be used.
Suppose now that this anoillarity condition is satisfied. As Dr Rubin stresses, ignoring the deletion
mechanism involves not only conditioning on M = m, but also assuming that DJy has a distribution with
marginal density/(%); 6), that is that for the observed pattern M =rh,
e), (4)
or thatflrfmlt^D; 0,$) = EUi>lg{m\lJi0),ulii;ff) is independent of u ^ which is Dr Rubin's condition (6-2).
A sufficient condition for (4) is a combination of Dr Rubin's conditions, missing at random and
observed at random, namely that
f(m\u; <f>) is independent oft*, (5)
This implies ancillarity if and only if it holds for all observable patterns m, and not just for the observed
pattern m, and also the parameter space for (0, <f>) is Clg x fl^; then the deletion pattern can be ignored.
For example, consider Dr Rubin's weighing problem in § 9, when a weighing value is displayed with
probability 81(1 — 6), and all values are displayed. Then (6) is satisfied for all patterns m, but 8 = <j>, so
that 8 and ^ are dependent, and anoillarity fails to hold. Thus in principle the rather complicated distri-
bution of obs (17, M) described by Dr Rubin should be used. However this deletion mechanism seems
highly unlikely in practice.
Comments on paper by D. B. Rubin 591
Necessary conditions for ignoring the deletion mechanism are unfortunately not obvious, and it is
worth considering some further examples.
Example 1. Suppose that for the observed value m, U^ and U^ are independently distributed, and that
the probability that M =m depends on U(i)b\it not U^th&t is g{rh\u; <j>) = g(th\tt^; <f>). Then clearly (4) ia
satisfied but not (5), so (6) ia not necessary for (4).
Example?,. Let U4 be independent N(6,1) (» = 1, ...,n) and suppose mf = l i f andonlyif \U{— U\ <<j>,
for some constant $. A simple computation of (1) establishes that m is ancillary for 6. However we cannot
ignore the deletion mechanism, since the correct distribution for sampling inference has density
(3)
f(u;6)du
where R{m) = {u: |u4 —u| *A: asm, = 0 or 1} is a region of R"; this is clearly not the normal density
The case of pure likelihood inferences is much simpler, since we can fix U^ and M at their observed
values flj.fn, and the rather complex sample space of obs (17, M) is not relevant. Dr Rubin's sufficient
conditions in Theorem 7 • 1 are perhaps more remarkable than his examples would suggest. TTia Example 3
for instance.is already well known: see Examples 2-34 and 2-40 of Cox & Hinkley (1974). We give a
multivariate example of some practical importance.
Example 3. Consider an incomplete bivariate normal sample size n of random variables X and T,
which have respective means fa, fa, variances a\, o\, and correlation p. Suppose X is always observed.
Two possible deletion mechanisms for T are: (a) observe Y if and only if Y > c; (b) observe Y if and
only if X > c. It is easily seen that Dr Rubin's' mianing at random' condition is satisfied in (6) but not
in (a), and so for nuTimnm likelihood estimation we can ignore the deletion mechanism in (6) but not
in (a). To illustrate this, the estimates of Table 1 were found from generated data with 50 observations,
c = 0 and fa = fa, = 0, so that about half the Y values were deleted in (o) and (6). Note that estimates
of fa, a\ and p in situation (ii o) are biased, confirming previous theory. However the estimates in
situation (ii 6) are maximum likelihood, and are close to their true values. Thus here we can ignore the
deletion pattern, although the observed values of Y do not follow the marginal ^(0,2) distribution, and
in particular their sample mean will overestimate zero.
In a real set of data for which (ii 6) is appropriate, X might be blood pressure, and Y a medical test
which for safety reasons is not carried out when X is below a certain level o.
In nummary, Dr Rubin's paper should stimulate thought about the many mechanisms whioh produce
data with miamng values.
REFERENCE
Cox, D. R. <fc HnncLHY, D. V. (1974). Theoretical Statistics. London: Chapman and Hall.
Reply to c o m m e n t s
B Y D. B. RUBIN
First, I want to thank Dr Little for his Example 3, which numerically illustrates the point being made
in the beginning of § 10. Secondly, I must reject bis restriction that M should be ancillary when malring
sampling distribution inferences for d which are conditional on M. As Theorem 6-1 states, if (a) the
missing data are missing at random and (b) the observed data are observed at random, then a sampling
592 DONALD B. RUBIN
distribution probability statement that ignores the process that causes missing data is correct if
interpreted as being conditional on M. Given (a) and (6), Theorem 7• 1 on likelihood inference implies that
suoh a probability statement cannot generally be fully efficient for inference about 6 unless (e) 6 is distinct
from <f>. Nevertheless, sampling distribution inferences that are less than fully efficient are often quite
useful. Furthermore, given (a), (6) and (c), sampling distribution inference for 6 should be conditional
on M whether or not M is ancillary. For a simple case, consider my Example 4 with m = (1,0), 0 = 0-1,
andfu^wj ~N{(d,6),I). The conditional probability of the event & = (u- 1-96 < d <u+ 1-96), where
t? = 1/mi uJJjmf, is 0-95 for all 0, while the unconditional probability of 8 is nearly 0-99 for 0 quite positive.
This example suggests that the usual definition of ancillary (Cox & Hinkley, 1974, p. 35) is incorrect for
inference about 6 and should be modified to be conditional on the observed value of the ancillary statistic.