Missing Data
Missing Data
1 of 7
Rev. Nr.: Effective date:
Handling Missing Data 1.1 5-7-2012
HB Nr. : 1.4-09 aw
1. Aim
To give researchers a structured guideline for handling missing data
2. Definitions
3. Keywords
Missing data, Missing completely at random, Missing at Random, Missing not at random, Imputation.
4. Description
4.1 Introduction
Missing data is a common problem in all kinds of research. The way you deal with it depends on how
much data is missing, the kind of missing data (single items, a full questionnaire, a measurement
wave), and why it is missing, i.e. the reasons that the data are missing. Handling missing data is an
important step in several phases of your study.
Data collection:
Closely monitor the completeness of the data when you receive or obtain the data. When you detect
missing data during data collection, try to complete your data. Look back in the raw data
(questionnaires), or ask your respondents to fill out the missing items. Describe in your logbook why
data are missing. This helps you to decide whether data are missing at random or not.
Data processing:
Investigate the number of missing data you have (see 4.4) and estimate the need for imputation and
think about the most adequate imputation method (see 4.5 and further).
Data analyses:
If you have missing values in your data set when starting your analyses, remember that case wise
and list wise deletion (default in SPSS regression and ANOVAs) may hamper the reliability of your
results (see 4.2).
SPSS can help you to identify the amount of missing data. When you are interested in the percentage
of missing values for each variable separately (e.g. item on a questionnaire) use the Frequency
option in SPSS:
1. Select Analyze Descriptive Statistics Frequencies
2. Move all variables into the “Variable(s)” window.
3. Click OK. The “Statistics” box tells you the number of missing values for each variable.
However, be aware that this only gives you information about the percentage of missing values for
each variable separately. It is more important to study the full percentage of missing data, especially
when you use more variables in your analysis.
When you are interested in the full percentage of missing data use the following option:
1. Select Analyze Multiple Imputation Analyze patterns
2. Move all variables into the “Variable(s)” window.
3. Click OK. The output tells you the percentage of variables with missing data, the
percentage of cases with missing data, and the number of missing values. This final pie
chart tells you the full percentage of missing data. Note the 5% borderline. Also patterns of
missing data are presented.
4. Tip: use the Help button, and click “show me” for more information about the options and
output in SPSS.
When you want to find out more about the patterns of missing data and the relation between missing
data between variables, use the following option:
1. Analyze Missing Value Analysis,
2. Move all variables of interest into the Quantitative or Categorical Variable(s) window.
3. Use the ‘patterns’ button to get information about the relation between missing data on
more variables
4. A tutorial of the Missing Value Analysis (SPSS 16 and further) procedures in SPSS can be
found via the Help button. A user’s guide can be downloaded freely on the internet.
The way you deal with missing data depends on the type of missing data
Think about your dataset. Is there an option that the missing values are non-random?
Title of the document: Page. 3 of 7
Rev. Nr.: Effective date:
Handling Missing Data 1.1 5-7-2012
HB Nr. : 1.4-09 aw
MAR: Missing at Random (most of The data are MAR when the probability that a value for a
the time) certain variable is missing is related to observed values
on other variables. An example is when older
respondents have more missing values than younger
respondents. However, within the group of older and
younger respondents, the data are still MCAR. Another
example is when respondents with low scores on the
first wave are not invited for a second wave.
MNAR: Missing Not At Random: The data are MNAR when the probability that a value for
a certain variable is missing is related to the scores on
that variable itself. An example is that respondents with
low income intentionally skip their low income scores
because that violates their privacy. In that case, the
probability that an observation is missing depends on
information that is not observed, like the value of the
income score, because only low values are missing.
MNAR is a serious problem, which can not be solved
with a technique as multiple imputation.
It is important to note that you’re not able to test whether your missing data is MAR or MNAR. The
above mentioned procedures (1 and 2) will only give you an indication. Pay attention to the
possibility of MNAR, because all analyses have serious problems when your missing data is
MNAR.
Title of the document: Page. 4 of 7
Rev. Nr.: Effective date:
Handling Missing Data 1.1 5-7-2012
HB Nr. : 1.4-09 aw
For MCAR and MAR, there are roughly two kinds of techniques for imputation. Single and Multiple
Imputation.
Single imputation is possible in SPSS and is an easy way to handle missings when just a few
cases are missing (less than 5%) and you think your missing values are MCAR or MAR. However,
after single imputation the cases are more similar which may result in an underestimation of the
standard errors, i.e. smaller confidence intervals. This increases the chance of a type 1 error (the
null hypothesis of no effect is rejected, while there is truly no effect). Therefore, this method is less
adequate when you have >5% missing data.
Multiple imputation is more complex, but also implemented in SPSS 17.0 and later versions.
Multiple imputation takes into account the uncertainty of missing values (present in all values of
variables) and is therefore more preferred than single imputation. When your missingness is high
(exceeds 5% in several variables and different persons) multiple imputation is more adequate.
Imputation techniques
Single imputation
Single imputation techniques are based on the idea that in a random sample every person can be
replaced by a new person, given that this new person is randomly chosen from the same source
population as the original person. In that case you can use the observed available data of the
other persons to make an estimation of the distribution of the test result in the source population.
It is called single imputation, because each missing is imputed once.
There are many methods for single imputation, such as replacement by the mean, regression,
and expected maximization. Expected maximization is preferred, because in the other methods
the variance and standard error are reduced and the chance for Type II errors increases.
Expected maximization forms a missing data correlation matrix by assuming the shape of a
distribution for the missing data and imputes missing values on the likelihood under that
distribution. Single imputation is possible in SPSS (analyze – missing value analyses – button EM
for Expected Maximization). Contact a statistician from EMGO+ who is willing to help you with this
procedure.
For the imputation of a missing score on a single item in a questionnaire (see 4.5) , SPSS
syntaxes can be found at:
http://www.tilburguniversity.edu/nl/over-tilburg-
university/schools/socialsciences/organisatie/departementen/mto/onderzoek/software/
tw.zip: Software for two-way imputation in SPSS. (Van Ginkel & Van der Ark, 2003a), or
rf.zip: Software for response function imputation in SPSS (Van Ginkel & Van der Ark, 2003b).
Title of the document: Page. 5 of 7
Rev. Nr.: Effective date:
Handling Missing Data 1.1 5-7-2012
HB Nr. : 1.4-09 aw
Sensitivity analysis
After imputation, sensitivity analysis is needed to determine how your substantive results depend
on how you handled the missing data.
Follow these steps:
1. Do a complete case analysis (default option in SPSS; cases with missings are not
included)
2. Do a missing data analysis after you imputed the results
3. Compare substantive conclusions, decide how to report.
4. Summary
• Make every effort to avoid missing data, or failing that, to understand how much and
why data is missing.
• Understand missing data mechanisms (MCAR, MAR, MNAR) and their implications
• Avoid default methods (listwise deletion, pairwise deletion)
• Avoid default fixups (mean imputation, etc.) where possible
• Use multiple imputation to take proper account of missings
• Do a sensitivity analysis
Title of the document: Page. 6 of 7
Rev. Nr.: Effective date:
Handling Missing Data 1.1 5-7-2012
HB Nr. : 1.4-09 aw
5. Details
I have missings in
my data
MCAR/MAR MNAR
6. Appendices/references/links
Multiple Imputation Methods, Niels Smits (technical literature).
http://www2.chass.ncsu.edu/garson/pa765/missing.htm
http://www.ssc.upenn.edu/~allison/MultInt99.pdf (especially for Multiple Imputation)
Ask EMGO+ statisticians for help via:
http://www.emgo.nl/kc/preparation/research%20design/3%20Advice%20and%20support.html
1. Sterne JA, White IR, Carlin JB, Spratt M, Royston P, Kenward MG, Wood AM, Carpenter JR.
Multiple imputation for missing data in epidemiological and clinical research: potential and
pitfalls. BMJ 2009;338:b2393. doi: 10.1136/bmj.b2393.
2. Allison, P.D. (2001). Missing Data (Sage University Papers Series on Quantitative
Applications in the Social Sciences, series no. 07-136). Thousand Oaks: Sage.
3. Schafer, J.L. & Graham, J.W. (2002). Missing data: Our view of the state of the art.
Psychological Methods, 7, 147-177.
4. Donders AR, van der Heijden GJ, Stijnen T, Moons KG. Review: a gentle introduction to
imputation of missing values. J Clin Epidemiol. 2006; 59(10):1087-91. Review.
5. http://www.stat.psu.edu/~jls/mifaq.html (Multiple Imputation FAQ page met uitleg)
6. Van Ginkel, J. R., & Van der Ark, L. A. (2003a). SPSS syntax for two-way imputation of
missing test data [computer software and manual]. Retrieved from
http://www.tilburguniversity.edu/nl/over-tilburg-
university/schools/socialsciences/organisatie/departementen/mto/onderzoek/software/
7. Van Ginkel, J. R., & Van der Ark, L. A. (2003b). SPSS syntax for response function imputation
of missing test data [computer software and manual]. Retrieved from
http://www.tilburguniversity.edu/nl/over-tilburg-
university/schools/socialsciences/organisatie/departementen/mto/onderzoek/software/
7. Amendments
V1.0 1-12-2011
V1.1 5-7-2012 : addition to section When is imputation of missing data not necessary?