0% found this document useful (0 votes)
9 views7 pages

Missing Data

The document provides guidelines for researchers on how to handle missing data, emphasizing the importance of understanding the type and extent of missingness before analysis. It categorizes missing data into three types: Missing Completely at Random (MCAR), Missing at Random (MAR), and Missing Not at Random (MNAR), and outlines appropriate imputation techniques for each scenario. The document also highlights the need for sensitivity analysis post-imputation to assess the impact of missing data handling on research conclusions.

Uploaded by

arif özer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views7 pages

Missing Data

The document provides guidelines for researchers on how to handle missing data, emphasizing the importance of understanding the type and extent of missingness before analysis. It categorizes missing data into three types: Missing Completely at Random (MCAR), Missing at Random (MAR), and Missing Not at Random (MNAR), and outlines appropriate imputation techniques for each scenario. The document also highlights the need for sensitivity analysis post-imputation to assess the impact of missing data handling on research conclusions.

Uploaded by

arif özer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Title of the document: Page.

1 of 7
Rev. Nr.: Effective date:
Handling Missing Data 1.1 5-7-2012
HB Nr. : 1.4-09 aw

1. Aim
To give researchers a structured guideline for handling missing data

2. Definitions

3. Keywords
Missing data, Missing completely at random, Missing at Random, Missing not at random, Imputation.

4. Description
4.1 Introduction
Missing data is a common problem in all kinds of research. The way you deal with it depends on how
much data is missing, the kind of missing data (single items, a full questionnaire, a measurement
wave), and why it is missing, i.e. the reasons that the data are missing. Handling missing data is an
important step in several phases of your study.

4.2 Why do you need to do something with missing data?


The default option in SPSS is that cases with missing values are not included in the analyses.
Deleting cases or persons results in a smaller sample size and larger standard errors. As a result the
power to find a significant result decreases and the chance that you correctly accept the alternative
hypothesis of an effect (compared to the null hypothesis of no effect) is smaller. Secondly, you
introduce bias in effect estimates, like mean differences (from t-tests) or regression coefficients (from
regression analyses). When the group of non-responders is large, and you delete them, your sample
characteristics are different from your original sample and from the population you study. There could
be a difference in characteristics between responders and non-responders. Therefore you need to
inspect the missing data, before doing further analyses. Thus, always check the missing data in your
data set before starting your analyses, and do never simply delete persons in your dataset with
missing values (default option in SPSS).

4.3 What to do with missing data in different phases of your study


Data preparation:
If you work with questionnaires, make sure that all questions are clear and applicable to your respondents.
If necessary, use the ‘not applicable’ answer option. To decrease the chance of missing data, use digital
applications to collect your data, such as Web based questionnaires where you can set the option that
answering the question is required. You can also use these applications for sending reminders and
tracking the respondents’ progress. If you work with physical or physiological data, the most frequent
cause of missing data is a technical problem with the instruments. Testing the instruments in a pilot study
will partly prevent you for these problems.

Data collection:
Closely monitor the completeness of the data when you receive or obtain the data. When you detect
missing data during data collection, try to complete your data. Look back in the raw data
(questionnaires), or ask your respondents to fill out the missing items. Describe in your logbook why
data are missing. This helps you to decide whether data are missing at random or not.

Data processing:
Investigate the number of missing data you have (see 4.4) and estimate the need for imputation and
think about the most adequate imputation method (see 4.5 and further).

Data analyses:
If you have missing values in your data set when starting your analyses, remember that case wise
and list wise deletion (default in SPSS regression and ANOVAs) may hamper the reliability of your
results (see 4.2).

4.4 How much data is missing?


Title of the document: Page. 2 of 7
Rev. Nr.: Effective date:
Handling Missing Data 1.1 5-7-2012
HB Nr. : 1.4-09 aw

SPSS can help you to identify the amount of missing data. When you are interested in the percentage
of missing values for each variable separately (e.g. item on a questionnaire) use the Frequency
option in SPSS:
1. Select Analyze Descriptive Statistics Frequencies
2. Move all variables into the “Variable(s)” window.
3. Click OK. The “Statistics” box tells you the number of missing values for each variable.
However, be aware that this only gives you information about the percentage of missing values for
each variable separately. It is more important to study the full percentage of missing data, especially
when you use more variables in your analysis.

When you are interested in the full percentage of missing data use the following option:
1. Select Analyze Multiple Imputation Analyze patterns
2. Move all variables into the “Variable(s)” window.
3. Click OK. The output tells you the percentage of variables with missing data, the
percentage of cases with missing data, and the number of missing values. This final pie
chart tells you the full percentage of missing data. Note the 5% borderline. Also patterns of
missing data are presented.
4. Tip: use the Help button, and click “show me” for more information about the options and
output in SPSS.

When you want to find out more about the patterns of missing data and the relation between missing
data between variables, use the following option:
1. Analyze Missing Value Analysis,
2. Move all variables of interest into the Quantitative or Categorical Variable(s) window.
3. Use the ‘patterns’ button to get information about the relation between missing data on
more variables
4. A tutorial of the Missing Value Analysis (SPSS 16 and further) procedures in SPSS can be
found via the Help button. A user’s guide can be downloaded freely on the internet.

4.5 What kind of data is missing?


Next step is to identify the kind of data that is missing. You can find out this information from the steps
described in 4.4.
1. A single item, or several items of a questionnaire is missing.
2. A full questionnaire or a single variable (such as blood pressure)
3. A measurement wave (in longitudinal / randomized studies)

The way you deal with missing data depends on the type of missing data

4.6 What type of missings do you have?


Missing values are either random or non-random. Random missing values may occur because the
subject accidentally did not answer some questions. For example, the subject may be tired and/or not
paying attention, and misses the question. Random missing values may also result from data entry
mistakes. Non-random missing values may occur because subjects purposefully do not answer some
questions. For example, the question may be confusing, so respondents do not answer the question.
Also, the question may not provide appropriate answer choices, such as “no opinion” or “not
applicable”, so the subject chooses not to answer the question. Also, subjects may be reluctant to
answer some questions because of social desirability concerns about the content of the question,
such as questions about sensitive topics like income, past crimes, sexual history, prejudice or bias
toward certain groups, and etc.

Think about your dataset. Is there an option that the missing values are non-random?
Title of the document: Page. 3 of 7
Rev. Nr.: Effective date:
Handling Missing Data 1.1 5-7-2012
HB Nr. : 1.4-09 aw

Rubin developed in 1976 a typology for missing data.

Type of missings Description


MCAR: Missing Completely At The data are MCAR when the probability that a value for
Random: a certain variable is missing is unrelated to the value of
other observed variables, or unrelated to the variable
with missing values itself. An example is when
respondents accidentally skip questions. In other words,
the observed values in your dataset is just a random
sample from your dataset, when it would have been
complete.

MAR: Missing at Random (most of The data are MAR when the probability that a value for a
the time) certain variable is missing is related to observed values
on other variables. An example is when older
respondents have more missing values than younger
respondents. However, within the group of older and
younger respondents, the data are still MCAR. Another
example is when respondents with low scores on the
first wave are not invited for a second wave.

MNAR: Missing Not At Random: The data are MNAR when the probability that a value for
a certain variable is missing is related to the scores on
that variable itself. An example is that respondents with
low income intentionally skip their low income scores
because that violates their privacy. In that case, the
probability that an observation is missing depends on
information that is not observed, like the value of the
income score, because only low values are missing.
MNAR is a serious problem, which can not be solved
with a technique as multiple imputation.

How do you know what kind of missings you have?


There are three kinds of methods.
1. First you can inspect the data by yourself. Are the missings equally distributed in the data. Are
low and / or high scores missing? If the missings are not equally spread this might be an
indication that the data are MNAR. With this method you a-priori must now what the
distribution of the variable normally is, i.e. is it normal or skewed? You need this information
before you can judge which part of the data suffers from missing values. This method only
applies if your dataset is large.
2. Second, SPSS can test whether the respondents with missing data differ from the
respondents without missing data on important variables (Analyze Missing Value
Analysis select important variables descriptives t-test formed by indicator. Significant?
Indication for MAR. Be aware that if your sample size is large (>500) this t-test might be
significant if the data truly are not MAR. So, just looking at the means and their difference
might be good enough. In case this mean difference is very small, this might be an indication
of MCAR.
3. In SPSS via (Analyze Missing Value Analysis, EM button), it is also possible to do a test for
MCAR data. This is called Little´s test. A tutorial of the Missing Value Analysis (SPSS 16 and
further) procedures in SPSS can be found via the Help button.

It is important to note that you’re not able to test whether your missing data is MAR or MNAR. The
above mentioned procedures (1 and 2) will only give you an indication. Pay attention to the
possibility of MNAR, because all analyses have serious problems when your missing data is
MNAR.
Title of the document: Page. 4 of 7
Rev. Nr.: Effective date:
Handling Missing Data 1.1 5-7-2012
HB Nr. : 1.4-09 aw

4.7 How to handle missing data?


Missing data is random:
For MCAR and MAR, many missing data methods have been developed in the last two decades
(Schafer & Graham, 2002). Although MCAR seems to be the least problematic mechanism,
deleting cases can still reduce the power of finding an effect. It is argued that the MAR
mechanism is most frequently seen in practice. An argument for this is that in most research
multifactorial or multivariable problems are studied, so when data on variables are missing it is
mostly related to other variables in the dataset.

Missing data is not random:


For MNAR, imputation is not sufficient, because the missing data are totally different from the
available data, i.e. your complete data has become a selective group of persons. If you think your
data is MNAR it might be wise to contact a statistician from EMGO+ who is willing to help you.

For MCAR and MAR, there are roughly two kinds of techniques for imputation. Single and Multiple
Imputation.

Single imputation is possible in SPSS and is an easy way to handle missings when just a few
cases are missing (less than 5%) and you think your missing values are MCAR or MAR. However,
after single imputation the cases are more similar which may result in an underestimation of the
standard errors, i.e. smaller confidence intervals. This increases the chance of a type 1 error (the
null hypothesis of no effect is rejected, while there is truly no effect). Therefore, this method is less
adequate when you have >5% missing data.

Multiple imputation is more complex, but also implemented in SPSS 17.0 and later versions.
Multiple imputation takes into account the uncertainty of missing values (present in all values of
variables) and is therefore more preferred than single imputation. When your missingness is high
(exceeds 5% in several variables and different persons) multiple imputation is more adequate.

Imputation techniques
Single imputation
Single imputation techniques are based on the idea that in a random sample every person can be
replaced by a new person, given that this new person is randomly chosen from the same source
population as the original person. In that case you can use the observed available data of the
other persons to make an estimation of the distribution of the test result in the source population.
It is called single imputation, because each missing is imputed once.
There are many methods for single imputation, such as replacement by the mean, regression,
and expected maximization. Expected maximization is preferred, because in the other methods
the variance and standard error are reduced and the chance for Type II errors increases.
Expected maximization forms a missing data correlation matrix by assuming the shape of a
distribution for the missing data and imputes missing values on the likelihood under that
distribution. Single imputation is possible in SPSS (analyze – missing value analyses – button EM
for Expected Maximization). Contact a statistician from EMGO+ who is willing to help you with this
procedure.

For the imputation of a missing score on a single item in a questionnaire (see 4.5) , SPSS
syntaxes can be found at:
http://www.tilburguniversity.edu/nl/over-tilburg-
university/schools/socialsciences/organisatie/departementen/mto/onderzoek/software/
tw.zip: Software for two-way imputation in SPSS. (Van Ginkel & Van der Ark, 2003a), or
rf.zip: Software for response function imputation in SPSS (Van Ginkel & Van der Ark, 2003b).
Title of the document: Page. 5 of 7
Rev. Nr.: Effective date:
Handling Missing Data 1.1 5-7-2012
HB Nr. : 1.4-09 aw

Multiple imputation (MI)


The difference with single imputation is that in MI the value is imputed for several times. There are
more imputed datasets created. The different imputations are then based on random draws of
different estimations of the underlying distribution in the source population. In this way, the
imputed data comes from different distributions and therefore are less look alike. There is more
uncertainty created in the dataset. Therefore the standard error increases. The amount of
imputations is dependent on the amount of missing data, but mostly 5 to 10 imputations are
enough. A drawback of this method it that several imputed datasets are created and that the
statistical analysis has to be repeated in each dataset. Finally, results have to be pooled in a
summary measure. Most statistical packages can do this automatically. Multiple imputation is
possible in recent versions (vs 17) of SPSS (analyze – multiple imputation – impute missing data
values). For more information see references. Contact a statistician from EMGO+ who is willing to
help you with this procedure.

Sensitivity analysis
After imputation, sensitivity analysis is needed to determine how your substantive results depend
on how you handled the missing data.
Follow these steps:
1. Do a complete case analysis (default option in SPSS; cases with missings are not
included)
2. Do a missing data analysis after you imputed the results
3. Compare substantive conclusions, decide how to report.

When is imputation of missing data not necessary?


1) When your missing data is MCAR or MAR, and you use Maximum Likelihood estimation
techniques in analyses such as Structural Equation Modelling (SEM) or Linear Mixed Models
(LMM), imputation of missing data is not necessary. These techniques use the available data,
and ignore the missing values and still give correct results. In such situations you do not have
to use an extra imputation technique to handle your missing values. Missing data that are
MNAR is still a problem for these methods.
2) A different approach may be used for descriptive studies. If you want to show the (observed)
study data (means and standard deviations), for example to compare them with other
countries/settings, without directly linking them to a conclusion, imputation is not immediately
needed. However, the evaluative statistics (t-tests, regressions, etc.) would certainly need
complete case analysis. So, if you use statistical tests to compare the descriptive, imputation
is needed (of course depending on the amount and type of missing data). In this final case,
you link your descriptive to a conclusion and want a corrected p-value / 95% CI, and therefore
you need to use the data with imputed values. Do not forget the reviewer, who may
sometimes have problems with using imputed and non-imputed data in one paper. Be clear
about imputation and point out why you choose to present imputed/non-imputed data.

4. Summary
• Make every effort to avoid missing data, or failing that, to understand how much and
why data is missing.
• Understand missing data mechanisms (MCAR, MAR, MNAR) and their implications
• Avoid default methods (listwise deletion, pairwise deletion)
• Avoid default fixups (mean imputation, etc.) where possible
• Use multiple imputation to take proper account of missings
• Do a sensitivity analysis
Title of the document: Page. 6 of 7
Rev. Nr.: Effective date:
Handling Missing Data 1.1 5-7-2012
HB Nr. : 1.4-09 aw

5. Details
I have missings in
my data

What is the type of


missing?

MCAR/MAR MNAR

Ask a statistician from Try to complete Ask a statistician from


EMGO+ to help you your dataset EMGO+ to help you

I use SEM or LMM I do not use SEM or


for my analyses LMM for my analyses.
Imputation is needed.

Use ML estimation, How much data is


no imputation is missing?
needed

< 5% use single >5% Use multiple


imputation imputation
Title of the document: Page. 7 of 7
Rev. Nr.: Effective date:
Handling Missing Data 1.1 5-7-2012
HB Nr. : 1.4-09 aw

6. Appendices/references/links
Multiple Imputation Methods, Niels Smits (technical literature).
http://www2.chass.ncsu.edu/garson/pa765/missing.htm
http://www.ssc.upenn.edu/~allison/MultInt99.pdf (especially for Multiple Imputation)
Ask EMGO+ statisticians for help via:
http://www.emgo.nl/kc/preparation/research%20design/3%20Advice%20and%20support.html

EMGO+ experts on Missing Data


Martijn Heymans: [email protected]
Jos Twisk: [email protected]

Recommended (non-technical) literature.

1. Sterne JA, White IR, Carlin JB, Spratt M, Royston P, Kenward MG, Wood AM, Carpenter JR.
Multiple imputation for missing data in epidemiological and clinical research: potential and
pitfalls. BMJ 2009;338:b2393. doi: 10.1136/bmj.b2393.
2. Allison, P.D. (2001). Missing Data (Sage University Papers Series on Quantitative
Applications in the Social Sciences, series no. 07-136). Thousand Oaks: Sage.
3. Schafer, J.L. & Graham, J.W. (2002). Missing data: Our view of the state of the art.
Psychological Methods, 7, 147-177.
4. Donders AR, van der Heijden GJ, Stijnen T, Moons KG. Review: a gentle introduction to
imputation of missing values. J Clin Epidemiol. 2006; 59(10):1087-91. Review.
5. http://www.stat.psu.edu/~jls/mifaq.html (Multiple Imputation FAQ page met uitleg)
6. Van Ginkel, J. R., & Van der Ark, L. A. (2003a). SPSS syntax for two-way imputation of
missing test data [computer software and manual]. Retrieved from
http://www.tilburguniversity.edu/nl/over-tilburg-
university/schools/socialsciences/organisatie/departementen/mto/onderzoek/software/
7. Van Ginkel, J. R., & Van der Ark, L. A. (2003b). SPSS syntax for response function imputation
of missing test data [computer software and manual]. Retrieved from
http://www.tilburguniversity.edu/nl/over-tilburg-
university/schools/socialsciences/organisatie/departementen/mto/onderzoek/software/

7. Amendments
V1.0 1-12-2011
V1.1 5-7-2012 : addition to section When is imputation of missing data not necessary?

You might also like