Missing Data and Data Cleaning - Tagged
Missing Data and Data Cleaning - Tagged
Missing Data and Data Cleaning - Tagged
• SPSS can be set to automatically exclude any missing values from an analysis, but
it has to be told to do so. Also, there are different kinds of ‘missing’ and
sometimes we are interested in a particular one; trying to work out why so many
people may leave a particular question blank is one instance, but there are many
more.
Missing values convention.
• There is a convention that all missing values are coded using a negative number.
The - sign makes them easy to spot, and also means that if they are included in an
analysis by mistake the results are likely to be quite strange.
“True” missing values usually coded as -9
or -99
• If a respondent simply ignores a question and writes nothing at all in response then
that is a true missing value, and is usually coded as -9, or -99.
• If we think people may refuse to answer a question and are interested to know if
they do, then this must be an option to choose; there must be a ‘prefer not to
answer’ box to tick (or something similar). In coding this would be coded as a
negative number, but not -9 or -99 (or -999), perhaps -7.
“Not applicable” frequently coded as -1
• These values typically occur in two part questions such as “Did you attend the research
seminar last week?” (with yes/no response options), “If yes, how interesting did you find
the presentation?”. If the respondent didn’t attend the seminar then they obviously
cannot answer the second part.
• The question would be coded using two variables, a ‘no’ response for the first
automatically leading to a -1 code for the second. It is very important to code this not
applicable as a missing value so that SPSS does not include it in any analysis.
• It is possible to have a missing value response to the second part instead of not
applicable. If a respondent indicated that they did attend the seminar, but then did not
complete the second part, the missing value code would be used instead.
Coding missing data
• To specify missing values:
• Click in the column labelled Missing in the Variable View.
• Then click on box with three dots in it.
• You can choose to define the missing variables in three ways:
- Discrete missing values - you can have up to three discrete values, e.g. -1, -
7, -99.
- Discrete missing value – you could have a value greater than a value you
would expect on a scale.
- Range of values – useful if you want to exclude data between two point, e.g.
scores between 5 and 10. You can also have a discrete value too.
• Go back and read the handout on Missing values.
• Make sure you understand the differences between the percent column and valid
percent column.
• Make sure you understand where the missing data have gone too.
Data Cleaning
Open the SPSS file <Hooray for mistakes with
mistakes.sav>
Errors
• Errors fall into two categories: ‘definitely wrong’ and ‘likely to be wrong but needs
to be checked’.
• For example, it is extremely unlikely that either a lecturer or a student would have
an income of over £100,000 per year, but it is technically possible so it would
need to be checked against the original questionnaire.
• Of course it is then possible that the figure was entered wrongly on the
questionnaire! This is where complex data cleaning is used as by checking the
response to one question against the response to another, errors can be identified.
Finding Errors - Example
• The easiest errors to find are those where a categorical variable has been
incorrectly coded using an out of range value.
• There is only one categorical variable in the dataset: ‘Group’, so we can run a
frequency table to see if there are any out of range values.
• The ‘3’ suggests either that a value has been entered wrongly in the SPSS file, or
that there is a further category that has not been coded.
• All the other variables are scale, so frequencies will tell us nothing useful.
• However, if we run some descriptives then we can see the minimum and
maximum values for each variable.
• Through examining this we can identify places where something looks ‘not quite
right’.
• It is possible to run the descriptives for all the scale variables at the same time.
• Go to “analyze”, then “descriptive statistics” then “descriptives”. Click all of the scale
variables across into the variable(s) box and then click on the ‘option’ button. For the
moment we will only look at ‘minimum’ and ‘maximum’ values so uncheck other options to
leave just those boxes ticked. Click ‘continue’ then ‘ok’ and before you can blink SPSS will
have produced the table.
• Unlike the categorical variable it is not immediately apparent if there are any errors in the
data. Take some time to look at this table and make a note of where you think errors may
be.
Potential Errors
Both income and neuroticism have large differences between minimum and
maximum values. A maximum of £50,000 looks OK, but a minimum of £10 per
year? Maybe correct or maybe someone missed a couple of 0’s off? The maximum
for neuroticism is high, 14,000. It would be extremely unlikely that any scale
would have a maximum so large, again that is worth checking. The minimum for
friends and alcohol consumption is zero. Again, this is possible but is it likely?
• We now have to go back to the data to find where these potential errors are.
Finding Errors (1)
• Like most programs SPSS has a ‘find’ facility. So either highlight the ‘group’
column or click on the top cell of the column and then go to “Edit”, then “find”.
Type 3 in the box and click ‘find next’. The offending 3 will be highlighted.
• It is possible to follow the row towards the left hand edge to find the ID number, 7.
We could then go to the original questionnaire to see which group this person
should belong to, or to see if there should have been an additional category. We
know from the frequency table that there is only one 3, so we have found it!
Finding Errors (2)
• Whenever an error is corrected the test must be re-run to make sure it has been
corrected. Replace the 3 with a 2 and re-run the frequency table. It should now
look OK.
• We can follow the same procedure to find the 14,000 neuroticism score, but we do
not know if there is one score or many. Just keep clicking on ‘find next’ until there
are no more to be found. Once again make a note of the ID number. This time
looking at the original questionnaire reveals that the number should have been 14
so change it to a 14 and re-run the analysis. Follow the same procedure to find the
ID number of the respondent with 0 friends and the respondent with an income of
£10 per year.
Importance of Data Cleaning
• Data cleaning is important because these errors do happen! Grandparents aged
22, people getting younger over a longitudinal study etc.
• Data cleaning is an important final part of the data entry process; working with
uncleaned data can seriously damage your analysis.
• N.B. When working with official secondary data the data will have been cleaned
before being made public.