Missing Data and Data Cleaning - Tagged

Uploaded by

Asad Ashraf

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views31 pages

Missing Data and Data Cleaning - Tagged

Uploaded by

Asad Ashraf

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

SXU4004: SPSS 2

Missing Data and

Data Cleaning
Dr. Alison J. Orrell
[email protected]
Missing Data
Missing Values
• When a respondent completes a questionnaire there will invariably be some
questions that have not been answered. This is either because of the design of the
questionnaire, or because a respondent neglects to answer them. It may be
tempting to think they “do not count”, but they do count, they really do matter.

• SPSS can be set to automatically exclude any missing values from an analysis, but
it has to be told to do so. Also, there are different kinds of ‘missing’ and
sometimes we are interested in a particular one; trying to work out why so many
people may leave a particular question blank is one instance, but there are many
more.
Missing values convention.
• There is a convention that all missing values are coded using a negative number.
The - sign makes them easy to spot, and also means that if they are included in an
analysis by mistake the results are likely to be quite strange.
“True” missing values usually coded as -9
or -99
• If a respondent simply ignores a question and writes nothing at all in response then
that is a true missing value, and is usually coded as -9, or -99.

• It is important that a distinction is made between this and ‘refused to answer’. We

cannot assume that a blank response is a refusal; the respondent may simply have
not seen the question or they could have been interrupted and forgotten about it.

• If we think people may refuse to answer a question and are interested to know if
they do, then this must be an option to choose; there must be a ‘prefer not to
answer’ box to tick (or something similar). In coding this would be coded as a
negative number, but not -9 or -99 (or -999), perhaps -7.
“Not applicable” frequently coded as -1
• These values typically occur in two part questions such as “Did you attend the research
seminar last week?” (with yes/no response options), “If yes, how interesting did you find
the presentation?”. If the respondent didn’t attend the seminar then they obviously
cannot answer the second part.

• The question would be coded using two variables, a ‘no’ response for the first
automatically leading to a -1 code for the second. It is very important to code this not
applicable as a missing value so that SPSS does not include it in any analysis.

• It is possible to have a missing value response to the second part instead of not
applicable. If a respondent indicated that they did attend the seminar, but then did not
complete the second part, the missing value code would be used instead.
Coding missing data
• To specify missing values:
• Click in the column labelled Missing in the Variable View.
• Then click on box with three dots in it.
• You can choose to define the missing variables in three ways:
- Discrete missing values - you can have up to three discrete values, e.g. -1, -
7, -99.
- Discrete missing value – you could have a value greater than a value you
would expect on a scale.
- Range of values – useful if you want to exclude data between two point, e.g.
scores between 5 and 10. You can also have a discrete value too.
• Go back and read the handout on Missing values.

• Rerun your frequency analysis with some missing data.

• Make sure you understand the differences between the percent column and valid
percent column.

• Make sure you understand where the missing data have gone too.
Data Cleaning
Open the SPSS file <Hooray for mistakes with
mistakes.sav>
Errors
• Errors fall into two categories: ‘definitely wrong’ and ‘likely to be wrong but needs
to be checked’.

• For example, it is extremely unlikely that either a lecturer or a student would have
an income of over £100,000 per year, but it is technically possible so it would
need to be checked against the original questionnaire.

• Of course it is then possible that the figure was entered wrongly on the
questionnaire! This is where complex data cleaning is used as by checking the
response to one question against the response to another, errors can be identified.
Finding Errors - Example
• The easiest errors to find are those where a categorical variable has been
incorrectly coded using an out of range value.

• There is only one categorical variable in the dataset: ‘Group’, so we can run a
frequency table to see if there are any out of range values.

• First of all produce a frequency table for the variable ‘Group’.

• Look at the Output Window. We can now see that there are 5 respondents in the
lecturer category, 4 in the student category and 1 in a ‘3’.

• The ‘3’ suggests either that a value has been entered wrongly in the SPSS file, or
that there is a further category that has not been coded.
• All the other variables are scale, so frequencies will tell us nothing useful.

• However, if we run some descriptives then we can see the minimum and
maximum values for each variable.

• Through examining this we can identify places where something looks ‘not quite
right’.
• It is possible to run the descriptives for all the scale variables at the same time.

• Go to “analyze”, then “descriptive statistics” then “descriptives”. Click all of the scale
variables across into the variable(s) box and then click on the ‘option’ button. For the
moment we will only look at ‘minimum’ and ‘maximum’ values so uncheck other options to
leave just those boxes ticked. Click ‘continue’ then ‘ok’ and before you can blink SPSS will
have produced the table.

• Unlike the categorical variable it is not immediately apparent if there are any errors in the
data. Take some time to look at this table and make a note of where you think errors may
be.
Potential Errors
Both income and neuroticism have large differences between minimum and
maximum values. A maximum of £50,000 looks OK, but a minimum of £10 per
year? Maybe correct or maybe someone missed a couple of 0’s off? The maximum
for neuroticism is high, 14,000. It would be extremely unlikely that any scale
would have a maximum so large, again that is worth checking. The minimum for
friends and alcohol consumption is zero. Again, this is possible but is it likely?

• We now have to go back to the data to find where these potential errors are.
Finding Errors (1)
• Like most programs SPSS has a ‘find’ facility. So either highlight the ‘group’
column or click on the top cell of the column and then go to “Edit”, then “find”.
Type 3 in the box and click ‘find next’. The offending 3 will be highlighted.

• It is possible to follow the row towards the left hand edge to find the ID number, 7.
We could then go to the original questionnaire to see which group this person
should belong to, or to see if there should have been an additional category. We
know from the frequency table that there is only one 3, so we have found it!
Finding Errors (2)
• Whenever an error is corrected the test must be re-run to make sure it has been
corrected. Replace the 3 with a 2 and re-run the frequency table. It should now
look OK.

• We can follow the same procedure to find the 14,000 neuroticism score, but we do
not know if there is one score or many. Just keep clicking on ‘find next’ until there
are no more to be found. Once again make a note of the ID number. This time
looking at the original questionnaire reveals that the number should have been 14
so change it to a 14 and re-run the analysis. Follow the same procedure to find the
ID number of the respondent with 0 friends and the respondent with an income of
£10 per year.
Importance of Data Cleaning
• Data cleaning is important because these errors do happen! Grandparents aged
22, people getting younger over a longitudinal study etc.

• Data cleaning is an important final part of the data entry process; working with
uncleaned data can seriously damage your analysis.

• N.B. When working with official secondary data the data will have been cleaned
before being made public.

Statisitvcal Methods and SPSS Latest-5
100% (1)
Statisitvcal Methods and SPSS Latest-5
150 pages
SPSS Session
No ratings yet
SPSS Session
133 pages
Data Analysis
100% (2)
Data Analysis
87 pages
Data Analysis Using SPSS: Research Workshop Series
No ratings yet
Data Analysis Using SPSS: Research Workshop Series
86 pages
STAT730 Lect 915
No ratings yet
STAT730 Lect 915
52 pages
Data Preparation
100% (1)
Data Preparation
38 pages
Lecture 2 - Introduction To SPSS
No ratings yet
Lecture 2 - Introduction To SPSS
44 pages
6.research Methodology-BBA S1M6
No ratings yet
6.research Methodology-BBA S1M6
64 pages
Marketing Analytics (Unit 2)
No ratings yet
Marketing Analytics (Unit 2)
78 pages
2000 Procedimientos Industriales - Formoso
100% (2)
2000 Procedimientos Industriales - Formoso
1,219 pages
Data Preparation
No ratings yet
Data Preparation
39 pages
10 Data Preparation
No ratings yet
10 Data Preparation
42 pages
Data Prepration Presentation
No ratings yet
Data Prepration Presentation
34 pages
Business Analytics Assignment Business Analytics Assignment: Neha Singh Neha Singh
No ratings yet
Business Analytics Assignment Business Analytics Assignment: Neha Singh Neha Singh
16 pages
Spss Training Manual
No ratings yet
Spss Training Manual
94 pages
Spss Before You Do Analysis
No ratings yet
Spss Before You Do Analysis
47 pages
Introduction To Data Cleaning and Bias in Analysis
No ratings yet
Introduction To Data Cleaning and Bias in Analysis
35 pages
L18&19 Data Exploration
No ratings yet
L18&19 Data Exploration
50 pages
SPSS Data Analysis
100% (6)
SPSS Data Analysis
47 pages
RM Module 1
No ratings yet
RM Module 1
63 pages
This is The Statistics Handbook your Professor Doesn't Want you to See. So Easy, it's Practically Cheating...
From Everand
This is The Statistics Handbook your Professor Doesn't Want you to See. So Easy, it's Practically Cheating...
S. Deviant
4.5/5 (6)
Business Analytics Assignment: Neha Singh
No ratings yet
Business Analytics Assignment: Neha Singh
16 pages
CH 6
No ratings yet
CH 6
42 pages
Natalie Loxton Data Screening
No ratings yet
Natalie Loxton Data Screening
36 pages
BRM Lab File
No ratings yet
BRM Lab File
52 pages
BRM File
No ratings yet
BRM File
55 pages
SPSS Data Analysis
No ratings yet
SPSS Data Analysis
47 pages
Excel-Statistics-Manual For Physics
No ratings yet
Excel-Statistics-Manual For Physics
24 pages
SPSS
No ratings yet
SPSS
25 pages
Course Code: 8614 Course Name: Educational Statistics Assignment: 2 Semester: Spring 2022 Program: B.Ed
No ratings yet
Course Code: 8614 Course Name: Educational Statistics Assignment: 2 Semester: Spring 2022 Program: B.Ed
19 pages
Analyzing Missing Data: Problems Using Scripts
No ratings yet
Analyzing Missing Data: Problems Using Scripts
49 pages
SPSS Pres
No ratings yet
SPSS Pres
25 pages
PASSS RQ2 Frequencies
No ratings yet
PASSS RQ2 Frequencies
7 pages
Smart City Bhopal - PAN CITY PROJECTS
100% (1)
Smart City Bhopal - PAN CITY PROJECTS
37 pages
Lecture 8 Data Analysis
No ratings yet
Lecture 8 Data Analysis
30 pages
Lesson 09 Data Analysis I Descriptive Statistics
No ratings yet
Lesson 09 Data Analysis I Descriptive Statistics
15 pages
Data Science CLASS 12 INVESTIGATORY PROJECT
No ratings yet
Data Science CLASS 12 INVESTIGATORY PROJECT
9 pages
PUCIT Entry Test Mcqs
100% (3)
PUCIT Entry Test Mcqs
4 pages
Project Management Book1
100% (1)
Project Management Book1
25 pages
Data Cleaning
No ratings yet
Data Cleaning
8 pages
BRM Statwiki
No ratings yet
BRM Statwiki
55 pages
1.data Cleaning Screening
No ratings yet
1.data Cleaning Screening
21 pages
Abhinn - Spss Lab File
No ratings yet
Abhinn - Spss Lab File
67 pages
SPSS Notes
No ratings yet
SPSS Notes
8 pages
Missing Data
No ratings yet
Missing Data
7 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
7 pages
Data Screening/Cleaning/ Preparation For Analyses
No ratings yet
Data Screening/Cleaning/ Preparation For Analyses
13 pages
320 Course Reader
No ratings yet
320 Course Reader
41 pages
Defining Operations in Oracle Process Manufacturing
No ratings yet
Defining Operations in Oracle Process Manufacturing
39 pages
Solutions To Missing Data
No ratings yet
Solutions To Missing Data
8 pages
2009 Fall Urbp 204a Spss Tutorial
No ratings yet
2009 Fall Urbp 204a Spss Tutorial
8 pages
Examples of Data Manipulation in SPSS: Operators
No ratings yet
Examples of Data Manipulation in SPSS: Operators
6 pages
SPSS2
No ratings yet
SPSS2
15 pages
Missing Data: I. Types of Missing Data. There Are Several Useful Distinctions We Can Make
No ratings yet
Missing Data: I. Types of Missing Data. There Are Several Useful Distinctions We Can Make
19 pages
Quntative Data Analysis SPSS: Formating, Handling, & Manipulation
No ratings yet
Quntative Data Analysis SPSS: Formating, Handling, & Manipulation
22 pages
Getting Started Guide Icepak
No ratings yet
Getting Started Guide Icepak
62 pages
Cyber Security Standards
No ratings yet
Cyber Security Standards
8 pages
BA UNIT-3 - Part 1
No ratings yet
BA UNIT-3 - Part 1
4 pages
Telecom Customer Churn
0% (1)
Telecom Customer Churn
39 pages
Missing Data Stata
No ratings yet
Missing Data Stata
18 pages
Data Cleaning Workshop:: Club Data Science and Cloud Computing
No ratings yet
Data Cleaning Workshop:: Club Data Science and Cloud Computing
6 pages
Mastering jBPM6 - Sample Chapter
No ratings yet
Mastering jBPM6 - Sample Chapter
52 pages
SPSS Basic
No ratings yet
SPSS Basic
24 pages
08 Sensor Guide
100% (1)
08 Sensor Guide
2 pages
Missing Data Part 1: Overview, Traditional Methods
No ratings yet
Missing Data Part 1: Overview, Traditional Methods
11 pages
Missing Data in Stata
No ratings yet
Missing Data in Stata
12 pages
Lab #1 - Data Screening: Statistics - Spring 2008
No ratings yet
Lab #1 - Data Screening: Statistics - Spring 2008
11 pages
Painless Statistics
From Everand
Painless Statistics
Barron's Educational Series
No ratings yet
LCGC Eur Burke 2001 - Missing Values, Outliers, Robust Stat and NonParametric PDF
No ratings yet
LCGC Eur Burke 2001 - Missing Values, Outliers, Robust Stat and NonParametric PDF
6 pages
Data-Two Categories: Ram Saran R (1827318:)
No ratings yet
Data-Two Categories: Ram Saran R (1827318:)
4 pages
SPSS Notes
No ratings yet
SPSS Notes
2 pages
Algan/Gan Hemts-An Overview of Device Operation and Applications
No ratings yet
Algan/Gan Hemts-An Overview of Device Operation and Applications
10 pages
RF Radio Frequency Signal Generator
No ratings yet
RF Radio Frequency Signal Generator
5 pages
Cainta Catholic College Senior High School Department Cainta, Rizal
No ratings yet
Cainta Catholic College Senior High School Department Cainta, Rizal
33 pages
STP1806 PDF
No ratings yet
STP1806 PDF
9 pages
Errors of Regression Models: Bite-Size Machine Learning, #1
From Everand
Errors of Regression Models: Bite-Size Machine Learning, #1
Lee Baker
No ratings yet
Unit 1 - SLL and DLL
No ratings yet
Unit 1 - SLL and DLL
45 pages
Handover - Event GSM
No ratings yet
Handover - Event GSM
2 pages
Unit 1 Sách ĐT5
No ratings yet
Unit 1 Sách ĐT5
18 pages
The Future of Cybersecurity - Emerging Trends and Challenges
No ratings yet
The Future of Cybersecurity - Emerging Trends and Challenges
5 pages
ECE650 Chapter 0 Course Outline
No ratings yet
ECE650 Chapter 0 Course Outline
11 pages
Logix Class7 Computer 18day LessonPlan
No ratings yet
Logix Class7 Computer 18day LessonPlan
1 page
Budget of Minority
No ratings yet
Budget of Minority
18 pages
Building Information Modelling (Bim) For Facilities Management (FM) : The Mediacity Case Study Approach
No ratings yet
Building Information Modelling (Bim) For Facilities Management (FM) : The Mediacity Case Study Approach
21 pages
ICT Lounge - Section 8.3 - Hacking
No ratings yet
ICT Lounge - Section 8.3 - Hacking
4 pages
Process Runner System Requirements
No ratings yet
Process Runner System Requirements
2 pages
4 - Creating Creative Photomontages or Image Mixing Using Generative Adversarial Networks
No ratings yet
4 - Creating Creative Photomontages or Image Mixing Using Generative Adversarial Networks
9 pages
Assignment 3
No ratings yet
Assignment 3
5 pages
Aaron Willette: Contact - (734) 680-4127 Github
No ratings yet
Aaron Willette: Contact - (734) 680-4127 Github
2 pages
Wires
No ratings yet
Wires
4 pages

Missing Data and Data Cleaning - Tagged

Uploaded by

Missing Data and Data Cleaning - Tagged

Uploaded by

SXU4004: SPSS 2

Missing Data and

• It is important that a distinction is made between this and ‘refused to answer’. We

• Rerun your frequency analysis with some missing data.

• First of all produce a frequency table for the variable ‘Group’.

You might also like