SEMINAR Data Screening
SEMINAR Data Screening
Dataset
Please download ‘Data – Data Screening.sav’ from the course Moodle page to your Desktop.
Learning Outcomes
Successful completion of the session means you will be able to use SPSS to perform
comprehensive data screening and cleaning. In particular, you will be able to:
1. Screen for missing data, violations of normality, outliers, linearity and
homoscedasticity
2. Tackle problems revealed by data screening appropriately
Dataset to be Used
This hypothetical dataset of 465 male athletes contains information on various characteristics
that might predict sporting success. We would typically go on to the explore how well these
characteristics predicted some dependent measure of sporting performance (e.g. time taken
to run a mile), probably in a multiple regression analysis (covered in a few weeks). But for the
purposes of this exercise we are only performing data screening and cleaning in order to
ensure the data would be ready for this, or any, type of multivariable analysis.
Continuous variables
Height (in cm)
Age (in years)
Weight (in lbs)
Strength (rated 0 - 10 on some performance test)
Speed (graded on a 0 - 5 scale on some performance test)
Categorical variables
High-protein diet (0=no, 1=yes)
Train more than 4 times a week (0=no, 1=yes)
Note: To keep the length of the seminar session manageable, we will only perform data
screening on a subset of variables (usually on variables where there is a problem). You would
usually perform full data screening on all variables. Note also that there are often alternative,
equally valid, data screening procedures that are not performed in this session.
1
Question 1 – Missing Data
First, we will check how many values are missing for each variable. Go to
Analyse > Missing Value Analysis
Put the continuous variables into the ‘Quantitative Variables’ box and the categorical
variables into the ‘Categorical Variables’ box (see below)
(a) From the ‘Univariate Statistics’ table, which variables have missing data?
(the relevant column is highlighted below)
2
(b) For one of these variables, the single missing case could simply be deleted (i.e. excluded
from the analysis). For the other, however, excluding missing cases could be potentially
problematic – why?
If the people with missing values for strength are different in some way (e.g. they report being
a different age, gender etc.) compared to the people for whom strength values are present,
this suggests these cases (people) are not missing randomly. You can find out by comparing
the group with missing data (for the strength variable) vs. the group with values present (for
strength) using t-tests comparing present vs missing cases on other continuous variables.
The second Table in the SPSS output gives you the results of these t-tests, with each column
representing the results of a t-test comparing the strength values missing vs. strength values
present group for that variable as a DV; e.g. column 1 below would show the results of a t-test
with height as the DV (and strength group missing vs. strength group present as the IV).
Note: t-tests are not performed for DVs of protein diet or training as these are not continuous
DVs (which is a requirement of t-tests).
(c) Row 3 of the t-tests table shows the p-values. Are there any significant differences for
missing vs. non-missing cases for any of the continuous variables? Do the missing cases
therefore appear to be missing randomly or non-randomly?
Although we will not do this now, the Missing Values Analysis option in SPSS offers several
methods of imputing (estimating) missing data.
3
Question 2 – Normality and Univariate Outliers
To check for univariate outliers and normality, inspect the histograms for the continuous
variables in the dataset -.
Put height, age, weight, strength and speed in the Variables box
Hit the Charts button, select Histograms and click Continue
Hit the Statistics button and from the pop-up window, check skewness and
kurtosis and click Continue
Untick the option to display frequency tables, then click OK
(a) Write down the name of the variable that appears to contain an obvious univariate outlier.
(b) In this case we will delete the outlier. The easiest way to identify the outlier is to order
weight values from highest to lowest. To do this, right-click the name of the problem variable
(at the top of the column) and choose Sort Descending.
Look at the data and see you will there are two values of 234 (lbs) – delete these values.
4
NOTE: It is very important to both make sure that any deletion of data is usually (A)
performed on a COPY of the dataset, and (B) that you consider very carefully whether deletion
is the right option (the lecture notes provide more detailed notes on handling outliers). In this
case deletion is unproblematic because the number of univariate outliers is so small.
(c) Which variable appears to be skewed and is this skew positive or negative? (The ‘Statistics’
table also give information about skewness and kurtosis and should confirm what the chart
suggests)
Given the apparent non-normality of Speed, remedial action is required. We can try to reduce
the positive skew by a logarithmic transformation.
Plot the histogram for log_speed using the below, as we did before
1
Sometimes (not here) you will need to add ‘+1’ to the variable you are transforming to
ensure we don’t ask SPSS to calculate the log of 0, which is a non-existent number
5
Question 3 - Homoscedasticity & Linearity
Use the Scatterplot command from the Graphs menu, as shown below, to plot the
relationship between strength (x-axis) and weight (y-axis).
Note: In a full data screening, you would usually examine scatterplots for more variable
pairings.
6
HOMEWORK EXTRA
Try this additional exercises at home (or in class if you have finished all of the other exercises)
to develop a more in-depth understanding of how to handle outliers in SPSS.
Univariate Outliers
Another way to identify the highest and lowest values for weight, or any other variable, is to
use the Explore option:
Descriptive Statistics > Explore and enter weight in the Dependent List box.
Then click on the Statistics button and select Outliers and Percentiles, Continue
and OK
The resulting output will give you a boxplot of the data to help you identify univariate outliers
(as an alternative to histograms). The ‘Extreme Values’ table will also give you the 5 lowest
and 5 highest values for this variable along with the row number showing where these values
occur (under the ‘Case Number’ column).
Multivariate Outliers
To check for multivariate outliers (see next page for SPSS screenshot) :
2
The multivariate outlier statistic we want looks only at the independent variables so we aren’t
interested in what goes in the dependent variable box as it isn’t actually included in the calculations.
A dependent variable is entered purely to get the analysis to run.
7
You are not interested in the output of this regression analysis, but it calculates the MD values
you want and puts them in the SPSS dataset for you. Have a look at the dataset and you will
see the new MAH_1 variable which has a value for every participant.
You can identify who is a multivariate outlier by finding the participants whose value on
MAH_1 exceeds the critical value for the 2. This critical value corresponds to that for p =
0.001 and df = the number of IVs -- see the MV Outliers slide of the lecture notes. (You could
use the Compute command on the Transform menu to find the relevant critical value of 2, by
using the IDF.CHISQ function that SPSS provides). The critical value when df = 7 is 24.32.
The easiest way to identify if any MD values have exceeded this critical value is to Sort
MAH_1 in descending order as we did earlier with the univariate outlier for Weight.