Exploring Data: The Beast of Bias
Exploring Data: The Beast of Bias
Sources of Bias
A bit of revision. We’ve seen that having collected data we usually fit a model that represents the hypothesis that we
want to test. This model is usually a linear model, which takes the form of:
Outliers
An outlier is a score very different from the rest of the data. When I published my first book (Field, 2000), I obsessively
checked the book’s ratings on Amazon.co.uk. Customer ratings can range from 1 to 5 stars, where 5 is the best. Back in
2002, my first book had seven ratings (in the order given) of 2, 5, 4, 5, 5, 5, and 5. All but one of these ratings are fairly
similar (mainly 5 and 4) but the first rating was quite different from the rest—it was a rating of 2 (a mean and horrible
rating). The mean of these scores was 4.43. The only score that wasn’t a 4 or 5 was the first rating of 2. This score is an
example of an outlier—a weird and unusual person (sorry, I mean score) that deviates from the rest of humanity (I
mean, data set). The mean of the scores when the outlier is not included is 4.83 (it increases by 0.4). This example shows
how a single score, from some mean-spirited badger turd, can bias a parameter such as the mean: the first rating of 2
drags the average down. Based on this biased estimate new customers might erroneously conclude that my book is
worse than the population actually thinks it is.
To plot a histogram we use Chart Builder (see last week’s handout) which is accessed through the
menu. Select Histogram in the list labelled Choose from to bring up the gallery shown in Figure 1. This gallery has four
icons representing different types of histogram, and you should select the appropriate one either by double-clicking on
it, or by dragging it onto the canvas in the Chart Builder:
® Simple histogram: Use this option when you just want to see the frequencies of scores for a single variable (i.e.
most of the time).
® Stacked histogram and Population Pyramid: If you had a grouping variable (e.g. whether men or women
attended the festival) you could produce a histogram in which each bar is split by group (stacked histogram) or
the outcome (in this case hygiene) on the vertical axis and each group (i.e. men vs. women) on the horizontal
(i.e., the histograms for men and women appear back to back on the graph).
® Frequency Polygon: This option displays the same data as the simple histogram except that it uses a line instead
of bars to show the frequency, and the area below the line is shaded.
Figure 2: Plotting a histogram
Figure 3: Histogram of the Day 1 Download Festival hygiene scores
Select the menu, then select Boxplot in the list labelled Choose from to bring up the gallery shown
in Figure 4. There are three types of boxplot you can choose:
® Simple boxplot: Use this option when you want to plot a boxplot of a single variable, but you want different
boxplots produced for different categories in the data (for these hygiene data we could produce separate
boxplots for men and women).
® Clustered boxplot: This option is the same as the simple boxplot except that you can select a second categorical
variable on which to split the data. Boxplots for this second variable are produced in different colours. For
example, we might have measured whether our festival-goer was staying in a tent or a nearby hotel during the
festival. We could produce boxplots not just for men and women, but within men and women we could have
different-coloured boxplots for those who stayed in tents and those who stayed in hotels.
® 1-D Boxplot: Use this option when you just want to see a boxplot for a single variable. (This differs from the
simple boxplot only in that no categorical variable is selected for the x-axis.)
Figure 5: Boxplot for the download festival data
The outlier that we detected in the histogram has shown up as an extreme score (*) on the boxplot. SPSS helpfully tells
us the number of the case (611) that’s producing this outlier. If we go to the data editor (data view), we can locate this
case quickly by clicking on and typing 611 in the dialog box that appears. That takes us straight to case 611. Looking
at this case reveals a score of 20.02, which is probably a mistyping of 2.02. We’d have to go back to the raw data and
check. We’ll assume we’ve checked the raw data and this score should be 2.02, so replace the value 20.02 with the
value 2.02 before we continue this example
SELF TEST: Now we have removed the outlier in the data, re-plot the histogram and boxplot.
Figure 6: Histogram (left) and boxplot (right) of hygiene scores on day 1 of the Download Festival
Figure 6 shows the histogram and boxplot for the data after the extreme case has been corrected. The distribution is
nicely symmetrical and doesn’t seem too pointy or flat. Neither plot indicates any particularly extreme scores: the
boxplot suggests that case 574 is a mild outlier, but the histogram doesn’t seem to show any cases as being particularly
out of the ordinary
Assumptions
Most of our potential sources of bias come in the form of ‘violations of assumptions’. An assumption is a condition that
ensures that what you’re attempting to do works. For example, when we assess a model using a test statistic, we have
usually made some assumptions, and if these assumptions are true then we know that we can take the test statistic
(and, therefore, p-value) associated with a model at face value and interpret it accordingly. Conversely, if any of the
assumptions are not true (usually referred to as a violation) then the test statistic and p-value will be inaccurate and
could lead us to the wrong conclusion if we interpret them at face value.
Assumptions are often presented as though different statistical procedures have their own unique set of assumptions.
However, because we’re usually fitting variations of the linear model to our data, all of the tests in my book (Field, 2013)
basically have the same assumptions. These assumptions relate to the quality of the model itself, and the test statistics
used to assess it (which are usually parametric tests based on the normal distribution). The main assumptions that we’ll
look at are:
• Additivity and linearity
• Normality of something or other
• Homoscedasticity/homogeneity of variance
• Independence
a kennel, or to fetch sticks, or to sit when you tell it to, but don’t be surprised when it’s behaviour isn’t what you expect
because even though you’ve a called it a dog, it is in fact a cat.
Many people take the ‘assumption of normality’ to mean that your data need to be normally distributed. However, that
isn’t what it means. What it does mean is:
1. For confidence intervals around a parameter estimate (e.g., the mean, or a b) to be accurate, that estimate
must come from a normal distribution.
2. For significance tests of models (and the parameter estimates that define them) to be accurate the sampling
distribution of what’s being tested must be normal. For example, if testing whether two means are different,
the data do not need to be normally distributed, but the sampling distribution of means (or differences
between means) does. Similarly, if looking at relationships between variables, the significance tests of the
parameter estimates that define those relationships (the bs in Eq. 1) will be accurate only when the sampling
distribution of the estimate is normal.
3. For the estimates of the parameters that define a model (the bs in Eq. 1) to be optimal (using the method of
least squares) the residuals (the errori in Eq. 1) in the population must be normally distributed.
The misconception that people often have about the data themselves needing to be normally distributed probably
stems from the fact that if the data are normally distributed then it’s reasonable to assume that the errors in the model
and the sampling distribution are also (and remember, we don’t have direct access to the sampling distribution so we
have to make educated guesses about its shape).
0.20 0.20
0.15 0.15
Population
Density
Density
0.10 0.10
Human
Muppets
0.05 0.05
0.00 0.00
5 10 15 20 5 10 15 20
Happiness Happiness
Figure 7: A distribution that looks non-normal (left) could be made up of different groups of normally-distributed scores
When you have a categorical predictor variable (such as people falling into different groups) you wouldn’t expect the
overall distribution of the outcome (or residuals) to be normal. For example, if you have seen the movie ‘the Muppets’,
you will know that Muppets live among us. Imagine you predicted that Muppets are happier than humans (on TV they
seem to be). You collect happiness scores in some Muppets and some Humans and plot the frequency distribution. You
get the graph on the left of Figure 7 and decide that your data are not normal: you think that you have violated the
assumption of normality. However, you haven’t because you predicted that Humans and Muppets will differ in
happiness; in other words, you predict that they come from different populations. If we plot separate frequency
distributions for humans and Muppets (right of Figure 7) you’ll notice that within each group the distribution of scores
is very normal. The data are as you predicted: Muppets are happier than humans and so the centre of their distribution
is higher than that of humans. When you combine all of the scores this gives you a bimodal distribution (i.e., two humps).
This example illustrates that it is not the normality of the outcome (or residuals) overall that matters, but normality at
each unique level of the predictor variable.
The central limit theorem means that there are a variety of situations in which we can assume normality regardless of
the shape of our sample data (Lumley, Diehr, Emerson, & Chen, 2002):
1. Confidence intervals: The central limit theorem tells us that in large samples, the estimate will have come from
a normal distribution regardless of what the sample or population data look like. Therefore, if we are interested
in computing confidence intervals then we don’t need to worry about the assumption of normality if our
sample is large enough.
2. Significance tests: the central limit theorem tells us that the shape of our data shouldn’t affect significance
tests provided our sample is large enough. However, the extent to which test statistics perform as they should
do in large samples varies across different test statistics—for more information read Field (2013).
3. Parameter estimates: The method of least squares will always give you an estimate of the model parameters
that minimizes error, so in that sense you don’t need to assume normality of anything to fit a linear model and
estimate the parameters that define it (Gelman & Hill, 2007). However, there are other methods for estimating
model parameters, and if you happen to have normally distributed errors then the estimates that you obtained
using the method of least squares will have less error than the estimates you would have got using any of these
other methods.
To sum up then, if all you want to do is estimate the parameters of your model then normality doesn’t really matter. If
you want to construct confidence intervals around those parameters, or compute significance tests relating to those
parameters then the assumption of normality matters in small samples, but because of the central limit theorem we
don’t really need to worry about this assumption in larger samples (but see Field (2013) for a discussion of what we
might mean by a larger sample). In practical terms, as long as your sample is fairly large, outliers are a more pressing
concern than normality.
Homogeneity of Variance/Homoscedasticity
The second assumption we’ll explore relates to variance and it can impact on the two main things that we might do
when we fit models to data:
• Parameters: If we use the method of least squares to estimate the parameters in the model, then this will give
us optimal estimates if the variance of the outcome variable is equal across different values of the predictor
variable.
• Null hypothesis significance testing: test statistics often assume that the variance of the outcome variable is
equal across different values of the predictor variable. If this is not the case then these test statistics will be
inaccurate.
Therefore, to make sure our estimates of the parameters that define our model and significance tests are accurate we
have to assume homoscedasticity (also known as homogeneity of variance).
In designs in which you test several groups of participants this assumption means that each of these samples comes
from populations with the same variance. In correlational designs, this assumption means that the variance of the
outcome variable should be stable at all levels of the predictor variable. In other words, as you go through levels of the
predictor variable, the variance of the outcome variable should not change.
In terms of estimating the parameters within a linear model if we assume equality of variance then the estimates we
get using the method of least squares will be optimal. If variances for the outcome variable differ along the predictor
variable then the estimates of the parameters within the model will not be optimal. They will be ‘unbiased’ (Hayes &
Cai, 2007) but not optimal.
Unequal variances/heteroscedasticity also creates bias and inconsistency in the estimate of the standard error
associated with the parameter estimates (Hayes & Cai, 2007). This basically means that confidence intervals and
significance tests will be biased (because they are computed using the standard error). Confidence intervals can be
‘extremely inaccurate’ when homogeneity of variance/homoscedasticity cannot be assumed (Wilcox, 2010). Some test
statistics are designed to be accurate even when this assumption is violated.
Independence
This assumption means that the errors in your model (the errori in Eq. 1) are not related to each other. Imagine Paul
and Julie were participants in an experiment where they had to indicate whether they remembered having seen
particular photos. If Paul and Julie were to confer about whether they’d seen certain photos then their answers would
not be independent: Julie’s response to a given question would depend on Paul’s answer. We know already that if we
estimate a model to predict their responses, there will be error in those predictions and because Paul and Julie’s scores
are not independent the errors associated with these predicted values will also not be independent. If Paul and Julie
were unable to confer (if they were locked in different rooms) then the error terms should be independent (unless
they’re telepathic): the error in predicting Paul’s response should not be influenced by the error in predicting Julie’s
response.
The equation that we use to estimate the standard error is valid only if observations are independent. Remember that
we use the standard error to compute confidence intervals and significance tests, so if we violate the assumption of
independence then our confidence intervals and significance tests will be invalid. If we use the method of least squares,
then model parameter estimates will still be valid but not optimal (we could get better estimates using a different
method). In general, if this assumption is violated, there are techniques you can use described in Chapter 20 of (Field,
2013).
Testing Assumptions
Testing normality
You can look for normality in three ways: (1) graphs; (2) numerically; and (3) significance tests. We can do all three using
the Explore command in SPSS. In terms of graphs we can look at histograms (which we’ve already learnt about) and P-
P plots and Q-Q plots. P-P and Q-Q plots basically show the same thing: a P-P plot plots the cumulative probability of a
variable against the cumulative probability of a particular distribution (in this case a normal distribution). A Q-Q plot
does the same thing but expressed as quantiles. With large data sets Q-Q plots are a bit easier to interpret. In a sense
they plot the ‘actual data’ against ‘what you’d expect to get from a normal distribution’, so if the data are normally
distributed then the actual scores will be the same as the expected scores and you’ll get a lovely straight diagonal line.
This ideal scenario is helpfully plotted on the graph and your job is to compare the data points to this line. If values fall
on the diagonal of the plot then the variable is normally distributed; however, when the data sag consistently above or
below the diagonal then this shows that the kurtosis differs from a normal distribution, and when the data points are S-
shaped, the problem is skewness.
Numerically, SPSS uses methods to calculate skew and kurtosis (see Field (2013) if you have forgotten what these
concepts are) that give values of zero in a normal distribution. Positive values of skewness indicate a pile-up of scores
on the left of the distribution, whereas negative values indicate a pile-up on the right. Positive values of kurtosis indicate
a pointy and heavy-tailed distribution, whereas negative values indicate a flat and light-tailed distribution. The further
the value is from zero, the more likely it is that the data are not normally distributed.
Finally, we can see whether the distribution of scores deviates from a comparable normal distribution. The Kolmogorov-
Smirnov test and Shapiro-Wilk test do this: they compare the scores in the sample to a normally distributed set of
scores with the same mean and standard deviation.
• If the test is non-significant (p > .05) it tells us that the distribution of the sample is not significantly different
from a normal distribution (i.e., it is probably normal).
• If, however, the test is significant (p < .05) then the distribution in question is significantly different from a
normal distribution (i.e., it is non-normal).
These tests seem great: in one easy procedure they tell us whether our scores are normally distributed (nice!). However,
the Jane Superbrain Box explains some really good reasons not to use them. If you insist on using them, bear Jane’s
advice in mind and always plot your data as well and try to make an informed decision about the extent of non-normality
based on converging evidence.
Figure 8 shows the dialog boxes for the Explore command ( ). First, enter
any variables of interest in the box labelled Dependent List by highlighting them on the left-hand side and transferring
them by clicking on . For this example, select the hygiene scores for the three days. If you click on a dialog
box appears, but the default option is fine (it will produce means, standard deviations and so on). The more interesting
option for our current purposes is accessed by clicking on . In this dialog box select the option
, and this will produce both the K-S test and some normal Q-Q plots. You can also split the
analysis by a factor or grouping varaiable (for example, we could do a separete analysis for males and females by
dragging gender to the Factor List box — we’ll do this later in the handout).
Figure 8: Dialog boxes for the explore command
We also need to click on to tell SPSS how to deal with misisng values. This is important because although we
start off with 810 scores on day 1, by day two we have only 264 and this is reduced to 123 on day 3. By default SPSS will
use only cases for which there are valid scores on all of thes elected variables. This would mean that for day 1, even
though we have 810 scores, it will use only the 123 cases for which there are scores on all three days. This is known as
exlcuding cases listwise. However, we want it to use all of the scores it has on a given day, which is known as pairwise.
Once you have clicked on select Exclude cases pairwise, then click on to return to the main dialog box
and click on to run the analysis.
SELF-TEST: We have already plotted a histogram of the day 1 scores, using what you leant before
plot histograms for the hygiene scores for days 2 and 3 of the Download Festival.
Figure 9: Histograms (left) and P-P plots (right)of the hygiene scores over the three days of the Download Festival
Figure 9 shows the histograms (from the self-test tasks) and the corresponding Q-Q plots. The day 1 scores look quite
normal; The Q-Q plot echoes this view because the data points all fall very close to the ‘ideal’ diagonal line. However,
the distributions for days 2 and 3 are not nearly as symmetrical as day 1: they both look positively skewed. Again, this
can be seen in the Q-Q plots by the data points deviating away from the diagonal. In general, this seems to suggest that
by days 2 and 3, hygiene scores were much more clustered around the low end of the scale. Remember that the lower
the score, the less hygienic the person is, so generally people became smellier as the festival progressed. The skew
occurs because a substantial minority insisted on upholding their levels of hygiene (against all odds) over the course of
the festival (baby wet-wipes are indispensable I find).
Output 1 shows the table of descriptive statistics for the three variables in this example. On average, hygiene scores
were 1.77 (out of 5) on day 1 of the festival, but went down to 0.96 and 0.98 on days 2 and 3 respectively. The other
important measures for our purposes are the skewness and the kurtosis, both of which have an associated standard
error. For day 1 the skew value is very close to zero (which is good) and kurtosis is a little negative. For days 2 and 3,
though, there is a skewness of around 1 (positive skew).
We can convert these values to z-scores which enables us to (1) compare skew and kurtosis values in different samples
that used different measures, and (2) calculate a p-value that tells us if the values are significantly different from 0 (i.e.,
normal). Although there are good reasons not to do this (see Jane Superbrain Box), if you want to you can do it by
subtracting the mean of the distribution (in this case zero) from the score and then dividing by the standard error of the
distribution.
𝑆𝑆 − 0 𝐾𝐾 − 0
𝑧𝑧+,-./-++ = 𝑧𝑧,4567+8+ =
𝑆𝑆𝑆𝑆+,-./-++ 𝑆𝑆𝑆𝑆,4567+8+
In the above equations, the values of S (skewness) and K (kurtosis) and their respective standard errors are produced
by SPSS. These z-scores can be compared against values that you would expect to get if skew and kurtosis were not
different from 0. So, an absolute value greater than 1.96 is significant at p < .05, above 2.58 is significant at p < .01 and
above 3.29 is significant at p < .001. However, you really should use these criteria only in small samples: in larger samples
look at the shape of the distribution visually, interpret the value of the skewness and kurtosis statistics, and possibly
don’t even worry about normality at all (Jane Superbrain Box).
Output 1
Using the values in Output 1, calculate the z-scores for skewness and Kurtosis for each day of
the Download festival.
Your Answers:
Output 2
Output 2 shows the K-S test. Remember that a significant value (Sig. less than .05) indicates a deviation from normality.
Are the results of the K-S tests surprising given the histograms we have already seen?
Your Answers:
For day 1 the K-S test is just about not significant (p = .097), which is surprisingly close to significant given now normal
the day 1 scores looked in the histogram (Figure 3). However, the sample size on day 1 is very large (N = 810) and the
significance of the K-S test for these data shows how in large samples even small and unimportant deviations from
normality might be deemed significant by this test (Jane Superbrain Box). For days 2 and 3 the test is highly significant,
indicating that these distributions are not normal, which is likely to reflect the skew seen in the histograms for these
data but could again be down to the large sample (Figure 9).
The test statistic for the K-S test is denoted by D and we should report the degrees of freedom (df) from the table in
brackets after the D. We can report the results in Output 2 in the following way:
ü The hygiene scores on day 1, D(810) = 0.029, p = .097, did not deviate significantly from normal; however,
day 2, D(264) = 0.121, p < .001, and day 3, D(123) = 0.140, p < .001, scores were both significantly non-
normal.
Throughout this handout we will look at various significance tests that have been devised to look at whether
assumptions are violated. These include tests of whether a distribution is normal (the Kolmorgorov-
Smirnoff and Shapiro-Wilk tests), tests of homogeneity of variances (Levene’s test), and tests of
significance of skew and kurtosis. All of these tests are based on null hypothesis significance testing and
this means that (1) in large samples they can be significant even for small and unimportant effects, and
(2) in small samples they will lack power to detect violations of.
The central limit theorem means that as sample sizes get larger, the less the assumption of normality
matters because the sampling distribution will be normal regardless of what our population (or indeed
sample) data look like. So, the problem is that in large samples, where we don’t need to worry about normality, a test
of normality is more likely to be significant, and therefore likely to make us worry about and correct for something that
doesn’t need to be corrected or worried about. Conversely, in small samples, where we might want to worry about
normality, a significance test won’t have the power to detect non-normality and so is likely to encourage us not to worry
about something that we probably ought to. Therefore, the best advice is that if your sample is large then don’t use
significance tests of normality, in fact don’t worry too much about normality at all. In small samples then pay attention
if your significance tests are significant but resist being lulled into a false sense of security if they are not.
1
Theses standardized errors are called standardized residuals.
Although Levene’s test can be selected as an option in many of the statistical tests that require it, it’s best to look at it
when you’re exploring data because it informs the model you fit. As with the K-S test you need to take Levene’s test
with a pinch of salt (Jane Superbrain Box).
Output 3
Levene’s test can be denoted with the letter F and there are two different degrees of freedom. As such you can report
it, in general form, as F(df1, df2) = value, sig:
ü For the hygiene scores on day 1 of the festival, the variances were unequal for for males and females, F(1,
808) = 4.74, p = .03.
Your Answers:
Reducing Bias
Having looked at potential sources of bias, the next issue is how to reduce the impact of bias. Essentially there are four
methods for correcting problems with the data, which can be remembered with the handy acronym of TWAT:
• Trim the data: delete a certain amount of scores from the extremes.
• Windsorizing: substitute outliers with the highest value that isn’t an outlier.
• Analyse with Robust Methods: this typically involves a technique known as bootstrapping.
• Transform the data: this involves applying a mathematical function to scores to try to correct any problems
with them.
Probably the best of these choices is to use robust tests, which is a term applied to a family of procedures to estimate
statistics that are reliable even when the normal assumptions of the statistic are not met. For the purposes of this
handout we’ll look at transforming data, and throughout the module we’ll use bootstrapping (which is a robust method
explained in your lecture), but you can find more detail on these techniques and the other in Chapter 5 of (Field, 2013).
http://youtu.be/mNrxixgwA2M
Transforming Data
The final thing that you can do to combat problems with normality and linearity is to transform your data. The idea
behind transformations is that you do something to every score to correct for distributional problems, outliers, lack of
linearity or unequal variances. If you are looking at relationships between variables (e.g., regression) just transform the
problematic variable, but if you are looking at differences between variables (e.g., change in a variable over time) then
you need to transform all of those variables. For example, our festival hygiene data were not normal on days 2 and 3 of
the festival. Now, we might want to look at how hygiene levels changed across the three days (i.e., compare the mean
on day 1 to the means on days 2 and 3 to see if people got smellier). The data for days 2 and 3 were skewed and need
to be transformed, but because we might later compare the data to scores on day 1, we would also have to transform
the day 1 data (even though scores were not skewed). If we don’t change the day 1 data as well, then any differences
in hygiene scores we find from day 1 to day 2 or 3 will be due to us transforming one variable and not the others.
However, if we were going to look at the relationship between day 1 and day 2 scores (not the difference between
them) we could transform only the day 2 scores and leave the day 1 scores alone.
Choosing a transformation
There are various transformations that you can do to the data that are helpful in correcting various problems. Table 1:
shows some common transformations and their uses. The way to decide which transformation to use is by good old
fashioned trial and error: try one out, see if it helps and if it doesn’t then try a different one.
Table 1: Data transformations and their uses
Data Transformation Can Correct For
Log transformation (log(Xi)): Taking the logarithm of a set of numbers squashes the right tail of the distribution. As such it’s Positive skew,
a good way to reduce positive skew. This transformation is also very useful if you have problems with linearity (it can positive kurtosis,
sometimes make a curvilinear relationship linear). However, you can’t get a log value of zero or negative numbers, so if your unequal variances,
data tend to zero or produce negative numbers you need to add a constant to all of the data before you do the lack of linearity
transformation. For example, if you have zeros in the data then do log(Xi + 1), or if you have negative numbers add whatever
value makes the smallest number in the data set positive.
Square root transformation (ÖXi): Taking the square root of large values has more of an effect than taking the square root Positive skew,
of small values. Consequently, taking the square root of each of your scores will bring any large scores closer to the centre— positive kurtosis,
rather like the log transformation. As such, this can be a useful way to reduce positive skew; however, you still have the unequal variances,
same problem with negative numbers (negative numbers don’t have a square root). lack of linearity
Reciprocal transformation (1/Xi): Dividing 1 by each score also reduces the impact of large scores. The transformed variable Positive skew,
will have a lower limit of 0 (very large numbers will become close to 0). One thing to bear in mind with this transformation positive kurtosis,
is that it reverses the scores: scores that were originally large in the data set become small (close to zero) after the unequal variances
transformation, but scores that were originally small become big after the transformation. For example, imagine two scores
of 1 and 10; after the transformation they become 1/1 = 1, and 1/10 = 0.1: the small score becomes bigger than the large
score after the transformation. However, you can avoid this by reversing the scores before the transformation, by finding
the highest score and changing each score to the highest score minus the score you’re looking at. So, you do a
transformation 1/(XHighest−Xi). Like the log transformation, you can’t take the reciprocal of 0 (because 1/0 = infinity) so if you
have zeros in the data you need to add a constant to all scores before doing the transformation.
Reverse score transformations: Any one of the above transformations can be used to correct negatively skewed data, but Negative skew
first you have to reverse the scores. To do this, subtract each score from the highest score obtained, or the highest score +
1 (depending on whether you want your lowest score to be 0 or 1). If you do this, don’t forget to reverse the scores back
afterwards, or to remember that the interpretation of the variable is reversed: big scores have become small and small
scores have become big.
Trying out different transformations can be quite time consuming; however, if heterogeneity of variance is your issue
then we can see the effect of a transformation quite quickly. When we ran Levene’s test (Figure 11) we ran the analysis
selecting the raw scores ( ). However, if the variances turn out to be unequal, as they did in our example,
you can use the same dialog box but select . When you do this you should notice a drop-down list that
becomes active and if you click on this you’ll notice that it lists several transformations including the ones that I have
just described. If you select a transformation from this list (Natural log perhaps or Square root) then SPSS will calculate
what Levene’s test would be if you were to transform the data using this method. This can save you a lot of time trying
out different transformations.
The compute command enables us to carry out various functions on columns of data in the data editor. Some typical
functions are adding scores across several columns, taking the square root of the scores in a column, or calculating the
mean of several variables. To access the compute dialog box, use the mouse to select . The
resulting dialog box is shown in Figure 13; it has a list of functions on the right-hand side, a calculator-like keyboard in
the centre and a blank space that I’ve labelled the command area. The basic idea is that you type a name for a new
variable in the area labelled Target Variable and then you write some kind of command in the command area to tell
SPSS how to create this new variable. You use a combination of existing variables selected from the list on the left and
numeric expressions. So, for example, you could use it like a calculator to add variables (i.e. add two columns in the data
editor to make a third). There are hundreds of built-in functions that SPSS has grouped together. In the dialog box it lists
these groups in the area labelled Function group; upon selecting a function group, a list of available functions within
that group will appear in the box labelled Functions and Special Variables. If you select a function, then a description of
that function appears in the box indicated in Figure 13. You can enter variable names into the command area by selecting
the variable required from the variables list and then clicking on . Likewise, you can select a certain function from the
list of available functions and enter it into the command area by clicking on .
First type a variable name in the box labelled Target Variable, then click on and another dialog box appears,
where you can give the variable a descriptive label and specify whether it is a numeric or string variable (see your
handout from week 1). Then when you have written your command for SPSS to execute, click on to run the
command and create the new variable. There are functions for calculating means, standard deviations and sums of
columns. We’re going to use the square root and logarithm functions, which are useful for transforming data that are
skewed.
Command area
Categories of
functions
Variable list
Functions
within the
selected
Description of
category
selected function
Figure 13: Dialog box for the Compute command
Log Transformation
Let’s return to our Download festival data. Open the main Compute dialog box by selecting .
Enter the name logday1 into the box labelled Target Variable, click on and give the variable a more
descriptive name such as Log transformed hygiene scores for day 1 of Download festival. In the list box labelled Function
group click on Arithmetic and then in the box labelled Functions and Special Variables click on Lg10 (this is the log
transformation to base 10, Ln is the natural log) and transfer it to the command area by clicking on . When the
command is transferred, it appears in the command area as ‘LG10(?)’ and the question mark should be replaced with a
variable name (which can be typed manually or transferred from the variables list). So replace the question mark with
the variable day1 by either selecting the variable in the list and dragging it across, clicking on , or just typing ‘day1’
where the question mark is.
For the day 2 hygiene scores there is a value of 0 in the original data, and there is no logarithm of the value 0. To
overcome the problem we add a constant to our original scores before we take the log of those scores. Any constant
will do (although sometimes it can matter), provided that it makes all of the scores greater than 0. In this case our lowest
score is 0 in the data so we could add 1 to all of the scores to ensure that all scores are greater than zero. Even though
this problem affects the day 2 scores, we need to be consistent and do the same to the day 1 scores as we will do with
the day 2 scores. Therefore, make sure the cursor is still inside the brackets and click on and then . The final dialog
box should look like Figure 13. Note that the expression reads LG10(day1 + 1); that is, SPSS will add one to each of the
day1 scores and then take the log of the resulting values. Click on to create a new variable logday1 containing the
transformed values.
SELF TEST: Have a go at creating similar variables logday2 and logday3 for the day 2 and day 3
data. Plot histograms of the transformed scores for all three days.
To use the square root transformation, we could run through the same process, by using a name such as sqrtday1 and
selecting the SQRT(numexpr) function from the list. This will appear in the box labelled Numeric Expression: as SQRT(?),
and you can simply replace the question mark with the variable you want to change—in this case day1. The final
expression will read SQRT(day1).
SELF TEST: Repeat this process for day2 and day3 to create variables called sqrtday2 and sqrtday3.
Plot histograms of the transformed scores for all three days.
Reciprocal Transformation
To do a reciprocal transformation on the data from day 1, we could use a name such as recday1 in the box labelled
Target Variable. Then we can simply click on and then . Ordinarily you would select the variable name that you
want to transform from the list and drag it across, click on or just type the name of the variable. However, the day 2
data contain a zero value and if we try to divide 1 by 0 then we’ll get an error message (you can’t divide by 0). We need
to add a constant to our variable just as we did for the log transformation. Any constant will do, but 1 is a convenient
number for these data. So, instead of selecting the variable we want to transform, click on . This places a pair of
brackets into the box labelled Numeric Expression; then make sure the cursor is between these two brackets and select
the variable you want to transform from the list and transfer it across by clicking on (or type the name of the variable
manually). Now click on and then (or type + 1 using your keyboard). The box labelled Numeric Expression should
now contain the text 1/(day1 + 1). Click on to create a new variable containing the transformed values.
SELF TEST: Repeat this process for day2 and day3. Plot histograms of the transformed scores for
all three days.
Figure 14 shows the distributions for days 1 and 2 of the festival after the three different transformations. Compare
these to the untransformed distributions in Figure 9. Now, you can see that all three transformations have cleaned up
the hygiene scores for day 2: the positive skew is reduced (the square root transformation in particular has been useful).
However, because our hygiene scores on day 1 were more or less symmetrical to begin with, they have now become
slightly negatively skewed for the log and square root transformation, and positively skewed for the reciprocal
2
transformation If we’re using scores from day 2 alone or looking at the relationship between day 1 and day 2, then we
could use the transformed scores; however, if we wanted to look at the change in scores then we’d have to weigh up
whether the benefits of the transformation for the day 2 scores outweigh the problems it creates in the day 1 scores—
data analysis can be frustrating sometimesJ
Day 1 Day 2
Log
Square
root
1/x
Figure 14: Distributions of the hygiene data on day 1 and day 2 after various transformations
2
The reversal of the skew for the reciprocal transformation is because, as I mentioned earlier, the reciprocal has the
effect of reversing the scores.
References
Efron, B., & Tibshirani, R. (1993). An introduction to the bootstrap. Boca Raton, FL: Chapman & Hall/CRC.
Field, A. P. (2000). Discovering statistics using SPSS for Windows: Advanced techniques for the beginner. London: Sage.
Field, A. P. (2013). Discovering statistics using IBM SPSS Statistics: And sex and drugs and rock 'n' roll (4th ed.). London:
Sage.
Gelman, A., & Hill, J. (2007). Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge: Cambridge
University Press.
Hayes, A. F., & Cai, L. (2007). Using heteroskedasticity-consistent standard error estimators in OLS regression: An
introduction and software implementation. Behavior Research Methods, 39(4), 709-722.
Levene, H. (1960). Robust tests for equality of variances. In I. Olkin, S. G. Ghurye, W. Hoeffding, W. G. Madow & H. B.
Mann (Eds.), Contributions to Probability and Statistics: Essays in Honor of Harold Hotelling (pp. 278-292).
Stanford, CA: Stanford University Press.
Lumley, T., Diehr, P., Emerson, S., & Chen, L. (2002). The importance of the normality assumption in large public health
data sets. Annual Review of Public Health, 23, 151-169.
Pearson, E. S., & Hartley, H. O. (1954). Biometrika tables for statisticians, volume I. New York: Cambridge University
Press.
Wilcox, R. R. (2010). Fundamentals of modern statistical methods: substantially improving power and accuracy. New
York: Springer.
Terms of Use
This handout contains material from:
Field, A. P. (2013). Discovering statistics using SPSS: and sex and drugs and rock ‘n’ roll (4th Edition). London: Sage.
This material is copyright Andy Field (2000-2016).
This document is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International
License (https://creativecommons.org/licenses/by-nc-nd/4.0/), basically you can use it for teaching and non-profit
activities but not meddle with it without permission from the author.