SPSS Explained 2nd Edition-312-339
SPSS Explained 2nd Edition-312-339
304 INTRODUCTION
TO THE KENDALL
TAU-B
CORRELATION
Sometimes we wish to collect a score on two variables from a set of participants to
306 INTRODUCTION
see whether there is a relationship between the variables. For example, we might TO
measure a person’s experience in a job (number of weeks) and the number of items SCATTERPLOTS
they produce each day to ask if there is a relationship between experience and 314 PARTIAL
productivity. Where we have two variables like this, produced by the same or related CORRELATION
participants, we are able to examine the association between the variables by a 316 LINEAR
correlation. A correlation is performed to test the degree to which the scores on the REGRESSION
two variables co-relate – that is, the extent to which the variation in the scores on 319 LOGISTIC
REGRESSION
one variable results in a corresponding variation in the scores on the second variable.
We will only be considering linear correlations in this book. The simplest
relationship between two variables is a linear relationship and a linear relationship
is the underlying model we propose for our data. The reasons for this are explained
in an earlier chapter. In this case, with two variables, we are arguing that if they are See Chapter 7
correlated, then if we plot the points on a graph they will follow a straight line. We
refer to this line as the regression line. However, we are unlikely to find that our data
points lie exactly on a straight line. We explain this by claiming that it arises from
random factors referred to as ‘error’. Given that the equation of a straight line is
defined mathematically as Y = a + bX, where X and Y are the variables, with ‘a’ the
intercept (or ‘constant’) and ‘b’ the slope of line, our observed values are defined as
follows:
Y = a + bX + error
We can work out the regression line by finding the equation that gives us the smallest
amount of error.
A strong correlation indicates that there is only a small amount of error and
most of the points lie close to the regression line; a weak correlation indicates that
there is a lot of error and the points are more scattered. In the second case we are
likely to conclude that a linear relationship is not a good model for our data.
High values of one variable associated with high values of the second variable
indicate that the correlation is positive. For example, we might find a positive
298 LINEAR CORRELATION AND REGRESSION
correlation between height and foot size, with taller people having larger feet and
shorter people having smaller feet. When high values of the first variable are associ-
ated with low values of the second variable then we refer to this as a negative
correlation. So, for a car travelling at a constant speed along a track, we will find
that the distance travelled is negatively correlated with the amount of petrol left in
the tank.
The statistical measures of correlation in this chapter (Pearson, Spearman,
Kendall tau-b) all produce a statistic that ranges from –1, indicating a perfect negative
correlation, to +1, indicating a perfect positive correlation. A value of zero indicates
no correlation at all.
The important point to remember is that we are considering a linear correlation here,
so our key assumption is that the points follow a straight line if they are correlated.
If we believe the relationship between the variables is not linear, then we do not use
the Pearson statistic but instead use the Spearman or Kendall tau-b explained later
in the chapter.
Scenario
A researcher postulated that students who spent the most time studying would
achieve the highest marks in science examinations, whereas those who did the least
studying would achieve lower marks. The researcher noted the results of ten first-
year university students, showing how much time (in hours) they spent studying
(on average per week throughout the year) along with their end-of-year examination
marks (out of 100).
LINEAR CORRELATION AND REGRESSION 299
3
As our prediction states that we are expecting a positive correlation this indicates a
direction. Our prediction is therefore one-tailed. If we did not know which direction our
relationship between the two variables would be, we would have a two-tailed prediction,
and we would be looking for either a positive or a negative correlation.
Data entry
3
Enter the dataset as shown in the example.
Remember when
entering data without
decimal places to change
the decimal places to
zero in the Variable
View. See an earlier
chapter for the full data
entry procedure.
See Chapter 2
n You can see that SPSS selects the Pearson coefficient as a default.
n Select whether your prediction is One-tailed or Two-tailed. Ours is one-tailed
as we stated there would be a positive correlation.
n Click on OK.
3
• The Flag significant box is selected as default. Significant correlations are highlighted
underneath the output table with a * for a significance of p < .05 and ** for p < .01.
• If you require the means and standard deviations for each variable, click on the
Options button and tick means and standard deviations, then Continue and OK.
SPSS output
SPSS produces one output table, the Correlations table, unless descriptive statistics
have been selected.
LINEAR CORRELATION AND REGRESSION 301
SPSS essential 3
n The Pearson Correlation test statistic = .721. SPSS indicates with ** that it The appropriate graph to
is significant at the .01 level for a one-tailed prediction. The actual p value support a correlation is a
scatterplot – see later in
is shown to be .009. These figures are duplicated in the matrix. this chapter for more
n A conventional way of reporting these figures would be as follows: details.
r = .72, N = 10, p < .01
n These results indicate that as study time increases, science exam performance
also increases, which is a positive correlation.
n As the r value reported is positive and p < .01, we can state that we have a
positive correlation between our two variables and our null hypothesis can
be rejected. If the r value was negative this would indicate a negative
correlation, and be counter to our hypothesis.
n The Pearson Correlation output matrix also shows the r value when ‘Study
time’ is correlated with itself, and there is a perfect correlation coefficient of
1.000. Similarly, ‘Science Exam’ has a perfect correlation with itself, r = 1.000.
These values are therefore not required.
Scenario
Two teachers were asked to rate the same eight teenagers on the variable ‘how well
they are likely to do academically at university’ on a 0–20 scale, from unlikely (0)
to highly likely (20). It was thought that there would be a significant correlation
between the teachers’ ranking.
3
As our prediction does not state whether we expect a positive or negative correlation, we
have a two-tailed prediction. If we predicted that our correlation would be either positive
or negative, then we would have a one-tailed prediction.
302 LINEAR CORRELATION AND REGRESSION
Data entry
Enter the dataset as shown in the example.
3
Remember when
entering data without
decimal places to change
the decimal places to
zero in the Variable
View. See previous
chapter for the full data
entry procedure.
See Chapter 2
3
The Flag significant
correlations box is n Highlight both teacher variables and send them to the Variables box.
selected as default.
Significant correlations n As SPSS selects the Pearson correlation coefficient as a default, deselect that box
are highlighted and put a tick in the Spearman box.
underneath the output
table with a * for a n Select whether your prediction is One-tailed or Two-tailed.
significance of p < .05
and ** for p < .01. n Click on OK.
LINEAR CORRELATION AND REGRESSION 303
SPSS output
In order to check if there is a significant correlation between the two teachers’ ratings
the Correlations table must be observed.
SPSS essential
n Spearman’s rho correlation test statistic = .833. This shows a positive
correlation between the two teachers’ ratings. SPSS also illustrates with * that
it is significant at the .05 level for a two-tailed prediction. The actual p value
is shown to be .010. (By double clicking on the figure of .010 in the output
table the value appears to six decimal places .010176, showing that it is just
over .01.) These figures are duplicated in the matrix.
n By observing the Spearman correlation output matrix it can be seen
that teacher 1 is (of course) perfectly correlated with teacher 1, hence the
Spearman’s rho correlation coefficient of 1.000. Similarly, teacher 2 is perfectly
correlated with teacher 2, with a Spearman’s rho correlation coefficient of
1.000.
n A conventional way of reporting the correlation between the two teachers is
as follows:
rs = .83, N = 8, p < .05
n These results indicate that as one teacher’s ratings increase the other teacher’s
ratings increase as well. Therefore, each teacher’s ratings of the teenagers’
expected academic performance is similar, with a student rated highly by one
teacher rated highly by the other as well.
3
While a scatterplot is generally the most appropriate illustrative statistic to support a
correlation, when conducting a Spearman’s test it should be used with caution. The
Spearman’s rho correlation coefficient is produced by using the rank of scores rather
than the actual raw data, whereas the scatterplot displays the raw scores. The Spearman
correlation is based on the ranks of the scores not the actual scores. This means that it
is predicting that that the scores are monotonically related – that they are increasing
together – and not that they lie along a straight line. The procedure for a scatterplot is
at the end of this section.
304 LINEAR CORRELATION AND REGRESSION
Scenario
A consumer testing company wanted to pilot
a new breakfast cereal. They decided to ask
people if the new brand was as tasty as a
current leading brand. They gave twenty
people the current leading brand (brand A)
and also gave them the new breakfast cereal
(brand B). The order in which the participants
tasted each brand differed to counterbalance
order effects. Each participant was then asked
to rate their enjoyment of the cereal on a 1–10
scale (1 they didn’t enjoy it and 10 it was very
tasty).
3
• Remember when entering data without decimal places to change the
decimal places to zero in the Variable View.
• As we are worried about the number of tied ranks in our dataset, we are
going to carry out a Kendall tau-b rather than a Spearman correlation. See
previous chapter for the full data entry procedure.
See Chapter 2
LINEAR CORRELATION AND REGRESSION 305
3
The Flag significant
correlations box is
selected as default.
Significant correlations
are highlighted
underneath the output
table with a * for a
significance of p < .05
and ** for p < .01.
SPSS output
The Kendall tau-b output is displayed in the Correlations table.
306 LINEAR CORRELATION AND REGRESSION
SPSS essential
n The Kendall tau-b correlation output matrix shows a correlation coefficient
of .397. As this value is a positive number it shows that our data is positively
correlated. SPSS also indicates with * that it is significant at the .05 level for
a one-tailed prediction. The actual p value is shown to be .017.
n A conventional way of reporting these figures is as follows:
Kendall tau-b = .40, N = 20, p < .05
n These results indicate that as the ratings for breakfast cereal brand A increase,
so do the ratings for breakfast cereal brand B. Therefore, as there are similar
ratings of enjoyment for both cereals, the consumer testing company are
happy to recommend the tasty new breakfast cereal.
INTRODUCTION TO SCATTERPLOTS
A scatterplot or scattergram illustrates the scores or data that we wish to correlate,
3 where the axes are the two variables. If the scores on one variable increase and so do
the scores on the second variable, this is known as a positive correlation. If scores on
While a scatterplot is one variable increase while the scores on the other variable decrease, this is known
generally the most
as a negative correlation. When the points are randomly scattered there is generally
appropriate illustrative
statistic to support a
no correlation between the two variables. Although a scatterplot is recommended as
correlation, when an illustration supporting correlation, it must be used with caution in conjunction
conducting a Kendall with Spearman and Kendall tau-b correlations because nonparametric analyses use
tau-b test it should be the rank scores rather than the actual raw data, whereas the scatterplot displays the
used with caution. The
raw scores.
Kendall tau-b correlation
coefficient is produced When producing a scatterplot you can ask SPSS to produce the regression line
by using the rank of – the line of best fit. This particular line minimises the distance of the points to the
scores rather than the straight line.
actual raw data, whereas
the scatterplot displays
the raw scores. The Scatterplot procedure through Chart Builder
procedure for producing
a scatterplot is shown The procedure for creating scatterplots through the Chart Builder command is
next. similar to other interactive charts and graphs that were produced in an earlier chapter.
We shall use the data of the Pearson correlation example.
See Chapter 4
n Select the Graphs drop-down menu and select Chart Builder. A Chart Builder
window appears, giving you the option to set the measurement level of your
variables. Normally, you have already set the measurement level.
LINEAR CORRELATION AND REGRESSION 307
n Double click on the type of scatterplot you require. We are using the simple
one on the top left of the 8 choices. Alternatively, you can drag the simple
scatterplot icon into the preview pane.
308 LINEAR CORRELATION AND REGRESSION
n The Chart preview will show the type of scatterplot you wish to produce.
n The Element Properties window also appears. We are not changing the element
properties.
n Press OK.
n Although the positive linear relationship can be seen from the above chart, adding
a regression line will enable a more accurate judgement to be made.
n To insert the regression line, double click inside the scatterplot output and the
SPSS Chart Editor window appears.
310 LINEAR CORRELATION AND REGRESSION
n Select the Elements drop-down menu and click Fit Line at Total. The
Properties window appears. Check that the Linear radio button is selected and
then click Close. Close the Chart Editor.
n Your scatterplot now displays a regression line showing the positive correlation
between study time and science exam marks, as shown below.
LINEAR CORRELATION AND REGRESSION 311
PARTIAL CORRELATION
Previously we have analysed some example data to show a significant correlation
between study time and science examination performance. However, we might decide
that a third variable, ‘intelligence’, could be influencing the correlation. If intelligence
positively correlates with study time – that is, the more intelligent students spend
the most time studying – and if it also positively correlates with examination
performance – that is, the more intelligent students get the higher marks in the
examination – then the correlation of study time and examination performance might
simply be due to the third factor, ‘intelligence’. If this is the case, then the relationship
between study time and examination performance is not genuine, in that the reason
they correlate is because they are both an outcome of ‘intelligence’. That is, the more
intelligent students both study more and get higher marks in the examination. If we
take out the effect of intelligence, the relationship of study time to examination
performance could disappear.
To answer the question of the influence of intelligence on the study time/
examination performance correlation we need to examine the correlation of study
time and examination performance after removing the effects of intelligence. If the
correlation disappears, we will know that it was due to the third factor. To do this
we calculate a partial correlation.
Data entry
Enter the dataset as shown in the example.
3
See Chapter 2 We could add labels to our variables as in the earlier example. See previous chapter for
the full data entry procedure.
LINEAR CORRELATION AND REGRESSION 315
n In the Partial Correlations window highlight the two variables that we want partial correlation
to correlate and then send them across to the Variables box. The correlation of two
variables after having removed
n The third factor that we want to control for needs to be sent to the Controlling the effects of a third variable
for box. from both.
3
n Change the Test of Significance to One-tailed. Click OK.
SPSS essential
n The correlation test statistic = .665. The p value is shown to be .025. As this
value is under .05, there is a significant correlation. These figures are
duplicated in the matrix.
n A conventional way of reporting these figures is r = .665, df = 7, p < .05.
316 LINEAR CORRELATION AND REGRESSION
n These results indicate that as study time increases science exam performance
also increases, when the effects of intelligence have been controlled for. This
is a positive correlation. So the relationship between study time and science
exam performance is not a result of intelligence.
n The output matrix also shows that study time is perfectly correlated with
itself, r = 1.000. Similarly, science exam results have a perfect correlation
with science exam, r = 1.000. These values are therefore not required.
LINEAR REGRESSION
We have previously identified a positive correlation between the two variables ‘study
time’ and ‘science exam’ mark. We may wish to investigate further this relationship
linear regression
by examining whether study time reliably predicts the science exam mark. To do this
A regression that is assumed to
follow a linear model. For two we use a linear regression.
variables this is a straight line
of best fit, which minimises
the ‘error’. Linear regression test procedure
SPSS output
The first table reminds us that we are predicting science scores (the dependent
variable) from the study time (the independent variable).
The next table is the Model Summary, which provides us with the correlation
coefficient. We can compare this table with the output from the Pearson correlation
on the same data, shown earlier.
SPSS essential
n The R Square value in the Model Summary table shows the amount of
variance in the dependent variable that can be explained by the independent
variable.
n In our example the independent variable of study time accounts for 51.9 per
cent of the variability in science exam scores.
SPSS advanced
n The R value (.721a) indicates that as study time increases the science score
also increases, and this is a positive correlation, with r = .721. We know this
to be statistically significant from the Pearson correlation output.
n The Adjusted R Square adjusts for a bias in R Square. R2 is sensitive to
the number of variables and scores there are, and adjusted R2 corrects for
this.
n The Std. Error of the Estimate is a measure of the accuracy of the prediction.
318 LINEAR CORRELATION AND REGRESSION
The ANOVA summary table that follows shows details of the significance of the
regression.
SPSS essential
n The ANOVA tests the significance of the regression model. In our example,
does the independent variable, study time, explain a significant amount of
the variance in the dependent variable, science exam result?
n As with any ANOVA, the essential pieces of information needed are the df,
the F value and the probability value. We can see from the above table that
F(1,8) = 8.647, p < .05, and therefore can conclude that the regression is
statistically significant.
Now we have the Coefficients output table, which gives us the regression
equation.
SPSS essential
n The Unstandardized Coefficients B column gives us the value of the
intercept (for the Constant row) and the slope of the regression line (from
the Study Time row). This gives us the following regression equation:
Science exam score = 34.406 + .745 Study time
n The Standardized Beta Coefficient column informs us of the contribution
that an individual variable makes to the model. From the above table we can
see that study time ‘contributes’ .721 to science exam performance, which is
our Pearson’s r value.
LINEAR CORRELATION AND REGRESSION 319
SPSS advanced
n The t value (t = 4.539, p < .01) for Constant tells us that the intercept is
significantly different from zero.
n The t value for study time (t = 2.941, p < .05) shows that the regression is
significant.
LOGISTIC REGRESSION
Sometimes we wish to create a regression equation that predicts a binary dependent
variable rather than a continuous dependent variable. So, rather than predicting an
examination mark (as in the example above) we wish to predict whether the value
of a variable will be 0 or 1, or no or yes. A logistic regression aims to see whether
a value of the binary dependent variable can be predicted by the scores of an
independent variable – for example, which factor(s) might significantly predict
whether students will succeed or fail at a specific exam.
With binary dependent variables the regression is not based directly on the
function of the straight line but on the logistic function, which ranges between 0
and 1. If the probability of a ‘yes’ response to a yes/no question in a questionnaire
is p, then the odds of getting a ‘yes’ response is p/(1 – p). The natural logarithm of
these odds is called the logit, ln (p/(1 – p)). It is the logit, rather than Y (as we saw
in the linear regression above) that is predicted to be linearly related to the
independent variable X. We can rearrange the regression equation of logit = a + bX
to predict the values of p. We predict that Y = 1 when p > = .5 and Y = 0
when p < 0.5. The point where a + bX = 0 is the point at which the prediction from
0 to 1 changes with a 1 (or yes value) is predicted with positive values and a 0 (or
a no value) predicted with negative values.
A logistic regression is the appropriate test to use when we have an independent
variable/s that are measured on an interval scale and we are trying to predict group
membership to a dependent variable measured on a nominal category.
Scenario
It was noted in a large town that many drivers used their cars to drive to work.
In order to reduce traffic congestion and support sustainable travel methods, twenty
drivers in a commuter car park were asked their distance to work (measured in miles)
and also whether they would use public transport if the price was reduced by
20 per cent (responses ‘yes’ or ‘no’).
Data entry
Enter the dataset as shown in the example. The Switch variable records whether the
commuter would change to public transport with the price reduction. A value of 0
is coded for ‘no’ responses and 1 for ‘yes’ responses. See earlier chapter for full data See Chapter 2
entry procedure.
320 LINEAR CORRELATION AND REGRESSION
3
• Binary responses are often recorded as either a 0 or a 1.
For a logistic regression you need to label your values as either
a 0 or a 1.
• Remember that to see the numerical values instead of the
value labels you need to go to the View drop-down menu and
deselect Value Labels.
SPSS output
The first table is the Case Processing Summary which tells us how many cases are
included in the analysis.
n We can see that there were 20 cases included for analysis and there were no
Missing Cases.
The Block 0: Beginning Block shows the model with no predictor variables
included.
n The rows in the Classification Table display the observed number of 0s and
1s that are observed in our dependent variable.
n It shows that a basic model predicting that all the results as ‘yes’ responses –
without including the ‘Distance’ variable – would give an accuracy of prediction
of 60 per cent.
322 LINEAR CORRELATION AND REGRESSION
n The Variables in the Equation table shows the significance of the basic model
without having included the ‘Distance’ variable.
n These values are generally not usually reported in reports or academic papers.
n By examining the Variables not in the Equation box we can see that the
variable ‘Distance’ (if it had been entered into the equation) would have been a
significant predictor of the ‘Switch to public transport’.
SPSS essential
n The first table in the block is the Omnibus Tests of Model Coefficients.
This shows how much the current step, block, and model predicts the
dependent variable compared to the basic model in Block .
In our example as we have just one independent variable, all of the values are
the same. There is a chi-square of 8.477, df = 1, p = .003. As the chi-square is
significant it indicates that the new model is a better predictor than the basic model
in Block 0.
SPSS advanced
n The Model Summary table presents estimations of the amount of variance
explained by the logistic regression model.
n The Cox & Snell R Square value and the Nagelkerke R Square value in
the Model Summary table show estimates of the amount of variance in the
dependent variable that can be explained by the independent variable. The
Nagelkerke value is usually the larger of the two and more often reported,
indicating that 46.7 per cent of the variation is explained by the model.
LINEAR CORRELATION AND REGRESSION 323
SPSS essential
n The Classification Table displays the number of observed cases that are
correctly predicted by the model. It also shows the Overall Percentage of
the cases that are predicted by the model. We can see that 70 per cent of the
cases are predicted by the model. This is a higher value than the previous
Classification Table of the basic model in Block 0, showing that the model
has more predictive power.
n The Variables in the Equation table shows the output of the model including
the predictor variable of ‘Distance’ plus the constant. This table shows the
logistic regression model that has been produced for the data.
n The Wald statistic gives the significance of each component of the logistic
regression. Distance is significant at p < .05 (Wald = 4.965, df = 1, Sig = .026).
n As we can see, ‘Distance’ has a significant effect on the prediction. The Exp(B)
value, as it is larger than 1, indicates that as Distance increases with each
additional mile (i.e. increases by one), then the odds that the person will
switch to a ‘yes’ response increases.
SPSS advanced
n The B values give the logistic regression coefficients that can be used in the
formula to predict the logit values. The logit is 0 at the point where the
prediction changes from no to yes. So we can use the B values to find this
point, as the regression equation is logit = aX + b. From the table a = .104
and b = –2.522, so 0 = .104 × X – 2.522 can be solved to give the value of
Distance at the cut-off point. So X = 2.522/0.104 = 24.25. Therefore, the
logistic regression model predicts a ‘yes’ response for commuters who travel
over 24.25 miles and a ‘no’ response for those driving less than 24.25 miles.
324 LINEAR CORRELATION AND REGRESSION
FAQ
The following Measurement Level window appears.
FAQ
I’ve carried out a correlation and my output states the significance for a two-tailed test.
However, my prediction is one-tailed. What have I done wrong?
When carrying out correlations in SPSS you need to specify whether you require the test to be
one- or two-tailed in the Bivariate Correlations window. To obtain a one-tailed significance
value after selecting a two-tailed calculation, divide the p value by 2, or redo the correlation but
this time select the one-tailed option.
FAQ
I have predicted a positive correlation in my study (one-tailed). My result has come as
r = – 0.7, which is highly significant (p < .01). Have I found support for my hypothesis?
No, you have predicted a positive correlation, which is that r would be between 0 and +1.
However, your results show a negative correlation with r = – 0.7, so the result has gone in the
opposite direction to that which you predicted.
FAQ
I have got two variables but am unsure whether to calculate a correlation or regression.
This will depend on what exactly you want to find out from your analysis. If you are interested
in the strength of the linear relationship between the two variables, then a correlation will be
the most appropriate. However, if you wish to predict values of one variable by the values of the
other variable, you should be calculating a linear regression.
See Chapter 20, Hinton Further details of linear correlation and linear regression can be found in Hinton
(2014) (2014).