Statistical Analysis Using SPSS and R - Chapter 4 PDF
Statistical Analysis Using SPSS and R - Chapter 4 PDF
4.1 Introduction
The term ‘Parametric tests’ will be used here to refer to statistical techniques,
such as the t-tests and the One-Way Anova test, that rely on a number of
assumptions on the distribution of the covariate of interest. One of these
assumptions is that the covariate observations obtained on the different
subgroups being considered come from a normally distributed population or
that the distribution of the population can be approximated by a normal
distribution. Since there are many situations where all necessary assumptions
cannot be met, statisticians have developed alternative tests that are based on
less stringent assumptions, known as non-parametric tests. In particular, these
non-parametric tests are used when the data entries are ranks or when the data
does not satisfy the normality condition.
Parametric Non-Parametric
One-Sample t-test One-Sample Wilcoxon Signed
Rank Test / Sign Test
Independent Samples T-test Mann-Whitney U test
One-Way Anova Kruskal-Wallis Test
Paired Samples t-test Wilcoxon Signed Ranks Test
Repeated measures ANOVA Friedman Test
Table 4.1.1
133 | P a g e
If the variable follows an approximate normal distribution and satisfies any
other assumptions upon which the test relies, we use parametric tests, if not
we consider the non-parametric alternatives.
Two popular tests for checking normality (hypothesis testing) are the Shapiro-
Wilk test and the well-known Kolmogorov-Smirnov test. Both test the
following hypothesis:
We shall see how each of these tests is conducted in SPSS and R/RStudio due
to the fact that both tests are widely used. However, if the two tests give
conflicting results, the result obtained from the Shapiro-Wilk test is
considered, since it has been found to have much better performance than the
Kolmogorov-Smirnov test.
134 | P a g e
Note that, to avoid unnecessary repetition, a detailed description of a test
is provided when demonstrating how the test is conducted using SPSS but
such detail will be skipped when demonstrating how this is conducted
using R. Readers focusing on R software might need to look for more
detail in the sections on SPSS. Remember the tests are the same no matter
which software you use. Interpretation does not change from one
software to another it is the execution and presentation which changes.
In SPSS
Figure 4.1.1
135 | P a g e
Tests of Normality
Kolmogorov-Smirnova Shapiro-Wilk
Statistic df Sig. Statistic df Sig.
Current Salary .208 474 .000 .771 474 .000
a. Lilliefors Significance Correction
Table 4.1.2
Since both the Kolmogorov-Smirnov test and the Shapiro-Wilk test gave out
a p-value (sig) of 0 < 0.05, then the hypothesis that Current Salary follows a
normal distribution has to be rejected. Current Salary is thus not normally
distributed.
Figure 4.1.2
136 | P a g e
A Q-Q plot charts observed values against a known distribution, in this case
a normal distribution. The expected normal distribution is the straight line
and the line made up of little circles is obtained from the observed values from
our data. If our data follows a normal distribution, the plot would have
observations lying close to the straight line. Hence our plot shows that the
distribution of our data deviates from normality almost anywhere.
Figure 4.1.3
The Detrended Normal Q-Q plot shows the differences between the observed
and the expected values of a normal distribution. If the distribution is normal,
the points should cluster in a horizontal band around zero with no pattern. Our
plot again indicates deviation from normality.
137 | P a g e
In R/RStudio
lawstat
mbess
pmcmr
65, 61, 63, 86, 70, 55, 74, 35, 72, 68, 45, 58.
We shall now check whether the age sample comes from a normal
distribution. We enter the above sample into a variable mydata as follows:
mydata<-c(65, 61, 63, 86, 70, 55, 74, 35, 72, 68, 45, 58)
data: mydata
W = 0.97107, p-value = 0.9216
Given that the p-value is greater than either 0.1, 0.05, and 0.01, the standard
levels of significance (0.05 is typically widely used), we deduce that the
Shapiro-Wilk test does not reject normality for the age variable.
The QQ-plot provides a graphical way to check for normality. This is done as
follows:
qqnorm(mydata,plot.it=TRUE)
qqline(mydata)
More detail on the Q-Q plot has been provided earlier when discussing how
these plots are issued using SPSS. It is important that one never uses the QQ-
plot to draw conclusions about normality or lack of. It should be used as an
indicative tool. The proper way to determine if the variable of interest is
normally distributed or not is through a goodness-of-fit-test such as Shapiro-
Wilk test and Kolmogorov-Smirnov test which will be discussed next.
ks.test(mydata,"pt",df=5)
where "pt" refers to the cumulative distribution of the t-distribution and
df=5 refers to the degrees of freedom parameter being tested.
139 | P a g e
To see how the command works, generate 1000 readings from a t-distribution
with 5 degrees of freedom as follows:
mydata<-rt(1000,df=5)
In this section we shall start by discussing a number of t-tests. There are three
types of t-tests: One-Sample t-test, Independent-Samples t-test, Paired-
Samples t-test.
While each of the t-tests compares mean values they are designed for
distinctly different situations as we shall shortly see.
The one-sample t-test is used to compare the mean of a single sample with a
specified population mean.
140 | P a g e
14.5 16.2 14.4 15.8 13.1 12.9 17.3 15.5 16.2 14.9 13.9 15.0
14.4 15.6 13.9 15.6 14.4 16.4 17.9 15.0 14.8 13.6 16.1 15.2
14.3 15.8 16.4 16.6 17.1 13.5 15.8 14.7 16.0 13.4 15.8 16.7
The null and alternative hypotheses for this two-tailed test are:
H 0 : 14.1g
H1 : 14.1g
Note that this is a two-tailed test since in the alternative hypothesis we are
considering values > 14.1g and values < 14.1g.
First using the normality tests discussed earlier we test if the data set under
study can be assumed to follow a normal distribution. To be able to check for
normality we need to enter the data in the format required by the software.
In SPSS
The list of readings making up the covariate ‘sugar’ is inserted in the cells of
the first column.
From Table 4.2.1, below, we note that the p-values are greater than 0.05 for
both the Kolmogorov-Smirnov test and Shapiro Wilk test, hence we cannot
reject the assumption of normality.
141 | P a g e
Tests of Normality
Kolmogorov-Smirnova Shapiro-Wilk
Statistic df Sig. Statistic df Sig.
Sugar content in grams .091 36 .200* .984 36 .876
*. This is a lower bound of the true significance.
a. Lilliefors Significance Correction
Table 4.2.1
Choose Analyze from the bar menu, select Compare Means and click on One-
Sample t Test. Move ‘sugar’ to the test variable box and input 14.1 for the
test value. Click OK to run the procedure:
Figure 4.2.1
142 | P a g e
One-Sample Statistics
N Mean Std. Deviation Std. Error Mean
Sugar content in grams 36 15.2417 1.23297 .20549
Table 4.2.2
One-Sample Test
Test Value = 14.1
95% Confidence Interval of the
Mean Difference
t df Sig. (2-tailed) Difference Lower Upper
Sugar content in 5.556 35 .000 1.14167 .7245 1.5588
grams
Table 4.2.3
Table 4.2.2 shows that the mean sample sugar content is 15.24g with a
standard deviation of 1.23g. Now, accepting the null hypothesis is the same
as saying that the mean difference between the population mean, 14.1g and
the sample mean, 15.24g is not statistically significant. This mean difference
in fact plays a major role in calculating the t-test. The t-test is simply a ratio
of the mean difference and the standard error of the mean.
From Table 4.2.3, the mean difference is 1.14167g and from Table 4.2.2, the
standard error of the mean is 0.020549g.
1.14167
The t statistic is thus 5.556 .
0.20549
Working out manually, we would have to get the critical values from the t-
distribution and hence check whether 5.556 lies in the acceptance or the
rejection region. The SPSS output reports a t-statistic and degrees of freedom
for all t-test procedures. Every unique value of the t-statistic and its associated
degrees of freedom have a significance value. So, when using SPSS, we just
have to consider the p-value (sig) so as to determine whether to accept or reject
H0 .
143 | P a g e
In this case, our p-value is 0. Now, by default SPSS works out using 95%
confidence (0.05 level of significance). Hence, since 0 < 0.05 we reject H 0
whilst if p-value was > 0.05 we would not reject H 0 . Hence we are 95%
confident that the actual sugar content is not equal to 14.1g.
If on the other hand it was required for us to test whether the mean sugar
content is larger than 14.1g, we would be dealing with a one-tailed test and
the hypotheses are:
H 0 : 14.1g
H1 : 14.1g
In this case we have a one-tailed test since in the alternative hypothesis we are
only considering values >14.1g. Now the p-value given by SPSS is always
computed for a two-tailed test. For a one-tailed test this p-value is simply
divided by 2. So for this example, the p-value provided by SPSS is first
divided by two. The resulting value is approximately 0, and smaller than the
level of significance (0.05), so we reject H 0 .
In R/RStudio
Save the data in a .csv file (which is an excel format). Place all the data in a
column with the first element of the column containing the variable name:
sugar in this case.
Recall the following commands that are used to upload and view the data file:
144 | P a g e
attach(dataonesample)
names(dataonesample) # list the variables in my data
str(dataonesample) # list the structure of my data
View(dataonesample) # opens the data viewer
t.test(dataonesample$sugar,mu=14.1)
where mu=14.1 above refers to the hypothetical value of 14.1g. This yields
the following output:
data: dataonesample$sugar
t = 5.5557, df = 35, p-value = 2.976e-06
alternative hypothesis: true mean is not equal to 14.1
95 percent confidence interval:
14.82449 15.65884
sample estimates:
mean of x
15.24167
We can see that the sample mean is 15.24167, the t-statistic is 5.5557, and that
a p-value of 2.976e-06 rejects the null hypothesis that the true population
mean is equal to 14.1g. Also note the 95% confidence interval for the
population mean, with the lower confidence bound given by 14.82449 and
the upper confidence bound given by 15.65884.
145 | P a g e
If we want to change the confidence level, say 99%, we put conf=0.99 in the
t.test command.
If on the other hand it was required for us to test whether the mean sugar
content is larger than 14.1g, we would be dealing with a one-tailed test and
the hypotheses are:
H 0 : 14.1g
H1 : 14.1g
In this case we have a one-tailed test since in the alternative hypothesis we are
only considering values >14.1g. Thus we type:
t.test(dataonesample$sugar, alternative="greater",mu=14.1)
data: dataonesample$sugar
t = 5.5557, df = 35, p-value = 1.488e-06
alternative hypothesis: true mean is greater than 14.1
95 percent confidence interval:
14.89447 Inf
sample estimates:
mean of x
15.24167
In this case, the t-statistic is the same as before but the p-value has been halved
due to the fact that we are now using a one-tailed test. The p-value is smaller
than 0.05, meaning that the null hypothesis is rejected at a 0.05 level of
significance. Hence the company’s claim is incorrect and the actual average
amount of sugar is greater than 14.1mg.
146 | P a g e
4.2.2 Independent Samples T-test
Male Clients 20, 18, 15, 10, 13.5, 5, 9.5, 15, 22.25, 25, 30.5, 15.25, 14,
10, 10.5, 8, 5.5, 6.6, 9.9, 8.2, 4.4, 2.2, 3.3, 4.5, 2.3
Female Clients 3,2, 3.5, 4.5, 6.6, 8.8, 9, 10, 2.5, 4.5, 5, 6.25, 4, 3, 2, 5.5,
6, 6, 4.25, 4, 4.5, 8, 9, 8.8, 8, 4, 2, 3.5, 5.25, 6
To be able to use the independent samples t-test we need to first check whether
the male client data and female client data are both normally distributed. Also,
the independent samples t-test relies on the assumption of equality of
population variances, meaning that to be able to use this test, the male
population and female population need to have equal variances. The former
is checked by means of normality tests. The latter is checked by means of
Levene’s test for equality of variances. If both assumptions are satisfied, the
independent samples t-test may then be used to compare the average number
of weekly hours spent at the fitness centre by male and female clients.
The null and alternative hypotheses for a two-tailed test would be:
147 | P a g e
H 0 : On average, male and female clients spend the same number of weekly
hours at the fitness centre.
H1 : On average, male and female clients do not spend the same number of
weekly hours at the fitness centre.
The null and alternative hypotheses for a one-tailed test could be:
H 0 : On average, male and female clients spend the same number of weekly
hours at the fitness centre.
H1 : On average, male clients tend to spend more time at the fitness centre
than female clients.
In SPSS
Figure 4.2.2
148 | P a g e
where for the variable Gender, 1=Male and 2=Female.
We start by testing if the variable Time follows a normal distribution for both
independent groups (that is for males and females). The output that follows
confirms that this is in fact true since all p-values are greater than 0.05.
Tests of Normality
Kolmogorov-Smirnova Shapiro-Wilk
Gender Statistic df Sig. Statistic df Sig.
Time in hours, spent at the Male .156 25 .119 .930 25 .089
fitness centre Female .136 30 .162 .939 30 .087
a. Lilliefors Significance Correction
Table 4.2.4
where the variable Time is moved underneath Test Variable(s) and the
variable Gender is moved into the Grouping Variable box.
The button Define Groups is then clicked to define the groups according to
the index given to represent Males and Females (for example 1 and 2).
Figure 4.2.3
149 | P a g e
The resulting outputs are as follows:
Table 4.2.5
Table 4.2.6
From Table 4.2.5, it should be noted that SPSS automatically computes the
descriptive statistics (sample mean, sample standard deviation, standard error
for the sample mean), Levene’s test for equality of variances, 95% confidence
interval (or otherwise if specified), t-statistic (with corresponding degrees of
freedom) and the two-tailed p-value, which will once again be the main focus
in our hypothesis testing.
Now, Table 4.2.6 contains two sets of values: the first line assumes equal
variances and the second does not. To assess which line of values to use, the
result for Levene’s test for Equality of Variances has to be taken into
consideration.
The Levene’s test checks whether our datasets come from populations with
equal variances and again, this hypothesis is accepted or rejected on the basis
of the p-value being greater or smaller than α. The corresponding hypothesis
test is:
150 | P a g e
H 0 : The variances of the two populations from which the samples are
extracted are equal.
H 1 : The variances of the two populations from which the samples
are extracted are not equal.
Taking 0.05 , the resulting p-value in this case is 0.00 < 0.05 and thus we
reject H 0 . So, the two datasets come from populations with unequal
variances. Thus, we should use the statistics in the row labelled Equal
variances not assumed.
Consider the two-tailed test mentioned earlier, our null and alternative
hypotheses can be written as:
H 0 : 1 2 0
H1 : 1 2 0
From Table 4.2.6 we note that the t-statistic (obtained using the Welch t-test)
under the assumption of unequal variances has a value of 4.063 with degrees
of freedom of 28.05 and an associated p-value of 0.000. Since the p-value is
less than 0.05, then we reject H 0 . Thus, the probability that there is no
difference between attendance of male and female clients is very small. The
sample means in Table 4.2.5 in fact suggest that males, on average, spend
more time at the fitness centre than females. The resulting p-value confirms
that the observed difference in sample means is not due to chance.
H1 : On average, male clients tend to spend more time at the fitness centre
than female clients
then, once again we can consider the t-statistic value obtained under the
assumption of unequal variances, 4.063 with degrees of freedom of 15 and
check whether this value falls inside the rejection region or not.
151 | P a g e
Otherwise, as is typically done when using software, we consider the
0.000
associated p-value of 0.000 . Since the p-value is less than 0.05, then
2
we reject H 0 . Thus we can conclude that, male clients tend to spend more
time at the fitness centre than female clients.
In R/RStudio
Suppose that the data has been entered in Excel in the same layout as was used
in SPSS. The data should be saved with extension .csv.
attach(dataindepsamples)
names(dataindepsamples)
# list the variables in my data
str(dataindepsamples)
# list the structure of mydata
View(dataindepsamples)
#opens data viewer
dataindepsamples$Gender<-
factor(dataindepsamples$Gender,levels=c(1,2),labels=
c("Male", "Female"))
#Adding labels to the levels of the fixed factor
152 | P a g e
We can check the assumption of normality for both subgroups using the
command:
by(dataindepsamples$Time, dataindepsamples$Gender,
shapiro.test) #Checking if both groups satisfy normality
Based on the p-values from the output below, we cannot reject normality at
either 0.01 or 0.05 levels of significance.
dataindepsamples$Gender: Male
data: dd[x, ]
W = 0.93048, p-value = 0.08916
-----------------------------------------------------------
----------------
dataindepsamples$Gender: Female
data: dd[x, ]
W = 0.9392, p-value = 0.08655
data: Independent_Samples_ttest$Time
Test Statistic = 22.808, p-value = 1.455e-05
With a test statistic of 22.808 and a p-value of 1.455e-05, the null hypothesis
of variance homogeneity is rejected, whether at the 0.1, 0.05 or 0.01 level of
significance. Typing:
153 | P a g e
by(dataindepsamples$Time,dataindepsamples$Gender,var)
we can in fact see that the observed times for males have a variance of 54.05
and those for females have a variance of 5.346.
H 0 : 1 2 0
H1 : 1 2 0
In the above command, mu=0 is the value set by the null hypothesis, and
var.equal=FALSE is required because we cannot assume equal variances.
When it is possible to assume equal variances this is replaced with
var.equal=TRUE.
The t-statistic of 4.0631 rejects the null hypothesis for levels of significance
0.01, 0.05 and 0.1, with a p-value of 0.003539. This means that the mean time
spent at the fitness centre by male clients is significantly different from the
mean time spent at the fitness centre by female clients.
154 | P a g e
The output also shows that the sample mean times for males and females are
11.54 and 5.32 respectively, which indicate that on average male clients spend
more than double the time at the fitness centre than females.
H 0 : 1 2 0
H1 : 1 2 0
As can be seen in the above command, due to the fact that we have a one-
tailed test and we are checking whether the average time spent at the fitness
centre by males is greater than the average time spent by females, we need to
set alternative="greater".
sample estimates:
mean in group Male mean in group Female
11.536 5.315
Note that the t-statistic and the sample means are the same as those obtained
for the two-tailed test. The p-value is smaller at 0.0001769 (due to dividing
by two). This value is once again smaller than 0.05, thus the null hypothesis
is rejected.
155 | P a g e
If on the other hand we wanted to test:
H 0 : 1 2 5
H1 : 1 2 5
Or
H 0 : 1 2 5
H1 : 1 2 5
156 | P a g e
Example: The following are the average weekly losses, in hours, due to
accidents in 10 industrial plants before and after a certain safety program was
put into operation. We want to check whether the program is significantly
reducing the mean weekly losses due to accidents. A parametric test is used
because both sets of readings may be shown to come from a population
following a normal distribution.
Before 45 73 46 124 33 57 83 34 26 17
After 36 60 44 119 35 51 77 29 24 11
The paired-samples t-test is the ideal test to use to check for significant
difference between the before and after readings. Our null and alternative
hypotheses are:
H 0 : Before After 0
H1 : Before After 0
Null: The program is not effective (Mean weekly losses are unaltered)
Alternative: The program has significantly reduced the mean weekly losses
In SPSS
Enter all the observations of one sample in the cells of the first column and
the observations of the second sample in the cells of the second column.
These columns define the ‘before’ and the ‘after’ average weekly losses on
man hours due to accidents.
157 | P a g e
Before we can proceed to use the paired samples t-test, we need to create a
new variable with the differences between Before and After. This new
variable is created so that we can use it to check for normality of the data.
Figure 4.2.4
In this case, focus lies with the differences since the paired samples t-test is
actually a one sample t-test applied to the set of differences. To create the new
variable go to Transform, Compute Variable, give a name to the new variable
underneath Target Variable, move the variables Before and After underneath
Numeric Expression and include the subtraction sign in between as shown in
Figure 4.2.4.
Having created the new variable Differences, we use the Shapiro-Wilk test to
check whether this variables is normally distributed or not. This test yields a
p-value of 0.682. Hence the null hypothesis that this variable follows a normal
distribution cannot be rejected. Thus the paired t-test can be applied.
158 | P a g e
Choose Analyze from the bar menu, select Compare Means and click on
Paired samples t-test. Move the variables ‘before’ and ‘after’ simultaneously
to the paired variables box and click on OK to run the procedure.
Figure 4.2.5
159 | P a g e
From Table 4.2.7, the average of the paired differences is 5.2. Our null
hypothesis says that there is no difference in the average weekly losses on
man-hours due to accidents before and after the program; in other words we
are testing whether this mean paired difference, 5.2, is significantly different
from zero. SPSS calculates the value of the t-statistic, which is simply the ratio
of mean paired difference and the standard error. So t = 5.2/1.289 = 4.033.
Since the p-value is 0.0015 (0.003/2 - due to the fact that the alternative
hypothesis is one-tailed) is smaller than the level of significance (0.05), we
have enough evidence to reject the null hypothesis. This implies that the
program was effective in reducing the weekly losses on man-hours due to
accidents. The probability that this assertion is wrong is 0.0015 which is very
small.
In R/RStudio
In this section we will show how the paired samples t-test may be conducted
in R/RStudio.
attach(datapairedsamples)
names(datapairedsamples) # list the variables in my data
str(datapairedsamples) # list the structure of my data
View(datapairedsamples) # opens data viewer
160 | P a g e
As a first step, we need to check that the difference between Before and After
readings follows a normal distribution. In R this is done through the following
command:
shapiro.test(datapairedsamples$Before-datapairedsamples$Aft
er) # Checking if differences satisfy normality
This test yields a p-value of 0.6824 so we cannot reject the null hypothesis
that the differences between before and after readings follow a normal
distribution.
We proceed to test the hypothesis stated for the example under study, that is,
we want to check whether the mean of the after data is significantly smaller
than that of the before data. We perform the paired samples t-test as follows:
Paired t-test
The outputs are very similar to those obtained in SPSS. The sample mean of
the differences is 5.2, and with a t-statistic of 4.033 and a p-value of 0.015,
the null hypothesis is rejected at 0.1, 0.05 and 0.01 level of significance. For
more detail on how to interpret this result from an application point of view
refer to the interpretation given for the SPSS output.
161 | P a g e
4.2.4 One-Way ANOVA
The null hypothesis for a one-way ANOVA is that the mean values of the
independent groups/subpopulations are equal. The alternative hypothesis is
that some of the mean values of the independent groups/subpopulations are
different.
162 | P a g e
Orange Lemon Cherry
13 12 7
17 8 19
19 6 15
11 16 14
20 12 10
15 14 16
18 10 18
9 18 11
12 4 14
16 11 11
9 5
9 6
4
The null and alternative hypotheses for this two-tailed test are:
H0: The mean scores between flavours are equal (there is no significant
difference in mean scores due to flavour)
H1: The mean scores between flavours are not equal (there is a significant
difference in mean scores due to flavour)
In SPSS
Enter all the observations of the three samples in the cells of the first column,
which defines the covariate ‘score’. In the second column enter the values 1,
2 or 3 besides the readings of Orange, Lemon and Cherry respectively. This
column defines the factor ‘flavour’ which has three levels.
163 | P a g e
Figure 4.2.6
First we need to check that the assumption of normality is satisfied for the
three levels of the factor flavour. Normality tests can be done by going to
Analyze, Descriptive Statistics, Explore and follow the instructions given
earlier in the notes. The procedure to check for normality here is the same as
what has been described before but in this case the factor flavour is entered
underneath Factor list as shown in Figure 4.2.7.
164 | P a g e
Figure 4.2.7
The following output confirms that at each level of the factor flavour, the
normality assumption is satisfied by the dependent variable score.
Tests of Normality
Kolmogorov-Smirnova Shapiro-Wilk
flavour Statistic df Sig. Statistic df Sig.
score Orange .142 12 .200* .916 12 .256
Lemon .172 13 .200* .937 13 .416
Cherry .153 10 .200* .970 10 .894
*. This is a lower bound of the true significance.
a. Lilliefors Significance Correction
Table 4.2.9
From the bar menu choose Analyze, select Compare Means and click on One
Way Anova. Move score into the Dependent List and move ‘flavour into the
grouping variable to be defined as a factor. Click on Options and select
Descriptives, Homogeneity of Variance Test and Means Plot. Finally click on
Continue and OK to run the procedure.
165 | P a g e
Figure 4.2.8
Descriptives
score
95% Confidence Interval
for Mean
Std. Std. Lower Upper
N Mean Deviation Error Bound Bound Minimum Maximum
Orange 12 14.0000 4.04520 1.16775 11.4298 16.5702 9.00 20.00
Lemon 13 9.6923 4.62574 1.28295 6.8970 12.4876 4.00 18.00
Cherry 10 13.5000 3.74907 1.18556 10.8181 16.1819 7.00 19.00
Total 35 12.2571 4.53965 .76734 10.6977 13.8166 4.00 20.00
Table 4.2.10
Table 4.2.10 gives the sample average, standard deviation, standard error and
95% confidence intervals of the mean scores obtained by the three flavours.
For each flavour, there is a 95% chance that the actual mean score lies within
the given confidence intervals. So there is a 95% chance that the actual mean
score for Orange lies between 11.43 and 16.357.
166 | P a g e
The sample means suggest that on average the scores for Orange and Cherry
can be considered equal while the score for Lemon is lower. The statistical
significance of these observed difference still needs to be tested.
The following table presents the results for Levene’s tests. Levene’s test is
used to check whether the assumption of equal variances is satisfied. In this
case we have the following hypothesis:
H0: The variances of the subpopulations from which the samples are extracted
are equal.
H1: The variances of the subpopulations from which the samples are extracted
are not equal.
Since the p-value is 0.599, which is greater than 0.05, then we cannot reject
the null hypothesis. So there is no significant difference in the three
subpopulation variances.
Since all assumptions made by the one-way ANOVA test have been shown to
be satisfied we can proceed to interpret the output for this test.
ANOVA
score
Sum of Squares df Mean Square F Sig.
Between Groups 137.416 2 68.708 3.903 .030
Within Groups 563.269 32 17.602
Total 700.686 34
Table 4.2.12
167 | P a g e
SPSS calculates the value of the F-statistic, which is simply the ratio of mean
square (between groups) and the mean square (within groups). So
F 68.708 17.602 3.903 . Since the p-value (0.03) is smaller than the level
of significance (0.05), we reject H0. This implies that there is a significant
difference in mean scores due to flavour. The differences between the sample
means are displayed visually in the means plot shown below:
Figure 4.2.9
For a more detailed study on which means are significantly different from the
rest, one can consider Post Hoc tests. More detail about such tests is provided
later in section 4.4.
In R/ RStudio
Data is saved in .csv format following the same structure used when entering
data in SPSS. So again we have two columns, the first defines the covariate
‘score’ and the second the factor with three levels, ‘flavour’.
168 | P a g e
We start by loading the data and then assigning the labels to the factor.
attach(dataflavours)
names(dataflavours) # list the variables in my data
dataflavours$flavour<-factor(dataflavours$flavour,
levels=c(1,2,3),labels=c("Orange","Lemon","Cherry"))
# adding labels
We next need to check for the normality assumption and the variance
homogeneity assumption. For the first, we write the command:
by(dataflavours$score,dataflavours$flavour,shapiro.test)
# testing for normality
data: dd[x, ]
W = 0.93673, p-value = 0.416
-----------------------------------------------------------
dataflavours$flavour: Cherry
169 | P a g e
Shapiro-Wilk normality test
data: dd[x, ]
W = 0.97035, p-value = 0.8941
We can see that for all three flavours, the Shapiro-Wilk test fails to reject the
normality assumption at all levels of significance (0.1, 0.05 and 0.01). This
means that we can assume that the normality assumption is satisfied in all
three cases. For the Levene's test on the other hand we write:
levene.test(dataflavours$score,dataflavours$flavour,locatio
n="mean")
# test for homogenity of variance
data: dataflavours$score
Test Statistic = 0.52004, p-value = 0.5994
With a test statistic of 0.52, and a p-value of 0.5994 we fail to reject the
homogeneity of variance hypothesis at 0.1, 0.05 and 0.01 levels of
significance. Having failed to reject both assumptions stipulated by the
ANOVA test, we can now proceed to implement one-way ANOVA. There
are various ways in which one-way ANOVA may be performed in R. The
following is just one of them:
output=aov(dataflavours$score~dataflavours$flavour)
summary(output)
170 | P a g e
Df Sum Sq Mean Sq F value Pr(>F)
dataflavours$flavour 2 137.4 68.71 3.903 0.0304 *
Residuals 32 563.3 17.60
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘
’ 1
From the resulting output, we fail to reject the null hypothesis that there is a
significant difference in mean scores due to flavour, at 0.01 level of
significance. The null hypothesis is however rejected at 0.05 and 0.1 level of
significance.
The output obtained in R is the same as that obtained in SPSS and hence can
be interpreted in the same way. So for more details refer to the SPSS section.
171 | P a g e
Example: Suppose that a number of patients with chronic pain have been
prescribed some new form of medication. Readings on the Visual Analog
Pain Scale (VAS) where 0 means no pain and 10 means worst pain ever, were
recorded for each patient after two months, four months and six months of
taking the medication. Having collected the data, it would then be of interest
to see whether there is any significant difference in the pain measurements
recorded. An improvement in the measurements recorded over time would
mean that the prescribed medication is effectively reducing the pain
experienced by the patients.
172 | P a g e
Since the multivariate normality assumption cannot be checked in SPSS, we
will start this section by showing how to perform repeated measures ANOVA
in R.
In R/RStudio
There are various ways in which repeated measures ANOVA can be carried
out in R. The command used depends on the output required. The test for
sphericity is for example not available with the aov() function that has been
used in one-way ANOVA. The command that will be used here is ANOVA
from the car package.
For the example being considered, the data entry consists of creating one
column for each time point together with a variable for Subject, as shown in
Figure 4.1.12. The data being shown here has been created in SPSS. As for
any other statistical technique shown earlier in the notes, the data could also
have been created in Excel. The import menu in RStudio or the command
read.csv is then used to import the data in R.
Figure 4.2.10
173 | P a g e
# Conducting repeated measures ANOVA in R
attach(datapain)
names(datapain) # list the variables in my data
Having entered the data into the required format, we can now proceed to start
analysing the data. Descriptive statistics of the data and plots will give insight
into whether there are any differences in the mean pain scores due to time.
The boxplot which follows shows that there seems to be a reduction in pain
scores with time.
8
6
4
2
Figure 4.1.13
Note that to be able to obtain the boxplot in Figure 4.1.13 we need the data to
be changed into what is known as long format. To change from the current
short format into long format, the command reshape needs to be used as
follows:
174 | P a g e
time<-factor(time,levels=c(1,2,3),labels=
c("Month_2","Month_4","Month_6"))
# adding labels
plot(time, score)
library(energy)
mvnorm.etest(datapain[,2:4],R=200)
The p-value for this test is 0.385 which is greater than 0.05 hence we accept
the hypothesis that the data follows a multivariate normal distribution.
The Mauchly test of sphericity can be carried out using the command
mauchly.test. This command is quite complicated for the non-advanced R
user. We can alternatively use the function ANOVA from the car package. The
Mauchly test result is obtained as part of the output for repeated measures
ANOVA. The commands to perform such a test and the output which results
follow:
options(contrasts=c("contr.sum", "contr.poly"))
# defines the way the sums of squares are worked out
# with this setting calculations match those of SPSS and of
# common ANOVA textbooks
175 | P a g e
design<-factor(c("Month_2","Month_4","Month_6"))
# defining the time variable
library(car)
# loading the package car to use the command ANOVA
rmmodel<-lm(as.matrix(datapain[,2:4]) ~ 1)
# note that if for example besides the pain levels over time
# we were also considering other variables in the study,
# such as age, gender etc, any possible influence that these
# between subjects variables might have posed on the pain
# experienced by the subjects would need to be catered for
# by including these variables in the right hand side of the
# model, instead of the value 1, in the form :
# rmmodel<-lm(as.matrix(datapain[,2:4]) ~ age*gender)
results<-Anova(rmmodel, idata=data.frame(design),
idesign=~design, type="III")
# note that the commands idata and idesign can be adapted
# for use with any within subjects design. In our example
# we are only considering one within subjects variable,
# Time. Suppose that besides measuring pain levels in month
# 2, 4 and 6, pain levels were also measured three times a
# day, morning, afternoon, evening. A two-way repeated
# measures ANOVA would have to be used in such a case.
# Assuming that the data would have been entered in a
# similar format to the one being used here with columns
# for Morning_Month2, Afternoon_Month2, Evening_Month2,…,
# Evening_Month6, the left hand side of the model inside
# the lm function above would need to changed to include
# all the repeated measures columns in the data and the
# commands idata and idesign would need to be changed to:
# idata <- expand.grid(timeofday= c("Morning","Afternoon",
# "Evening"),time= c("Month_2","Month_4","Month_6"))
# idesign=~time*timeofday
summary(results, multivariate=F)
176 | P a g e
Mauchly Tests for Sphericity
GG eps Pr(>F[GG])
design 0.9936 4.446e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
HF eps Pr(>F[HF])
design 1.046776 4.129517e-07
Since the p-value obtained from the Mauchly test of sphericity is 0.8845 >
0.05 then we do not have enough evidence to reject the sphericity assumption.
We can thus proceed with interpreting the output that result for the repeated
measures ANOVA. The p-value of interest in this case is 0 (marked in bold
in the output, design - 4.13e-07) meaning that there is a significant
difference in the pain scores due to time.
If it was the case that the Mauchly test of sphericity gave enough evidence to
reject the null hypothesis of sphericity, we would not be able to trust the
resulting F-statistic and p-value. We would then need to move on to the
output given by Greenhouse-Geisser and Huynh-Feldt Corrections.
Greenhouse-Geisser and Huynh-Feldt are two methods which calculate the
amount of sphericity in the data (called eps in R output) and are used to correct
the p-value according to how bad the violation of sphericity is. If sphericity
is > 0.75, the Huyn-Feldt correction is used. If sphericity is < 0.75, the
Greenhouse-Geisser correction is used. An eps of 1 or close to 1, as is the
case in this example, shows that the condition of sphericity is not violated.
Note that as with many other tests, Mauchly’s test of sphericity might be
affected by the use of a small sample size. In small sample sizes, large
deviations from sphericity might be interpreted as being non-significant.
Having found that there is a significant difference in mean pain scores due to
Time, it might be of interest to proceed with doing post hoc tests. Such tests
will be discussed in Section 4.4.
177 | P a g e
In SPSS
For the example being considered, the data entry consists of creating one
column for each time point together with a variable for Subject, as shown in
Figure 4.2.10.
The pain data being analysed in this section has been shown to be multivariate
normally distributed by means of the command mvnorm.etest from the
package energy in R. SPSS provides us with two tests for normality, the
Shapiro-Wilk test and Kolmogorov-Smirnov test. Both of these tests,
however, are tests for univariate normality, not multivariate normality.
Multivariate tests for normality are not available in SPSS.
178 | P a g e
Figure 4.2.12
Press the Add button underneath Number of Levels and Score, go on Define,
select the variables Month_2, Month_4 and Month_6 and move them to
Within Subjects Variables as shown in Figure 4.1.14. Note that if Between
Subjects Variables such as gender and age were also available in the data,
these would need to be moved underneath Between Subjects Factor(s).
Descriptive statistics may also be obtained from the Options button. Then
press Continue and OK.
179 | P a g e
Figure 4.2.13
Table 4.2.13 shows descriptive statistics for the sets of pain scores obtained
in the three different months being considered in this analysis.
Descriptive Statistics
Mean Std. Deviation N
Month_2 6.00 1.340 40
Month_4 5.05 1.648 40
Month_6 3.88 1.522 40
Table 4.2.13
Table 4.2.14 shows the results from Mauchly test of sphericity. Since the p-
value obtained from the Mauchly test of sphericity is 0.885 > 0.05 then we do
not have enough evidence to reject the sphericity assumption.
180 | P a g e
Mauchly's Test of Sphericitya
Measure: Score
Epsilonb
Within Subjects Mauchly's Approx. Greenhous Huynh- Lower-
Effect W Chi-Square df Sig. e-Geisser Feldt bound
Time .994 .245 2 .885 .994 1.000 .500
Table 4.2.14
So we can now proceed to focus on the p-value obtained from the repeated
measures ANOVA table. The p-value of interest in Table 4.1.19 is 0 (marked
in bold in the output) meaning that there is a significant difference in the pain
scores due to time.
Having found that there is a significant difference in mean pain scores due
to Time, it might be of interest to proceed with doing post hoc tests. These
will be discussed later in section 4.4.
For more details on the output, refer to the repeated measures ANOVA section
for R/RStudio.
181 | P a g e
4.3 Non-Parametric Tests
The tests presented in Section 4.2 can only be applied when the assumption
of normality is satisfied. If we need to test such hypotheses but the assumption
of normality is not satisfied we turn to the non-parametric alternatives which
are presented here.
Important: Note that all non-parametric tests can still be used when data
follows a normal distribution but in such cases their power is less than that
of the respective parametric tests.
Example: Henna Ltd. claims that the 20g tubs of yogurts it produces contain
an average of 2g of fat. A quality control officer takes a sample of 39 (20g)
tubs of yogurts produced by this company and notes the amount of fats they
contain (rounded to the nearest gram). The observed values are listed in the
table which follows:
182 | P a g e
1.77 1.86 4.04 1.38 3.65
3.8 1.79 2.04 4.56 1.66
2.69 4 2.61 3.07 2.59
3.08 1.85 0.53 1.33
1.7 2.88 1.58 3.85
3.59 1.14 1.49 1.82
1.24 2.49 1.94 1
1.06 6.33 1.61 1.01
2.9 2.34 1 0.48
Since the data given in this example may be shown to not be normally
distributed, a one-sample Wilcoxon signed ranks test can be used to test the
following hypothesis:
H 0 : median 2 g
.
H1 : median 2 g
Alternatively:
Only if the assumption of normality is not satisfied should one proceed to use
the one-sample Wilcoxon test to test the above hypothesis.
In SPSS
183 | P a g e
Figure 4.3.1
Figure 4.3.2
184 | P a g e
On the Fields tab specify the variable for which the one-sample Wilcoxon test
is desired.
Figure 4.3.3
On the Settings tab specify Customize tests, check the box for Compare median
to hypothesized (Wilcoxon signed-rank test), specify the Hypothesized median
value in this case enter 2 and click RUN.
Figure 4.3.4
185 | P a g e
The following output is obtained:
Table 4.3.1
From Table 4.3.1 we see that the null hypothesis cannot be rejected with a
0.05 level of significance (p-value (sig) = 0.332 > 0.05). This means that the
company’s claim is true. That is, on average 20g tubs of yogurt produced by
Henna Ltd. contain 2g of fat.
In R/RStudio
186 | P a g e
Wilcoxon signed rank test with continuity correction
data: dataonesamplewil$Fat
V = 459, p-value = 0.3391
alternative hypothesis: true location is not equal to 2
From the output above we see that the null hypothesis cannot be rejected with
a 0.05 level of significance (p-value (sig) = 0.339 > 0.05). This means that
the company’s claim is true. That is, on average 20g tubs of yogurt produced
by Henna Ltd. contain 2g of fat.
Note that:
if you want to test whether the median amount of fat is less than 2g (one-
tailed test), type:
wilcox.test(dataonesamplewil$Fat, mu = 2,alternative =
"less")
Or, if you want to test whether the median weight of mice is greater than
25g (one-tailed test), type:
wilcox.test(dataonesamplewil$Fat, mu = 2,alternative =
"greater")
187 | P a g e
The first group is injected with the new drug while the second group is the
control, injected with an old established drug. After 2 weeks, the bodies of the
mice are scanned, and any decrease in the size of the tumour (in cm) recorded.
New_Drug 0.71 0.83 0.89 0.57 0.68 0.74 0.75 0.67 0.9 0.25 0.88 0.13
Control
(Old Drug) 0.72 0.68 0.69 0.66 0.57 0.66 0.7 0.63 0.86 0.5 0.34 0.76
New_Drug 0.83 0.72 0.11 0.45 0.06 0.57 0.22 0.8 0.43 0.97 0.19 0.24
Control
(Old Drug) 0.31 0.42 0.12 0.77 0.81 0.87 0.14 0.37 0.29 0.42 0.44 0.15
When preparing the data to be uploaded for the software the same format as
that used for the independent sample-test is to be used. We should have two
columns of data, the first column is a fixed factor with two levels which
specifies the drug used. So this variable will be called Drug. The second
column which will be called Size_decrease contains the decrease in the size
of the tumour (in cm) after 2 weeks.
Upon testing for normality using the Shapiro-Wilk tests p-values of 0.14 and
0.03 were obtained for the size decrease of the Control group and of the New
Drug Group respectively. This means that normality for the variable
Size_decrease is only satisfied for the control group. Thus, the Mann-Whitney
test should be used. The following hypothesis will be tested:
188 | P a g e
In SPSS
From the bar menu choose Analyze, select NonParametric tests, Legacy
Dialogs and the click on Two-Independent samples. Move Size_decrease
under Test Variable List and Drug under Grouping Variable.
Figure 4.3.5
Press the tab Define Groups to define the groups considered. For ‘Group 1’
we enter 1 which refers to the Control group and for ‘Group 2’ we enter 2
which refers the new drug group. Press Continue then Ok and we get the
following output:
Test Statisticsa
in cm
Mann-Whitney U 255.000
Wilcoxon W 555.000
Z -.681
Asymp. Sig. (2-tailed) .496
a. Grouping Variable: Drug
Table 4.3.2
189 | P a g e
Since the p-value (0.496) is bigger than the level of significance (0.05), we
cannot reject H 0 . So we can conclude that, on average, both drugs have the
same effect on reduction of tumour size.
In R/RStudio
190 | P a g e
Since the p-value (0.503) is bigger than the level of significance (0.05), we
cannot reject H 0 . So we can conclude that, on average, both drugs have the
same effect on reduction of tumour size.
With traffic 214, 159, 169, 202, 103, 119, 200, 109, 132, 142, 194, 104, 219,
119, 234
Without traffic 159, 135, 141, 101, 102, 168, 62, 167, 174, 159, 66, 118, 181, 171,
112
Considering the fact that observations were obtained from the same city (with
its peculiarities, weather, ventilation, etc.) albeit in two different days, the data
may be considered to be paired data. The Shapiro Wilk test applied to both
variables leads us to conclude that we are unable to assume a normal
distribution for the pollution values recorded. Thus, we must proceed with
testing using a non-parametric test, the Wilcoxon signed rank test.
191 | P a g e
The structure that should be used when entering the data should involve two
columns, the first one called WithTraffic and the second one called
WithoutTraf.
H 0 : Pollution levels are not affected by closing streets to car traffic (Median
hourly pollution is unaltered)
H 1 : Closing streets to car traffic has significantly reduced the hourly median
pollution level
In SPSS
From the bar menu choose Analyze, select NonParametric tests, Legacy
Dialogs and the click on 2 Related-Samples. The following dialogue box will
appear:
Figure 4.3.6
192 | P a g e
Follow the instructions shown in Figure 4.3.6 and press Ok. The following
output is obtained:
Test Statisticsa
WithoutTraf -
WithTraffic
Z -3.681b
Asymp. Sig. (2-tailed) .000
a. Wilcoxon Signed Ranks Test
b. Based on positive ranks.
Table 4.3.3
Since the p-value (0/2 = 0 – since we have a one-tailed test) is smaller than
the level of significance (0.05), we reject H 0 .
So we can conclude that, closing streets to car traffic has significantly reduced
the hourly median pollution level.
In R/RStudio
In R or RStudio the Wilcoxon Signed Ranks test is also conducted using the
command wilcox.test .
shapiro.test(datapairedwil$WithTraffic)#Checking normality
shapiro.test(datapairedwil$WithoutTraf)#Checking normality
We then use the Wilcoxon signed ranks test to conduct the one-tailed test
described earlier:
193 | P a g e
wilcox.test(datapairedwil$WithTraffic,
datapairedwil$WithoutTraf,alternative="greater",
paired=TRUE,exact = FALSE)
Since the p-value (0.0001266) is smaller than the level of significance (0.05),
we reject H 0 . So we can conclude that, closing streets to car traffic has
significantly reduced the hourly median pollution level.
Example: Suppose that 10 subjects are asked to rate 4 different wines from 0
(I hate it!) to 5 (I love it!). The results are found in the following table:
194 | P a g e
We would like to test whether the median score for each of the above wines
is the same throughout, or whether some wines are preferred more than others.
Hence we test the following hypothesis:
In other words in the alternative hypothesis we assume that some wines are
rated better than others.
The Friedman test is the ideal test to use in this case, since the different wines
have been rated by the same 10 subjects. It thus stands to reason that the rating
that one subject gives to one particular wine is related to the rating the same
subject gives to another wine. So here we are dealing with four related
samples.
Note that if we had to tests for multivariate normality as we have done for
reapeated measures ANOVA, from the output below we note that this is
satisfied and hence this data could be analysed using repeated measures
ANOVA. Here we shall consider the Friedman test keeping in mind that for
such data its power is less than that of the repeated measures ANOVA.
> library(energy)
> mvnorm.etest(Friedman,R=200)
In SPSS
For the purpose of examining this data using SPSS, it needs to be entered into
four different columns. Each column will contain the scores given by each
subject to the different wine. So we shall call each variable Wine1, Wine2,
Wine3, Wine4, respectively.
195 | P a g e
From the bar menu choose Analyze, select NonParametric tests, Legacy
Dialogs and then click on K Related Samples. The following dialogue box
will appear:
Figure 4.3.7
Descriptive Statistics
N Mean Std. Deviation Minimum Maximum
Score for Wine1 10 2.10 1.663 0 5
Score for Wine2 10 3.30 1.337 1 5
Score for Wine3 10 2.30 1.494 0 5
Score for Wine4 10 4.00 .816 3 5
Table 4.3.4
196 | P a g e
Test Statisticsa
N 10
Chi-Square 7.979
df 3
Asymp. Sig. .046
a. Friedman Test
Table 4.3.5
The mean rating scores for wines 1, 2, 3 and 4 are respectively 2.10, 3.30, 2.30
and 4. Since the p-value for the Friedman test (0.046) is less than the level of
significance (0.05), we reject H 0 . So there is a significant difference in the
quality of the different wines. However note that the p-value is almost equal
to 0.05 and hence our conclusion is not very ‘strong’. When this happens it is
recommended to increase the sample size. In this example this would imply
asking more subjects to score the wines.
The mean rating scores suggest that wine 4 is significantly superior to wines
1 and 3 but to confirm this we would need to consider post hoc tests. These
tests are discussed later in section 4.4.
In R/RStudio
For the purpose of examining this data using R software it needs to be entered
into three columns:
The first column will represent a factor with 4 levels each level
representing a different wine so we call this variable Wine.
In the second column we have a factor with 10 levels, each level refers
to a different subject, call this variable Subject.
The third column will contain the scores given by the subjects to each
wine and hence will be called Scores.
197 | P a g e
datafriedman<-read.csv( file.choose(), header = TRUE)
# this command will open a window which will allow us
# to look for and open the required data file
attach(datafriedman)
names(datafriedman) # list the variables in my data
str(datafriedman) # list the structure of mydata
View(datafriedman) # opens data viewer
We then type:
friedman.test(datafriedman$Scores,datafriedman$Wine,
datafriedman$Subject)
The order in which we input the variables inside the friedman.test command
is important. We first need to put the response variable (in this case the score).
The second input must be the grouping variable (in this case the wine).
Finally, the third input must be the variable pertaining to the subject
(sometimes also referred to as blocks). The output for the Friedman test, in
this case, is:
The test statistic related to this test is the Friedman chi-squared statistic. For
the data being considered, it has a value of 7.9787. The p-value of 0.04645 is
small enough to reject the null hypothesis of no difference between groups
(that is no difference in the quality of wines) at 0.1 and 0.05 level of
significance. At a 0.01 level of significance, however, the null hypothesis is
not rejected.
198 | P a g e
4.3.5 The Kruskal Wallis Test
Example: (https://statistics.laerd.com/spss-tutorials/kruskal-wallis-h-test-using-spss-
statistics.php) A medical researcher has heard anecdotal evidence that certain
anti-depressive drugs can have the positive side-effect of lowering
neurological pain in those individuals with chronic, neurological back pain,
when administered in doses lower than those prescribed for depression. The
medical researcher would like to investigate this anecdotal evidence with a
study. The researcher identifies 3 well-known, anti-depressive drugs which
might have this positive side effect, and labels them Drug A, Drug B and Drug
C. The researcher then recruits a group of 60 individuals with a similar level
of back pain and randomly assigns them to one of three groups – Drug A,
Drug B or Drug C treatment groups – and prescribes the relevant drug for a 4
week period. At the end of the 4 week period, the researcher asks the
participants to rate their back pain on a scale of 1 to 10, with 10 indicating the
greatest level of pain. The recorded data is found in the table which follows:
199 | P a g e
The Kruskal-Wallis test is the ideal test to use in this case, since we have three
different sets of independent samples and the Shapiro-Wilk normality test
shows that the scores obtained on Drug A and Drug C are not normally
distributed.
To run a Kruskal-Wallis H test to compare the pain score between the three
drug treatments the data should be entered into two columns. The first column
which we label Pain_Score represents the dependent variable consisting of
the scores given by the patients and the second column, which we label
Drug_Treatment_Group, represents the factor with three levels, each level
represents a different drug.
In SPSS
From the bar menu choose Analyze, select NonParametric tests, Legacy
Dialogs and the click on K Independent Samples. The following dialogue box
will appear:
Figure 4.3.8
200 | P a g e
Move Pain_Score into the test variable list and then move
Drug_Treatment_Group, into the grouping variable list. Click on Define
range and write 1 for the minimum value and 3 for the maximum value.
Test Statisticsa,b
Pain_Score
Chi-Square 47.280
df 2
Asymp. Sig. .000
a. Kruskal Wallis Test
b. Grouping Variable: Drug
given
Table 4.3.4
Since the p-value (0) is smaller than the level of significance (0.05), H 0 is
rejected. So there is a significant difference in the median pain scores given
by patients administered different drugs. Post hoc tests can than be applied to
tests which means are actually significantly different.
In R/RStudio
The command for running the Kruskal-Wallis test is kruskal.test. For the data
under study this is used as follows:
kruskal.test(datakw$Pain_Score,datakw$Drug_Treatment_Group)
201 | P a g e
The output for the Kruskal-Wallis test, in this case, is:
Since the p-value is very small it can be considered to be equal to 0 and hence
smaller than the level of significance (0.05). Thus H 0 is rejected. So there is
a significant difference in the median pain scores given by patients
administered different drugs. Post hoc tests can than be applied to tests which
means are actually significantly different. Such tests are discussed in the next
section.
When comparing more than two means/medians, whether we are dealing with
independent or related samples, parametric or non-parametric tests, post-hoc
analysis is the procedure of looking into the data to identify where the
differences truly lie. As a result, we can group the different samples in the
analysis into homogenous subgroups. Thus far, we have considered four tests
for comparing more than two means/medians. These are the following:
202 | P a g e
The most straightforward way to compare where the differences lie, is to
perform pairwise comparisons on all possible pairs using the two-sample
equivalent of the test in question. The two-sample equivalent for one-way
ANOVA is the independent samples t-test, for Kruskal-Wallis test is the
Mann-Whitney test, for repeated measures ANOVA is the paired samples t-
test and for Friedman test is the Wilcoxon test. This will give us an indication
of where the differences lie. However, it would be naïve to use these pairwise
comparisons individually, when the aforementioned four tests are comparing
all means simultaneously. The reason for this is the following.
203 | P a g e
On the other hand, under the assumption that the hypothesis of no difference
holds true for all m pairwise comparisons, the probability of accepting the null
hypothesis of no difference for all m pairwise comparisons is greater or equal
to 1 – if we test each with level of significance /m. Consequently, the
probability of wrongly rejecting at least one of them is less than or equal to .
In most statistical software, what happens when the Bonferroni method is
applied is that the p-value for the individual pairwise comparison is adjusted
by being multiplied by m - if this is greater than 1, then the p-value for
simultaneous comparison is set to 1.
The Bonferroni approach is not the only post-hoc method, but it will be the
method which we will focus on in this unit. However, it can be noticed that
the Bonferroni approach is dependent on the number of pairwise comparisons
m, and it will therefore have low power for large m. There are other
approaches, some of which differ in the p-value adjustment method, and some
of which are not dependent on m. Furthermore, some post-hoc approaches are
unique to specific tests and cannot be applied for all, as in the Bonferroni case.
We shall now have a look at how we can perform Bonferroni post-hoc tests in
both SPSS and R.
In SPSS
204 | P a g e
Figure 4.4.1
The output for the Bonferroni multiple comparisons can be seen in Table
4.4.1. Since there are 3 possible pairwise comparisons (Orange-Lemon,
Orange-Cherry and Lemon-Cherry) then, in comparison to the standard t-
test, the p-value has been adjusted by being multiplied by 3. It can be seen
that, for these simultaneous multiple comparisons, only Orange and Lemon
are significantly different at 0.05 level of significance (but not at 0.01 level
of significance).
Multiple Comparisons
Dependent Variable: score
Bonferroni
Mean Difference 95% Confidence Interval
(I) flavour (J) flavour (I-J) Std. Error Sig. Lower Bound Upper Bound
Orange Lemon 4.30769* 1.67954 .046 .0645 8.5509
Cherry .50000 1.79640 1.000 -4.0385 5.0385
Lemon Orange -4.30769* 1.67954 .046 -8.5509 -.0645
Cherry -3.80769 1.76472 .116 -8.2661 .6507
Cherry Orange -.50000 1.79640 1.000 -5.0385 4.0385
Lemon 3.80769 1.76472 .116 -.6507 8.2661
*. The mean difference is significant at the 0.05 level.
Table 4.4.1
205 | P a g e
The following are the homogenous subgroups arising from the post-hoc test
at 0.05 level of significance:
Figure 4.4.2
206 | P a g e
Pairwise Comparisons
Measure: Score
95% Confidence Interval for
Mean Difference Differenceb
(I) Time (J) Time (I-J) Std. Error Sig.b Lower Bound Upper Bound
Month 2 Month 4 .950* .351 .030 .073 1.827
Month 6 2.125* .347 .000 1.256 2.994
Month 4 Month 2 -.950* .351 .030 -1.827 -.073
Month 6 1.175* .370 .009 .249 2.101
*
Month 6 Month 2 -2.125 .347 .000 -2.994 -1.256
Month 4 -1.175* .370 .009 -2.101 -.249
Based on estimated marginal means
*. The mean difference is significant at the .05 level.
b. Adjustment for multiple comparisons: Bonferroni.
Table 4.4.2
Since all the resulting p-values are less than 0.05 after the Bonferroni p-value
adjustment, all the pairwise comparisons contributed towards to a significant
difference in mean pain scores at 0.05 level of significance, showing that there
was a change in the pain levels experienced by the patients, throughout the
whole period of study. Since all levels of the time factor are distinct, each
level forms a distinct homogenous group.
Friedman test:
To perform Bonferroni post-hoc tests for the Friedman test in the previously
conducted wine example, we first do Analyze, NonParametric tests, Related
Samples. The window displayed in Figure 4.4.3 is opened. In the tab Fields,
select Use custom field assignments and move the the variables ‘Wine1’,
‘Wine2’, ‘Wine3’ and ‘Wine4’ simultaneously into the test fields list. In the
tab Settings (see Figure 4) select Customize tests, Friedman 2-Way Anova by
ranks (k samples) and make sure that the Multiple Comparisions option below
the latter test is set at All pairwise.
207 | P a g e
Figure 4.4.3
Figure 4.4.4
208 | P a g e
In the output, you get Table 4.4.3.
Table 4.4.3
To obtain a more detailed output double click on this table in the SPSS output
file. The window in Figure 4.4.5 is opened. In the lower right corner (see red
circle in Figure 4.4.5) there is an option view, select Pairwise comparisons to
see the results of post hoc tests displayed in Figure 4.4.6.
Figure 4.4.6
209 | P a g e
Figure 4.4.7
From the output displayed in Figure 4.4.7, we can see that the Bonferroni
corrected p-values suggest that all medians are equal at 0.05 level of
significance, though Wine 1 and Wine 4 are significantly different at 0.1 level
of significance. This means that difference between Wine 1 and Wine 4 is
most likely the highest contributor to the difference between the groups. This
contradiction is probably due to the fact that we are working with small
sample sizes and hence the power of the tests is weak. Also the p-value for
the Friedman tests was almost equal to 0.05 which is the ‘grey area’ value.
The only way one can improve these results is by increasing the sample sizes.
However, at 0.05 level of significance, all levels of the wine factor fall in one
homogenous group. At a 0.01 level of significance we have:
210 | P a g e
Kruskal-Wallis test:
Figure 4.4.8
211 | P a g e
Figure 4.4.9
SPSS then gives us the following output which we double-click (see Table
4.4.3). If we select ‘Multiple Comparisons’ in the ‘View’ option, we obtain
the table we require (Figure 4.4.10).
Table 4.4.3
212 | P a g e
Figure 4.4.10
It can be seen that, in Figure 4.4.10, that after the Bonferroni p-value
adjustment, all 3 drugs are still significantly different from each other. This
means that at 0.05 level of significance, and even at 0.01 level of significance,
every drug pertains to a distinct homogenous group.
213 | P a g e
In R/RStudio
Using the dataflavours variable from the one-way ANOVA example, we use
the following command for the Bonferroni approach using the pairwise t-test:
> pairwise.t.test(dataflavours$score, dataflavours $flavour, p.adjust.m
ethod = "bonferroni",pool.sd = TRUE, paired = FALSE,alternative="two.si
ded")
Orange Lemon
Lemon 0.046 -
Cherry 1.000 0.116
For these simultaneous multiple comparisons, only Orange and Lemon are
significantly different at 0.05 level of significance (but not at 0.01 level of
significance). The following are the homogenous subgroups arising from the
post-hoc test at 0.05 level of significance:
In this case, we can also apply the pairwise paired sample t-test on the
Paindatalong variable created earlier, by applying the Bonferroni approach,
as follows:
214 | P a g e
Pairwise comparisons using paired t tests
Since all the resulting p-values are less than 0.05, all the pairwise comparisons
contributed towards to a significant difference in mean pain scores at 0.05
level of significance, showing that there was a change in the pain levels
experienced by the patients, throughout the whole period of study. Since all
levels of the time factor are distinct, each level forms a distinct homogenous
group.
In this case, we can also apply the pairwise Wilcoxon test the datafriedman
variable created earlier, applying the Bonferroni approach, as follows:
From the output, we can see that the Bonferroni corrected p-values suggest
that all medians are equal at 0.05 level of significance, and even at 0.1 level
of significance.
215 | P a g e
It can be seen though that the difference between Wine 1 and Wine 4, and
Wine 3 and Wine 4, are the major contributors to the significant difference
detected by the Friedman test. This is probably due to the fact that we are
working with small sample sizes and hence the power of the tests is weak.
Also, the p-value for the Friedman tests was almost equal to 0.05 which is the
‘grey area’ value. The only way one can improve these results is by increasing
the sample sizes. In this case, all wines belong to just one homogenous group,
both at 0.05 and 0.1 level of significance.
In this case, we can also apply the pairwise Mann-Whitney test to the datakw
variable, applying the Bonferroni approach, as follows:
It can be seen that, after the Bonferroni p-value adjustment, all 3 drugs are
still significantly different from each other, even at 0.01 level of significance.
Since all drugs perform distinctly different, each drug forms a distinct
homogenous group.
We shall now discuss other post-hoc tests which can be applied to the
abovementioned tests.
216 | P a g e
First of all, the Bonferroni is not the only possible adjustment for simultaneous
multiple comparisons, and the R commands for pairwise.t.test and
pairwise.wilcox.test also allow for the following adjustments:
The option ("none") is also included, which does not involve any p-value
adjustments whatsoever. SPSS also implements the Sidak correction (Sidak
1967).
There are also other tests specific to the individual tests. For the one-way
ANOVA test, one can also implement the Scheffe and the Tukey tests. For the
Friedman test, one can also implement the Nemenyi and the Conover tests.
For the Kruskall-Wallis test, one can also implement the Kruskal-Nemenyi,
Kruskal-Conover and the Kruskal-Dunn tests. For post-hoc analysis on
repeated measures ANOVA, we show how one can do post-hoc analysis using
the Pillai test statistic, which is used in Multivariate ANOVA (MANOVA).
To be able to used this command, the package phia is used as follows:
library(phia)
217 | P a g e
Since all the resulting p-values are less than 0.05, all the pairwise comparisons
contributed towards to a significant difference in mean pain scores, showing
that there was a change in the pain levels experienced by the patients,
throughout the whole period of study. This post-hoc test can also be applied
when between-subject effects are present.
Consider for example, height and weight of individuals. These are related -
taller people tend to be heavier than shorter people. However, this relationship
is not perfect. People of the same height vary in weight and we may easily
think of two people we know, where the shorter one is heavier than the taller
one. Nonetheless, the average weight of people of height 1.6m is less than the
average weight of people of height 1.7m and the average weight of the two is
less than that of people who are 1.8m tall, etc. Correlation can tell us just how
much of the variation in peoples' weights is related to their heights.
N.B: Note that each one of the coefficients, considered here, is sensitive to
outliers. That is, in the presence of outliers they can give misleading results.
218 | P a g e
Hence, before computing such coefficients, the variables should be screened
and any outlying values should be corrected or removed.
219 | P a g e
2. A relationship between the variables exist which is linear.
Whilst there are a number of ways to check whether a linear
relationship exists between your two variables, the simplest way is by
plotting a scatterplot (as shown in Chapter 3) of one variable against
the other variable, and then visually inspect the scatterplot to check for
linearity. Your scatterplot may look something like the ones presented
in Figure 4.5.1 where figures a, b, c, and d display linear relationships
while figures e and f display non-linear relationships.
(https://statistics.laerd.com/spss-tutorials/pearsons-product-moment-
correlation-using-spss-statistics.php, September 2017)
220 | P a g e
Figure 4.5.1 Examples of scatter plots. The Pearson correlation
coefficient is denoted by r.
221 | P a g e
Example: An ice cream manufacturer wants to test whether temperatures have
an effect on ice cream sales. The following is a table of 10 daily pairs of
readings concerning temperatures and ice cream sales.
Temperature 14.2 16.2 11.9 15.2 18.5 22.1 19.4 25.1 23.4 18.1
(Celsius)
Ice-cream 2154 3256 1859 3321 4062 5221 4120 6144 5449 4211
Sales
(in Euros)
Use a 0.05 level of significance to test whether the correlation between the
two variables is significant.
Box plots issued for both variables identified no outliers and the multivariate
normality test conducted in R software gave a p-value of 0.905, hence we
accept the hypothesis that our data follows a bivariate distribution.
Furthermore scatter plot indicated the presence of a linear relationship
between the two variables. Thus we can proceed to calculate the Pearson
correlation coefficient with its significance test.
In SPSS
222 | P a g e
Then go to Analyze, Correlate, Bivariate, move the two variables underneath
Variables, select Pearson from underneath Correlation Coefficients (as
shown in Figure 4.5.2) and press OK.
Figure 4.5.2
Correlations
Ice_Cream_Sal
Temperature es
Temperature Pearson Correlation 1 .985**
Sig. (2-tailed) .000
N 10 10
Ice_Cream_Sales Pearson Correlation .985** 1
Sig. (2-tailed) .000
N 10 10
**. Correlation is significant at the 0.01 level (2-tailed).
Table: 4.5.1
223 | P a g e
From Table 4.5.1, the Pearson correlation coefficient is 0.985 and the resulting
p-value is 0.000 which is less than 0.05 meaning that the resulting correlation
is significant. So there is a strong, positive linear relationship between the
two variables.
In R/RStudio
rcorr(datapearson$Temperature,datapearson$Ice_Cream_Sales,
type="pearson") # type can be pearson or spearman
x y
x 1.00 0.98
y 0.98 1.00
n= 10
P
x y
x 0
y 0
From the above output we note that for the variables under study Pearson
correlation coefficient is 0.98, which is close to 1. P-value = 0 which is less
than 0.05 hence the coefficient is significantly different from zero. Hence we
conclude that the variables are strongly, positively linearly related.
224 | P a g e
4.5.2 The Kendall Rank Correlation Coefficient
This correlation coefficient was developed to cater for two different scenarios:
as a measure of agreement for the same subject/object
as a measure of agreement between different subjects/objects
In the first instance, suppose that a trainee was asked to arrange (rank) a
number of objects during quality control. Suppose that this same trainee was
asked to arrange (rank) the same objects, the following day. The Kendall
correlation coefficient will provide a measure of agreement on the two sets of
ranks given by the same person. This coefficient may thus be used as a
measure of repeatability (reliability) on the quality of judgement of an
individual.
In the second instance, suppose that the ranks obtained by the trainee on the
first day, have to be compared to the ranks given by an experienced employee.
The Kendall correlation coefficient will in this case provide a measure of
agreement between two different persons. It will give a measure on the
similarity of judgment of different individuals.
225 | P a g e
Example: Two interviewers ranked 10 candidates according to their
suitability to fill in a vacancy within the company, with a rank of 1 meaning
the most suitable candidate and a rank of 10 meaning the least suitable
candidate. The ranks given are shown in the table which follows.
Interviewer 1 1 3 5 9 7 10 8 4 2 6
Interviewer 2 3 2 6 10 7 8 9 5 1 4
Use a 0.05 level of significance to test whether the two variables are
independent.
Since the data being considered in this example is rank data, testing whether
the two variables are related or not should be carried out using the Kendall tau
correlation coefficient. No bivariate normality testing is needed here. Pearson
correlation coefficient should not be used with rank data. We test the
following hypothesis:
In SPSS
To be able to use the Kendall tau-b correlation coefficient in SPSS, enter one
column for each of the two variables in the data. Then go to Analyze,
Correlate, Bivariate, move the two variables underneath Variables, select
Kendall’s tau-b from underneath Correlation Coefficients (as shown in Figure
4.1.3) and press OK.
226 | P a g e
Figure: 4.5.3
Correlations
Interviewer_1 Interviewer_2
Kendall's tau_b Interviewer_1 Correlation Coefficient 1.000 .733**
Sig. (2-tailed) . .003
N 10 10
Interviewer_2 Correlation Coefficient .733** 1.000
Sig. (2-tailed) .003 .
N 10 10
**. Correlation is significant at the 0.01 level (2-tailed).
Table:4.5.2
From Table 4.5.2 Kendall tau-b correlation coefficient is 0.733 and the
resulting p-value is 0.003 which is less than 0.05 meaning that the resulting
correlation is significant. So there is agreement in the rankings (similar ranks)
given by the two interviewers.
227 | P a g e
In R/RStudio
The following commands were used to find the Kendall correlation coefficient
for the two variables being considered in this study:
attach(dataKendall)
names(dataKendall) # list the variables in my data
str(dataKendall) # list the structure of mydata
View(dataKendall) # open data viewer
From the resulting output, Kendall tau-b correlation coefficient is 0.733 and
the resulting p-value is 0.003 which is less than 0.05 meaning that the resulting
correlation is significant. So there is agreement in the rankings (similar ranks)
given by the two interviewers.
228 | P a g e
A monotonic function is a function that increases (decreases) monotonically,
does not exclusively have to increase (decrease), it simply must not decrease
(increase). See Figures 4.5.4 to 4.5.6 which are taken from
https://en.wikipedia.org/wiki/Monotonic_function.
229 | P a g e
Figure 4.5.6 : A function which is not monotonic.
Like the Kendall tau, this test is non-parametric as it does not rely on any
assumptions on the distributions of the two variables under study. Hence it is
a suitable alternative to the Pearson correlation coefficient if the two variables
under study are ordinal variables or covariates for which the assumption of
bivariate normality is not satisfied.
Example: Consider once again the example presented in the Kendall tau
section. This time we shall repeat the correlation analysis using Spearman
correlation. Given that we already know that the assumption of bivariate
normality is not satisfied, we move on to checking monotonicity by plotting a
scatter plot of the two variables.
230 | P a g e
Figure 4.5.7
The scatter plot in Figure 4.5.7 indicates that the relationship between the two
variables is linear hence monotonic. Next we test the following hypothesis:
In SPSS
231 | P a g e
Figure 4.5.8
Correlations
Interviewer_1 Interviewer_2
Spearman's rho Interviewer_1 Correlation Coefficient 1.000 .891**
Sig. (2-tailed) . .001
N 10 10
Interviewer_2 Correlation Coefficient .891** 1.000
Sig. (2-tailed) .001 .
N 10 10
**. Correlation is significant at the 0.01 level (2-tailed).
Table: 4.5.3
From Table 4.5.3, the Spearman correlation coefficient is 0.091 and the
resulting p-value is 0.001. So there is a monotonic relationsip between the
rankings given by the two interviewers.
232 | P a g e
As you can see the conclusion here is different from that made when looking
at the Kendall tau correlation coefficient but this is not surprising since one is
measuring agreement and the other is measuring the presence of a monotonic
relationship. A monotonic relationship can exist even when there is no
agreement.
In R/RStudio
The following commands were used to find the Pearson correlation coefficient
for the two variables being considered in this study:
From the resulting output, the Spearman correlation coefficient is 0.891 and
the resulting p-value is 0.001 which is less than 0.05 meaning that the resulting
correlation is significant. So there is a monotonic relationsip between the
rankings given by the two interviewers. As you can see the conclusion here is
different from that made when looking at the Kendall tau correlation
coefficient but this is not surprising since one is measuring agreement and the
other is measuring the presence of a monotonic relationship. A monotonic
relationship can exist even when there is no agreement.
233 | P a g e
4.5.4 The Pearson Partial Correlation Coefficient
For example, suppose we have data on the height, weight and age of a number
of individuals and we find that there is correlation between the variables
weight, height and age. Now suppose we are interested in the relationship
between height and weight when the effect of age is eliminated. We might
suspect that the correlation between weight and height might be due to the fact
that both weight and height are correlated with age and not because there truly
exists a relationship between height and weight. Thus we can use a partial
correlation coefficient to eliminate the effect of age from the two variables
and see if they are still correlated.
1
The higher the rating the more popular is the cereal.
234 | P a g e
Carbs 65 70 55 50 100 110 110 120
Sugars 7.20 9.16 8.56 6.45 10.78 14.50 12.73 14.23
Rating 58.14 68.40 66.73 93.70 39.70 22.74 22.40 21.87
Carbs 120 50 90 131 120 116 80 90
Sugars 12.26 5.52 9.712 13.31 15.10 12.37 11.58 13.64
Rating 28.04 53.31 45.81 19.82 39.26 23.80 68.24 74.47
Carbs 90 88 141 142
Sugars 11.89 10.11 17.72 14.38
Rating 72.80 89.24 29.24 39.10
The multivariate test for normality introduced earlier when discussing Pearson
correlation coefficients was conducted and it confirmed that the variables
under study can be assumed to follow a multivariate normal distribution.
Pearson correlation coefficients were computed for each pair of variables and
were found to be significant. Their values are reported in the table below:
Table: 4.5.4
235 | P a g e
In SPSS
To be able to use the Pearson partial correlation coefficient in SPSS, enter one
column for each of the three variables in the data. Then go to Analyze,
Correlate, Partial, move the two variables of interest underneath Variables,
and the variable you want to control for under Controling for (as shown in
Figure 4.1.9) and press OK.
Figure 4.5.9
Correlations
Control Variables Rating Carbohydrates
Sugars Rating Correlation 1.000 -.612
Significance (2-tailed) . .005
df 0 17
Carbohydrates Correlation -.612 1.000
Significance (2-tailed) .005 .
df 17 0
Table: 4.5.5
236 | P a g e
Note that from Table 4.4.4, the Pearson correlation that was obtained for
rating and carbohydrates, when not controlling for sugar, was -0.744. Recall
that the partial coefficient gives us a measure of the association between rating
and carbohydrates while removing the association between sugar and
carbohydrates and sugar and rating. From Table 4.4.5 we note that the partial
Pearson correlation coefficient for rating and carbohydrates is -0.612 with a
p-value of 0.005, which is less than 0.05, hence the coefficient is still
significant, showing that the linear relationship between these two variables
is not in fact a result of the influence of sugar.
In R/RStudio
The following commands were used to find the partial Pearson correlation
coefficient for the three variables being considered in this study:
library(ppcor)
attach(datapartial)
names(datapartial) # list the variables in my data
str(datapartial) # list the structure of mydata
View(datapartial) # open data viewer
Note that from Table 4.4.4, the Pearson correlation that was obtained for
rating and carbohydrates, when not controlling for sugar, was -0.744.
237 | P a g e
Recall that the partial coefficient gives us a measure of the association
between rating and carbohydrates while removing the association between
sugar and carbohydrates and sugar and rating. From Table 4.4.5 we note that
the partial Pearson correlation coefficient for rating and carbohydrates is -
0.612 with a p-value of 0.005, which is less than 0.05, hence the coefficient is
still significant, showing that the linear relationship between these two
variables is not in fact a result of the influence of sugar.
238 | P a g e