SPSS Without Pain
SPSS Without Pain
SPSS Without Pain
First Edition
ASA Publications
Dhaka, Bangladesh
ISBN 978-984-34-8254-9
Publisher
ASA Publications
Dhaka, Bangladesh
Email: [email protected]
Distributor
Altaf Medical Book Center
Shop: 121 & 128; Lane: 3; Islamia Market, Nilkhet, Dhaka 1205
Cell: +880-1711-985991; +880-1611-985991; +880-1511-985991
Production credits
Publisher: ASA Publications
Production Director: Md Altaf Hossain
Marketing Manager: Md Altaf Hossain
Composition: Zariath Al-Mamun Badhon and Md Mamunur Rasel
Cover Design: Zariath Al-Mamun Badhon
Printing: ASA Publications
Binding: Rahim Bindings
Printed in Bangladesh
To all my family members, students and young researchers in
health and social sciences
Preface
This manual is intended for the students (MPH, FCPS, MD, MS, MPhil, and
others), teachers and young researchers in health and social sciences. It is written
in a very simple language and used health related data as examples. This manual
answers three basic questions related to data analysis. They are: a) what to do
(what statistics to be used for the data analysis to achieve the objectives); b) how
to do (how to analyze data by SPSS); and c) what do the outputs mean (how to
interpret the outputs)? All these questions are answered in a very simple and
understandable manner with examples. This manual covers the basic statistical
methods of data analysis used in health and social science research. It is the gate-
way to learn SPSS and will help the users to go further. This manual is organized
in 22 sections that covers the data management, descriptive statistics, hypothesis
testing using bivariate and multivariable analysis and others. It is easier to learn
through exploration rather than reading. The users can explore further once the
basics are known. From my understanding, using the statistics as covered in this
manual, the students and researchers will be able to analyze most of their data on
epidemiological studies and publish them in the international peer review journals.
I am optimistic that this manual will make the students’ and researchers’ life easier
for analyzing data and interpreting the outputs meaningfully. The users can down-
load the datasets used in this manual from the following links. If you have any
comments about the manual, feel free to write at the e-mail address below.
M. Tajul Islam
[email protected]
[email protected]
i
Acknowledgements
To the users
Of course, this e-manual is free for all. However, if you can afford, please donate
Tk.100 only (US$ 2 for the users outside Bangladesh) to any charity or a needy
person. This little amount is sufficient to offer a meal to an orphan in the develop-
ing countries.
ii
Contents
Section 1 Introduction 1
1.1 Steps of data analysis 1
iv
Section 11 Repeated Measures ANOVA: One-way 69
11.1 One-way repeated measures ANOVA 69
11.1.1 Commands 70
11.1.2 Outputs 70
11.1.3 Interpretation 73
Annex 165
References 166
vii
Section 1
Introduction
1
used tools for data collection are questionnaire and record sheet, while the com-
monly used data collection methods are face-to-face interview, observation, physi-
cal examination, lab test and others. Sometimes we use the available data (second-
ary data) for our research studies, for example, hospital records, and data of other
studies (e.g., Bangladesh Demographic and Health Survey data). Once data is
collected, the steps of data analysis are:
• Data coding, if pre-coded questionnaire or record sheet is not used
• Development of data file and data entry
• Data cleaning (checking for errors in data entry)
• Data screening (checking assumptions for statistical tests)
• Data analysis
• Interpretation of results
In the following sections, I have discussed the development of data file, data
management, data analysis and interpretation of the outputs.
2
Section 2
Generating Data File
Like other data analysis programs, SPSS has to read a data file to analyze data. We,
therefore, need to develop a data file for the use of SPSS. The data file can be
generated by SPSS itself or by any other program. Data files generated in other
programs can be easily transferred to SPSS for analysis. Here, I shall discuss how
to generate a data file in SPSS.
3
Table 2.1. Codebook
SPSS variable name Actual variable name Variable code
V1 Age in years Actual value
V2 Sex m= Male
f= Female
V3 Religion 1= Islam
2= Hindu
3= Others
V4 Occupation 1= Business
2= Government job
3= Private job
4= Others
V5 Monthly family income Actual value
V6 Marital status 1= Married
2= Unmarried
3= Others
V7 Have diabetes mellitus 1= Yes
2= No
V8_a Systolic blood pressure Actual value
V8_b Diastolic blood pressure Actual value
Note: Instead of V1, V2, etc., you can use any other name as SPSS variable name.
For example, you can use the variable name “age” instead of V1, “sex” instead of
V2, etc.
Now, open the SPSS program by double clicking the SPSS icon. You will see
the following dialogue box (fig 2.1). Click on cancel box ( ) to close the “SPSS
for Windows”. Now we have the dialogue box as shown in fig 2.2 (SPSS Data
Editor).
4
Figure 2.1. Dialogue box for defining variables
5
Figure 2.2. SPSS data editor: dialogue box for defining variables
The “SPSS Data Editor (fig 2.2)” shows Name, Type, Width, Decimals, Label,
Values, Missing, Columns, Align, Measure, at the top row. If you do not see this,
click on “Variable View” at the left-bottom corner of the window.
Name: In this column, type the brief SPSS variable name as shown in the code-
book. For example, type V1 (you can also use “age” as the variable name) for the
first variable, age. Note that this short name will be used to identify the variable in
the data file.
6
Type: This column indicates the characteristic of the variable, whether it is numer-
ic or string. Numeric means expressed by numbers (e.g., 1, 2, 3, etc.), while string
means expressed by alphabets (e.g., m, f, y, n, etc.). In SPSS, the default value for
“Type” is numeric. If the nature of the variable is string (alphabet or text variable),
we need to change it. To change the variable type into string, follow the following
steps:
Click on the cell under the column “Type” (you will see a box with three dots)
> Click on the “three-dot box” (you will see the options in a separate dialogue
box) > Select “String” from the options > Click OK
Similarly, if it is a date variable (e.g., date of hospital admission), you have to
change the variable type into a date format in the same manner.
Width: The default value for width is 8. In most of the cases, it is sufficient and
serves the purpose. However, if the variable has very large value, then we need to
increase it using the arrow (up and down) button in the box. For practical purpose,
keep the width 8 unless the variable values are larger than 8 characters.
Decimals: This is applicable for the numeric variables and the default value is 2.
If your data does not have any decimal value, you can make it “0” using the down
arrow or keep it as it is.
Label: This is the space where we write the longer description of the variable
(actual variable name as shown in the codebook). For example, we have used “V1”
to indicate age in years. We should, therefore, write “Age in years” in the label
column for the variable name “V1”.
Values: This is applicable for the variables to define their levels using code num-
bers (such as 1, 2 or m, f, etc). This allows the SPSS to retain the meaning of values
(code numbers) you have used in the data set. For example, our variable 2 is sex
and is defined by “V2”. It has two levels, male (coded as “m”) and female (coded
as “f”). Follow the commands below to put the value labels.
Click on cell under the column “Value” (you will see a box with three dots) >
Click on “three-dot box” > Click in the box “Value” > Type “m” > Click in the
box “Value label” > Type “male” > Click on “Add” > Repeat the same process
7
for female (value “f”, value label “female”, add) > OK
In this way, complete value labels for all the variables, if applicable. Note that
value labels are needed only for the variables that have been coded.
Missing: If there is any missing value in the dataset, SPSS has the option to indi-
cate that. If we want to set a missing value for a variable, we have to select a value
that is not possible (out of range) for that variable. For example, we have conduct-
ed a study and the study population was women aged 15-49 years. There are sever-
al missing values for age in the data (i.e., age was not recorded on the question-
naire or respondent did not tell the age). First, we have to select a missing value for
age. We can select any value which is outside the range 15-49 as the missing value.
Say, we have decided to use 99 as the missing value for age. Now, to put the miss-
ing value for age in SPSS, use the following commands.
Click on the cell under the column “Missing” (you will see a box with three
dots) > Click on the “three-dot box” > Select “Discrete missing values” > Click
on the left box > type “99” > Ok
However, you may omit this. Just keep the cell blank while entering data in the
data file. SPSS will consider the blank cells in the data file as missing values (sys-
tem missing).
Columns: The default value for this is 8, which is sufficient for most of the cases.
If you have a long variable name, then only change it as needed. For practical
purpose, just keep it as it is.
Measure: This cell indicates the measurement scale of the data. If the variable is
categorical use “Nominal” for nominal or “Ordinal” for ordinal scale of measure-
ment. Otherwise use “Scale” for interval or ratio scale of measurement. You can
also keep it as it is.
In this way, define all the variables of your questionnaire/record sheet in the
SPSS data editor. The next step is data entry.
8
2.1.2 Data entry in SPSS:
Once all the variables are defined, click on the “Data View” tab at the bottom-left
corner of the window. You will see the following dialogue box (fig 2.3) with the
variable names at the top row. This is the spreadsheet for data entry. Now, you can
enter data starting from row 1 for each of the variables. Complete your data entry
in this spreadsheet and save the data file at your desired location/folder (save the
file as you save your file in MS Word, such as click on File> click on Save as ….
etc.).
If you want to open the data file later, then use the following steps:
Click on File > Open > Data > Select the folder you have saved your SPSS data
file > Select the file > Click “Open”
9
Figure 2.3 Spreadsheet for data entry (SPSS data editor)
10
Section 3
Data Cleaning and Data Screening
Once data is entered into the SPSS, we need to be sure that there are no errors in
the dataset (i.e., there were no errors during data entry). Data cleaning is common-
ly done by generating frequency distribution tables of all the variables to see the
out-of-range values, and by cross tabulations (or by other means) for checking the
conditional values. If errors are identified, they need to be corrected. Simultane-
ously, we also need to check the data if it fulfils the assumptions of the desired
statistical test (data screening), e.g., is data normally distributed to do a t-test? The
users may skip this section for the time being and go to section 4. Once the users
develop some skills in data analysis, they can come back to this section. Use the
data file <Data_3.sav> for practice. The codebook of this data file can be seen in
the annex (table A.1).
The table shows that the values range from 1 to 3 (minimum 1 and maximum
11
3), which are within the range of our code numbers. Therefore, there is no
out-of-range error in this variable.
12
Table 3.2. Descriptives
13
Section 4
Data Analysis: Descriptive Statistics
Descriptive statistics are always used at the beginning of data analysis. The objec-
tive of using the descriptive statistics is to organize and summarize data. Common-
ly used descriptive statistics are frequency distribution, measures of central
tendency (mean, median, and mode) and measures of dispersion (range, standard
deviation, and variance). Measures of central tendency convey information about
the average value of a dataset, while a measure of dispersion provides information
about the amount of variation present in the dataset. Other descriptive statistics
include quartile and percentile. Use the data file <Data_3.sav> for practice.
14
Figure 4.2. Selection of variables for frequency distribution
You will see the following outputs (I have shown only the table of sex) (table
4.1). The table indicates that there are in total 210 subjects, out of which 133 or
63.3% are female and 77 or 36.7% are male. If there is any missing value, the table
will show it. In that case, use the “valid percent” instead of “percent” for reporting.
For example, table 4.2 shows 4 missing values. You should, therefore, report 130
or 63.1% are female and 76 or 36.9% are male. Note that the “Percent” and “Valid
Percent” will be the same, if there is no missing value.
Sex
Frequency Percent Valid Percent Cumulative Percent
Valid Female 133 63.3 63.3 63.3
Male 77 36.7 36.7 100.0
Total 210 100.0 100.0
Sex
Frequency Percent Valid Percent Cumulative Percent
Valid Female 130 61.9 63.1 63.1
Male 76 36.2 36.9 100.0
Total 206 98.1 100.0
Missing 9 4 1.9
Total 210 100.0
15
4.2 Central tendency and dispersion
We calculate the central tendency and dispersion for the quantitative variables.
Suppose, you want to find the mean, median, mode, standard deviation (SD), vari-
ance, standard error (SE), skewness, kurtosis, quartile, percentile (e.g., 30th and
40th percentile), minimum and maximum values of the variable “age” of the study
subjects. All these statistics can be obtained in several ways. However, using the
following commands is the easiest way to get them together (fig 4.3-4.5).
Analyze > Descriptive statistics > Frequency > Select the variable “age” and
push it into the "Variable(s)” box > Statistics > Select all the descriptive mea-
sures you desire (mean, median, mode, SD, SE, quartile, skewness, kurtosis) >
Select "Percentiles" > Write “30” in the box > Add > Write “40” in the box >
Add > Continue > OK
16
Figure 4.4. Selection of variable(s)
17
4.2.1 Outputs:
The SPSS will produce the following output (table 4.3).
Table 4.3. Descriptive statistics of age
AGE
N Valid 210
Missing 0
Mean 26.5143
Std. Error of Mean .51689
Median 27.0000
Mode 26.00
Std. Deviation 7.49049
Variance 56.10745
Skewness -.092
Std. Error of Skewness .168
Kurtosis -.288
Std. Error of Kurtosis .334
Minimum 6.00
Maximum 45.00
Percentiles 25 21.0000
30 22.3000
40 25.0000
50 27.0000
75 32.0000
4.2.2 Interpretation:
We can see all the descriptive statistics (central tendency and dispersion) that we
have selected for the variable “age” including the statistics for Skewness and Kur-
tosis in table 4.3. Hope, you understand the mean (average), median (middle value
of the data set), mode (most frequently occurring value), SD (average difference of
individual observation from the mean), variance (square of SD) and SE of the
mean. As presented in table 4.3, the mean age is 26.5 and SD is 7.49 years. Let me
discuss the other statistics provided in table 4.3, especially the skewness, kurtosis,
quartile and percentile.
Skewness and Kurtosis: These two statistics are used to judge whether the data
have come from a normally distributed population or not. In table 4.3, we can see
the statistics for Skewness (- .092) and Kurtosis (- .288). Skewness indicates the
spreadness of the distribution. Skewness “>0” indicates data is skewed to the right;
skewness “<0” indicates data skewed to the left, while skewness “~0” indicates
data is symmetrical (normally distributed). The acceptable range for normality of
18
a data set is skewness lying between “-1” and “+1”. However, normality should
not be judged based on skewness alone. We need to consider the statistics for
kurtosis as well. Kurtosis indicates “peakness” or “flatness” of the distribution.
Like skewness, the acceptable range of kurtosis for a normal distribution is
between “-1” and “+1”. Data for “age” has skewness -.092 and kurtosis -.288,
which are within the normal limits of a normal distribution. We may, therefore,
consider that the variable “age” in the population may be normally distributed.
Quartile and Percentile: When a dataset is divided into four equal parts after
arranging into ascending order, each part is called a quartile. It is expressed as Q1
(first quartile or 25th percentile), Q2 (second quartile or median or 50th percentile)
and Q3 (third quartile or 75th percentile). On the other hand, when data is divided
into 100 equal parts (after ordered array), each part is called a percentile. We can
see in table 4.3 that, Percentile 25 (means Q1), Percentile 50 (Q2) and Percentile
75 (Q3) for age are 21, 27 and 32 years, respectively. Q1 or the first quartile is 21
years, means that 25% of the study subjects’ age is less than or equal to 21 years.
On the other hand, 30th percentile (P30) is 22.3 years, which means that 30% of the
study subjects’ age is less than or equal to 22.3 years. Hope, you can now interpret
the P40.
4.3.1 Outputs:
The outputs are shown in table 4.4 and figs 4.6 to 4.9.
19
Table 4.4. Descriptive statistics of age
2.00 0. 66
10.00 1. 0122344444
24.00 1. 555566666777788888889999
44.00 2. 00000000000000111112222222233333334444444444
63.00 2. 555555556666666666666666777777777788888888888888999999999999999
34.00 3. 0000111111112222222333333333444444
26.00 3. 55556666666677777788888999
6.00 4. 001133
1.00 4. 5
20
Figure 4.8. Box and plot chart of age
Q3
Q2
Q1
4.3.2 Interpretation:
Before we understand the graphs, let us see the descriptive statistics provided in
table 4.4. We can see that the SPSS has provided mean and 5% trimmed mean of
age. Five percent trimmed mean is the mean after discarding 5% of the upper and
5% of the lower values of age. The extent of the effect of outliers can be checked
by comparing the mean with the 5% trimmed mean. If they are close together (as
we see in table 4.4; mean= 26.51 and 5% trimmed mean= 26.56), there is no signif-
icant influence of the outliers (or there are no outliers) on age in the dataset. If they
are very different, it means that the outliers have significant influence on the mean
value, and suggests for checking the outliers and extreme values in the dataset. The
21
table 4.4 also shows the 95% Confidence Interval (CI) for the mean of “age”,
which is 25.49-27.53. The 95% CI for the mean indicates that we are 95% confi-
dent/sure that the mean age of the population lies between 25.49 and 27.53 years.
The SPSS has provided several graphs (figs 4.6 to 4.9), such as histogram,
stem and leaf, and box and plot charts. Histogram gives us information about – a)
distribution of the dataset (whether symmetrical or not); b) concentration of
values; and c) range of values. Looking at the histogram (fig 4.6), it seems that the
data is more or less symmetrical. This indicates that age may be normally (approx-
imately) distributed in the population.
Stem and leaf chart (fig 4.7) provides information similar to a histogram, but
retains the actual information on data values. Looking at the stem and leaf chart,
we can have an idea about the distribution of the dataset (whether symmetrical or
not). Data displayed in figure 4.7 shows that the data is more or less symmetrical.
Stem and leaf charts are suitable for small datasets.
The box and plot chart (fig 4.8) provides information about the distribution of
a dataset. It also provides summary statistics of a variable, like Q1 (first quartile or
25th percentile), median (second quartile or Q2) and Q3 (third quartile or 75th
percentile) as well as information about outliers/extreme values. The lower bound-
ary of the box indicates the value for Q1, while the upper boundary indicates the
value for Q3. The median is represented by the horizontal line within the box. The
smallest and largest values are indicated by the horizontal lines of the whiskers.
In the box and plot chart, presence of outliers is indicated by the ID number
and circle, while the presence of extreme values is indicated by “*”. Outliers are
the values lying between 1.5 and <3 box length distance from the edge (upper or
lower) of the box. On the other hand, the extreme values are 3 or more box length
distance from the upper or lower edge of the box. Fig 4.8 shows that there is no
outlier in the data for age. I have provided another box and plot chart, which is for
the variable “diastolic BP” (fig 4.9). Figure 4.9 shows that there are 3 outliers (ID
no. 19, 27 and 121) in the data of diastolic BP, but does not have any extreme
value.
Descriptives
Age Sex Statistics Std. Error
Female Mean 26.8872 .58981
95% Confidence Interval for Lower Bound 25.7205
Mean Upper Bound 28.0539
5% Trimmed Mean 26.8413
Median 27.0000
Variance 46.267
Std. Deviation 6.80202
Minimum 10.00
Maximum 45.00
Range 35.00
Interquartile Range 9.50
Skewness .074 .210
Kurtosis -.212 .417
Male Mean 25.8701 .97549
95% Confidence Interval for Lower Bound 23.9273
Mean Upper Bound 27.8130
5% Trimmed Mean 26.0144
Median 26.0000
Variance 73.272
Std. Deviation 8.55993
Minimum 6.00
Maximum 41.00
Range 35.00
Interquartile Range 13.00
Skewness -.153 .274
Kurtosis -.606 .541
Extreme Values
Case Number id no. Value
age Highest 1 210 210 45.00
2 209 209 43.00
3 208 208 43.00
4 207 207 41.00
5 206 206 41.00
Lowest 1 3 3 10.00
2 1 1 10.00
3 4 4 11.00
4 2 2 11.00
5 6 6 12.00
a. Only a partial list of cases with the value 12.00 are shown in the table of lower
extremes.
24
Section 5
Checking Data for Normality
5.1 How to understand that the data have come from a normally distributed
population
This is an important assumption for doing a parametric test. Whether the data have
come from a normally distributed population or not, can be assessed in three
different ways. They are by:
a) Graphs, such as histogram and Q-Q chart;
b) Descriptive statistics, using skewness and kurtosis; and
c) Formal statistical tests, such as 1-sample Kolmogorov Smirnov (K-S) test
and Shapiro Wilk test.
Now, let us see how to get the histogram and Q-Q chart, and do the formal
statistical tests (K-S test and Shapiro Wilk test).
Suppose, we want to know whether the variable “systolic BP (SPSS variable
name: sbp)” is normally distributed in the population or not. We shall first
construct the histogram and Q-Q chart. To construct a histogram for systolic BP,
use the following commands:
Graphs > Legacy dialogs > Histogram > Select the variable “sbp” and push it
into the “Variable” box > Select “Display normal curve” clicking at the box >
Ok
The SPSS will produce a histogram of systolic BP, as shown in fig 5.1.
25
Figure 5.1. Histogram of systolic BP
To get the Q-Q plot for systolic BP, use the following commands:
Analyze > Descriptive statistics > Q-Q Plots > Select the variable “sbp” and
push it into the “Variables” box > For Test Distribution select “Normal” (usual-
ly remains as default) > Ok
The computer will produce a Q-Q plot of systolic BP, as shown in figure 5.2.
26
To do the formal statistical tests (K-S test and Shapiro Wilk test) to understand
the normality of the data, use the following commands.
Analyze > Descriptive Statistics > Explore > Select the variable “sbp” and
push it into the "Dependent List" box > Plots > Deselect "Stem-and-leaf" and
select “Histogram” > Select “Normality plots with test” > Continue > OK
Note that these commands will also produce the histogram and Q-Q plot. You
may not need to develop histogram and Q-Q plot separately as mentioned earlier.
5.1.1 Outputs:
You will get the following tables (table 5.1 and 5.2) along with the histogram, Q-Q
plot and box and plot chart. The histogram, Q-Q plot and box and plot chart, gener-
ated by the commands, have been omitted to avoid repetition.
Tests of Normality
Kolmogorov-Smirnova Shapiro-Wilk
Statistic df Sig. Statistic df Sig.
SYSTOLIC BP .119 210 .000 .956 210 .000
a. Lilliefors Significance Correction
5.1.2 Interpretation:
The SPSS has generated the histogram and Q-Q plot for “systolic BP” (fig 5.1 and
5.2) and tables 5.1 and 5.2. While getting the specific statistical tests (KS test and
27
Shapiro Wilk test) to check the normality of a dataset, the SPSS automatically
provides the descriptive statistics of the variable (systolic BP) (table 5.1). I have
already discussed the measures of skewness and kurtosis to assess the normality of
a dataset earlier (section 4).
Histogram (fig 5.1) provides an impression about the distribution of the dataset
(whether symmetrical or not). If we look at the histogram of systolic BP, it seems
that the data is slightly skewed to the right (i.e., distribution is not symmetrical).
The Q-Q plot (fig 5.2) also provides information on whether data have come
from a normally distributed population or not. The Q-Q plot compares the distribu-
tion of data with the standardized theoretical distribution from a specified family
of distribution (in this case normal distribution). If data are normally distributed,
all the points (dots) lie on the straight line. Note that our interest is in the central
portion of the line. Deviation from the central portion of the line means non-nor-
mality. Deviations at the ends of the plot indicate the existence of outliers. We can
see (in fig 5.2) that there is a slight deviation at the central portion as well as at the
ends. This may indicate that the data may not have come from a normally distribut-
ed population.
The specific tests (objective tests) to assess if the data have come from a
normally distributed population are the K-S (Kolmogorov-Smirnov) test and
Shapiro Wilk test. The results of these two tests are provided in table 5.2.
Look at the Sig (significance) column of table 5.2. Here, Sig (indicates the
p-value) is 0.000 for both the tests. A p-value of <0.05 indicates that the data have
not come from normally distributed population. In our example, the p-value is
0.000 for both the tests, which is <0.05. This means that the data of systolic BP
have not come from a normally distributed population. The null hypothesis here is
“data have come from a normally distributed population”. The alternative hypoth-
esis is “data have not come from a normally distributed population”. We will reject
the null hypothesis, since the p-value is <0.05.
Note that the K-S test is very sensitive to sample size. The K-S test may be
significant for slight deviations of a large sample data (n>100). Similarly, the
likelihood of getting a p-value <0.05 for a small sample (n<20, for example) is
low. Therefore, the rules of thumb for normality checking are:
1) Sample size <30: Assume non-normal;
2) Moderate sample size (30-100): If the formal test is significant (p<0.05),
consider non-normal distribution, otherwise check by other methods, e.g.,
28
histogram, Q-Q plot, etc.; and
3) Large sample size (n>100): If the formal test is not significant (p>0.05),
accept normality, otherwise check with other methods.
However, for practical purposes, just look at the histogram. If it seems that the
distribution is approximately symmetrical, consider that the data have come from
a normally distributed population.
29
Section 6
Data Management
While analyzing data, you may require to make class intervals, classify a group of
people with specific characteristic using a cutoff value (e.g., you may want to clas-
sify people who have hypertension using a cutoff value of either systolic or diastol-
ic BP), and recode data for other specific purposes. In this section, I shall discuss
data manipulations that are commonly needed during data analysis. For example,
• Recoding of data
• Making class intervals
• Combine data to form an additional variable
• Data transformation
• Calculation of total score
• Extraction of time
• Selection of a subgroup for data analysis
Use the data file <Data_3.sav> for practice.
30
default) under “Old Value” > Type m in the box below > type 1 in the “Value”
area under “New Value” > Click “Add” > Type f in the value area under “Old
Value” > type 2 in the “Value” area under “New Value” > Click Add > Contin-
ue > Ok (Fig 6.1 and 6.2)
Figure 6.1
Figure 6.2
31
Check the data file in the “data view” option. You will notice that all the “m”
has been replaced by 1 and “f” by 2. Now, go to the “variable view” of the data file
to replace the codes. Click in the “Values” box against the variable “sex”. Replace
the codes as 1 is “Male” and 2 is “Female”.
Figure 6.3
32
Figure 6.4
Check on the “variable view” option of the data file. You will notice that SPSS
has generated a new variable “sex1” (the last variable both in the variable view and
data view options). Like before, in the variable view, define the value labels of the
new variable sex1 as 1 is “male” and 2 is “female”.
Figure 6.5
34
Figure 6.6
Go to the data file in the “data view” option and then to the “variable view”
option. You will notice that SPSS has generated a new variable “age1” (the last
variable both in the data view and variable view options). Like before, in the “vari-
able view” option, define the value labels of the variable “age1” as 1 is “≤ 20
years”, 2 is “21-30 years”, 3 is “31-40 years” and 4 is “>40 years”.
Using transformation, you can also classify people who have hypertension and
who do not have hypertension (for example). To do this, you shall have to use a
cutoff point to define hypertension. For example, we have collected data on
diastolic BP (SPSS variable name is “dbp”). We want to classify those as “hyper-
tensive” if the diastolic BP is >90 mmHg. Now, recode the variable diastolic BP
into a new variable (say, d_hyper) using “Recode into Different Variables” option
as ≤ 90 (normal BP) and > 90 (as hypertensive). Hope, you can do it now. If you
cannot, use the following commands:
Transform > Recode into Different Variable > Select “dbp” and push it into the
“Input Variable –Output Variables” box > Type “d_hyper” in the “Name” box
and type “diastolic hypertension” in the “Label” box under “Output Variable”
> Click “Change” > Click on “Old and New Values” > Select “System-miss-
ing” under “Old value” > Select “System-missing” under “New Value” > Add
> Select “Range, LOWEST through value” under “Old value” and type “90” in
35
the box below > Select “Value” under “New Value” and type “1” > Add >
Select “All other values” under “Old value” > Select “Value” under “New
Value” and type “2” > Add > Continue >Ok
This will create a new variable “d_hyper” with code numbers 1 and 2 (the last
variable both in the variable and data view options). Code 1 indicates the persons
without hypertension (diastolic BP ≤ 90) and code 2 indicates the persons with
hypertension (diastolic BP >90). As done before, in the “variable view” option,
define the value labels of the new variable “d_hyper” as 1 is “Do not have hyper-
tension” and 2 is “Have hypertension”.
36
Figure 6.7
This will generate the new variable “HTN” with all the values 0 (you can check
it in the “data view” option; the last variable). Now use the following commands:
Transform > Compute Variable > Click in the box under “Numeric Expres-
sion” > Delete 0 > Click on 1 > Click If (optional case selection condition) >
Select “Include if case satisfies condition” > Select “dbp” and push it into the
box > Click “greater than sign (>)” then write “90” (always use the number
pad) > Click “&” on the “number pad” > Select “sex_1” and push it into the
box > Click on “=” and then “1” (note: 1 is the code no. for male) > Continue
> Ok > (SPSS will provide the message “Change existing variable” > Click on
“Yes” (fig 6.8 and 6.9)
Again,
Transform > Compute Variable > Click “If (optional case selection condition)”
> Delete “90” and write “85” (for dbp) and delete “1” and click “0” (for sex_1,
since 0 is the code for female) > Continue > Ok > (SPSS will give you the mes-
sage “Change existing variable” > Click “Yes”
37
Go to the “data view” option of the data file. You will notice that the new vari-
able “HTN” (the last variable both in the “data view” and “variable view” options)
has values either “0” or “1”. “0” indicates “no hypertension”, while “1” indicates
“have hypertension”. Like before, go to the “variable view” option and define the
value labels of the variable “HTN” as “0” is “No hypertension” and “1” is “Have
hypertension”.
Figure 6.8
38
Figure 6.9
Figure 6.10
40
6.5 Calculation of total score
Suppose, you have conducted a study to assess the knowledge of the secondary
school children on how HIV is transmitted. To assess their knowledge, you have
set the following questions (data file: HIV.sav).
Figure 6.11
41
Figure 6.12
SPSS will generate a new variable “t_know (total knowledge on HIV)” (look
at the “variable view”). This variable has the total score of knowledge of the
students. Now, you can get the descriptive statistics and frequency of the variable
“t_know” by using the following commands:
Analyze > Descriptive statistics > Frequencies > Select “t_know” and push it
into the “Variables” box > Statistics > Select “Mean, Median and Std. devia-
tion” > Continue > Ok
You will get the tables (table 6.2 and 6.3), showing the descriptive statistics
(mean, median, etc.) and frequency distribution of total knowledge of the students.
Table 6.2 shows that the mean of the total knowledge is 2.18 (SD 0.63) and the
median is 2. Table 6.3 shows that there are 2 (1%) students who do not have any
knowledge on HIV transmission (since the score is 0, i.e., could not answer any
question correctly). One hundred and twenty five (63.8%) students know 2 ways
of HIV transmission, while only 1.5% of the students know all the ways of HIV
transmission. You can also classify the students as having “Good” or “Poor”
knowledge using a cutoff value based on the total score.
42
Table 6.2. Descriptive Statistics
total knowledge
N Valid 196
Missing 0
Mean 2.18
Median 2.00
Std. Deviation .638
There is an alternative way of getting the total score. In that case, the correct
answers have to be coded as 1, while the incorrect answers must be coded as 0
(zero). The commands are as follows:
Transform > Compute variable > Write “t_know” in the “Target variable” box
> Select “k1” under ‘Type and label” and push it into the “Numeric expres-
sion” box > From the key pad click “+” > Select “k2” and push it into the “Nu-
meric expression” box > From the key pad click “+” > Select “k3” and push it
into the “Numeric expression” box > From the key pad click “+” > Select “k4”
and push it into the “Numeric expression” box > OK (fig 6.13)
You will get the same results.
43
Figure 6.13
44
Select “date_ad” from “Type and Label” box and push it into the “Numeric
Expression” box > Ok
You will notice that SPSS has generated a new variable “dura” (the last vari-
able) that contains the duration of hospital stay of each subject in the dataset.
45
Section 7
Testing of Hypothesis
The current and following sections provide basic information on how to select
statistical tests for testing hypotheses, perform the statistical tests in SPSS and
interpret the results of common problems related to health and social science
research. Before I proceed, let me discuss a little bit about the hypothesis.
A hypothesis is a statement about one or more populations. The hypothesis is
usually concerned with the parameters of the populations about which the state-
ment is made. There are two types of statistical hypothesis, Null (H0) and Alterna-
tive (HA) hypothesis. The null hypothesis is the hypothesis of equality or no differ-
ence. The null hypothesis always says that the two or more quantities (parameters)
are equal. Note that, we always test the null hypothesis, not the alternative hypoth-
esis. We either reject or do not reject the null hypothesis. If we can reject the null
hypothesis, then only we can accept the alternative hypothesis. It is, therefore,
necessary to have very clear understanding about the null hypothesis.
Suppose, we are interested to determine the association between coffee drink-
ing and stomach cancer. In this situation, the null hypothesis is "there is no associa-
tion between coffee drinking and stomach cancer (or, coffee drinking and stomach
cancer are independent)", while the alternative hypothesis is "there is an associa-
tion between coffee drinking and stomach cancer (or, coffee drinking and stomach
cancer are not independent) ". If we can reject the null hypothesis by a statistical
test (i.e., if the test is significant; p-value <0.05), then only we can say that there is
an association between coffee drinking and stomach cancer.
Various statistical tests are available to test hypothesis. Selecting an appropri-
ate statistical test is the key to analyze data. What statistical test to be used to test
the hypothesis depends on study design, data type, distribution of data, and objec-
tive of the study. It is, therefore, important to understand the nature of the variable
(categorical or quantitative), measurement type, as well as the study design.
Following table (table 7) provides basic guideline about the use of statistical tests
depending on the type of data and situation.
46
Table 7. Selecting statistical test for hypothesis testing
7.1 Association between quantitative and qualitative or quantitative variables
Situation for hypothesis testing Data normally Data
distributed non-normal
1 Comparison with single popula- 1-sample t-test Sign test/
tion mean (with a fixed value) Wilcoxon
Signed Rank
Example: You have taken a random
test
sample from a population of diabetic
patients to assess the mean age. Now,
you want to test the hypothesis
whether the mean age of diabetic
patients in the population is 55 years
or not.
2 Comparison of means of two Paired t-test Sign test/
related samples Wilcoxon
Signed Rank
Example: You want to test the
test
hypothesis whether the drug “Inderal”
reduces blood pressure (BP) or not.
To do this study, you have selected a
group of subjects and measured their
BP before administration of the drug
(measurements before treatment; or
pre-test). Then you have given the
drug “Inderal” to all the subjects and
measured their BP after one hour
(measurements after treatment; or
post-test). Now you want to compare
if the mean BP before (pre-test) and
after (post-test) administration of the
drug is same or not.
3 Comparison between two inde- Independent samples Mann-Whitney
pendent sample means [associa- t-test U test (also
tion between quantitative and called Wilcox-
qualitative variable with 2 levels] on Rank-Sum
test)
Example: You have taken a random
sample of students of a university.
Now, you want to test the hypothesis
if the mean systolic blood pressure of
male and female students is same or
not.
47
Situation for hypothesis testing Data normally Data
distributed non-normal
4 Comparison of more than two One way ANOVA Kruskal Wallis
independent sample means test
[association between quantita-
tive and a categorical variable
with more than 2 levels]
Example: You have taken a random
sample from a population. You want
to test the hypothesis if the mean
income of different religious groups
(e.g., Muslim, Hindu and Christian) is
same or not. Another example, you
have three drugs, A, B & C. You want
to investigate whether all these three
drugs equally reduce the BP or not.
5 Association between two quanti- Pearson’s correlation Spearman’s
tative variables correlation
(Also valid for
Example: You want to test the
ordinal qualita-
hypothesis if there is a correlation
tive data)
between systolic BP and age.
48
7.3 Multivariable analysis
49
7.4 Agreement analysis
50
Section 8
Student’s t-test for Hypothesis Testing
Assumptions:
1. The distribution of diastolic BP in the population is normal;
2. The sample is a random sample from the population.
The first job, before hypothesis testing, is to check whether the distribution of
diastolic BP is normal or not in the population (assumption 1). To do this, check
the histogram and/or Q-Q plot of diastolic BP and do the formal statistical test of
normality (K-S test or Shapiro Wilk test) as discussed in section 5. If the assump-
tion is met (diastolic BP is at least approximately normal), do the 1-sample t-test,
otherwise we have to use the non-parametric test (discussed later). Suppose,
diastolic BP is normally distributed in the population. Use the following com-
mands to do the 1-sample t-test:
51
Analyze > Compare means > One sample t-test > Select the variable “dbp”
and push it into the “Test variable(s)” box > Click in the “Test value” box and
write “80” > OK
8.1.1 Outputs:
The computer will provide the following tables (table 8.1 and 8.2).
Table 8.1. Descriptive statistics of diastolic BP
One-Sample Statistics
N Mean Std. Deviation Std. Error Mean
DIASTOLIC BP 210 83.0429 12.45444 .956
One-Sample Test
Test Value = 80
t df Sig. Mean 95% Confidence Interval of the
(2-tailed) Difference Difference
Lower Upper
DIASTOLIC BP 3.541 209 .000 3.04286 1.3486 4.7371
8.1.2 Interpretation:
In this example, we have tested the null hypothesis “the mean diastolic BP of the
students is equal to 80 mmHg in the population”. Data shows that the mean
diastolic BP of the sample of the students is 83.04 mmHg with an SD of 12.4
mmHg (table 8.1). One-sample t-test results are shown in table 8.2. The calculated
value of “t” is 3.541 and the p-value (Sig. 2-tailed) is 0.000. Since the p-value is
<0.05, we can reject the null hypothesis at 95% confidence level. This means that
the mean diastolic BP of the students (in the population) from where the sample is
drawn is different from 80 mmHg (p<0.001). The SPSS has also provided the
difference between the observed value (83.04) and hypothetical value (80.0) as
mean difference (which is 3.042) and its 95% confidence interval (1.34 – 4.73)
(table 8.2).
52
know if the mean age of diabetic and non-diabetic patients is same or not. Here, the
test variable (dependent variable) is age (quantitative) and the categorical variable
is diabetes, which has two levels/categories (have diabetes and do not have diabe-
tes).
Hypothesis:
H0: The mean age of the diabetic and non-diabetic patients is same in the
population.
HA: The mean age of the diabetic and non-diabetic patients is different (not
same) in the population.
Assumptions:
1. The dependent variable (age) is normally distributed at each level of the
independent (diabetes) variable;
2. The variances of the dependent variable (age) at each level of the indepen-
dent variable (diabetes) are same/equal;
3. Subjects represent random samples from the populations.
8.2.1 Commands:
Use the following commands to do the independent samples t-test. Before doing
the test, we have to remember/check (from code book or variable view) the catego-
ry code numbers of diabetes. In our example, we have used code “1” for defining
“have diabetes” and “2” for “do not have diabetes”.
Analyze > Compare means > Independent samples t-test > Select “age” and
push it into the “test variable(s)” box and select “diabetes” for “grouping vari-
able” box > Click on “define groups” > Type 1 in “Group 1” box and type 2 in
“Group 2” box > Continue > OK
Note: You shall have to use exactly the same code numbers as it is in the data-
set for the grouping variable. Otherwise, SPSS cannot analyze the data.
8.2.2 Outputs:
The SPSS will produce the outputs as shown in table 8.3 and 8.4.
53
Table 8.3. Descriptive statistics of age by grouping variable (having diabetes)
Group Statistics
DIABETES N Mean Std. Deviation Std. Error
MELLITUS Mean
AGE Yes 45 27.9111 8.46335 1.26164
No 165 26.1333 7.18360 .55924
Lower Upper
AGE Equal 3.218 .074 1.415 208 .159 1.7778 1.25671 -.699 4.255
variances
assumed
8.2.3 Interpretation:
Table 8.3 shows the descriptive measures of age by grouping variable (diabetes).
We can see that there are 45 persons with diabetes and 165 persons without diabe-
tes. The mean age of the diabetic persons is 27.9 (SD 8.46) and that of the non-dia-
betic persons is 26.1 (SD 7.18) years.
Table 8.4 shows the t-test results. The first portion of the table indicates the
Levene’s test results. This test is done to understand if the variances of age in the
two categories of diabetes are homogeneous (equal) or not (assumption 2). Look at
the p-value (Sig.) of the Levene’s test, which is 0.074. Since the p-value is >0.05,
it indicates that the variances of age of the diabetic and non-diabetic persons are
equal (assumption 2 is fulfilled).
Now, look at the other portion of the table, the t-test for equality of means.
Here, we have to decide which p-value we shall consider. If Levene’s test p-value
is >0.05, take the t-test results at the upper row, i.e., t-test for “Equal variances
assumed”. If the Levene’s test p-value is ≤0.05, take the t-test results at the lower
54
row, i.e., t-test for “Equal variances not assumed”.
In this example, as the Levene’s test p-value is >0.05, we shall consider the
t-test results of “Equal variances assumed”, i.e., the upper row. Table 8.4 shows
that the t-value (calculated) is 1.415, and the p-value (2-tailed) is 0.159 (which is
>0.05) with 208 degrees of freedom. We cannot, therefore, reject the null hypothe-
sis. This means that the mean age of diabetic and non-diabetic persons in the popu-
lation from where samples are drawn is not different (p=0.159).
Hypothesis:
H0: There no difference of the mean scores before and after the training (for
example 1).
HA: The mean scores are different before and after the training.
Assumptions:
1. The difference between two measurements (pre- and post-test) of the depen-
dent variable (examination scores) is normally distributed;
2. Subjects represent a random sample from the population.
55
8.3.1 Commands:
Analyze > Compare means > Paired-samples t-test > Select the variables
“post-test” and “pre-test” and push them into the “Paired variables” box > OK
8.3.2 Outputs:
The SPSS will produce the following outputs (tables 8.5-8.7).
Table 8.5. Descriptive statistics of pre- and post-test results
8.3.3 Interpretation:
Table 8.5 shows the means of both the pre- (53.5) and post-test (90.9) scores along
with the standard deviations and SEs. Looking at the mean scores, we can have an
impression on whether the training has increased the mean score or not. We can see
that the post-test mean is 90.9, while the pre-test mean is 53.5. To understand, if
the difference between post-test mean and pre-test mean is significant or not, we
have to check the paired samples t-test results (table 8.7). Table 8.7 shows that the
mean difference between the post- and pre-test scores is 37.4. The calculated t-test
56
value is 15.09 and the p-value (sig.) is 0.000. As the p-value is <0.05, reject the
null hypothesis. This indicates that the mean knowledge score has been increased
significantly after the training (p<0.001). Note that for conclusion, we do not need
table 8.6.
57
Section 9
Analysis of Variance (ANOVA): One-way ANOVA
Hypothesis:
H0: The mean income of all the religious groups is same/equal.
HA: Not all the means (of income) of religious groups are same.
Assumptions:
1. The dependent variable (income) is normally distributed at each level of the
independent variable (religion);
2. The variances of the dependent variable (income) for each level of the inde-
pendent variable (religion) are same; and
3. Subjects represent random samples from the populations.
If the variances of the dependent variable in all the categories are not equal
(violation of assumption 2), but sample size in all the groups is large and similar,
58
ANOVA can be used.
9.1.1 Commands:
Analyze > Compare means > One-way ANOVA > Select “income” and push it
into the “Dependent list” box > Select “religion_2” for the “Factor” box >
Options > Select “Descriptive” and “Homogeneity of variance test” > Contin-
ue > OK
9.1.2 Outputs:
The SPSS will generate the following outputs (table 9.1-9.3).
Descriptives
INCOME
N Mean Std. Std. Error 95% Confidence Minimum Maximum
Deviation Interval for Mean
Lower Upper
Bound Bound
MUSLIM 126 88180.90 17207.61 1532.976 85146.95 91214.85 55927 117210
HINDU 36 79166.03 17804.63 2967.439 73141.81 85190.25 53435 110225
CHRISTIAN 26 79405.62 19857.02 3894.282 71385.19 87426.04 52933 114488
BUDDHISM 22 84796.59 14447.34 3080.185 78391.00 91202.19 56249 109137
Total 210 85194.49 17724.03 1223.074 82783.34 87605.63 52933 117210
Table 9.2. Levene’s test for homogeneity of variances of income in different religious groups
ANOVA
INCOME
Sum of Squares df Mean Square F Sig.
Between Groups 3306848581.156 3 1102282860.385 3.642 .014
Within Groups 62348694323.301 206 302663564.676
Total 65655542904.457 9
59
9.1.3 Interpretation:
In this example, I have used “income” as the dependent variable and “religion” as
the independent (or factor) variable. The independent variable (religion) has 4
categories (levels) – Muslim, Hindu, Christian and Buddhism.
Table 9.1 provides all the descriptive measures (mean, SD, SE, 95% CI, etc.)
of income by religion. For example, the mean income of Muslims is 88,180.9 with
SD of 17,207.6.
The second table (table 9.2) shows the test results of homogeneity of variances
(Levene’s test). This test was done to understand if all the group variances of
income are equal or not (assumption 2). Look at the p-value (Sig.), which is 0.107.
The p-value is >0.05, which means that the variances of income in all the religious
groups are equal (i.e., assumption 2 is not violated).
Now, look at the ANOVA table (table 9.3). The value of F-statistic is 3.642 and
the p-value is 0.014. Since the p-value is <0.05, reject the null hypothesis. This
means that, not all group means (of income) are same.
However, the ANOVA test does not provide information about which group
means are different. To understand which group means are different, we need to
use the post hoc multiple comparison test, such as Tukey’s test or Bonferroni test.
Use the following commands to get the post-hoc test results. Note that if the ANO-
VA test (F-test) is not significant (i.e., p-value is >0.05), we do not need the
post-hoc test.
Analyze > Compare means > One-way ANOVA > Select “income” and push it
into the “Dependent list” box > Select “religion_2” for the “Factor” box >
Options > Select “Descriptive”, and “Homogeneity of variance test” > Contin-
ue > Post Hoc > Select “Tukey” (or Bonferroni) > Continue > OK
The SPSS will produce the following table (table 9.4) in addition to others.
60
Table 9.4. Comparisons of mean income between the religious groups
Multiple Comparisons
Dependent Variable: INCOME
Tukey HSD
(I) (J) Mean Std. Error Sig. 95% Confidence Interval
RELIGION RELIGION Difference (I-J)
Lower Upper
Bound Bound
MUSLIM HINDU 9014.88(*) 3287.767 .033 499.15 17530.60
CHRISTIAN 8775.29 3747.399 .092 -930.94 18481.52
BUDDHISM 3384.31 4019.891 .834 -7027.71 13796.33
HINDU MUSLIM -9014.88(*) 3287.767 .033 -17530.60 -499.15
CHRISTIAN -239.59 4477.525 1.000 -11836.93 11357.76
BUDDHISM -5630.56 4707.946 .630 -17824.73 6563.60
CHRISTIAN MUSLIM -8775.29 3747.399 .092 -18481.52 930.94
HINDU 239.59 4477.525 1.000 -11357.76 11836.93
BUDDHISM -5390.98 5039.677 .708 -18444.37 7662.41
BUDDHISM MUSLIM -3384.31 4019.891 .834 -13796.33 7027.71
HINDU 5630.56 4707.946 .630 -6563.60 17824.73
CHRISTIAN 5390.98 5039.677 .708 -7662.41 18444.37
* The mean difference is significant at the .05 level.
120000
100000
80000
60000
40000
N = 126 36 26 22
RELIGION
9.1.5.1 Outputs:
The SPSS will produce the following additional tables (table 9.5 and 9.6).
62
Table 9.6. Comparison of means of income between the religious groups
Multiple Comparisons
income
Games-Howell
(I) religion (J) religion Mean Std. Error Sig. 95% Confidence Interval
Difference (I-J) Lower Upper
Bound Bound
MUSLIM HINDU 9014.877* 3340.016 .044 166.37 17863.38
CHRISTIAN 8775.289 4185.146 .175 -2541.95 20092.53
BUDDHISM 3384.314 3440.575 .760 -5931.90 12700.53
HINDU MUSLIM -9014.877* 3340.016 .044 -17863.38 -166.37
CHRISTIAN -239.588 4896.032 1.000 -13248.23 12769.05
BUDDHISM -5630.563 4277.059 .557 -16986.14 5725.02
CHRISTIAN MUSLIM -8775.289 4185.146 .175 -20092.53 2541.95
HINDU 239.588 4896.032 1.000 -12769.05 13248.23
BUDDHISM -5390.976 4965.176 .700 -18635.83 7853.88
BUDDHISM MUSLIM -3384.314 3440.575 .760 -12700.53 5931.90
HINDU 5630.563 4277.059 .557 -5725.02 16986.14
CHRISTIAN 5390.976 4965.176 .700 -7853.88 18635.83
*. The mean difference is significant at the 0.05 level.
9.1.5.2 Interpretation:
Table 9.5 shows the Welch test results of comparison of means (Robust Tests of
Equality of Means). Just look at the p-value (Sig.). The p-value is 0.027, which is
<0.05. This means that the mean income of all the religious groups is not same in
the population.
Table 9.6 conveys the same information as of Tukey’s test that I have discussed
earlier. Here, the difference of mean income of Muslims and Hindus is significant-
ly different as indicated by the p-value (Sig.) (p=0.044). The difference of means
among the other religious groups is not significant. The table has also provided the
95% CI of the differences.
63
Section 10
Two-way ANOVA
Assumptions:
1. The dependent variable (systolic BP) is normally distributed at each level of
the independent variables (occupation and sex);
2. The variances of the dependent variable (systolic BP) at each level of the
independent variables are same; and
64
3. Subjects represent random samples from the populations.
First of all, we have to check for normality of data (systolic BP) in different
categories of occupation and sex separately using histogram, Q-Q plot and Shapiro
Wilk test (or, K-S test). We also need to check the homogeneity of variances in
each group of the independent variables (occupation and sex) using the Levene's
test.
10.1.1 Commands:
Analyze > General linear model > Univariate > Select “sbp” for "Dependent
variable" box and select “occupation” and “sex” for “Fixed factors” box >
Options > Select “Descriptive statistics, Estimates of effect size and Homoge-
neity test" > Continue > OK
10.1.2 Outputs:
The SPSS will give you the following outputs (table 10.1-10.4).
Between-Subjects Factors
Value Label N
OCCUPATION 1 GOVT JOB 60
2 PRIVATE JOB 49
3 BUSINESS 49
4 OTHERS 52
SEX f FEMALE 133
m MALE 77
65
Table 10.2. Descriptive statistics of systolic BP by occupation and sex
Descriptive Statistics
Dependent Variable: SYSTOLIC BP
OCCUPATION SEX Mean Std. Deviation N
GOVT JOB FEMALE 130.84 21.264 38
MALE 126.86 19.548 22
Total 129.38 20.574 60
PRIVATE JOB FEMALE 131.26 21.534 31
MALE 117.89 13.394 18
Total 126.35 19.894 49
BUSINESS FEMALE 131.10 24.023 31
MALE 123.44 14.448 18
Total 128.29 21.178 49
OTHERS FEMALE 125.73 18.772 33
MALE 129.26 19.084 19
Total 127.02 18.778 52
Total FEMALE 129.73 21.309 133
MALE 124.56 17.221 77
Total 127.83 20.021 210
66
10.1.3 Interpretation:
Table 10.1 (between-subjects factors) shows the frequencies of occupation and
sex. Table 10.2 (descriptive statistics) shows the descriptive measures of systolic
BP at different levels of occupation and sex. For example, the mean systolic BP of
females doing the government job is 130.84 (SD 21.2) and that of males doing the
government job is 126.8 (SD 19.5).
Table 10.3 shows the Levene’s test results for homogeneity of variances. The
p-value (Sig.) of the test, as shown in the table, is 0.090. A p-value >0.05 indicates
that the variances of systolic BP at each level of the independent variables (occu-
pation and sex) are not different. Thus, the assumption 2 is not violated.
The table of “Tests of between-subjects effects" (table 10.4) shows the main
effects of the independent variables. Look at the p-values (Sig.) of occupation and
sex. They are 0.758 and 0.063, respectively. This indicates that the mean systolic
BP is not different in different occupation groups as well as sex (males and
females). Now, look at the p-value for “occupation*sex”, which indicates the
significance of the interaction between these two variables on systolic BP. A p-val-
ue ≤0.05 indicates the presence of interaction, that means that the systolic BP of
different occupation groups is influenced by (depends on) sex. In our example, the
p-value is 0.221 (>0.05), which means that there is no interaction between occupa-
tion and sex to influence the systolic BP. The Partial Eta Squared (last column of
the table) indicates the effect size. The Eta statistics for occupation and sex are
0.006 and 0.017, which are very small. These values are equivalent to R2 (Coeffi-
cient of Determination). Eta 0.006 indicates that only 0.6% variance of systolic BP
can be explained by occupation (and 1.7% by sex). However, most of the research-
ers do not report this in their publications.
The Post-hoc test (as discussed under one-way ANOVA) is performed if the
main effect is significant (i.e., the p-values for occupation and/or sex are <0.05),
otherwise it is not necessary. To have a clearer picture of the presence of interac-
tion, it is better to get a graph of the mean systolic BP for occupation and sex. Use
the following commands to get the graph.
Graphs > Line > Select "Multiple" > Select "Summarizes for groups of cases"
> Define > Select "Other summary function" > Move the dependent variable
(sbp) into the "Variable" box > Move one independent variable (occupation)
with most categories (here occupation has more categories than sex) into the
67
"Category axis" box > Move the other independent variable (sex) into "Define
line by" box > OK
This will produce the following graph (fig 10.1). The graph shows that there is
a greater difference in mean systolic BP between males (117.89) and females
(131.26) among private job holders, compared to other occupations. However, this
difference is not statistically significant to show an interaction between occupation
and sex. This means that there is no significant variation of systolic BP in the occu-
pation groups by sex.
132
130
128
126
124
122
120
SEX
118 FEMALE
116 MALE
GOVT JOB PRIVATE JOB BUSINESS OTHERS
OCCUPATION
68
Section 11
Repeated Measures ANOVA: One-way
The one-way repeated measures ANOVA test is analogous to paired samples t-test
that I have discussed earlier. The main difference is that, in paired samples t-test
we have two measurements at different times (e.g., before and after giving a drug,
or pre-test and post-test results) on the same subjects, while in one-way repeated
measures ANOVA, there are three or more measurements on the same subjects at
different points in time (i.e., the subjects are exposed to multiple measurements
over a period of time or conditions). The one-way repeated measures ANOVA is
also called one-way within-subjects ANOVA. Use the data file <Data_re-
peat_anova_2.sav> for practice.
Hypothesis:
H0: The mean blood sugar level is same/equal at each level of measurement
(i.e., the population mean of blood sugar at 0, 7, 14 and 24 hours is same).
69
HA: The mean blood sugar is not same at different levels of measurement (that
is, population mean of blood sugar at 0, 7, 14 and 24 hours is different).
Assumptions:
1. The dependent variable (blood sugar level) is normally distributed in the
population at each level of within-subjects factor;
2. The population variances of the differences between all combinations of
related groups/levels are equal (called Sphericity assumption); and
3. The subjects represent a random sample from the population.
11.1.1 Commands:
Analyze > General linear model > Repeated measures > In “Within subject
factor name” box write “time” (give any other name) after deleting factor1 >
in “Number of levels” box write “4” (since there are 4 time factors) > Add >
Write “blood_sugar” in “Measures Name” box > Add > Define > Select vari-
ables “sugar_0, sugar_7, sugar_14 and sugar_24” and push them into "With-
in-Subjects Variables" box > Options > Select "Descriptive statistics, Esti-
mates of effect size and Homogeneity tests" > Select “time” and push it into the
“Display means for” box > Select “Compare main effects” > Select “Bonfer-
roni” in box “Confidence interval adjustment” > Continue > Pots > Select
“time” and push it into “Horizontal axis” box > Add > Continue > OK
11.1.2 Outputs:
The SPSS will produce several tables. However, we need only the following tables
(table 11.1-11.7) for interpreting the results. The tables are set chronologically for
easier interpretation (not in the order as provided by SPSS).
Table 11.1. Codes for different levels of measurements of blood sugar
Within-Subjects Factors
Measure: Blood_sugar
Time Dependent Variable
1 sugar_0
2 sugar_7
3 sugar_14
4 sugar_24
70
Table 11.2. Descriptive statistics of blood sugar at different levels (times) of measurement
Descriptive Statistics
Mean Std. Deviation N
Blood sugar at hour 0 109.200 5.12975 15
Blood sugar at hour 7 103.733 3.73146 15
Blood sugar at hour 14 97.8667 4.08598 15
Blood sugar at hour 24 98.1333 5.86596 15
Table 11.3. Descriptive statistics of blood sugar at different levels of measurement with 95% CI
Estimates
Measure: Blood_sugar
Time Mean Std. Error 95% Confidence Interval
Lower Bound Upper Bound
1 109.200 1.324 106.359 112.041
2 103.733 .963 101.667 105.800
3 97.867 1.055 95.604 100.129
4 98.133 1.515 94.885 101.382
Tests the null hypothesis that the error covariance matrix of the orthonormalized transformed dependent
variables is proportional to an identity matrix.
a. May be used to adjust the degrees of freedom for the averaged tests of significance. Corrected tests are
displayed in the Tests of Within-Subjects Effects table.
b. Design: Intercept
Within Subjects Design: Time
Table 11.5. Test results of within subject effects (alternative univariate tests)
71
Table 11.6. Multivariate test results
Multivariate Testsb
Effect Value F Hypothesis Error df Sig. Partial Eta
df Squared
Time Pillai's Trace .918 44.724a 3.000 12.000 .000 .918
Wilks' Lambda .082 44.724a 3.000 12.000 .000 .918
Hotelling's Trace 11.181 44.724a 3.000 12.000 .000 .918
Roy's Largest
11.181 44.724a 3.000 12.000 .000 .918
Root
a. Exact statistic
b. Design: Intercept
Within Subjects Design: Time
Table 11.7. Pair-wise comparison of mean blood sugar at different times of measurement
Pairwise Comparisons
Measure:Blood_sugar
(I) Time (J) Time Mean Std. Error Sig.a 95% Confidence Interval
Difference (I-J) Differencea
Lower Bound Upper Bound
1 2 5.467* 1.064 .001 2.202 8.732
3 11.333* 1.260 .000 7.467 15.200
4 11.067* 2.256 .001 4.143 17.990
2 1 -5.467* 1.064 .001 -8.732 -2.202
3 5.867* .608 .000 4.000 7.734
4 5.600* 1.492 .013 1.021 10.179
3 1 -11.333* 1.260 .000 -15.200 -7.467
2 -5.867* .608 .000 -7.734 -4.000
4 -.267 1.240 1.000 -4.072 3.539
4 1 -11.067* 2.256 .001 -17.990 -4.143
2 -5.600* 1.492 .013 -10.179 -1.021
3 .267 1.240 1.000 -3.539 4.072
72
11.1.3 Interpretation:
The outputs of the analysis are shown in tables 11.1-11.7 and figure 11.1. Table
11.1 shows the value labels (times) of the blood sugar measurements. Table 11.2
and 11.3 (descriptive statistics and estimates) shows the descriptive statistics
(mean, standard deviation, no. of study subjects, SE of the means, 95% CI, etc.) of
the blood sugar levels at different times of measurement, such as at hour-0, hour-7,
hour-14 and hour-24.
One of the important issues for the repeated measures ANOVA test is the Sphe-
ricity assumption, as mentioned earlier under “assumptions”. Table 11.4 shows the
test results of “Mauchly’s test of Sphericity” to understand whether the Sphericity
assumption is correct or violated. The table shows that the Mauchly’s W is 0.095
and the p-value is 0.000. Since the p-value is less <0.05, the Sphericity assumption
is violated (not correct).
Three types of tests are conducted if within-subjects factors (here, it is the
times of measurement of blood sugar, which has 4 levels – hour-0, hour-7, hour-14
and hour-24) have more than 2 levels (here, we have 4 levels). The tests are:
1. Standard univariate test (Sphericity Assumed) [table 11.5];
2. Alternative univariate tests (Greenhouse-Geisser; Huynh-Feldt; Lower-
bound) [table 11.5]; and
3. Multivariate tests (Pillai's Trace; Wilks' Lambda; Hotelling's Trace; Roy's
Largest Root) [table 11.6]
All these tests evaluate the same hypothesis – i.e., the population means are
equal at all levels of the measurement. The standard univariate test is based on the
Sphericity assumption, i.e., the standard univariate test result is considered, if
Sphericity assumption is correct (not violated). In reality and in most of the cases
(also in our example), the Sphericity assumption is violated, and we cannot use the
standard univariate test (Sphericity assumed as given in table 11.5) result.
In our example, we see that the Sphericity assumption is violated, since the
Mauchly’s test p-value is 0.000 (table 11.4). Therefore, we shall have to pick up
the test results either from alternative univariate tests (table 11.5) or multivariate
tests (table 11.6). For practical purpose, it is recommended to use the multivariate
test results for reporting, since it does not depend on Sphericity assumption.
However, for the sake of better understanding, let me discuss table 11.5, which
indicates the standard and alternative univariate test results. Table 11.5 shows the
73
univariate test results of within-subjects effects. The standard univariate ANOVA
test result is indicated by the row “Sphericity Assumed”. Use this test result, when
Sphericity assumption is correct/not violated (i.e., Mauchly’s test p-value is
>0.05). Since, our data shows that the Sphericity assumption is violated, we cannot
use this test result.
When Sphericity assumption is violated (not correct), you can use the results
of one of the alternative univariate tests (i.e., Greenhouse-Geisser, Huynh-Feldt or
Lower-bound) for interpretation. It is commonly the Greenhouse-Geisser test,
which is reported by the researchers. Table 11.5 shows that the test (Green-
house-Geisser) provided the same F-value and p-value like other tests. Since the
test’s p-value is 0.000, reject the null hypothesis. This means that, the mean blood
sugar levels at different time factors (i.e., at different levels of measurement) are
not same.
To make it simple, I would suggest to use the multivariate test results, which
are not dependent on the Sphericity assumption. Table 11.6 shows the multivariate
test results. In the multivariate tests table, the SPSS has given several test results,
such as Pillai's Trace, Wilks’ Lambda, Hotelling’s Trace and Roy’s Largest Root.
All these multivariate tests have given the same results. It is recommended to use
the Wilks’ Lambda test results for reporting. In our example, the multivariate test
indicates significant time effect on blood sugar levels, as the p-value of Wilks’
Lambda is 0.000. This means that the population means of blood sugar levels at
different time factors (different times of measurement) are not same.
The last table (table 11.7) shows pairwise comparison of means at different
times of measurement. It shows the results as we have seen under one-way ANO-
VA (table 9.4; Tukey HSD). Look at the p-values. It is better to assess the differ-
ences of adjacent measurements, such as the difference of blood sugar levels
between “time 1 & 2”, “time 2 & 3” and “time 3 & 4”. The table shows that all the
differences have p-values <0.05, except for “time 3 and 4” (p=1.0). This means
that mean blood sugar levels are significantly different in all adjacent time periods
except for the time between 3 and 4. The mean blood sugar levels at different times
of measurement are depicted in figure 11.1.
Note that if the overall test is not significant (i.e., p-value of Wilks’ Lambda is
>0.05), the table for pairwise comparison is not necessary.
74
Section 12
Repeated Measures ANOVA: Within and Between-Subjects
The within and between-subjects ANOVA is also called two-way repeated mea-
sures ANOVA. In the previous section, I have discussed the one-way repeated
measures ANOVA, which is also called within-subjects ANOVA. In within-sub-
jects ANOVA, we have only one treatment (intervention) group. On the other hand,
the within and between-subjects ANOVA is used when there are more than one
treatment group. In this method, at least 3 variables are involved – one dependent
quantitative variable, and two independent categorical variables with two or more
levels. Use the data file <Data_repeat_anova_2.sav> for practice.
75
experiment, the researcher has selected 10 subjects and randomly allocated the
treatments (5 in each group). Blood sugar levels of the subjects were measured at
the baseline (sugar_0), after 7 hours (sugar_7), after 14 hours (sugar_14) and after
24 hours (sugar_24). Data is provided in the data file <Data_Re-
peat_anova_2.sav>.
Hypothesis:
We test two hypotheses here. One is for within-subjects effects and the other is for
between-subjects effects.
H0: Daonil and Metformin are equally effective in reducing the blood sugar
levels over time (between-subjects effects).
HA: Both these drugs are not equally effective in reducing the blood sugar level
over time (you cans also use one sided hypothesis, such as “Daonil is more
effective in reducing the blood sugar levels over time compared to Met-
formin”.
We can also test the hypothesis whether these drugs are effective in reducing
the blood sugar levels over time (within-subjects effects; discussed in section 11).
The assumptions of two-way repeated measures ANOVA are same as one-way
repeated measures ANOVA.
12.1.1 Commands:
Analyze > General linear model > Repeated measures > Write “time” in
“Within subject factor name” box > Write “4” in “Number of levels” box
(since we have 4 time levels) > Add > Write “blood_sugar” in “Measures
name” box > Add > Define > Select variables “sugar_0, sugar_7, sugar_14 and
sugar_24” and push them into "Within-Subjects Variables" box > Select “treat-
ment” and push it into “Between-subjects factors” box > Options > Select
“treatment” and push it into the “Display means for” box> Select “Compare
main effects” > Select “Bonferroni” in “Confidence interval adjustment” box
> Select "Descriptive statistics, and homogeneity tests" > Continue > Contrasts
> Select “time” > Select “Repeated” in the “Contrast” box under “Change
contrast” > Change > Continue > Plots > Select “time” and push it into “Hori-
zontal axis” box > Select “treatment” and push it into the “Separate lines” box
76
12.1.2 Outputs:
The SPSS will provide many tables, but only the relevant ones are provided below.
The outputs are arranged according to – A) Basic tables; B) Tables related to With-
in-subjects effects; C) Tables related to Between-subjects effects; D) Tables to
check the assumptions; and E) Additional tables.
Within-Subjects Factors
Measure: Bloodsugar
Sugar Dependent Variable
1 sugar_0
2 sugar_7
3 sugar_14
4 sugar_24
Between-Subjects Factors
Value Label N
treatment groups 1 Daonil 5
2 Metformin 5
Table 12.3. Descriptive statistics of blood sugar at different levels and treatment groups
Descriptive Statistics
treatment groups Mean Std. Deviation N
Blood sugar at hour 0 Daonil 112.8000 2.16795 5
Metformin 108.4000 7.09225 5
Total 110.6000 5.46097 10
Blood sugar at hour 7 Daonil 104.0000 4.18330 5
Metformin 103.0000 4.69042 5
Total 103.5000 4.22295 10
Blood sugar at hour 14 Daonil 97.4000 3.43511 5
Metformin 98.6000 3.91152 5
Total 98.0000 3.52767 10
Blood sugar at hour 24 Daonil 94.4000 2.70185 5
Metformin 97.6000 2.50998 5
Total 96.0000 2.98142 10
77
B. Within-subjects effects (table 12.4-12.6):
Table 12.4. Within-subjects multivariate test results
Multivariate Testsb
Effect Value F Hypothesis Error df Sig. Partial Eta
df Squared
time Pillai's Trace .955 42.767a 3.000 6.000 .000 .955
Wilks' Lambda 42.767a
.045 44.724 3.000 6.000 .000 .955
Hotelling's Trace 21.384 42.767a 3.000 6.000 .000 .955
Roy's Largest
21.384 42.767a 3.000 6.000 .000 .955
Root
time * treatment Pillai's Trace .452 1.649 a
3.000 6.000 .275 .452
Wilks' Lambda .548 1.649a 3.000 6.000 .275 .452
Hotelling's Trace .825 1.649a 3.000 6.000 .275 .452
Roy's Largest
.825 1.649a 3.000 6.000 .275 .452
Root
a. Exact statistic
b. Design: Intercept + treatment
Within Subjects Design: time
Estimates
Measure: Bloodsugar
Time Mean Std. Error 95% Confidence Interval
Lower Bound Upper Bound
1 110.600 1.658 106.776 114.424
2 103.500 1.405 100.259 106.741
3 98.000 1.164 95.316 100.684
4 96.000 .825 94.098 97.902
Table 12.6. Pairwise comparisons of blood sugar levels at different time intervals
78
C. Between-subjects effects (table 12.7-12.9 & fig 12.1):
Table 12.7. Test results of between-subjects effects
Estimates
Measure:Bloodsugar
treatment Mean Std. Error 95% Confidence Interval
groups Lower Bound Upper Bound
Daonil 102.150 1.503 98.683 105.617
Metformin 101.900 1.503 98.433 105.367
Pairwise Comparisons
Measure:Bloodsugar
(I) treatment (J) treatment Mean Std. Error Sig.a 95% Confidence Interval
groups groups Difference (I-J) Differencea
Lower Bound Upper Bound
Daonil Metformin .250 2.126 .909 -4.653 5.153
Metformin Daonil -.250 2.126 .909 -5.153 4.653
Figure 12.1. Blood sugar levels by treatment group (Daonil and Metformin)
79
D. Tables for checking assumptions (table 12.10-12.12):
Table 12.10. Box’s M test
Tests the null hypothesis that the observed covariance matrices of the
dependent variables are equal across groups.
a. Design: Intercept + treatment
Tests the null hypothesis that the error covariance matrix of the orthonormalized transformed dependent
variables is proportional to an identity matrix.
a. May be used to adjust the degrees of freedom for the averaged tests of significance. Corrected tests are
displayed in the Tests of Within-Subjects Effects table.
b. Design: Intercept + treatment
Within Subjects Design: time
80
Table 12.13. Descriptive statistics
Descriptive Statistics
treatment groups Mean Std. Deviation N
Blood sugar at hour 0 placebo 110.4000 3.36155 5
Daonil 112.8000 2.16795 5
Total 111.6000 2.95146 10
Blood sugar at hour 7 placebo 108.6000 2.60768 5
Daonil 104.0000 4.18330 5
Total 106.3000 4.08384 10
Blood sugar at hour 14 placebo 108.6000 4.15933 5
Daonil 97.4000 3.43511 5
Total 103.0000 6.91215 10
Blood sugar at hour 24 placebo 109.4000 2.60768 5
Daonil 94.4000 2.70185 5
Total 101.9000 8.29257 10
Multivariate Testsb
Effect Value F Hypothesis Error df Sig.
df
time Pillai's Trace .949 37.505a 3.000 6.000 .000
Wilks' Lambda 37.505a
.051 44.724 3.000 6.000 .000
Hotelling's Trace 18.752 37.505a 3.000 6.000 .000
Roy's Largest
18.752 37.505a 3.000 6.000 .000
Root
time * treatment Pillai's Trace .927 25.566a 3.000 6.000 .001
Wilks' Lambda .073 25.566a 3.000 6.000 .001
Hotelling's Trace 12.783 25.566a 3.000 6.000 .001
Roy's Largest
12.783 25.566a 3.000 6.000 .001
Root
a. Exact statistic
b. Design: Intercept + treatment
Within Subjects Design: time
Table 12.15. Pairwise comparison between time adjacent blood sugar levels
81
Table 12.16. Test results of between-subjects effects
Estimates
Measure:Bloodsugar
treatment Mean Std. Error 95% Confidence Interval
groups Lower Bound Upper Bound
placebo 109.250 1.208 106.465 112.035
Daonil 102.150 1.208 99.365 104.935
Figure 12.2. Blood sugar levels by treatment group (Placebo & Daonil)
12.1.3 Interpretation:
A. Basic tables (outputs under A)
The outputs of the analysis are shown in tables and graphs. Table 12.1 and 12.2
show the value labels of the blood sugar measurements and treatment groups,
respectively. Table 12.3 shows the descriptive statistics (mean, standard deviation
and no. of study subjects) of blood sugar levels at different times of measurement
by treatment groups.
82
B. Within-subjects effects (outputs under B)
Table 12.4 shows the multivariate test results of within-subjects effects. Since the
Sphericity assumption is frequently violated, we would consider the multivariate
test results (table 12.4) as discussed in section 11. First, look at the interaction term
(time*treatment) in the row of Wilks’ Lambda. The p-value (Sig.) is 0.275, which
is not significant. This means that there is no interaction between time and treat-
ment (i.e., blood sugar levels over time are not dependent on the treatment groups).
Now, look at the row of Wilks’ Lambda at time. The p-value is 0.000, which is
statistically significant. This means that the mean blood sugar levels measured at
different times are significantly different (i.e., there is a significant reduction of
blood sugar levels over time in both the treatment groups; fig 12.1). Table 12.5
shows the means and 95% confidence intervals of blood sugar levels at different
times of measurement. Table 12.6 shows the difference in blood sugar levels
between the adjacent measurements. The table shows that there is significant
difference in blood sugar levels between time 1 and 2, and time 2 and 3 (p=0.000),
but not between time 3 and 4 (p=0.081). Note that interaction for any comparison
is not significant.
83
D. Test of assumptions (outputs under D)
Whether the assumptions are violated or not are checked by: a) Box’s M test (table
12.10); b) Levene’s test (table 12.11); and c) Mauchly’s test (table 12.12). If the
assumptions are met, the p-values of all these tests would be >0.05. We can see that
the p-values of all these tests are >0.05 except for Mauchly’s test (p=0.017). Note
that the Mauchly’s test tests the Sphericity assumption. As discussed earlier, to
interpret the results, it is recommended to use the multivariate test, which is not
dependent on Sphericity assumption.
84
Section 13
Association between Two Categorical Variables:
Chi-Square Test of Independence
The Chi-square test is a commonly used statistical test for testing hypothesis in
health research. This test is suitable to determine the association between two cate-
gorical variables, whether the data are from cross-sectional, case-control or cohort
studies. On the other hand, in epidemiology, cross-tabulations are commonly done
to calculate the Odds Ratio (OR) [for case-control studies] or Relative Risk (RR)
[for cohort studies] with 95% Confidence Intervals (CI). Odds Ratio and RR are
the measures of strength of association between two variables. Use the data file
<Data_3.sav> for practice.
Hypothesis:
H0: There is no association between gender and diabetes (it can also be stated
as, gender and diabetes are independent).
HA: There is an association between gender and diabetes (or, gender and diabe-
tes are not independent).
Assumption:
1. Data have come from a random sample drawn from a selected population.
13.1.1 Commands:
Analyze > Descriptive statistics > Crosstabs > Select “sex” and push it into the
“Row(s)” box > Select “diabetes” for the “Column(s)” box > Statistics > Select
“Chi-square” and “Risk” > Continue > Cells > Select “Raw” and “Column”
under percentages > Continue > OK
85
Note: We have selected “risk” to get the OR and RR including their 95% CIs.
13.1.2 Outputs:
Table 13.1. Cross-tabulation between sex and diabetes mellitus
Chi-Square Tests
Value df Asymp. Sig. Exact Sig. Exact Sig.
(2-sided) (2-sided) (1-sided)
Pearson Chi-Square 8.799a 1 .003
Continuity Correctionb 7.795 1 .005
Likelihood Ratio 8.537 1 .003
Fisher's Exact Test .005 .003
Linear-by-Linear 8.758 1 .003
N of Valid Casesb 210
a. 0 cells (.0%) have expected count less than 5. The minimum expected count is 16.50.
b. Computed only for a 2x2 table
Table 13.3. Odds Ratio (OR) and Relative Risk (RR) with 95% Confidence Interval (CI)
Risk Estimate
Value 95% Confidence Interval
Lower Upper
Odds Ratio for Sex (Male / Female) 2.716 1.385 5.327
For cohort Diabetes mellitus = Yes 2.159 1.288 3.620
For cohort Diabetes mellitus = No .795 .670 .943
N of Valid Cases 210
13.1.3 Interpretation:
Table 13.1 is a 2 by 2 table of sex and diabetes with row (% within sex) and column
(% within diabetes mellitus) percentages. The question is which percentage should
86
you report? It depends on the situation and what do you want to report. For the data
of a cross-sectional study, it may provide better information to the readers if row
percentages are reported. For example, one can understand from the table 13.1 that
the prevalence of diabetes among males is 32.5% and that of the females is 15.0%,
when row percentages are used. However, column percentages can also be report-
ed in cross-sectional studies (most of the publications use column percentages). If
column percentage is used, the meaning would be different. In this example (table
13.1), it means that among those who have diabetes, 55.6% are male, compared to
31.5% who do not have diabetes. If data are from a case-control study, you must
report the column percentages (we cannot use row percentages for case-control
studies). On the other hand, for the data of a cohort study, one should report the
row percentages. In this case it would indicate the incidence of the disease among
males and females.
We can see in table 13.1 (in the row of total) that the overall prevalence (irre-
spective of sex) of diabetes is 21.4% (consider the data is from a cross-sectional
study). Table 13.1 also shows that 32.5% of the males have diabetes compared to
only 15.0% among females (i.e., the prevalence among males and females). The
Chi-square test actually tests the hypothesis whether the prevalence of diabetes
among males and females is same in the population or not.
Table 13.2 shows the Pearson Chi-square test results, including the degree of
freedom (df) and p-value (Asymp. Sig). The table also shows other test results,
such as Continuity Correction and Fisher’s Exact test. Before we look at the
Chi-square test results, it is important to check if there is any cell in the 2 by 2 table
with expected value <5. This information is given at the bottom of the table at “a”
as “0 cells (0%) have expected count less than 5”. For the use of the Chi-square
test, it is desirable to have no cell (in a 2 by 2 table) with expected count less than
5. If this is not fulfilled, we have to use the Fisher’s Exact test p-value to interpret
the result (see table 13.2). In fact, to use the Chi-square test, no more than 20%
cells should have expected frequency <5. You can have the expected frequencies
for all the cells if you select “Expected” under “Count” in “Cell” option during
analysis.
For the Chi-square test, consider the Pearson Chi-square value (see table 13.2).
In our example, Chi-square value is 8.799 and the p-value is 0.003 (table 13.2).
Since the p-value is <0.05, there is an association between sex and diabetes. It can,
therefore, be concluded that the prevalence of diabetes among males is significant-
87
ly higher than the females, which is statistically significant at 95% confidence
level (p=0.003).
Table 13.3 shows the OR (2.716) and it’s 95% CI (1.385-5.327). Use the OR if
the data are from a case-control study. Odds Ratio is also sometimes used for
cross-sectional data. The table also provided the RR (2.159) and it’s 95% CI
(1.288-3.620) [take the RR and 95% CI for diabetes= yes]. Use RR if the data are
from a cohort study. Note that both the OR and RR is statistically significant as
they do not include 1 in the 95% CI. Odds ratio 2.71 indicates that males are 2.7
times more likely to have diabetes compared to the females. On the other hand, RR
2.15 indicates that the risk of having diabetes is 2.1 times higher in males com-
pared to females. SPSS will not provide the OR and RR, if there are more than 2
categories in any of the variables (e.g., a 2 by 3 table). In such a situation, you have
to get the OR and RR in different ways.
88
Section 14
Association between Two Continuous Variables:
Correlation
Nature and strength of relationship between two or more continuous variables can
be determined by regression and correlation analysis. Correlation is concerned
with measuring the strength of relationship between continuous variables. The
correlation model provides information on the relationship between two variables,
without distinguishing which is dependent and which is independent variable. But
the basic procedure for regression and correlation model is the same.
Under the correlation model, we calculate the “r” value. The “r” is called the
sample correlation coefficient. It (“r” value) indicates the degree of linear relation-
ship between dependent (Y) and independent (X) variable. Value of “r” ranges
from +1 to –1. In this section, I shall discuss the correlation model. Use the data
file <Data_3.sav> for practice.
Hypothesis:
H0: There is no correlation between systolic and diastolic BP.
HA: There is a correlation between systolic and diastolic BP.
Assumptions:
1. The variables (systolic and diastolic BP) are normally distributed in the
population;
2. The subjects represent a random sample from the population.
The first step, before doing correlation, is to generate a scatter diagram. The
scatter diagram provides information/ideas about:
• Whether there is any correlation between the variables;
89
• Whether the relationship (if there is any) is linear or non-linear; and
• Direction of the relationship, i.e., whether it is positive (if the value of one
variable increases with the increase of the other variable) or negative (if the
value of one variable decreases with the increase of the other variable).
If you want to get the regression line on the scatter plot, use the following
commands:
Graphs > Legacy dialogues > Interactive > Scatterplot… > Select “sbp” and
drag it into the “X-axis” box > select “dbp” and drag it into the “Y-axis” box >
Click on “Fit” tab > Select “Regression” after clicking the dropdown arrow >
Ok
The SPSS will produce the following scatter plot (fig 14.2) with the regression
line on it. In the same manner, you can produce the scatter diagram of age and
diastolic BP (fig 14.3).
90
Figure 14.2. Scatter diagram of systolic and diastolic BP with regression line
Figure 14.3. Scatter diagram of diastolic BP and age with regression line
91
14.1.3 Outputs:
Table 14.1. Pearson correlation between systolic and diastolic BP
14.1.4 Interpretation:
In the first step, I have constructed the scatter plot of systolic and diastolic BP (fig
14.1 and 14.2). Figure 14.1 shows that the data points are scattered around an
invisible straight line and there is an increase in the diastolic BP (Y) as the systolic
BP increases (X). This indicates that there may have a positive correlation between
these two variables. Look at fig 14.2, which shows the regression line in the scatter
plot. The regression line has passed from near to the lower left corner to the upper
right corner, indicating a positive correlation between systolic and diastolic BP. If
the relationship were negative (inverse), the regression line would have passed
from the upper left corner to lower right corner. Figure 14.3 shows the scatter plot
of diastolic BP and age. It does not indicate any correlation between diastolic BP
and age, since the dots are scattered around the regression line, which is more or
less parallel to the X-axis.
For correlation, look at the value of correlation coefficient [r] (Pearson
Correlation). Table 14.1 shows that the correlation coefficient of systolic and
diastolic BP is 0.858 and the p-value is 0.000. Correlation coefficient “r” indicates
the strength/degree of linear relationship between the two variables (systolic and
diastolic BP). As the value of “r” is positive and the p-value is <0.05, there is a
significant positive correlation between systolic and diastolic BP.
The value of “r” lies between –1 and +1. Values near to “zero” indicate no
correlation, while values near to “+1” or “–1” indicate strong correlation. Negative
(– r) value indicates an inverse relationship. A value of r ≥ 0.8 indicates very strong
correlation; “r” value between 0.6 and 0.8 indicates moderately strong correlation;
“r” value between 0.3 and 0.5 indicates fair correlation and “r” value < 0.3 indi-
cates poor correlation.
92
14.2 Spearman’s correlation
Spearman’s correlation (instead of Pearson correlation) is done when the normali-
ty assumption is violated (i.e., if the distribution of either the dependent or inde-
pendent or both the variables are not normal). Spearman’s correlation is also appli-
cable for two categorical ordinal variables, such as intensity of pain (mild, moder-
ate, severe pain) and grade of cancer (stage 1, stage 2, stage 3, etc.).
Suppose, we want to explore if there is any correlation between systolic BP
(variable name is “sbp”) and income (income is not normally distributed).
14.2.2 Outputs:
Correlations
Systolic BP Monthly income
Spearman's rho Systolic BP Correlation
1.000 .007
Coefficient
Sig. (2-tailed) .919
N 210 210
Monthly income Correlation
.007 1.000
Coefficient
Sig. (2-tailed) .919
N 210 210
14.2.3 Interpretation:
Table 14.2 shows the Spearman’s correlation coefficient between systolic BP and
income. The results indicate that there is no correlation between systolic BP and
income (r= 0.007; p=0.919), since the “r” value is very small and the p-value is
>0.05.
93
by the “r” value) between two variables after adjusting for one or more other vari-
ables (continuous or categorical). This means that through partial correlation, we
get the adjusted “r” value after controlling for the confounding factors. For exam-
ple, if we assume that the relationship between systolic and diastolic BP may be
influenced (confounded) by other variables (such as, age, and diabetes), we should
do the partial correlation to exclude the influence of other variables (age, and
diabetes). The partial correlation will provide the correlation (r value) between
systolic and diastolic BP after controlling/ adjusting for age and diabetes.
14.3.1 Commands:
Analyze > Correlate > Partial > Select “sbp” and “dbp” for “Variables” box >
Select “age” and “diabetes” for “Controlling for” box > Ok
14.3.2 Outputs:
Table 14.3. Correlation between systolic and diastolic BP after controlling for age and diabetes mellitus
Correlations
Control Variables Systolic BP Monthly income
Age & Diabetes Systolic BP Correlation 1.000 .847
mellitus Significance
. .000
(2-tailed)
df 0 206
Diastolic BP Correlation .847 1.000
Significance
.000 .
(2-tailed)
df 206 0
14.3.3 Interpretation:
Table 14.3 shows the results of partial correlation between systolic and diastolic
BP after adjusting for age and diabetes mellitus. We can see that r=0.847 and
p=0.000. This means that these two variables (systolic and diastolic BP) are
significantly correlated (p=0.000), even after controlling for age and diabetes
mellitus. If the relationship between systolic and diastolic BP was influenced by
age and diabetes mellitus, the crude (unadjusted) and adjusted “r” values would be
different. Look at table 14.1, which shows the crude “r” value (0.858). After
adjusting for age and diabetes, the “r” value becomes 0.847 (table 14.3). Since the
crude and adjusted “r” values are almost similar, there is no influence of age and
94
diabetes mellitus in the relationship between systolic and diastolic BP (i.e., age and
diabetes mellitus are not the confounding factors in the relationship between
systolic and diastolic BP).
95
Section 15
Linear Regression
Assumptions:
1. Normality: For any fixed value of X (systolic BP), the sub-population of Y
values (diastolic BP) is normally distributed;
2. Homoscedasticity: The variances of the sub-populations of “Y” are all
equal;
3. Linearity: The means of the sub-populations of “Y” lie on the same straight
line;
4. Independence: Observations are independent of each other.
The first step in analyzing the data for regression is to construct a scatter
diagram, which has already been discussed in section 14. This would give an idea
about the linear relationship between the variables, systolic and diastolic BP.
15.1.1 Commands:
Analyze > Regression > Linear > Select “dbp” for “Dependent" box and “sbp”
for "Independent(s)” box > Method “Enter” (usually the default) > Statistics >
Select “Estimates, Descriptive, Confidence interval, and Model fit” > Contin-
ue > Ok
97
15.1.2 Outputs:
Descriptive Statistics
Mean Std. Deviation N
DIASTOLIC BP 83.04 12.454 210
SYSTOLIC BP 127.83 20.021 210
Correlations
DIASTOLIC BP SYSTOLIC BP
Pearson Correlation DIASTOLIC BP 1.000 .858
SYSTOLIC BP .858 1.000
Sig. (1-tailed) DIASTOLIC BP . .000
SYSTOLIC BP .000 .
N DIASTOLIC BP 210 210
SYSTOLIC BP 210 210
Model Summary
Model R R Square Adjusted R Std. Error of the
Square Estimate
1 .858(a) .737 .736 6.403
a Predictors: (Constant), SYSTOLIC BP
ANOVA(b)
Model Sum of df Mean F Sig.
Squares Square
1 Regression 23890.586 1 23890.586 582.695 .000(a)
Residual 8528.028 208 41.000
Total 32418.614 209
a Predictors: (Constant), SYSTOLIC BP
b Dependent Variable: DIASTOLIC BP
Coefficients(a)
Model Unstandardized Standardized t Sig. 95% Confidence
Coefficients Coefficients Interval for B
B Std. Beta Lower Upper
Error Bound Bound
1 (Constant) 14.779 2.862 5.164 .000 9.136 20.422
SYSTOLIC BP .534 .022 .858 24.139 .000 .490 .578
a Dependent Variable: DIASTOLIC BP
98
15.1.3 Interpretation:
Table 15.1 and 15.2 provides the descriptive statistics (mean and standard devia-
tion) and correlation coefficient (r-value) of the diastolic and systolic BP. The
model summary table (table 15.3) shows the Pearson’s correlation coefficient “R”
(r = 0.858) and coefficient of determination “R-square” (r2 = 0.737). The value of
“R” is same, as we have seen in section 14.
It is important to note the value of R-square (coefficient of determination)
given in the model summary table (table 15.3). R-square indicates the amount of
variation in “Y” due to “X”, that can be explained by the regression line. Here, the
R-square value is 0.737 (~0.74), which indicates that 74% variation in diastolic BP
can be explained by the systolic BP. The rest of the variation (36%) is due to other
factors (unexplained variation). The adjusted R-square value (0.736), as shown in
the table, is the value when the R-square is adjusted for better population estimate.
The ANOVA table (table 15.4) indicates whether the correlation coefficient (R)
is significant or not (i.e., whether the linear regression model is useful to explain
the dependent variable by the independent variable). As the p-value (Sig.) is 0.000,
R is statistically significant at 95% confidence level. We can, therefore, conclude
that there is a significant positive correlation (because R value is positive) between
the diastolic and systolic BP, and we can use the regression equation for prediction.
The table also shows the regression (also called explained) sum of squares
(23890.586) and residual (also called error) sum of squares (8528.028). The resid-
ual indicates the difference between the observed value and predicted value (i.e.,
value on the regression line). Residual sum of squares provides an idea about how
well the regression line actually fits into the data.
Table 15.5 (coefficients) provides quantification of the relationship between
the diastolic and systolic BP. The table shows the values for “a” or Y-intercept
(also called constant) and “b” (unstandardized coefficients) or slope (also called
regression coefficient, β). Note that for a single independent variable, standardized
coefficient (Beta) is equal to Pearson’s correlation value.
Here, the value of “a” is 14.779 and “b” is 0.534 (both are positive). The value,
a= +14.78, indicates that the regression line crosses/cuts the Y-axis above the
origin (zero) and at the point 14.78 (a negative value indicates that the regression
line crosses the Y-axis below the origin). This value (value for a) does not have any
practical meaning, since it indicates the average diastolic BP of individuals, if the
systolic BP is 0 (zero).
99
The value of “b” (the regression coefficient or slope) indicates the amount of
variation/change in “Y” (here it is diastolic BP) for each unit change in “X” (sys-
tolic BP). Here, the value of “b” is 0.534, which means that if the systolic BP
increases (or decreases) by 1 mmHg, the diastolic BP will increase (or decrease)
by 0.534 mmHg. The table also shows the significance (p-value) of “b”, which is
0.000. Note that for simple linear regression, if R is significant, “b” will also be
significant and will have the same sign (positive or negative).
We know that the simple linear regression equation is, Y = a + bX (“Y” is the
predicted value of the dependent variable; “a” is the Y-intercept or constant; “b” is
the regression coefficient and “X” is a value of the independent variable). There-
fore, the regression/prediction equation for this regression model is
Y = 14.737 + 0.534X.
With this equation, we can estimate the diastolic BP by the systolic BP. For
example, what would be the estimated diastolic BP of an individual whose systolic
BP is 130 mmHg? The answer is, the estimated diastolic BP would be equal to
(14.727 + 0.534*130) 84.1 mmHg.
Note that, if we want to use the regression equation for the purpose of predic-
tion/estimation, “b” has to be statistically significant (p<0.05). In our example, the
p-value for “b” is 0.000, and we can, therefore, use the equation for the prediction
of diastolic BP by systolic BP.
Table 15.5 has actually evaluated whether “b” in the population is zero or not
by the t-test (Null hypothesis: “b” is equal to “zero” in the population; Alternative
hypothesis: the population regression coefficient is not equal to “zero”). We can
reject the null hypothesis, since the p-value is <0.05. It can, therefore, be conclud-
ed that the systolic BP can be used to predict/estimate the diastolic BP using the
regression equation, Y = 14.737 + 0.534X.
100
• Get the adjusted estimates of regression coefficients (B) of the explanatory
variables in the model;
• Predict or estimate the value of the dependent variable by the explanatory
variables in the model; and
• Understand the amount of variation in the dependent variable explained by
the explanatory variables in the model.
Suppose, we want to assess the contribution of four variables (age, systolic BP,
sex and religion) in explaining the diastolic BP in a sample of individuals selected
randomly from a population. Here, the dependent variable is the diastolic BP and
the explanatory variables (independent variables) are age, systolic BP, sex and
religion. Of the explanatory (independent) variables, two are quantitative (age and
systolic BP) and two are categorical variables (sex and religion). Of the categorical
variables, sex has two levels (male and female) and religion has 3 levels (Islam,
Hindu and Christian). When the independent variable is categorical with more
than two levels (e.g., religion), we need to create dummy variables for that vari-
able. For example, if we want to include the variable “religion” in the regression
model, we shall have to create dummy variables for religion.
101
able name” box > write “Dummy variable 1 for religion” in the “label” box >
Change > Click on “Old and New Values..” > Select “Value” under “Old
value” and write 1 in the box > Select “Value” under “New value” and write 1
in the box > Add > Select “All other values” under “Old value” > Write 0
(zero) in the box “Value” under the “New value” > Add > Continue > Ok
Step 2: Create the second dummy variable “reli_2” for religion
Transform > Recode into different variables > Select “religion” and push it into
the “Input variable – Output variable” box > Write “reli_2” in the “Output vari-
able name” box > Write “Dummy variable 2 for religion” in the “Label” box >
Change > Click on “Old and New Values..” > Select “Value” under “Old
value” and write 2 in the box > Select “Value” under “New value” and write 1
in the box > Add > Select “All other values” under “Old value” > Write 0 in the
box “Value” under the “New value” > Add > Continue > Ok
The above commands will create two dummy variables for religion, the
“reli_1” (for which code 1= Islam and 0= other religions, i.e., Hindu and Chris-
tian)” and “reli_2” (for which code 1= Hindu and 0= other religions, i.e., Islam and
Christian)”. You can see the new variables in the variable view of the data file.
Don’t forget to provide the value labels for the dummy variables.
102
This would create a new variable “sex_1” (last variable in the variable view)
with codes 0= female and 1= male. Go to the “variable view” of the data file and
set these code numbers in the column “Value” of the variable sex_1.
15.2.5 Outputs:
Model Summaryb
Model R R Square Adjusted R Std. Error of the
Square Estimate
1 .852a .725 .719 6.233
a. Predictors: (Constant), Reli_2, Systolic BP, age, Sex, Reli_1
b. Dependent Variable: Diastolic BP
103
Table 15.7. ANOVA table for significance of R
ANOVAb
Model Sum of df Mean F Sig.
Squares Square
1 Regression 20925.182 5 4185.036 107.710 .000a
Residual 7926.385 204 38.855
Total 28851.567 209
a. Predictors: (Constant), Reli_2, Systolic BP, age, Sex, Reli _1
b. Dependent Variable: Diastolic BP
Table 15.8. Adjusted regression coefficients of explanatory variables and their significance
Coefficientsa
Model Unstandardized Standardized t Sig. 95% Confidence
Coefficients Coefficients Interval for B
B Std. Beta Lower Upper
Error Bound Bound
1 (Constant) 20.894 3.420 6.110 .000 14.151 27.637
age .004 .058 .003 .070 .944 -.110 .119
Systolic BP .490 .022 .836 22.559 .000 .447 .532
Sex -2.179 .916 -.090 -2.380 .018 -3.985 -.374
Islam .162 1.369 .007 .119 .906 -2.537 2.862
Hindu -.271 1.492 -.010 -.182 .856 -3.212 2.670
a. Dependent Variable: Diastolic BP
15.2.6 Interpretation:
Table 15.6 (model summary) shows the values for R (0.852), R-square (0.725) and
adjusted R-square (0.719) [adjusted for better population estimation]. In multiple
regression, the R measures the correlation between the observed value of the
dependent variable and the predicted value based on the regression model. The
R-square may overestimate the population value, if the sample size is small. The
adjusted R-square gives the R-square value of better population estimation. The
R-square value 0.725 indicates that all the independent variables (age, systolic BP,
sex and religion) together in the model explains 72.5% variation in diastolic BP,
which is statistically significant (p=0.000), as shown in the ANOVA table (table
15.7).
The Coefficients table (table 15.8) shows regression coefficients (unstandard-
ized and standardized), p-values (Sig.) and 95% confidence intervals (CI) for
regression coefficients of all the explanatory variables in the model along with the
constant. This is the most important table for interpretation of results. The unstan-
dardized regression coefficients (B) are shown in the table for age (0.004;
p=0.944), systolic BP (0.490; p<0.001), sex (-2.179; p=0.018 for males compared
104
to females), Islam (0.162; p=0.906 compared to Christian) and Hindu (-0.271;
p=0.856 compared to Christian).
From this output (table 15.8), we conclude that the systolic BP and sex are the
factors significantly influencing the diastolic BP (since the p-values are <0.05).
The other variables in the model (age and religion) do not have any significant
influence in explaining the diastolic BP. The unstandardized coefficient (B) [also
called multiple regression coefficient] for systolic BP, in this example, is 0.490
(95% CI: 0.45 to 0.53). This means that the average increase (or decrease) in
diastolic BP is 0.49 mmHg, if the systolic BP increases (or decreases) by 1 mmHg
after adjusting for all other variables (age, sex and religion) in the model. On the
other hand, the unstandardized coefficient (B) for sex is -2.179 (95% CI: -3.985 to
-0.374), which means that (on an average) the diastolic BP of males is 2.2 mmHg
less (since the coefficient is negative. If it is positive, it would be more) than that
of the females, given the other variables constant.
The standardized coefficients (Beta) (table 15.8) indicate which independent
variables have more influence on the dependent variable (diastolic BP). Bigger the
value more is the influence. We can see in table 15.8 that the standardized coeffi-
cients for systolic BP and sex are 0.83 and -0.09, respectively. This means that
systolic BP has greater influence in explaining the diastolic BP than the sex.
Correlations
Systolic BP Sex Reli_1 Reli_2 age
Systolic BP Pearson Correlation 1 -.125 .026 -.011 -.042
Sig. (2-tailed) .071 .710 .870 .542
N 210 210 210 210 210
Sex Pearson Correlation -.125 1 .077 .038 -.066
Sig. (2-tailed) .071 .269 .581 .344
N 210 210 210 210 210
Reli_1 Pearson Correlation .026 .077 1 -.757* .073
Sig. (2-tailed) .710 .269 .000 .292
N 210 210 210 210 210
Reli_2 Pearson Correlation -.011 .038 -.757* 1 -.020
Sig. (2-tailed) .870 .581 .000 .776
N 210 210 210 210 210
Age Pearson Correlation -.042 -.066 .073 -.020 1
Sig. (2-tailed) .542 .344 .292 .776
N 210 210 210 210 210
**. Correlation is significant at the 0.01 level (2-tailed).
106
We can see in the table (15.9) that there is a moderately strong correlation
between reli_1 and reli_2 (r = -0.757), while the correlation coefficients for other
variables are low. However, the correlation between reli_1 and reli_2 did not affect
our regression analysis.
Pearson’s correlation can only check collinearity between any two variables.
Sometimes a variable may be multicollinear with a combination of variables. In
such a situation, it is better to use the tolerance measure, which gives the strength
of the linear relationships among the independent variables (usually the dummy
variables have higher correlation). To get the tolerance measure (another measure
for multicollinearity), use the following commands:
Analyze > Regression > Linear > Select “dbp” for “Dependent" box and “age,
sbp, sex_1, reli_1 and reli_2” for "Independent(s)” box > Method “Enter” >
Statistics > Select “Estimates, Confidence interval, Model fit, and Collineari-
ty diagnostics” > Continue > Ok
This would provide the collinearity statistics in the coefficients table as shown
in table 15.10.
Table 15.10. Collinearity statistics for multicollinearity diagnosis
Coefficientsa
Model Unstandardized Standardized t Sig. 95% Confidence Collinearity
Coefficients Coefficients Interval for B Statistics
B Std. Beta Lower Upper Tolerance VIF
Error Bound Bound
1 (Constant) 20.894 3.420 6.110 .000 14.151 27.637
age .004 .058 .003 .070 .944 -.110 .119 .983 1.018
Systolic BP .490 .022 .836 22.559 .000 .447 .532 .981 1.019
Sex -2.179 .916 -.090 -2.380 .018 -3.985 -.374 .950 1.053
Islam .162 1.369 .007 .119 .906 -2.537 2.862 .411 2.432
Hindu -.271 1.492 -.010 -.182 .856 -3.212 2.670 .416 2.404
a. Dependent Variable: Diastolic BP
The tolerance value ranges from 0 to 1. A value close to “zero” indicates that
the variable is almost in a linear combination (i.e., has strong correlation) with
other independent variables. In our example (table 15.10), the tolerance values for
age, systolic BP, and sex are more than 0.95. However, the tolerance values of
Islam (reli_1) and Hindu (reli_2) [the dummy variables] are a little more than 0.4.
The recommended tolerance level is more than 0.6 before we put the variable in
the multiple regression model. However, a tolerance of 0.4 and above is accept-
107
able, especially if it is a dummy variable. The other statistics provided in the last
column of the table are the VIF (Variance Inflation Factor). This is the inverse of
the tolerance value.
If there are variables that are highly correlated (tolerance value is <0.4), one
way to solve the problem is to exclude one of the correlated variables from the
model. The other way is to combine the explanatory variables together (e.g., taking
their sum).
Finally, for developing a model for multiple regression, we should first check
for multicollinearity and then the residual assumptions (see below). If they fulfil
the requirements, then only we can finalize the regression model.
108
Table 15.11. Durbin-Watson statistics for checking data points are independent
Model Summaryb
Model R R Square Adjusted R Std. Error of the Durbin-Watson
Square Estimate
1 .852a .725 .719 6.233 1.701
a. Predictors: (Constant), Religion: dummy var_2, Systolic BP, age, Sex:
numeric code, Religion: dummy var_1
b. Dependent Variable: Diastolic BP
Casewise Diagnosticsa
Case Number Std. Residual Diastolic BP Predicted Value Residual
204 -3.625 62 85.11 -23.115
a. Dependent Variable: Diastolic BP
Residuals Statisticsa
Minimum Maximum Mean Std. Deviation N
Predicted Value 65.46 115.85 82.77 9.917 210
Residual -23.115 18.678 .000 6.301 210
Std. Predicted Value -1.745 3.336 .000 1.000 210
Std. Residual -3.625 2.929 .000 .988 210
Table 15.14. Residuals statistics without any outliers in the data set
Residuals Statisticsa
Minimum Maximum Mean Std. Deviation N
Predicted Value 65.32 116.16 82.77 10.006 210
Residual -15.446 18.452 .000 6.158 210
Std. Predicted Value -1.743 3.338 .000 1.000 210
Std. Residual -2.478 2.960 .000 .988 210
Look at the residuals statistics table (table 15.13). Our interest is in the Std.
Residual value. The “minimum” and “maximum” values should not exceed “+3”
or “–3”. Table 15.13 shows that the minimum value is “–3.625” (exceeded –3).
This means that there are outliers. Now, look at the casewise diagnostics table
(table 15.12). The table shows that there is an outlier in the diastolic BP, the value
of which is 62 and the case number (ID number) is 204 (if there is no outlier in the
data, this table will not be provided). If no outlier is present in the data, we shall
get a Residuals Statistics table like table 15.14.
109
The Durbin-Watson test is done to check whether data points are independent.
The Model Summary table (table 15.11) shows the Durbin-Watson statistics
results in the last column. The Durbin-Watson estimate ranges from 0 to 4. Values
around 2 indicate that the data points are independent. Values near zero indicate a
strong positive correlation and values near 4 indicate a strong negative correlation.
The table shows that the value of the Durbin-Watson statistics, in our example, is
1.701, which is close to 2 (i.e., data points are independent)
110
Figure 15.2. Normal probability plot
Let us see how to use the “Stepwise” method (commonly used method in mul-
tiple regression analysis) for modelling. To do this, use the following commands
(only change is in “Method”):
112
Analyze > Regression > Linear > Select “dbp” for “Dependent" box and “age,
sbp, sex_1, reli_1 and reli_2” for "Independent(s)” box > Method “Stepwise”
(fig 15.4) > Statistics > Select “Estimates, Confidence interval, and Model fit”
> Continue > Ok
Coefficientsa
Model Unstandardized Standardized t Sig. 95% Confidence
Coefficients Coefficients Interval for B
B Std. Beta Lower Upper
Error Bound Bound
1 (Constant) 19.407 2.793 6.948 .000 13.900 24.913
Systolic BP .496 .022 .847 22.961 .000 .453 .539
2 (Constant) 21.016 2.838 7.405 .000 15.421 26.611
Systolic BP .490 .022 .836 22.768 .000 .447 .532
Sex: numeric -2.180 .893 -.090 -2.441 .015 -3.941 -.419
a. Dependent Variable: Diastolic BP
15.2.10.2 Interpretation:
During analysis, we included 5 independent variables (age, systolic BP, sex, and
two dummy variables of religion) in the model. The SPSS has provided table 15.16
that shows the adjusted regression coefficients and models. Let us compare the
outputs in table 15.16 with those of table 15.8, where we have used the “enter”
method. In table 15.8, we can notice that the SPSS retained all the independent
variables in the model that were included, and only the “systolic BP” and “sex”
were found to be significantly associated with the dependent variable (diastolic
BP). When we used the “stepwise” method, the SPSS has provided two models –
model 1 and model 2. In the model 1, there is only one independent variable (sys-
tolic BP) and in the model 2, there are two independent variables (systolic BP and
sex; others are automatically removed). We consider the last model as the final
model.
Sometimes you may need to include certain variable(s) in the model for theo-
retical or practical reason. In such a situation, after you derive the model with
“stepwise” method, add the additional variable(s) of your choice and re-run the
model using the “enter” method.
For automatic selection method, you can specify the inclusion (entry) and
113
exclusion (removal) criteria of the variables. Usually, the inclusion and exclusion
criteria, set as default in SPSS, are 0.05 and 0.10, respectively (fig 15.5). You can,
however, change the criteria based on your requirements. Finally, for model build-
ing, the researcher should decide the variables to be included in the final model
based on theoretical understanding and empirical findings.
114
Section 16
Logistic Regression
Logistic regression is a commonly used statistical method for health data analysis.
Logistic regression is done when the outcome variable is a dichotomous variable,
such as diabetes (present or absent), vaccinated (yes or no) and an outcome (died
or did not die). The purposes of logistic regression analysis (and other multivari-
able analysis) are to: a) Adjust the estimate of risk for a number of confounding
factors set in the model; b) Determine the relative contribution of factors to a
single outcome; c) Predict the probability of an outcome for a number of indepen-
dent variables in the model; and d) Assess interaction of multiple variables for the
outcome. Use the data file <Data_4.sav> for practice.
Assumptions:
Logistic regression does not make any assumptions concerning the distribution of
predictor (independent) variables. However, it is sensitive to high correlation
among the independent variables (multicollinearity). The outliers may also affect
the results of logistic regression.
115
16.1.1 Commands:
Analyze > Regression > Binary logistic > Put “diabetes” in the “Dependent”
box > Put “sex_1”, age, pepticulcer, and f_history in the “Covariate” box >
Categorical > Push “sex, pepticulcer and f_history” in “Categorical covari-
ates” box > Select “f_history” > Select “first” under “Change contrast” (we are
doing this because 0 is our comparison group) > Click on “Change” > (do the
same thing for all the variables in “Categorical covariates” box) > Continue >
Options > Select “Classification plots, Hosmer-Lemeshow goodness-of-fit,
Casewise listing of residuals, Correlations of estimates, and CI for exp(B)” >
Continue > OK
16.1.2 Outputs:
SPSS provides many tables while doing the logistic regression analysis. Only the
useful tables are discussed here. After the basic tables (tables 16.1 to 16.3), the
outputs of logistic regression are provided under Block 0 and Block 1.
A. Basic tables:
116
Table 16.3. Categorical Variables Codings
Observed Predicted
DIABETES MELLITUS Percentage
No Yes Correct
Step 0 DIABETES MELLITUS No 165 0 100.0
Yes 45 0 .0
Overall Percentage 78.6
117
16.1.4 Interpretation: Outputs under Block 0
Analysis of data without any independent variable in the model is provided under
Block 0. The results indicate the baseline information that can be compared with
the results when independent variables are put into the model (provided under
Block 1).
Look at the classification table (table 16.4). The table indicates the overall
percentage of correctly classified cases (78.6%). We will see whether this value
has increased with the introduction of independent variables in the model under
Block 1 (given in table 16.8). If the value remains the same, it means that the inde-
pendent variables in the model do not have any influence/contribution to predict
diabetes (dependent variable). In our example, the overall percentage has
increased after inclusion of the independent variables in the model (90.5%; table
16.8 under Block 1) compared to the value under Block 0 (78.6%). This means that
adding independent variables improved the ability of the model to predict the
dependent variable.
Chi-square df Sig.
Step 1 Step 101.589 4 .000
Block 101.589 4 .000
Model 101.589 4 .000
118
Table 16.8. Classification Tablea
Observed Predicted
Diabetes mellitus Percentage
No Yes Correct
Step 1 Diabetes mellitus No 159 6 96.4
Yes 14 31 68.9
Overall Percentage 90.5
a. The cut value is .500
119
16.1.5 Interpretation: Outputs under Block 1
Omnibus tests of Model Coefficients (table 16.5): This table indicates whether
the overall performance of the model is better if independent variables are includ-
ed in the model compared to the model without any independent variables (given
under Block 0). We want this test to be significant (p-value < 0.05). In this exam-
ple, the p-value of the Omnibus test is 0.000, which indicates that the proposed
model is better than the model without any predictor (independent) variables.
Model summary table (table 16.6): This table indicates usefulness of the model.
The Cox & Snell R-square and Nagelkerke R-square (called pseudo R-square)
values provide indication about the amount of variation in the outcome variable
that can be explained by the independent variables in the model. In this example,
the values of the pseudo R-square are 0.384 (Cox & Snell R-square) and 0.593
(Nagelkerke R-square), respectively. This means that between 38.4% and 59.3%
variation in the outcome variable can be explained by the independent variables set
in the model. This information is not needed if the objective of the analysis is to
adjust for the Odds Ratio.
Classification table (table 16.8): This table indicates how well the model is able
120
to predict the correct category in each case (have or do not have the disease). This
table shows that the overall accuracy of this model to predict diabetes (with a
predicted probability of 0.5 or greater) is 90.5%. This table also shows the Sensi-
tivity and Specificity of the model as 68.9% (31 ÷ 45) and 96.4% (159 ÷ 165),
respectively. Positive and negative predictive values can also be calculated from
the table, which are 83.8% (31÷37) and 91.9% (159÷173), respectively. Interpreta-
tion of the findings of this table is a little bit complicated and needs further expla-
nation, especially to explain sensitivity, specificity, and positive and negative
predictive values.
However, the information that we need to check is the overall percentage.
Compare this value with the value under the Block 0 outputs. We expect this value
(overall percentage) to be increased, otherwise adding independent variables in the
model does not have any impact on prediction. We can see that the overall percent-
age of the model to correctly classify cases is 90.5% under Block 1 (table 16.8).
This value, compared to the value (78.6%; table 16.4) that we have seen under
Block 0, has improved. This means that adding independent variables in the model
improved the ability of the model to predict the dependent variable. This informa-
tion is needed, if the intention of this analysis is prediction. If the objective is
adjustment for confounding factors, we can ignore this information.
Variables in the equation (table 16.9): This is the most important table to look at.
This table shows the results of logistic regression analysis. This table indicates
how much each of the independent variables contributes to predict/explain the
outcome variable. This table also indicates the adjusted Odds Ratio (OR) and its
95% confidence interval (CI). The B values (column 3) indicate the logistic regres-
sion coefficients for the variables in the model. These values are used to calculate
the probability of an individual to have the outcome. The positive values indicate
the likelihood for the outcome, while the negative values indicate the less likeli-
hood for the outcome. The exponential of B [Exp(B)] is the adjusted OR.
Let us see how to interpret the results. There are 4 independent (explanatory)
variables in the model – age (as a continuous variable), sex, family history of
diabetes and peptic ulcer. The table shows the adjusted OR [Exp(B)], 95% CI for
the adjusted OR and p-value (Sig.). The adjusted OR for sex is 4.891 (95% CI:
1.737-13.774), which is statistically significant (p=0.003). Here, our comparison
group is female (see table 16.3). This indicates that males are 4.9 times more likely
121
to have diabetes compared to females after adjusting (or controlling) for age, fami-
ly history of diabetes and peptic ulcer. Similarly, persons who have the family
history of diabetes are 3.1 times more likely (OR: 3.06; 95% CI: 1.11-8.43;
p=0.03) to have diabetes compared to those who do not have the family history
after adjusting for age, sex, and peptic ulcer. Interpretation of Exp(B) for age is a
little bit different since the variable was entered as a continuous variable. Here,
Exp(B) for age is 1.259. This means that the odds of having diabetes would
increase by 25.9% [Exp(B) – 1; i.e., 1.259 – 1] (95% CI: 16.6-35.9) with each year
increase of age, which is statistically significant (p<0.001).
If we want to know which variable contributed most in the model, then look at
the Wald statistics. Higher the value (of Wald), the greater is the importance. Age
is the most important variable contributed to the model since it has the highest
Wald value (34.6).
Casewise list (table 16.11): This table provides information about the cases for
which the model does not fit well. Look at the ZResid values (last column). The
values above 2.5 are the outliers and do not fit well in the model. The case numbers
are shown in the column 1. If present (cases that do not fit the model well), all
these cases need to be examined closely. Under the “Predicted Group” column,
you may see “Y (means yes)” or “N (means no)”. If it is “Y”, the model predicts
that the case (case no. 137, in our example) should have diabetes, but in reality (in
the data) the subject does not have diabetes (see the observed column where it is
“N”). Similarly, is if it is “N” under the “predicted group”, the model predicts that
the case should not have diabetes, but in reality the subject has diabetes.
123
ROC indicates good model for prediction. To generate the ROC curve, use the
following commands:
Analyze > Regression > Binary logistic > Put “diabetes” in the “dependent”
box > Put “sex_1”, age, pepticulcer, and f_history in the “Covariate” box >
Categorical > Push “sex, pepticulcer and f_history” in “Categorical covari-
ates” box > Select “f-history” > Select “First” under “Change contrast” (we are
doing this because 0 is our comparison group) > Click on “Change” > (do the
same thing for all the variables in “Categorical covariates” box) > Continue >
Options > Select “Classification plots, Hosmer-Lemeshow goodness-of-fit,
Casewise listing of residuals, Correlations of estimates, and CI for exp(B)” >
Continue > Save > Select “Probabilities” under “Predicted values” > Continue
> OK
This will create a new variable, Pre_1 (predicted probability) (look at the
bottom of the variable view). Now, to get the ROC curve, use the following com-
mands:
Analyze > ROC curve > Select “Pre_1” for the “Test variable” box and “diabe-
tes” for the “State variable” box and put “1” for the value of the state variable
(since code 1 indicates individuals with diabetes) > Select “ROC curve” and
“Standard error and Confidence interval” under “Display” > OK
The test result variable(s): Predicted probability has at least one tie between the positive
actual state group and the negative actual state group. Statistics may be biased.
a. Under the nonparametric assumption
b. Null hypothesis: true area = 0.5
124
Figure 16.2. ROC Curve
125
Commands for automatic selection of independent variables (use the data file
<Data_3.sav>):
Analyze > Regression > Binary logistic > Put “diabetes” in the “Dependent”
box > Put “sex_1”, age, pepticulcer, and f_history in the “Covariate” box >
Select “Backward LR” from “Method” (fig 16.3) > Categorical > Push “sex,
pepticulcer and f_history” in the “Categorical covariates” box > Select “f-his-
tory” > Select “First” under “Change contrast” (we are doing this because 0 is
our comparison group) > Click on “Change” > (do the same thing for all the
variables in “Categorical covariates” box) > Continue > Options > Select
“Classification plots, Hosmer-Lemeshow goodness-of-fit, Casewise listing of
residuals, Correlations of estimates, and CI for exp(B)” > Continue > OK
Figure 16.3
SPSS will give the following table (table 16.13) along with others (not shown
here as they are not relevant). We can see that analysis is done in 4 steps (Step 1 to
4; the first column of the table). In the first step, all the independent variables are
in the model. Gradually, SPSS has removed variables that are not significantly
associated with the outcome. Finally, SPSS has provided the final model (Step 4)
with a single variable (sex) in it, which is significantly associated with the
outcome. If the "Enter" method was used, SPSS would give us only the step 1 (see
table 16.9).
126
The inclusion (entry) and exclusion (removal) criteria, set as default in SPSS,
are 0.05 and 0.10, respectively. As discussed in section 15, you can change the
“Entry” and “Removal” criteria from the “Option” dialogue box (under the “Prob-
ability for Stepwise”). Finally, for model building, you should decide the variables
to be included in the final model based on theoretical understanding and empirical
findings.
127
Section 17
Survival Analysis
128
On the other hand, the event time is the amount of time contributed by the
patients who developed the outcome of interest during the study period.
If we have the above information, it is possible to estimate the median survival
times and cumulative survival probabilities for two or more treatment groups for
comparison. Such a comparison allows us to answer the question “which treatment
delays the time of occurrence of the event”. The method commonly used to
analyze the survival-time data is the Kaplan-Meier method, and SPSS can be used
to analyze such data. Use the data file <Data_survival_4.sav> for practice.
Assumptions:
• The probability of the outcome is similar among the censored and under-ob-
servation individuals;
• There is no secular trend over the calendar period;
• The risk is uniform during the interval;
• Losses are uniform over the interval.
17.1.1 Commands:
Analyze > Survival > Kaplan Meier > Push the variable “time” to “Time” box
> Push the variable “outcome” in the “Status” box > Click “Define event” >
Select “Single value” and type “1” (here 1 is the event) in the box > Continue
> Push “treatment” in the “Factor” box > Click Options… > Select “Survival
129
table(s), Mean and Median survival” under statistics > Select “Survival” under
“Plots” > Continue > Click “Compare Factor…” > Select “Log rank” under
“Test statistics” > Continue > OK
17.1.2 Outputs:
The SPSS will give the following outputs.
130
Table 17.3. Survival Table
131
Table 17.4. Overall Comparisons
Chi-square df Sig.
Log Rank (Mantel-Cox) 4.660 1 .031
Test of equality of survival distributions for the different levels of Treatment group.
Chi-square df Sig.
Log Rank (Mantel-Cox) 4.660 1 .031
Breslow (Generalized Wilcoxon) 6.543 1 .011
Tarone-Ware 6.066 1 .014
Test of equality of survival distributions for the different levels of Treatment group.
17.1.3 Interpretation:
Table 17.1 is the summary table indicating the number of study subjects in each
group (22 in the placebo and 22 in the new treatment group) and the number of
events (no. died) occurred in each group including the number censored. The table
shows that in the treatment group, 11 patients died and 11 were censored, while in
the placebo group, 16 died and 6 censored.
Table 17.2 shows the mean and median survival times for both the placebo and
new treatment groups. We do not consider the mean survival time for reporting.
We consider the median survival time. The median survival time is the time when
the cumulative survival probability is 50%. The table indicates that the median
survival time, if the patient is in the placebo group, is 40 days (95% CI:
14.71-65.28), while it is 146 days (95% CI: 89.5-202.42), if the patient is in the
new treatment group. This means that the new treatment increases the survival
time, i.e., the new treatment is associated with longer time to event (and placebo is
associated with shorter time to event). Thus, we conclude that the person lives
longer if s/he receives the new treatment compared to the placebo.
Table 17.3 shows the survival probability (Cumulative Proportion Surviving at
the Time) at different points of time in the placebo and treatment group. From the
table, we can see that the cumulative survival probability at the end of 71 days, in
the placebo group, is 0.273 (27.3%). Since there is no death after that, the cumula-
tive survival probability at the end of 182 days will be the same (27.3%).
On the other hand, the cumulative survival probability is 0.304 (30.4%) at the
end of 168 days, if the patient is in the new treatment group. As there is no death
after that, the cumulative survival probability at the end of 181 days will be the
132
same (30.4%). In the new treatment group, the cumulative survival probability at
the end of 71 days is about 0.722 (72.2%), which is much higher than in the place-
bo group (27.3%). This indicates that the probability of survival at the end of 71
days is higher among the patients who received new treatment compared to place-
bo. This may indicate the benefit of the new treatment (i.e., the new treatment is
better than the placebo).
However, if we consider the cumulative survival probability of patients in both
these groups at the end of 180 days, the outcome is not that different – 27.3% in
the placebo group and 30.4% in the treatment group. This information indicates
that the survival probability is still higher if the person is on the new treatment than
on the placebo.
We can also estimate the median survival time (it is the time when the cumula-
tive survival probability is 50%) in both these groups from this table. The median
survival time for placebo group is 40 days and that of the treatment group is 146
days (see table 17.3). Now, the question is whether the survival experiences of
both these groups in the population are different or not? To answer this question,
we have to use a statistical test (Log Rank test) as given in table 17.4.
Table 17.4 shows the Log Rank test results. For an objective comparison of the
survival experience of two groups, it is desirable to use some statistical methods
that would tell us whether the difference of the survival experiences in the popula-
tion is statistically significant or not. Here, the null hypothesis is “there is no
difference in the survival experience of these two groups (new treatment and
placebo) in the population”. Such a null hypothesis is tested by Log Rank test. The
Log Rank test results show that the p-value is 0.031, which is <0.05. This means
that, the survival experience of both these groups in the population is not same. In
other words, it indicates that the survival probability is better if the patient receives
the new treatment (i.e., the new treatment is more effective/better than the placebo
in improving the patients’ survival).
Note that there are alternative procedures for testing the null hypothesis that
the two survival curves are identical. They are Breslow test, Tarone-Ware test and
Peto test (table 17.5). The Log Rank test ranks all the deaths equally, while the
other tests give more weight to early deaths. The options are available in SPSS
under the “Compare Factor” tab.
134
Section 18
Cox Regression
The Cox Regression is also called Proportional Hazards Analysis. In the previous
section (section 17), I have discussed the survival analysis using the Kaplan Meier
method. Like other regression methods (e.g., multiple linear regression and logis-
tic regression), Cox Regression is a multivariable analysis technique where the
dependent measure is a mixture of time-to-event and censored time observations.
Use the data file <Data_survival_4.sav> for practice.
18.1.1 Commands:
Let us use the previous example and data for Cox Regression analysis along with
the variables sex and age for adjustment. Note that the variable “treatment” has
two categories – placebo (coded as “0”) and new treatment (coded as “1”).
Analyze > Survival > Cox Regression > Push “time” into the “Time” box >
135
Push “outcome” into the “Status” box > Click on “Define event” > In “Single
value” box write 1 (since 1 is the code no. of the event) > Continue > Push
“treatment, age and sex” into the “Covariate” box > Click on “Categorical” >
Push “treatment and sex” into the “Categorical Covariates” box > Select
“Last” from “Reference category” (usually the default) under “Change
contrast” > Continue > Click on “Options” > Select “CI for Exp(B) and
Correlation of estimates” > Continue > Click on “Plot” > Select “Survival and
Log minus Log” > Select the variable “treatment” from the “Covariate Values
Plotted at” and push into the “Separate Line for” box > Continue > Ok
18.1.2 Outputs:
Only relevant tables are provided below.
N Percent
Cases available in Eventa 27 61.4%
analysis Censored 17 38.6%
Total 44 100.0%
Cases dropped Cases with missing values 0 .0%
Cases with negative time 0 .0%
Censored cases before the
earliest event in a stratum 0 .0%
Total 0 .0%
Total 44 100.0%
a. Dependent Variable: Survival time in days
Frequency (1)b
Treatment 0=Placebo 22 1
1=New treatment 22 0
sexa 0=Male 21 1
1=Female 23 0
a. Indicator Parameter Coding
b. The (0,1) variable has been recoded, so its coefficients will not be the same as for
indicator (0,1) coding.
c. Category variable: treatment (Treatment gr)
d. Category variable: sex (Sex)
136
Table 18.3. Variables in the Equation
18.1.3 Interpretation:
Table 18.1 shows the number of cases that are analyzed. Table 18.2 is very import-
ant for interpretation. This table indicates which category of the categorical vari-
137
ables is the comparison group. Look at the last column [(1)b]. The value “0” in this
column indicates the comparison group. In the table “New treatment” is indicated
as “0” in the last column. Therefore, in the analysis, “new treatment” is the com-
parison group (though “new treatment” is actually coded as “1”). Similarly,
“females” are the comparison group in this analysis since the value of being
“female” is “0” in the last column. We shall, therefore, get the Hazard Ratio for the
“placebo” group compared to the “new treatment” group and for the “males” com-
pared to the “females”, as shown in table 18.3 (Variables in the Equation).
Our main interest is in table 18.3 (Variables in the Equation). The table indi-
cates the Hazard Ratio [Exp(B)], p-value (Sig.) and 95% confidence interval (CI)
for the Hazard Ratio [95% CI for Exp(B)]. The Hazard Ratio for the variable
“treatment” is 2.72 (95% CI: 1.11-6.69) and the p-value is 0.028. This indicates
that compared to the “new treatment”, patients in the “placebo” group are 2.72
times more likely to have shorter time to event after controlling for “age” and
“sex”, which is statistically significant (p=0.028). On the other hand, males are
more likely (2.43 times) to have shorter time to event compared to the females
after controlling for the variables “treatment” and “age” (p=0.041). Age, inde-
pendently, does not have any significant effect on the survival time, since the
p-value is 0.698.
Figure 18.1 shows the survival plot of the heart failure patients by treatment
group. The upper line is for the new treatment group and the lower one is for the
placebo group. The figure shows the outcome difference between the new treat-
ment and placebo. The group represented by the upper line has the better survival
probability.
However, before we conclude the results, we have to check if: a) there is multi-
collinearity among the independent variables; and b) relative hazards over the time
are proportional (also called the proportionality assumption of the proportional
hazards analysis). Look at the SE of the variables in the model (table 18.3). There
is no value which is very small (<0.001) or very large (>5.0) (refer to the logistic
regression analysis in section 16), indicating that there is no problem of multicol-
linearity in the model.
For the second assumption, we need to check the log-minus-log survival plot
(fig 18.2). If there is a constant vertical difference between the two curves (i.e.,
curves are parallel to each other), it means that the relative hazards over time are
proportional. If the curves cross each other, or are much closer together at some
138
points in time and much further apart at other points in time, then the assumption
is violated. In our example, two lines are more or less parallel indicating that the
assumption is not violated. When the proportional hazard assumption is violated,
it is recommended to use the Cox regression with time dependent covariate to
analyze the data.
139
Section 19
Non-parametric Methods
Non-parametric tests, in general, are done when the quantitative dependent vari-
able is not normally distributed. Non-parametric tests are also used when the data
are measured in nominal and ordinal scales. Table 19.1 shows the types of
non-parametric methods recommended against the parametric tests, when the
dependent variable is not normally distributed in the population. Note that
non-parametric tests are less sensitive compared to the parametric tests and may,
therefore, fail to detect differences between groups that actually exist. Use the data
file <Data_3.sav> for practice.
140
19.1.1 Commands:
Analyze > Nonparametric tests > 2 Independent samples > Select “sbp” and
push into the “Test Variable List” box > Select “sex_1” and push into the
“Grouping Variable” box > Click on “Define Groups” > Write 0 in “Group1”
box and 1 in “Group 2” box (note: our code nos. are 0 for female and 1 for
male) > Continue > Select “Mann-Whitney” under “Test Type” > Ok
19.1.2 Outputs:
Table 19.2. Mann-Whitney: Ranks
Systolic BP
Mann-Whitney U 4535.500
Wilcoxon W 7538.500
Z -1.379
Asymp. Sig. (2-tailed) .168
a. Grouping Variable: Sex: numeric
19.1.3 Interpretation:
Our interest is in table 19.3. Just look at the p-value of the test. Here, the p-value
is 0.168, which is >0.05. This indicates that the distribution of systolic BP among
males and females is not different (or median of systolic BP of males and females
is not different). However, with this test result, the median systolic BP of females
and males should be reported. To get the medians, use the following commands.
Analyze > Compare means > Means > Select “sbp” and push into the “Depen-
dent List” box > Select “sex_1” and push it into the “Independent List” box >
Remove “Mean, Number of cases and Standard deviation” from the “Cell
Statistics” box > Select “Median” from “Statistics” box and push it into the
“Cell Statistics” box > Continue > OK
You will get the following tables (19.4 & 19.5). Tables 19.5 shows the median
systolic BP of males and females.
141
Table 19.4. Case Processing Summary
Cases
Included Excluded Total
N Percent N Percent N Percent
Systolic BP * Sex: numeric 210 100.0% 0 .0% 210 100.0%
Median
Sex: numeric Systolic BP
Female 124.00
Male 122.00
Total 123.00
19.2.1 Commands:
Analyze > Nonparametric tests > 2 Related Samples > Select “post-test and
pre-test” together and push into the “Test Pairs” box > Options > Select “De-
scriptive” and “Quartile” > Continue > Select “Wilcoxon” under “Test Type”
> Ok
19.2.2 Outputs:
Table 19.6. Descriptive Statistics
142
Table 19.7. Test Statisticsb
19.2.3 Interpretation:
Table 19.6 shows the descriptive statistics of pre- and post-test scores. The median
(50th percentile) score of pre-test is 52.0, while the median score is 92.5 for the
post-test. The difference between these scores is quite big. Look at table 19.7. The
p-value of the Wilcoxon Signed Ranks test is 0.000, which is highly significant.
This indicates that the pre- and post-test scores (medians) are significantly differ-
ent. We, therefore, conclude that the training has significantly improved the
knowledge of the participants (since the median of the post-test score is signifi-
cantly higher than that of the pre-test score).
19.3.1 Commands:
Analyze > Nonparametric Tests > K Independent Samples > Select “sbp” and
push into “Test Variable List” box > Select “religion” and push it into “Group-
ing Variable” box > Click “Define Range” > Write 1 in “Minimum” box and
3 in “Maximum” box (the religion has code numbers from 1 to 3) > Continue
> Options > Select “Quartile” > Continue > Select “Kruskal Wallis H” > Ok
143
19.3.2 Outputs:
Table 19.8. Ranks
Systolic BP
Chi-Square .054
df 2
Asymp. Sig. .973
a. Kruskal Wallis Test
b. Grouping Variable: religion
19.3.3 Interpretation:
Table 19.9 shows the Kruskal Wallis test results (dependent variable is systolic BP
and grouping variable is religion with 3 levels – Muslim, Hindu and Christian as
shown in table 19.8). The p-value (Asymp. Sig.) of the Chi-square test is 0.973,
which is >0.05. Therefore, we are unable to reject the null hypothesis. We
conclude that the median of the systolic BP among religious groups is not signifi-
cantly different. You can get the median of the systolic BP using the commands as
mentioned under Mann-Whitney U test. The medians of systolic BP in different
religious groups are provided in table 19.10.
Median
Religion Systolic BP
MUSLIM 122.00
HINDU 126.00
Christian 121.50
Total 123.00
144
hour 0, hour 7, hour 14 and hour 24) after administration of a drug. To conduct this
study, we have selected 15 individuals randomly from a population and measured
their blood sugar levels at the baseline (hour 0). All the individuals were then given
the drug, and their blood sugar levels were measured again at hour 7, hour 14 and
hour 24. The blood sugar levels at hour 0, hour 7, hour 14, and hour 24 are named
in SPSS as Sugar_0, Sugar_7, Sugar_14 and Sugar_24, respectively. Use the data
file <Data_Repeat_anova_2.sav> for exercise.
19.4.1 Commands:
Analyze > Nonparametric Tests > K Related Samples > Select “sugar_0,
sugar_7, sugar_14 and sugar_24” and push them into “Test Variables” box >
Statistics > Select “Quartile” > Continue > Select “Friedman” > Ok
19.4.2 Outputs:
Table 19.11. Descriptive Statistics
N Percentiles
25th 50th (Median) 75th
Blood sugar at hour 0 15 106.0000 110.0000 115.0000
Blood sugar at hour 7 15 100.0000 105.0000 110.0000
Blood sugar at hour 14 15 96.0000 100.0000 107.0000
Blood sugar at hour 24 15 95.0000 98.0000 110.0000
Mean Rank
Blood sugar at hour 0 3.80
Blood sugar at hour 7 2.73
Blood sugar at hour 14 1.63
Blood sugar at hour 24 1.83
N 15
Chi-Square 27.562
df 3
Asymp. Sig. .000
a. Friedman Test
19.4.3 Interpretation:
Outputs are provided in tables 19.11-19.13. Table 19.11 shows the median blood
145
sugar levels at 4 different time periods. Look at the Friedman test results as provid-
ed in table 19.13. The Chi-square value is 27.56 and the p-value (Asymp. Sig.) is
0.000, which is <0.05. This indicates that there is a significant difference in blood
sugar levels across the 4 time periods (p<0.001). The findings indicate that the
drug is effective in reducing the blood sugar levels.
19.5.1 Commands:
Analyze > Nonparametric tests > Chi-square > Move the variable “diabetes”
into the “Test Variable List” box > Select “Values” under “Expected Values” >
Write “0.18” in the box > Add > Again write “0.72” (1 minus 0.18) in the box
> Add > Click on “Options” > Select “Descriptive” > Continue > Ok
19.5.2 Outputs:
146
19.15. Test Statistics
Diabetes mellitus
Chi-Square .268a
df 1
Asymp. Sig. .605
a. 0 cells (.0%) have expected frequencies less than 5.
The minimum expected cell frequency is 42.0.
19.5.3 Interpretation:
Table 19.14 provides the observed and expected frequencies for those who have
diabetes (as “Yes”) and those who do not have diabetes (as “No”). These are the
descriptive information, and you do not need to report them. Table 19.15 is the
main table to interpret the results. Our interest is at the p-value. The Chi-square
value is 0.268 and the p-value is 0.605. Since the p-value is >0.05, we cannot reject
the null hypothesis. This means that the prevalence of diabetes in the population
may not be different from 18%.
147
Section 20
Checking Reliability of Scale: Cronbach’s Alpha
When the researchers select a scale (e.g., a scale to measure depression) in their
study, it is important to check that the scale is reliable. One of the ways to check
the internal consistency (reliability) of the scale is to calculate the Cronbach’s
alpha coefficient. Cronbach’s alpha indicates the degree to which the items in the
scale correlate with each other in the group.
Ideally, the Cronbach’s alpha value should be above 0.7. However, this value
is sensitive to the number of items in the scale. If the number of items in the scale
is less than 10, the Cronbach’s alpha coefficient tends to be low. In such a situation,
it is more appropriate to use the “mean inter-item correlations”. The optimum
range of the mean inter-item correlation is between 0.2 and 0.4. Use the data file
<Data_cronb.sav> for practice.
148
20.1.1 Outputs:
Table 20.1. Case Processing Summary
N %
Cases Valid 60 100.0
Excludeda 0 .0
Total 60 100.0
a. Listwise deletion based on all variables in the procedure.
q1 q2 q3 q4
q1 1.000 .512 .491 .548
q2 .512 1.000 .635 .630
q3 .491 .635 1.000 .589
q4 .548 .630 .589 1.000
149
Table 20.7. Scale Statistics
20.1.2 Interpretation:
The Reliability Statistics table (table 20.2) shows the Cronbach’s alpha value. In
this example, the value is 0.839. This indicates very good correlation among the
items in the scale (scale is reliable).
However, before looking at the value of Cronbach’s alpha, look at the table
“Inter-item Correlation Matrix” (table 20.4). All the values in the table should be
positive (all are positive in our example). One or more negative values (if there is
any) indicate that some of the items have not been “reverse scored” correctly. This
information is also provided in the table “Item-total Statistics (table 20.6)”. All the
values under “Corrected item - Total Correlation” should be positive (there should
not have any negative value).
The corrected item-total correlation (table 20.6) indicates the degree to which
each item correlates with the total score. In our example, the values are 0.60, 0.71,
0.67 and 0.70. Low value for any item (<0.3) could be a problem. If the Cronbach’s
alpha value (<0.7) (table 20.2) and the corrected item-total correlation value (<0.3)
is low, one may consider omitting the item from the scale with low value. In our
example, there is no such problem.
However, if the number of items is small in the scale (fewer than 10), it may be
difficult to get a reasonable Cronbach’s alpha value. In such a situation, report the
Mean Inter-item Correlation value (Summary-item Statistics table; table 20.5). In
this example, the Inter-item Correlation values range from 0.491 to 0.635, and the
mean is 0.567 (optimum range of the mean is 0.2 to 0.4). This indicates a strong
relationship among the items.
150
Section 21
Analysis of Covariance (ANCOVA): One-way ANCOVA
Hypothesis:
Suppose, you want to assess if the mean systolic BP (dependent variable) is same
among males and females (independent variable) after controlling for diastolic BP
(covariate).
H0: There is no difference of the mean systolic BP between males and females
in the population (after controlling for diastolic BP).
HA: The mean systolic BP of males and females is different in the population.
Assumptions:
1. The dependent variable is normally distributed at each level of the indepen-
dent variable;
2. The variances of the dependent variable for each level of the independent
variable are same (homogeneity of variance);
3. The covariates (if more than one) are not strongly correlated with each other
(r<0.8);
4. There is a linear relationship between the dependent variable and the covari-
ates at each level of the independent variable;
5. There is no interaction between the covariate (diastolic BP) and the inde-
pendent variable (sex) [called homogeneity of regression slopes].
152
21.1.1 Commands:
A. Homogeneity of regression slopes (Assumption 5): First, we shall have to
check the homogeneity of regression slopes, using the following commands. Note
that, the SPSS variable names for sex is “sex_1 (0= female; 1= male)”, Systolic BP
is “sbp” and Diastolic BP is “dbp”.
Analyze > General linear model > Univariate > Push “sbp” into the “Depen-
dent Variables” box > Push “sex_1” into the “Fixed Factor” box > Push “dbp”
in the “Covariate box” > Click Model > Select “Custom” under “Specify mod-
el” > Confirm that interaction option is showing in the “Build Terms” box >
Push “sex_1” and “dbp” into the “Model” box > Click on “sex_1” in “Factors
& Covariates” box > Pressing the control button click on the “dbp” in “Factors
& Covariates” box > Push them into the “Model” box (you will see “dbp*-
sex_1” in the Model box) > Continue > Ok
Value Label N
Sex_1 0 Female 133
1 Male 77
Dependent Variable:SYSTOLIC BP
Source Type III Sum of df Mean Square F Sig.
Squares
Corrected Model 61902.160a 3 20634.053 194.296 .000
Intercept 385.931 1 385.931 3.634 .058
Sex_1 7.019 1 7.019 .066 .797
Dbp 41714.418 1 41714.418 392.795 .000
Sex_1 * dbp 17.964 1 17.964 .169 .681
Error 21877.006 206 106.199
Total 3515465.000 210
Corrected Total 83779.167 209
a. R Squared = .739 (Adjusted R Squared = .735)
153
interaction is 0.681, which is >0.05. This indicates that the homogeneity of regres-
sion slopes assumption is not violated. A p-value of <0.05 indicates that the regres-
sion slopes are not homogeneous and the ANCOVA test is inappropriate.
B. One-way ANCOVA:
To perform the one-way ANCOVA, use the following commands:
Analyze > General linear model > Univariate > Push “sbp” into the “Depen-
dent Variables” box > Push “sex_1” into the “Fixed Factor” box > Push “dbp”
in the “Covariate” box > Click “Model” > Select “Full Factorial” > Continue
> Options > Select “sex_1” and push it into the “Display Means for” box (this
would provide the adjusted means) > Select “Compare main effects” > Select
“Bonferroni” from “Confidence interval adjustment” > Select “Descriptive
Statistics, Estimates of effect size, and Homogeneity” tests under “Display” >
Continue > Ok
Value Label N
Sex 0 Female 133
1 Male 77
154
Table 21.6. Tests of Between-Subjects Effects
Dependent Variable:SYSTOLIC BP
Source Type III Sum of df Mean Square F Sig. Partial Eta
Squares Squared
Corrected Model 61884.196a 2 30942.098 292.534 .000 .739
Intercept 688.574 1 688.574 6.510 .011 .030
Dbp (diastolic BP) 60580.272 1 60580.272 572.740 .000 .735
Sex_1 143.945 1 143.945 1.361 .245 .007
Error 21894.971 207 105.773
Total 3515465.000 210
Corrected Total 83779.167 209
a. R Squared = .739 (Adjusted R Squared = .736)
Sex
Dependent Variable: SYSTOLIC BP
Sex Mean Std. Error 95% Confidence Interval
Lower Bound Upper Bound
Female 127.191a .898 125.421 128.962
Male 128.942a 1.186 126.604 131.281
a. Covariates appearing in the model are evaluated at the following values:
DIASTOLIC BP = 83.04.
Pairwise Comparisons
Dependent Variable:Systolic BP
(I) Sex: (J) Sex: Mean Std. Error Sig.a 95% Confidence Interval
numeric numeric Difference (I-J) Differencea
Lower Bound Upper Bound
Female Male -1.751 1.559 .263 -4.825 1.322
Male Female 1.751 1.559 .263 -1.322 4.825
155
Table 21.9. Pairwise Comparisons
Dependent Variable:Systolic BP
(I) Religion (J) Religion Mean Std. Error Sig.a 95% Confidence Interval
Difference (I-J) for Differencea
Lower Upper
Bound Bound
MUSLIM HINDU -.448 1.705 1.000 -4.562 3.666
CHRISTIAN .948 2.313 1.000 -4.635 6.532
HINDU MUSLIM .448 1.705 1.000 -3.666 4.562
CHRISTIAN 1.397 2.535 1.000 -4.721 7.514
CHRISTIAN MUSLIM -.948 2.313 1.000 -6.532 4.635
HINDU -1.397 2.535 1.000 -7.514 4.721
Based on estimated marginal means
a. Adjustment for multiple comparisons: Bonferroni.
156
systolic BP can be explained by the diastolic BP, after controlling for sex.
Table 21.7 (estimated marginal) shows the adjusted (adjusted for diastolic BP)
means of the dependent variable (systolic BP) at different levels of the indepen-
dent variable (sex). We can see that the mean systolic BP of females is 127.19
mmHg and that of males is 128.94 mmHg, after adjusting for diastolic BP (note
that the adjusted means are different from the unadjusted means as shown in table
21.4).
Table 21.8 is the table for pairwise comparison. This table is not necessary in
this example, since the independent variable (sex) has two levels. If the indepen-
dent variable has more than two levels, then the table for pairwise comparison is
important to look at, especially if there is a significant association between the
dependent and independent variable. Look at table 21.9 [this is an additional table
I have provided where the independent variable (religion) has three categories],
which shows the pairwise comparison of mean systolic BP by religious groups.
The results indicate that there is no significant difference of the mean systolic BP
among different religious groups after controlling for diastolic BP, since all the
p-values are >0.05.
157
Section 22
Two-way ANCOVA
In two-way ANCOVA, there are two independent categorical variables with two or
more levels/categories, while in one-way ANCOVA, there is only one independent
categorical variable with two or more levels. Therefore, in two-way ANCOVA,
four variables are involved. They are:
• One continuous dependent variable (e.g., diastolic BP, blood sugar, post-test
score, etc.);
• Two categorical independent variables (with two or more levels) [e.g., occu-
pation, diabetes, type of drug, etc.]; and
• One or more continuous covariates (e.g., age, systolic BP, income, etc.).
Use the data file <Data_3.sav> for practice.
158
explains the interaction of two independent variables (occupation and diabetes) on
the dependent variable (diastolic BP). For analysis, we shall use the data file
<Data_3.sav>. Note that the SPSS variable names for diastolic BP is “dbp”, occu-
pation is “occupation”, diabetes is “diabetes” and age is “age”.
Assumptions:
All the assumptions mentioned under one-way ANCOVA are applicable for
two-way ANCOVA. Look at one-way ANCOVA for the assumptions and how to
check them.
22.1.1 Commands:
To perform the two-way ANCOVA, use the following commands:
Analyze > General linear model > Univariate > Push “dbp” into the “Depen-
dent Variables” box > Push “occupation” and “diabetes” into the “Fixed
Factor” box > Push “age” into the “Covariate” box > Click “Model” > Select
“Full Factorial” > Continue > Options > Push “occupation, diabetes and occu-
pation*diabetes” into the “Display Means for” box (this would provide the
adjusted means of the diastolic BP for occupation and diabetes) > Select
“Compare main effects” > Select “Bonferroni” from “Confidence interval
adjustment” > Select “Descriptive Statistics, Estimates of effect size, and
Homogeneity tests” > Continue > Plots > Select “occupation” and push into
the “Horizontal” box > Select “diabetes and push it into the “Separate Lines”
box > Click “Add” > Continue > Ok
22.1.2 Outputs:
Value Label N
OCCUPATION 1 GOVT JOB 60
2 PRIVATE JOB 49
3 BUSINESS 49
4 OTHERS 52
DIABETES MELLITUS 0 No 165
1 yes 45
159
Table 22.2. Descriptive Statistics (unadjusted)
Dependent Variable:DIASTOLIC BP
Source Type III Sum of df Mean Square F Sig. Partial Eta
Squares Squared
Corrected Model 495.245a 8 61.906 .390 .925 .015
Intercept 101042.759 1 101042.759 636.198 .000 .760
age 34.661 1 34.661 .218 .641 .001
Occupation 229.135 3 76.378 .481 .696 .007
diabetes 12.293 1 12.293 .077 .781 .000
occupation * diabetes 124.301 3 41.434 .261 .854 .004
Error 31923.370 201 158.823
Total 1480603.000 210
Corrected Total 32418.614 209
a. R Squared = .015 (Adjusted R Squared = -.024)
160
Table 22.5. Estimated Marginal for occupation (adjusted)
1. OCCUPATION
Dependent Variable: DIASTOLIC BP
OCCUPATION Mean Std. Error 95% Confidence Interval
Lower Bound Upper Bound
GOVT JOB 83.605a 2.183 79.300 87.910
PRIVATE JOB 81.253a 2.436 76.449 86.056
BUSINESS 84.374a 2.041 80.350 88.399
OTHERS 81.708a 1.974 77.815 85.601
a. Covariates appearing in the model are evaluated at the
following values: age = 26.5143.
2. DIABETES MELLITUS
Dependent Variable: DIASTOLIC BP
DIABETES Mean Std. Error 95% Confidence Interval
MELLITUS Lower Bound Upper Bound
No 83.037a .990 81.086 84.988
yes 82.433a 1.930 78.627 86.239
a. Covariates appearing in the model are evaluated at the following values: age = 26.5143.
161
22.8. Pairwise Comparisons
Multiple Comparisons
Dependent Variable:Diastolic BP
(I) Occupation (J) Occupation Mean Std. Error Sig.a 95% Confidence Interval
Difference (I-J) for Differencea
Lower Upper
Bound Bound
GOVT JOB PRIVATE JOB 2.139 3.089 1.000 -6.093 10.371
BUSINESS -.379 2.821 1.000 -7.896 7.138
OTHERS 1.778 2.778 1.000 -5.625 9.180
PRIVATE JOB GOVT JOB -2.139 3.089 1.000 -10.371 6.093
BUSINESS -2.518 3.002 1.000 -10.519 5.482
OTHERS -.361 2.963 1.000 -8.257 7.534
BUSINESS GOVT JOB .379 2.821 1.000 -7.138 7.896
PRIVATE JOB 2.518 3.002 1.000 -5.482 10.519
OTHERS 2.157 2.678 1.000 -4.978 9.291
OTHERS GOVT JOB -1.778 2.778 1.000 -9.180 5.625
PRIVATE JOB .361 2.963 1.000 -7.534 8.257
BUSINESS -2.157 2.678 1.000 -9.291 4.978
Based on estimated marginal means
a. Adjustment for multiple comparisons: Bonferroni.
Figure 22.1 Mean diastolic BP of different occupation groups by diabetes after adjustment for age
22.1.3 Interpretation:
Table 22.1 and 22.2 shows the descriptive statistics. All the means provided in
table 22.2 are the crude (unadjusted) means, i.e., without adjusting for age.
162
Table 22.3 shows the results of Levene’s test of Equality of Error Variances. This
is the test for homogeneity of variances. We expect the p-value (sig.) to be >0.05
to meet the assumption. In this example, the p-value is 0.459, which is more than
0.05. This means that the variances of the dependent variable (diastolic BP) are
same for each level of the independent variables (occupation and diabetes).
Table 22.4 (tests of between-subjects effects) is the main table showing the
results of the two-way ANCOVA test. We tested the hypothesis whether:
• Mean diastolic BP (in the population) in different occupation groups is same
after controlling for age;
• Mean diastolic BP (in the population) among diabetics and non-diabetics is
same after controlling for age; and
• Is there any interaction between occupation and diabetes after controlling for
age?
Look at the p-values for occupation, diabetes and occupation*diabetes in table
22.4. They are 0.696, 0.781 and 0.854, respectively, indicating that none of them
are statistically significant. This means that occupation and diabetes do not have
any influence on the diastolic BP after controlling for age. There is also no interac-
tion between occupation and diabetes after controlling for age. However, we
should always check the p-value of interaction first. If the interaction is significant
(p-value <0.05), then the main effects (of occupation and diabetes) are not import-
ant, because effect of one independent variable is dependent on the level of the
other independent variable.
The effect size is indicated by the value of Partial Eta Squared. Eta indicates
the amount of variance in the dependent variable that is explained by the indepen-
dent variable (also called effect size). We can see that the effect sizes are very
small both for occupation (0.007) and diabetes (0.000) (table 22.4).
We can also have information about the influence of the covariate (age) on the
dependent variable (diastolic BP). We can see (table 22.4) that the p-value for age
is 0.641, which is not statistically significant. This indicates that there is no signifi-
cant association between age and diastolic BP after controlling for occupation and
diabetes. The value of Partial Eta Squared for age is 0.001 (0.1%). This means that
less than 1% variance in diastolic BP can be explained by age, after controlling for
occupation and diabetes.
Tables 22.5 and 22.6 (estimated marginal) show the adjusted means of the
163
diastolic BP (dependent variable) at different levels of the independent variables
(occupation and diabetes) after controlling for age. In this example, the adjusted
mean of diastolic BP of government job holders is 83.6 mmHg and that of the
diabetics (diabetes mellitus: yes) is 82.4 mmHg, after controlling for age. Similar-
ly, the last table (table 22.7) shows the adjusted mean of diastolic BP of different
occupation groups by diabetes.
Table 22.8 is the table of pairwise comparison of the mean diastolic BP in
different occupation groups. This table is necessary when the independent variable
has more than two levels, and there is a significant association between the depen-
dent and independent variable. Look at the p-values (Sig.) in table 22.8. Since all
the p-values are >0.05, there is no significant difference of the mean diastolic BP
in the population between different occupation groups after controlling for age.
Figure 22.1 plotted the mean diastolic BP of different occupation groups disag-
gregated by diabetes. Finally, from the data, we conclude that the diastolic BP is
not influenced (there is no association) by occupation and diabetes after
controlling for age.
164
Annex
Table A.1. Codebook of data file <Data_3.sav>
SPSS variable name Actual variable name Variable code
ID_no Identification number Actual value
age Age in years Actual value
sex Sex: string m= Male
f= Female
sex_1 Sex: numeric 0= Female
1= Male
religion Religion 1= Islam
2= Hindu
3= Others
religion_2 Religion 2 1= Islam
2= Hindu
3= Christian
4= Buddha
occupation Occupation 1= Government job
2= Private job
3= Business
4= Others
income Monthly family income in Actual value
Tk.
sbp Systolic blood pressure in Actual value
mmHg
dbp Diastolic blood pressure in Actual value
mmHg
f_history Family history of diabetes 0= No
1= Yes
pepticulcer Have peptic ulcer 1= Yes
2= No
diabetes Have diabetes mellitus 1= Yes
2= No
post_test Post-test score Actual value
pre_test Pre-test score Actual value
date_ad Date of hospital admission Actual value
date_dis Date of discharge Actual value
165
References
1. Daniel WW. (1999). Biostatistics: A Foundation for Analysis in the Health
Science. 7th edition. John Wiley & Sons, Inc.
2. Altman DG. (1992). Practical Statistics for Medical Research. 1st Edition.
Chapman & Hill.
3. Katz MH. (2011). Multivariable Analysis: A Practical Guide for Clinicians and
Public Health Researchers. 3rd Edition. London, Cambridge University Press.
4. Katz MH. (2009). Study Design and Statistical Analysis: A Practical Guide for
Clinicians. Cambridge University Press.
5. Gordis L. (2014). Epidemiology. 5th Edution. ELSEVIER Sounders.
6. Szklo M, Nieto FJ. (2007). Epidemiology: Beyond the Basics. 2nd Edition.
Jones and Bartlett Publishers.
7. Field A. (2002). Discovering Statistics Using SPSS for Windows. SAGE Publi
cations: London, California, New Delhi.
8. Reboldi G, Angeli F, Verdecchia P. Multivariable Analysis in Cerebrovascular
Research: Practical Notes for the Clinician. Cerebrovasc Dis 2013; 35:187–193.
DOI: 10.1159/000345491.
9. Katz MH. Multivariable Analysis: A Primer for Readers of Medical Research.
Ann Intern Med 2003; 138:644–650.
10. Pallant J. (2007). SPSS Survival Manual. 3rd edition. Open University Press.
11. Chan YH. Biostatistics 103: Qualitative Data – Tests of Independence. Singa-
pore Med J 2003; Vol 44(10):498-503.
12. Chan YH. Biostatistics 104: Correlational Analysis. Singapore Med J 2003;
Vol 44(12):614-619.
13. Chan YH. Biostatistics 201: Linear Regression Analysis. Singapore Med J
2004; Vol 45(2):55-61.
14. Chan YH. Biostatistics 202: Logistic regression analysis. Singapore Med J
2004; Vol 45(4):149-153.
15. Chan YH. Biostatistics 203. Survival analysis. Singapore Med J 2004; Vol
45(6):249-256.
16. Amderson M, Nelson A. Data analysis: Simple statistical tests. FOCUS on
Field Epidemiology; Vol 3(6).
166