0% found this document useful (0 votes)
47 views26 pages

Basic SPSS Guidance 1

1. Descriptive analysis of student exam marks: - The mean mark was 64.75% with a standard deviation of 9.28% - The marks were normally distributed based on Kolmogorov-Smirnov test 2. One sample t-test to compare water pH to safe range: - The mean pH of 7.07 was not significantly different than the safe range of 6.5-7.5. - The 95% CI was between 6.6 to 7.4, including the safe range. - The water pH was concluded to be safe for consumption. 3. Paired t-test example to compare BP before and after treatment: - This test

Uploaded by

Lim Kaishi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views26 pages

Basic SPSS Guidance 1

1. Descriptive analysis of student exam marks: - The mean mark was 64.75% with a standard deviation of 9.28% - The marks were normally distributed based on Kolmogorov-Smirnov test 2. One sample t-test to compare water pH to safe range: - The mean pH of 7.07 was not significantly different than the safe range of 6.5-7.5. - The 95% CI was between 6.6 to 7.4, including the safe range. - The water pH was concluded to be safe for consumption. 3. Paired t-test example to compare BP before and after treatment: - This test

Uploaded by

Lim Kaishi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 26

SPSS Basic Guidance

1
Descriptive analysis & Parametric analysis

Content:
1. Descriptive analysis
2. Parametric analysis
2.1 One sample T-test
2.2 Dependent T-test
2.3 Independent T-test
2.4 ANOVA
3. Extra
- Extra 1: Degree of freedom

Disclaimer: This is a student’s work. If there is any mistake can contact the author

0
1. Descriptive analysis

A lecturer recorded the final marks obtained by his students in biostatistics exam.
The individual marks for 60 students are given below.

75 62 71 63 79 60 57 70 67 81 53 70
70 67 66 65 56 68 65 72 65 69 68 83
63 63 67 58 77 45 57 64 75 62 60 75
65 59 54 51 55 85 67 55 66 48 72 50
76 87 62 50 48 56 60 65 67 67 68 64
a. How to key in data?

1
Analyze > Descriptive statistic > Explore > tick at any data that you needed for your
study

b. What to write? Interpretation

 There are total of 60 data


 The mean marks obtained by the students is 64.75% with standard deviation of
9.28%
 The maximum mark is 87% and the
minimum mark is 45%
 The range is 42% (range = max –
min)
 The median is 65% which means
50% of the students have the mark more than 65%
 Since the mean and median mark are close to each other, thus we can assume that
the data is symmetrically distributed
If far from each other, means the data may not be symmetrically distributed.

 The skewness value is 0.148 which is within ± 1, thus the data is assumed to be
symmetrical

Note: If > 1 = positively


skewed / <1 = Negatively skewed

 The Kurtosis is -0.042 which is within ± 1, thus the data is mesokurtic


Note: If > 1 = Leptokurtic / <1 = Platykurtic

2
 Since the data have sample more than 50
samples, thus Kolmogorov-Smirnov test is
used
if sample size < 50, use Shapiro-Wilk test

 P-value (sig.) is 0.2 which is greater than


0.05, thus the data is assumed to be
normally distributed.
If p-value < 0.05 means the data is NOT normally distributed,
Need to consider 2 things:
1. If there is significant outlier, report it and remove it, if data become normal, then
is ok.
2. If no significant outlier, then move on to non-parametric test (normally is this
one)

 The histogram is symmetry


Note: sometimes can be negatively skewed /
positively skewed.

 Stem-an-leaf plot (check if there is any extreme


cases)
 From the plot there is no present of extreme cases
If there are extreme cases report how many and
which number is it

3
 There is no outlier or extreme cases present in the box plot.
 The length of the whiskers is the same
 The median is located at the center of the box.
If have ‘*’ means it is extreme cases, if ‘o’ means outlier
 Report, how many cases, which cases is it

Note:
Extreme = those that is not belong to the study population

Outliers = They are part of study population but on outskirt

2. T-test and ANOVA


2.1 One sample t-test

Uses: To compare mean of SAMPLE to a PREDEFINED MEAN VALUE.

The pH range of water that is safe to consume is from 6.5 to 7.5. To determine the if the
water is safe to drink, 20 samples have been collected from different places as below:

7.45 7.15 7.47 8.37 5.74 6.40 8.05 7.51 8.45 6.39
5.55 7.91 7.35 6.54 7.63 5.85 7.62 6.98 6.14 6.92

Objective: To test if the water is safe to consume

a. How to key in data?

4
1. Key in vertical in one column
2.  Go to variable view, change label
 make sure the MEASURE is in
SCALE
3. Analyze Compare
meansOne sample T test
4. Put the test value as the mean
that u want to compare with

b. What to write?Ho
/ Interpretation
= The pH of the water is 7
H1 = The pH of the water is not 7

1. Always check for normality first!!!


Since data <50 thus Shapiro-Wilk test is used, the p-value is 0.56 which is > 0.05,

The data is
normally distributed
Note: if it is not normal, we
need to carry out data
transformation

Understanding
1. Small t-value
shows the
groups are
similar
2. Means the
two groups
are 0.381
times
different
5 from each
other as they
are within
each other
 The mean for 20 data is 7.07 with a SD 0.86
 The standard error is 0.19. 70.7% % probability the Remember to
event occurs by chance (+) to get CI!!

 The standardized difference, t is 0.381 and the mean difference is 0.07.


 The 95% Confidence interval is [(-0.4 + 7.0), (0.41 + 7.0)]= [6.6, 7.4] *from test value.
 The two-tailed p-value is 0.707 which is greater than 0.05.

Conclusion
 P-value of the test is greater than 0.05, thus accept null hypothesis
 The mean pH of the test is around 7 (The sample mean same as the pre-determined
mean)
 We are 95% confidence that the mean pH of water is between 6.67 and 7.48.
If p-value < 0.05, then reject null hypothesis. The sample mean is not same/different as
the pre-determined mean.

Note: if Normality test show, not normal  check for outlier  if got then remove it 
check again whether it become normal already or not.
- If still not normal, then need to do transformation

6
2.2 Paired sample T tests / Dependent T test
Uses: To compare the mean of two related
variables (same things / persons) Patient BP before BP after
9 140 140
A study on the effectiveness of a 10 167 167 treatment to
hypertension. The BP (in mmHg) before 11 127 120 the treatment
Patient BP before BP after and 12 145 147 after the
1 180 170 13 166 150 treatment is
2 160 155 14 145 125 measured. The
3 160 161 15 149 140 reading is as
4 170 172 16 150 150 below
5 150 145
6 165 160
7 155 150
8 135 133

a. How to
key in
the data?

 

Transform > compute var.


(Before – after) or vice versa
By doing this, they will give you the
data for the differences which u will
need to use for the descriptive
analysis (test for normality)
Note: we are not using the before and after for
descriptive.

7

Analyze> Compare
means > Paired
sample T tests
Ho = The BP before and after is
the same
H1 = The BP before and after is
the not the same
b. What to write? / Interpretation
Can check for descriptive analysis for the
differences (before – after)
First, always Check for the normality!!

 The number of data is 16 which


is < 50
 Shapiro-wilk test is used.
P-value = 0.053 which is > 0.05, thus the
normality can be assumed.
Can continue with parametric test
If it shows not normal, then
directly go to the Wilcoxon Signed

Rank test (non-parametric test)

 The mean BP before is 154


mmHg with SD 14.01
 The mean BP after is 149 mmHg with SD 15.17
This part is important for comparison (who’s larger/smaller), only if they are
significantly different

8
 The mean difference is 4.938 mmHg with SD of 6.37 kg
 The t value is 3.1 with degree of freedom of 15
 The p-value is 0.007 which is smaller than 0.05.
 The 95% confidence interval of the difference is [1.54, 8.33]
 P-value less than 0.05, accept null hypothesis, there is significant change in
the mean BP, in which the BP after is smaller than BP before
 Thus, the treatment is effective
 We are 95% confidence that the reduction in BP is between 1.54mmHg and
8.33mmHg.

2.3 Independent sample T-test


Uses: To test if there is a difference in means between two groups (2 different/not
related group only!!)

A researcher wanted to know whether there is a difference in mean BMI between


male students from school A and School B. 20 samples from A and 24 samples from
B is chosen. Their BMI are as below:

School A School B
32 31 34 28 21 28 20 24 29 23 24 26
22 34 32 30 26 22 23 25 24 22 26 24
25 26 26 32 24 24 34 21 18 21 28 24
29 32 24 19 26 28 23 20

Objective: To determine if there Is a difference of BMI between the groups.

a. How to key in the data?

9

 Key in the data in one row, no


In the ‘variable view’ label the

Analyze > Compare means >


Independent T test

b. What to
write? /

Interpretation

Ho = There is no differences in BMI

between A and B
H1 = There is a difference in BMI
between A and B

10
1st step, always check for normality (use the whole data, no need to separate into A and
B while performing descriptive analysis.

 The number of samples is 44


which is < 50
 Use Shapiro-Wilk test,
 p-value is 0.092, which is >0.05
 Thus, the data is normally
distributed
Note: If not normal, then directly
move to Mann-Whitney test (non-
para)

11
 The mean BMI of 20 male in A is
27.90 with SD of 4.128
 The mean BMI of 24 male in B is
24.00 with SD of 3.539

- From the Levene’s Test the p-value is 0.148 which is greater


Levene’s test than 0.05. Thus, accept null hypothesis. The equality of variance
- Use to determine the equality of is assumed
variance
- Ho= same variance. Ha= Only refer to the ‘equal variances assumed part’
different variance  The two-tailed p-value is 0.002 which is less than 0.05
- If p < 0.05. Variance not same  The 95% confidence level for mean is [1.568, 6.232]
(Ha). Then, move on to non-
Interpret
parametric test! (does not meet  P-value is less than 0.05, thus reject null hypothesis, there is
one of the requirement for significant differences between mean BMI of male and female
parametric)  The mean BMI of male (27.9) is higher than female (24)
- Mann-Whitney test (compare who high who low)
 We are 95% confidence that the difference is between 1.568
and 6.232

12
2.4 ANOVA

Uses: To test if there is a difference in means between more than two groups (3 or
more different/not related group only!!)

A researcher wanted to test the effect of 4 different types of drugs (A, B, C, D) on


weight loss. 20 mice were randomly selected and feed randomly with the drugs. The
weight loss by the mice were recorded as below

Drugs
A B C D
8 5 6 3
8 6 5 8
7 5 5 7
8 6 6 2
9 3 8 5
Objective: To test if there is a difference in weight loss between different drugs

a. How to key in data?

Change the type of data in ‘Drugs’ row


into ‘Numeric’
The ‘Measure’ remained as nominal
 Key in the data in one row, no
need to separate4 different
rows
 If wanted to do descriptive,
take all data no need to
separate

Label the drugs into A, B, C and D

13

Analyze > Compare means >



One-way ANOVA

At post hoc… At contrast…


At option…

14
15
Ho = There is no difference in
weight loss between usage of
different drugs
H1 = There is difference in
weight loss between usage of
b. What to write / interpretation
different drugs

 Since the sample


size is 20 which is
less than 50. Thus,
Shapiro-Wilk test
is used.
 The p-value is
0.170 which is greater than 0.05.
 Thus, the data can be assumed normally distributed
Note: if p-value < 0.05, not normally distributed, then directly move on to Kruskal-Wallis test

(non-parametric test)

 The mean for weight loss by A is 8 ± 0.707 kg


 The mean for weight loss by B is 5 ± 1.225 kg
 The mean
for weight
loss by C is
6 ± 1.225
kg
 The mean
for weight
loss by D is
5 ± 2.550
kg

 The p-value from Levene’s test is 0.062 which is greater than 0.05
 Thus, accept null hypothesis, reject alternative hypothesis

16

The

equality of variances is assumed (meet the requirement for the parametric test, can
continue ANOVA)
Note:
 IF p< 0.05, equality is not assumed, Move on to non-parametric
 Kruskal-Wallis test

 The F-value is 4 with the degree of


freedom are 3 and 16 between groups
and within groups respectively.
 The p-value is 0.027 which is smaller
than 0.05.

Interpret
 P-value less than 0.05. Thus, reject null
hypothesis, accept alternative
hypothesis.
 There are at least once pair of means differ significantly.

B, D and C are at the


same subset
Thus, there is no
differences between
these three

C and A are at the


same subset
Thus, there is no
differences between
these two

 Thus we can say: C does not significantly differ from all the other three drugs
 But, there are significant difference between A and both B and D (different
subset)

17
 Weight loss in A is significantly larger compare to both B and D (what is the
difference?)

18
 The table further validate the statement we can see from the column.
 p-value between A and both B and D is 0.038 which is less than 0.05. Thus, there is
significant differences between them
 Meanwhile p-value between A and C is 0.229 which is greater than 0.05, so there is
no significant difference between them

19
3. Extra
Extra 1: Degree of freedom (not so important can skip)

What does degree of freedom mean?

Eg: I have set of number 1, 2, 3, 4, 5, 6. The mean of this set of number is 3.5.

FYI (fact): the sum of all differences between the number and the means is equal to Zero.
(didn’t believe it? Let me show you)

1 – 3.5 = - 2.5
2 – 3.5 = - 1.5
3 – 3.5 = - 0.5
4 – 3.5 = 0.5
5 – 3.5 = 1.5
+ 6 – 3.5 = 2.5
0 (Tada, 0 right?!)
So, what does the degree of freedom means?
It means when you work of the calculation, we know that we must get 0 as the final
answer. So out of the 6 number here only 5 number that can varies whatever the
number are, but the last number must be a fixed number in order for us to get 0
Like the below equation, the X must be 2.5 for us to get 0!
- 2.5 - 1.5 - 0.5 + 0.5 + 1.5 + X = 0
Or
3.4 + 4.6 – 4.5 - 2 + 4 + y = 0

- I randomly choose these 5 number


- But how should I get 0 in the end?
I should fix Y as 5.5

So I have the freedom to choose 5 number


Thus the degree of freedom is 5

Or formula df = n - 1

Suggested video: https://youtu.be/VIlVWeUQ0vs

Eg:

20
Extra 2: Standard error mean (SEM) (quite important, but is hard, can watch the video at the
link on last page for better understanding)

First must understand what the standard deviation (SD) is,


- SD simply means how much did the data deviate from the mean
- So, SD measure the dispersion of the data
- If SD very SMALL means all the data are quite close to the mean
- Eg: 3, 3, 3.2, 3.5 the mean is 3.175 (very close, so SD must be very small)
- If SD very BIG means all the data are far from the mean
- Eg: 1, 8, 10, 30 the mean is 12.25 (very far, so SD must be very big)

However, ‘Standard error of means’ is the variation of multiple mean from the same population.

Example like I want to measure the time taken by Ali in 100 m sprint, so in a set of
experiment he ran 5 times, and the mean is obtained.
Then I told him to repeat few set again in subsequent day, set 2, 3, 4, …. And the mean of
each set is obtained.
Set
1 2 3 …
50 51 … …
47 54 … …
52 46 … …
50 50 … …
48 50 ... …
mea 49.4 50.2 … …
n
So, SEM is measuring the
variation / dispersion of the means from the average of total experiment (set 1, 2, 3…..).

Why it is important?
- Because standard error allows us to calculate confidence interval
- Thus, allow us to use this data not only for a certain group of people, but also to estimate
the true population mean.
While Standard deviation just tell us that:
Eg: mean = 49, SD = 1
If I told Ali to run again in a single set of experiment the time taken by him will be probably 49 ± 1

 But, repeating the experiment for many set is time consuming and use up a lot of energy, so
the statistician came up with the equation:

Standard deviation
SE = √n

What is the use of SE?

21
1. Calculate the confidence interval:

95% Confidence interval (± 2¿ :

Lower bound = mean – 2(standard error)

upper bound = mean + 2(standard error)

eg:

SE= 1.198
Lower bound = 64.75 – 2 (1.198) = 62.35
Upper bound = 64.75 + 2 (1.198) = 67.15

So this means that we are 95% confidence that the mean of…….. of the population will be
between 62.35 – 67.15

Example: A researcher wanted to find out the effect of revision on the outcome of final
exam, the data is as below:
Do revision Didn’t do revision
69 47
54 68
80 52
mean 68 56
2x SEM 15 13

22
The bar represent the confidence
interval that we have calculated
using the SEM.

When we plot graph for MEAN we get:

80 The mean of
70 sample,
60 being
50 overlapped
Mean

40
30
20
10
0
Do revision Didn’t do revision

What does it mean?

Look at the red area.


It shows the overlapping area of the bar & the overlapping region include the
mean of sample then:
- Strong evidence that there is no difference between the two population
- Accept Ho, Reject H1.
- No difference between doing and not doing revision on their final exam.

Hmmm, how about we increase the number of data?

Do revision Didn’t do revision


69 47
54 68
80 52
77 48
72 70
mean 70 57
2x SEM 9 10

23
80

70

60

50
Mean

40

30

20

10

0
Do revision Didn’t do revision

What does it mean?

Look at the red area.


It shows the overlapping area of the bar & the overlapping region DOES NOT
include the mean of sample then:
- No strong evidence that there are similarities or differences between the
population
- The uncertainty is too high, so we cannot conclude anything from here

Lets, we increase the sample again for the last time.

Do revision Didn’t do revision


69 47
54 68
80 52
77 48
72 70
83 65
93 45
81 44
85 59
75 55
mean 77 55
2x SEM 7 6

24
Chart Title
90
80
70

60
50
Mean

40
30

20
10
0
Do revision Didn’t do revision

What does it mean?

Look at the red area.


It shows there is no overlapping of the error bar
- strong evidence that there is a difference between the population
- Reject Ho, Accept H1.
- There is a difference between doing and not doing revision on their final
exam.
- In which students that do revision score higher mark than students who
did not do their revision

For better understanding:

Why overlapping means no difference while overlapping means have difference?


https://youtu.be/OoSsp16MFng
Start from 9:25

Standard deviation vs standard error of mean


https://youtu.be/3UPYpOLeRJg

How to put error bar in Excel graph


https://youtu.be/6BcccTH_2Ig

25

You might also like