41-47 Introductory Biostatistics Notes - Osmosis
41-47 Introductory Biostatistics Notes - Osmosis
exist a
yontorionsonly
hi
Gorica
ftp.y
Thitating
É
Biosta s cs
During course:
statistical analysis
Exam:
30MC 30 min
~
,
MCQ with theore cal and prac cal ques ons (worth 50%) #O CHAT GPT
h ps://www.ncbi.nlm.nih.gov/pubmed/
h ps://scholar.google.lv/
h ps://bestprac ce.bmj.com/info/evidence-informa on/
h ps://www.cochrane.org
Lesson 1
Sta s cs: a branch of mathema cs dealing with the collec on, organiza on, analysis, interpreta on and
presenta on of data.
sample popula on ex :
-
# of
mean
people
age
in group
ow Dispersion analysis
o Factor analysis X
o Covariance sta s cs -
nominal
g
·
:
g w/ predetermined answers
e g or
overweight subjects
.
., , ,
e .
g .
e .
g ..
VAS ,
scale
a rating scale used to measure
no mean values opinions, attitudes, or
behaviors
median value only stage of cancer
count , percentage
19R
numerical
(quantiform abnormene
2 .
I
distribution
· interval
· ratio
descriptive stats parameters
I numerical
spss can
only open one worksheet at a time
names :
G2 MTX A/B
colors aren't lead
by spss
preparing data :
law data
only divide
into years months age w/ decimals
, ,
pestation : XX 37 Koplus
no words e .
g . F =
0 , M =
1
↑
w column for was
vcolumn for days
codify worded data
be consistent 3 ,
accurate a/ data
entry
Descriptive statistics
Biostatistics
RSU, The Statistics Unit
Descriptive Statistics
Descriptive statistics involves summarizing and organizing the
data so they can be easily understood.
Most commonly
used for categorical
data, especially
nominal data.
Frequency distribution
Percentage
Relative frequency
https://www.spss-tutorials.com/spss-stacked-bar-charts-percentages/
Frequency Distribution
1 2 3 4 5 1 2 3 4 5
1 2 3 4 5
25 25 25
20 20 20
15 15 15
10 10 10
5 5 5
0 0 0
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
Count or %? Count %
N Year male female male female
60 1st year 20 40 33 67
30 3rd year 20 10 67 33
0 0
male female male female
1st year 3rd year 1st year 3rd year
Frequency Distribution
Age Group Frequency Relative frequency/
(Count) Percentage
0-19 0 0/30…0%
20-29 2 2/30 = 0.0667 = 6.67%
30-39 5 5/30 = 0.1667 = 16.67%
40-49 6 6/30 = 0.2000 = 20.00%
50-59 8 8/30 = 0.2667 = 26.67%
60-69 5 5/30 = 0.1667 = 16.67%
70-79 4 4/30 = 0.1333 = 13.33%
Total: 30 100%
Frequency Distribution
Mode - Most often patients stay for 10 days in the hospital after surgery.
Median - 50% of the patients stay for 15 days or less in the hospital after surgery.
Mean - On average patients stay for 20 days in the hospital after surgery.
In this example, Mode is useful since hospitals can estimate the bed occupancy days. Median is useful for
the patients to evaluate duration of hospital stay. Mean is not useful because of extreme outliers!
Measures of variability
Measures of variability (spread)
P H N emo N
• Range (Max - Min)
• Variance howspreadthe data
• Standard deviation (SD) Tarrant
• Interquartile range
σ^2 = Σ(xi - μ)^2 / n
We
Where: E
σ^2 represents the variance.
xi represents each individual data point.
μ represents the mean (average) of the
dataset.
n is the total number of data points in the
dataset.
Standard deviation
https://www.mathsisfun.com/data/standard-deviation.html
Standard deviation and Variance Xieachdatapoint
gon nfr.name on
now
waste Me Fmean
5141001 theman 65ps E sigmasignasuz
head you no push
a an no mayst
SDisenessinto
next SD e
one's
glans
Variance = s2
Percentiles
Percentiles splits the data into 100 equal parts.
Quartiles
Quartiles splits the data into quarters (four equal parts):
Q0 zero quartile; the same as minimum value and 0th
percentile;
Q1 first quartile; the middle number between the minimum
value and the median of the data set, the same as 25th
percentile;
Q2 second quartile; the same as median and 50th percentile;
Q3 third quartile; the middle value between the median and
the maximum value of the data set, the same as 75th
percentile;
Q4 fourth quartile; the same as maximum value and 100th
percentile;
IQR = Q3 - Q1 = interquartile range.
Boxplot
Data
Data
sorted in
ascending 1, 2, 2, 2, 4, 5, 6, 8, 8, 9, 9, 10, 12, 12
order median from Mediayform
Min Max
Box plot
IQR = interquartile range
Q3
Box Plot & Histogram
https://openlab.citytech.cuny.edu/math1272statistics-fall2013-ganguli/2013/09/30/example-boxplots-olympic-athletes/
Summary
How to describe data?
Nominal Ordinal Quantitative
Males/Females Stage of the cancer Pain scale No normal Normal distribution
(I-IV) (0 to 10) distribution
Count, Count, Median, Median, Mean, SD or
Percentage Percentage IQR IQR Median, IQR
or
How to write the result
Follow APA/AMA style or recommendations from publications
how to write the result correctly. Here are some examples.
Choose «Percentiles»
in order to obtain
values of quartiles.
Central tendencies and measures of spread II C
Check the number of valid and
missing values. Analyzing multiple
variables at the same time, only
those participants will be included
where all values in all variables are
valid.
First quartile,
median, third
quartile.
Frequencies for multiple choice questions
Write in new
name for the
whole variable
set, click «Add»
and «Close».
Open again
«Multiple
response», and
now option
«Frequencies»
are available.
Frequencies for multiple choice questions
Row FI w
ne me
É
axisx a pis
a
staff
a
i
a memo
on
Charts p
l
on d Ageo f perron int
Skewness measures the symmetry or
lack of symmetry in a dataset.
g
Kurtosis measures the thickness or
thinness of the tails of a distribution.
wear
M
on is
I
asclosetozeroisbetterstewnessbetter
than
kurtosis
outcome
age
i
cause
grouping any
nominations
ascountry
Hypothesis Testing and
Qualitative Data Analysis
Biostatistics
RSU, The Statistics Unit
Hypothesis Testing
Days to recovery
12,4
11,5
6
If the difference would be 7 days?
4 (With placebo 11.5 and with drugs
2
0
4.5 days)?
New drug Placebo
Research hypothesis: Use of new drugs for the treatment of the Is it enough now?
disease shortens the time to recovery. Where to draw the line to tell the
Is this information enough to compare average time to difference is significant?
recovery?
Example:
White blood cell count is not
diferent in patients
with lung cancer treated with
chemotherapy or radiation.
H0 – null hypothesis
Ha – alternative hypothesis
http://www.socialresearchmethods.net/kb/stat_t.htm https://www.slideshare.net/NirajanBam/hypothesis-testing-38588210
Wgggwwhere wedon'tknow
whichdirection
b e p
Alternative (HA or H1) Hypothesis One-tailed vs Two-tailed me
mostly
Yant usedfor
that
inmedicine willbe
Two sided/tailed
White blood cell count is statistically different in patients with
lung cancer treated with chemotherapy or radiation.
toy author what
https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faq-
what-are-the-differences-between-one-tailed-and-two-tailed-
tests/
the
result
One sided/tailed
White blood cell count in patients with lung cancer treated with
chemotherapy is significantly higher than for patients treated
with radiation.
Or
White blood cell count in patients with lung cancer treated with
chemotherapy is significantly lower than for patients treated
with radiation.
Probability density
data) that shows the likelyhood of p-value
the difference between •If p-value > , then do not
two means being not significant.
Very un-likely
4.3. Two-tailed p-value (shaded observation
Very un-likely
observation
reject the H0.
t-value Observed
area) is the probability of an (observed value
observed (or more extreme) result data point)
assuming that the null hypothesis is In this example critical t-value is 2.06.
true. -1.67 +1.67 -2.06 +2.06
EY e Beta agha's
5. Reject or fail to reject Type I and Type II Errors
• Or we can say that P-value is probability of the difference being • Choosing 95% confidence level, there may be a 5% risk that
gained due to chance. the null hypothesis is rejected wrongly – type 1 error
(α=0,05).
•If = 0.05:
• There may be a situation where the null hypothesis is false,
1
H0 is not rejected but it is not rejected – type 2 error (significance level β).
Conclusion – the difference is not statistically
significant • Decision to reject or fail to reject H0 is based only on α.
P-value
Power Power
•Statistical power is the likelihood that a study will detect an •Statistical power is affected chiefly by the size of the effect and
effect when there is an effect there to be detected. If statistical the size of the sample used to detect it. Bigger effects are easier
power is high, the probability of making a Type II error, or to detect than smaller effects, while large samples offer greater
concluding there is no effect when, in fact, there is one, goes test sensitivity than small samples.
down. Power = 1- β. •Usually it's requred to have power not less than 0.8 or 80% and
•Typically used type II errors: is choosed before the study.
β = 0.20 •Sometimes power is asked to calculate for results that are not
β = 0.10 statistically significant.
β = 0.05
https://www.statisticsteacher.org/2017/09/15/what-is-power/
https://effectsizefaq.com/2010/05/31/what-is-statistical-power/
Clinicalsignificance otherfactorsremoving
effects
aslongsamplesize bigger Pvaluewill besmaller
butsmallersamplebiggerPvalue
Sample Size
How to Calculate Sample Size for Different Study Designs in
Medical Research?
Jaykaran Charan and Tamoghna Biswas
Indian J Psychol Med. 2013 Apr-Jun; 35(2): 121–126.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3775042/
https://wnarifin.github.io/ssc_web.html
Tests of Normality
https://onlinecourses.science.psu.edu/stat100/book/export/html/698
Tests of Normality Tests of Normality
“The normality tests are supplementary to the graphical Keep in mind that tests of normality are not perfect, use them
assessment of normality. The main tests for the assessment of together with other methods.
normality are Kolmogorov-Smirnov (K-S) test, Lilliefors corrected In SPSS you can find Kolmogorov-Smirnov and Shapiro-Wilk test.
K-S test, Shapiro-Wilk test, Anderson-Darling test, Cramer-von Shapiro-Wilk test is mostly recommended as the best choice for
Mises test, D’Agostino skewness test, Anscombe-Glynn kurtosis testing the normality of data.
test, D’Agostino-Pearson omnibus test, and the Jarque-Bera test.”
• H0 = The shape of the actual data
distribution is the same as for theoretical
normal distribution. (Sample comes from
normally distributed population.)
• H1 = The shape of the actual data
distribution is not the same as for
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3693611/ theoretical normal distribution.
Practical work
Biostatistics
RSU, The Statistics Unit
Difference between proportions 2x2 and RxC tables
Independent samples:
Pearson's chi-squared test/
Paired samples:
McNemar's test
Rhi 2x2 table
crosstabsn
Smoker
RxC table Y.toiggc Oral health
Eigg
Fisher's exact test Yes No
Poor Fair Good
Male Male
fisherFreemanHatto
exactest
t
Chi-squared vs Fisher's Exact Test Chi-squared vs Fisher's Exact Test
You can use Chi-squared test if:
• 80% expected values 5
• All expected values 1 If ≥1, you can use
chi- square test
results.
Otherwise use Fisher's Exact test. If <1, you need to
j
use Fisher's Exact
Test results.
Choose «Adjusted
standardized» to
gain additional
information.
association
Chi-squared Test
Chey g Fisher's Exact test for R x C tables
H0: There is no association between
gender and smoking.
HA: There is association between
gender and smoking.
Kendell'stan B
choose “Phi” and “Cramer’s V”.
ordinal
Strength of Association Conclusion
Only 18,5% of those who do not
eat breakfast have gastritis, as
compared to 73,3% of those who
don’t eat breakfast. This
association between habit of
eating breakfast and having
gastritis is statistically significant
(chi-squared test,
χ2(1,N=42)=12,286, p<0,001) and
shows large effect (Phi = 0,541).
http://www.real-statistics.com/chi-square-and-f-distributions/effect-size-chi-square/
McNemar's Test
Test for qualitative data, repeated samples. Often used to
estimate the effectiveness of the treatment.
•H0: Proportion of non-smokers before intervence = Proportion
of non-smokers after intervence
•H1: Proportion of non-smokers before intervence ≠ Proportion
of non-smokers after intervence
McNemar's Test
proportion
McNemar's Test McNemar's Test
• 25 (53.2%) non-smokers and 22
(46.8%) smokers started the research.
• After the intervention the count of
non-smokers reached 36 participants
(76.6% of 47), the count of smokers
decreased to 11 participants (23.4%
of 47).
• 16 smokers quit smoking, bet 5 non-
smokers started to smoke.
• There was a statistically significant
difference in the proportion of non-
smokers pre- and post-intervention
(p=0.027).
Choose percentage
from the total
amount of
participants.
Chart builder & Categorical Data I Chart builder & Categorical Data II
Choose the
type and
subtype of Drag and drop
the plot you variables from the
Click «OK», if you have assigned want to list to defined
proper measurement scales for use. places.
your variables in the «Variable
View».
Correlation
tell how close these ants to eat
Correlation is a statistical technique
that can show whether and how
strongly pairs of variables are related.
When we have two variables we can
ask about the degree to which they
co-vary.
Correlation Between
Total Mental Toughness (MT) and
Medicine Ball Throw (MBT)
Performance.
Nota Bene!
https://prometejs.wordpress.com/2012/11/15/sokolade-un-nobela-premija-korelacija-un-celonsakariba/
Correlation, not Causation
Correlation coefficient
1 co O
• Computation of the correlation IN s
coefficient results in a number d.ve
between -1.00 and 1.00.
• The number can be positive or
negative.
• Picture represents perfect linear
correlation.
Positive and Negative Correlation
Negative Correlation
Negative correlation does not mean the result is bad by meaning.
(High)
(Low)
Strength of the linear relationship
The description of
correlation coefficient is
differently used in various
fields of sciences. More
detailed is better in https://statistics.laerd.com/statistical-guides/pearson-correlation-coefficient-statistical-guide.php
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3576830/#R4
Non-linear Correlation
https://opentextbc.ca/researchmethods/chapter/describing-statistical-relationships/
Subsets of data can show different situation than the whole data
set. Explore your data to avoid misleading conclusions (Simpson’s
paradox).
Correlation and the Range of Data
For all ages r = −0.77, for 18- to 24-year-olds (in the blue box) r = 0.
If = 0.05:
1
H0 is not rejected
Conclusion – the correlation is not statistically
significant
P-value
0.05
H0 is rejected and HA is accepted
Conclusion – the correlation is statistically
significant
0
P-value in Correlation
• If the sample size is small, r=0,846
correlation has to be p=0,358
stronger to be statistically
significant.
• In large samples correlation
can be weak and it still
might be statistically r=0,141
significant. p=0,035
Pearson's and Spearman's correlation coefficient
Iinear correlation
Pearson's
Spearman's correlation coefficient
coefficient
weighther ears
ghaseing
Both variables are quantitative, Both variables are One or both variables are
and both are normally quantitative but one or both ordinal.
distributed. In the scatter-plot
no non-linearity is observed,
are not normally distributed,
N < 30. In the scatter-
Fix
no outliers. N ≥ 30. plot relationship is
nonlinear or with outliers.
nonlinear
Pearson Correlation Coefficient
• Two variables should be measured at the interval or ratio level.
• There is a linear relationship between your two variables.
• There should be no significant outliers.
• Variables should be approximately normally distributed.
• Random samples are used.
• Independent samples (no manipulation with one of the variables,
and only one measurement from each subject) are used.
• Sample size > 30.
(If you want to asses how the same thing is assessed, using two
different methods, look for interclass correlation.)
Pearson Correlation Coefficient
r=0,40 r=0,70
Choose the
appropriate
coefficient
and click
«OK».
Correlation coefficient (0,163)
P-value (0,147)
Number of participants (sample size)
Spearman’s Rank Correlation Coefficient
• One or both variables is measured on an ordinal
scale.
• One or both variables are not normally
distributed.
• The sample size is small (<30).
• The relationship between both variables is non-
linear.
If any of those statements is true, use Spearman's
Rank Correlation Coefficient.
Plot your data! For U-shaped relationhsip no
coefficient will show the proper strength of
relationship.
Spearman’s Rank Correlation Coefficient
First steps are the same as in calculation of Pearson correlation
coefficient.
Scatter plot
Choose the
type of
diagram and
click «Define».
Choose
variables you
want to plot
and click «OK».
Results
Gian
Correlation between time
spent revising and exam
performance is strong and
statistically significant,
coerient
r(41)=0,817; p<0,001.
df=N-2 pvalue
ifparametric samesize 2
astraysignificance
Linear regression
hurt need normal distribution
Linear Regression
Linear regression uses mathematical model to describe
relationship between both variables.
Types of regression
1. By number of factors in the
model:
• One-factor
• Multiple factor
2. By mathematical model of
relationship:
• Linear
• Non-linear
logarithmic
exponential
Cox proportional hazards
etc.
Regression line
• If there is a relationship between x and y → we might want to
find the equation of a line that best approximates the data.
• This is called the regression line (also called best-fit line or
least-squares regression line).
• We can use this line to make predictions and evaluate the ratio
of change in y per change in x.
Regression
• One variable is regarded as the predictor,
explanatory, or independent variable (x).
• Other variable is regarded as the response,
outcome, or dependent variable (y).
resianal
slog constant
ordinal
data e MY
slopethee
of in
y a bx y
b
constant X
intertant
careabout
ordinal data
that
Straight line equation
In algebraic terms, a straight line is defined by the following
formula:
Y = a + bX
Where:
• Y is the predicted variable.
• a is the intercept (where the straight line intercepts the
vertical Y-axis on a graph).
• b is the slope of the straight line.
• X is the value on the predictor variable.
Once we know the values for the intercept a and the slope b,
we can insert various X values into the equation in order to
predict values of Y (only in interval of known values of X).
The same intercept, different slopes
http://core.ecu.edu/psyc/wuenschk/docs01/regr1.gif
The same slope, different intercepts
http://qcbs.ca/wiki/_media/fig_7_w5.png?w=600&tok=ab7bc2
Least square method (LSM)
Residuals (errors)
Example
x y
1,00 1,00
2,00 2,00
3,00 1,30
4,00 3,75
5,00 2,25
Example
x y ŷ y-ŷ (y-ŷ)^2
http://work.thaslwanter.at/BSA/html/_images/RegCoefficients.jpg
Regression and p value
H0: the slope of the regression line is zero (the change in
independent variable(s) do not predict changes in dependent
variable).
https://www.researchgate.net/publication/259983890_Quantitative_Assessment_of_Pancre
atic_Fat_by_Using_Unenhanced_CT_Pathologic_Correlation_and_Clinical_Implications
Assumptions
• Both variables should be interval or ratio scale.
• There needs to be a linear relationship between the two
variables.
• There should be no significant outliers.
• The residuals (errors) of the regression line need to be
approximately normally distributed.
• The residuals should not be correlated with another variable.
Assumptions
• Your data need to show homoscedasticity (homogenity), which
is where the variances along the line of best fit remain similar
as you move along the line.
Linear regression in SPSS
Analyze Regression Linear
Select the dependent
variable.
Select the
independent
variable.
Linear regression in SPSS
Results I
Y= a+bX = 15,82+1,27*X
Exam performance = 15,82 + 1,27*Time spent revising
With every hour spent revising, the test score increases by 1.27
(95% CI = 0,99, 1,56) points, this relationship is statistically
significant at 95% level of confidence (t=9.075, p<0,001).
Plots of residuals - homoscedasticity
Homoscedasticity Heteroscedasticity
https://www.statisticssolutions.com/testing-assumptions-of-linear-regression-in-
spss/
Plots of residuals - normality
Multiple linear regression
https://statisticspicturebook.wordpress.com/category/regression/
Multiple linear regression in SPSS
Additional assumption - covariates should not correlate strongly
with each other (to avoid multicollinearity).
Select the dependent
variable.
Select independent
variables, not more
than 5, if the
chosen method is
«Enter».