1
IM8.2.1.4 Distributions and
Probability
Dr. Nada El-Ekiaby
Lecturer of Molecular
Pharmacology
10/14/2024
Equations individual values
within
Confidence interval mean 1.96 50
TotalnumberofResults sym
Mean frequency
3
SE SD
4
Mff
to obtain p value
Thtic
5
CI mean 1.96 5
boundaries
Objectives
• By the end of this lecture you should be able to:
– Explain the term ‘normal distribution’ and its relevance
to medical data
– Explain the difference between a ‘sample’ and
‘population’
– Calculate the 95% confidence interval for a population
mean, given the mean, standard deviation and
sample size
– Explain what is meant by a 95% confidence interval
– Understand the concept of a 95% confidence interval
for any statistic
3
Why bother with statistics ?
• A tool to aid the practice of evidence-
based medicine (EBM)
– EBM is “the integration of best research evidence with
clinical expertise and patient values” (D L Sackett et al.
2001)
• Best research evidence:
– Patient based research on:
• Accuracy of diagnostic tests
• Power of prognostic markers
• Efficacy of therapeutic interventions
4
Steps in practicing EBM Evidence Based
Medicine
• Posing a clinical question
• Tracking down the research
evidence available
• Critically appraising evidence for its
s methodology, results and applicability Applicable
EEE
gg
• Integrating evidence with clinical expertise
and patient values
5
“Distributions and probability”
• STATISTICS: the study of variation
– Variation => variables: quantitative or
categorical
umberbased•Quantitative - height, weight, blood pressure
• Categorical - gender, blood group, vital
I 201 status (dead/alive)
6
Example
(Study of cardiovascular health of 7735 British men aged 40-59)
Blood Group (categorical variable)
Categorical
Quantakate
Blood Group Frequency Relative
Frequency (% )
O 3572 46.6
A 3150 41.1
B 718 9.4
AB 222 2.9
TOTAL 7662 100.0
7
Example
(Study of cardiovascular health of 7735 British men aged 40-59)
Height (quantitative variable)
Height range Frequency Relative
(cm) Frequency (% )
145- 20 0.3
155- 755 9.8
GI
165- 3918 50.7
175- 2735 35.4
185- 299 3.9
195-204 5 0.1
Total 7732 100.0
8
Histogram of height (cm) I
5000
4000 3918
2735
3000
2000
1000
755
299
0 20 5
150.0 170.0 190.0
Height (cm)
9
Histogram of height (cm) II
3000
4900
155 2000
145
20 1500
1000
500
0
147.5 167.5 187.5
157.5 177.5 197.5
Height (cm)
10
Histogram of height (cm) III
500
400
300
200
100
Height (cm)
11
Normal (“Gaussian”) distribution
curve
1000
800
600
400
200
Height (cm)
12
Normal distribution
• Followed by many biological variables
• Characteristic symmetrical ‘bell’ shape
• Defined by mean and standard deviation
• Useful for formal statistics (hypothesis
testing & confidence intervals)
Totalnumberofresults
Mean 13
Normal distribution
Parameters – mean location
Mean
“location” l
•A and B have same
mean l
•C has mean greater
than Aor B
l l
l l
I 6 15 20 30
Variable e.g.
Height
14
SD.MENormal
9 distribution
StandardderiaJtion
Areaunderthegraph
Parameters – s.d. Mean
Mean Moredispersed
Standard Deviation
(s.d.) l
“dispersion”
•A and C have same
s.d.
i
•B has s.d. greater
É
than Aor C
s
I
Variable e.g.
Height
15
Normal distribution
Standard normal distribution
81
E
16
o
Normal distribution
Tails areas
17
Normal distribution
95% of distribution
18
Normal distribution
95% of distribution (any mean/s.d.)
19
HE Predictions for Normal
E.FI
distribution
• 50% of values lie below, and 50% above
mean
• 68% of values lie between (mean-1xSD)
and (mean+1xSD)
• 95% of values lie between (mean+/-
30.8
1.96xSD) CI SE
30.79
• e.g heights: mean=173.25cm, SD=6.63cm
– 95% of heights between 160.3cm and 186.2cm
20
160.3 186
6.63
8D
21
160.3 173.25 186.2
Sample and populations
• Each set of data is a sample from a wider
population
• Make inferences about population using sample
results reasonableguess
• Sample mean is best available estimate of
population mean (but not completely reliable)
individual
• Different samples of same size from same
population would produce different sample means
differentResu
10 10 10 10 10
50 so 3150 22
100 students
Lots of different samples would produce
Lid distribution of sample means
• Sample means (25 men per
To • Sample means (350 men per
5 frequency sample) sample)
– 173.48 – 172.62
my
– 170.94 – 172.99
– 172.34 – 173.03
– 172.48 – 172.30
– 173.57 – 172.96
– 173.17 – 172.96
– 169.66 – 173.02
• Mean of means=173.23 • Mean of means=173.02
• SD of means=1.45 • SD of means=0.28
23
Distribution of sample means would
itself follow a Normal Curve
• Its Mean= true population
mean 3000
• Its SD =“standard error of
the mean” is smaller than 2000
the SD of individual
values
1000
• Standard error of mean
r
(SEM) = SD/sqrt(number
of observations (n))
0
Sample mean height
SE SE 24
95% confidence interval (CI)
• Expect 95% of individual values to fall
within 1.96 SD of the population mean
• Also expect 95% of sample means to fall
within 1.96 SE of the population mean
• For single sample calculate:
– sample mean +/- 1.96xSE
MML
• Two values give 95% confidence interval
for population mean
25
Example: 95% reference range
• Heights of 100 men
– mean=175cm SD=7.5cm
• Within what range would you expect 95%
of men’s heights to lie ? (reference range)
1 gig 1
Mean 196 5D
26
175 1.96 7.5
Example: 95% reference range
• Heights of 100 men
– mean=175cm SD=7.5cm
• Within what range would you expect 95% of men’s
heights to lie ? (reference range)
– Mean +/- 1.96 x SD
– 175 +/- 1.96x7.5
– 175+/- 14.7
– (160.3cm,189.7cm)
27
Example: 95% CI
• Heights of 100 men
– mean=175cm SD=7.5cm
• Within what interval would you expect the
population mean to lie, with 95%
confidence ? (confidence interval)
1173.53 175 176.471
9 51
8
52 7 0.75 28
Example: 95% CI
• Heights of 100 men
– mean=175cm SD=7.5cm
• Within what interval would you expect the population
mean to lie, with 95% confidence ? (confidence interval)
– mean+/- 1.96xSE
– 175+/-1.96x(SD/sqrt(n))
– 175+/-1.96x(7.5/sqrt(100))
– 175+/-1.47
– (173.53cm, 176.47cm) 29
95% confidence interval
• The range within which we expect the true
population mean to lie, with 95%
confidence.
• If we repeated a study 100 times (and
calculated a 95% confidence interval each
time) 95 of these intervals would contain
the true population mean
30
Confidence intervals for all
statistics
• Concept does not only apply to the mean
• Can calculate CIs for e.g.
proportions/percentages, difference between
means, other measures (RR, RRR, ARR,
OR) etc.
O
• CI quantifies the reliability of an estimate
from a sample
31
Example: proportions
• Student lifestyle questionnaire
• “Do you smoke cigarettes at all nowadays?”
• Of 328 Students 54 ticked “yes”
• 54/328=0.165 (SAMPLE estimate)
– Within what range are we 95% certain that the
true value lies for the proportion of smokers in
the POPULATION of students?
(i.e. the 95% confidence interval)
32
Example: 95% CI for proportions (p)
• 54/328=0.165 (SAMPLE estimate)
– or equivalently as a percentage 16.5%
• Can calculate standard error for this Proportion
(p)
• SE (p)= sqrt ((p x (1-P))/n)
• sqrt((0.165 x (1-0.165))/328) = 0.020
• Hence, 95% CI is:
• (0.165 – (1.96 x 0.020), 0.165 + (1.96 x 0.020))
• (0.126,0.204)
33
– or equivalently as percentages (12.6%,20.4%)
95% CI for proportion of
smokers
• We are 95% certain that the true value for
the proportion of smokers in the
POPULATION of students lies somewhere
between:
– 0.126 (12.6%) and 0.204 (20.4%)
34
Summary
• You should be able to:
– Explain the term ‘normal distribution’ and its relevance
to medical data
– Explain the difference between a ‘sample’ and
‘population’
– Calculate the 95% confidence interval for a population
mean, given the mean, standard deviation and sample
size
– Explain what is meant by a 95% confidence interval
– Understand the concept of a 95% confidence interval
for any statistic
35
Assignment
Open the below mentioned link and solve Section 1 to 4 of
the exercises.
www.ucl.ac.uk/lapt?biom1
Click on the grey “Start” button towards the top of the
screen.
You will now be taken through the exercises.
36
Recommended Reading
Petrie A, Sabin C. Medical statistics at a glance.
3rd ed. Chichester: Wiley-Blackwell; 2009.
37
Significance Testing
Dr. Nada El-Ekiaby
Summary of previous session
• Learnt about two main classifications of
data
• Introduced to normal distribution
• Learnt difference between sample and
population
• Calculated and interpreted 95% CIs for a
mean and percentage
2
Objectives for this session
• Understand the concept and relevance of
hypothesis testing (significance testing)
• Define the ‘null hypothesis’ when comparing two
groups
• Be aware of the t-test (paired and unpaired) and
its role in hypothesis testing
• Interpret a p-value
• Understand the relationship between p-values and
confidence interval
3
T test Related samples
Paired
Data before After
Unpaired Ttest
independentsamples
ex Data between twogroups
value
P probability Ho is correct
W
Null Hypothesis Ho
no significant
difference
Alternative Hypothesis 1 or Ha
There is significant difference
H Ha Rejection
Two tailed
Stages of Hypothesis testing
to H
1. Define the null and alternative hypothesis under study.
2. Collect relevant data from a sample of individuals
3. Calculate the value of the test statistic specific to the null
hypothesis
4. Compare the value of the test statistic to values from a
known probability distribution
5. Interpret the p-value and results
Starting point
Null hypothesis (H0) III
• Investigator may conduct a study to address a
certain theory (study hypothesis)
– e.g. FEV1 in young adults varies between males and
orced females
piratory
• Nullhypothesis (H0 ) starting point of statistical
Olume
analysis
1sec
• H0: precise statement about population of interest,
that there is no effect or no difference (usually
converse of study hypothesis!)
Default
Between two 5
g
Null hypothesis (continued)
• E.g. The difference in mean FEV1 between the
population of young adult males and females is 0
litres
• Can’t look at whole population of young adult
males and females!!
• Use a sample to make inferences about wider
population
• Is there any evidence from our sample against the
null hypothesis?
6
Test statistic
• Assess the evidence against the null
hypothesis using the ‘test statistic’
• Test statistic calculated from our sample
data
• Type of test statistic depends upon type of
data (e.g. quantitative or categorical)
• Test statistic can then be ‘looked up’ in
tables and a p-value obtained
7
The unpaired (2 sample) t-test
• Quantitative (continuous) data e.g. height, blood
pressure, age, FEVl…
• Comparing continuous outcome (FEV1) between
two groups (males versus females)
• Assumes values follow normal distribution
– (in each of the two population subgroups - see lecture
one)
• Assumes standard deviation the same in the
population subgroups
8
Example
• FEV1 (litres)
– Sample: 39 males, 46 females
• Males: mean=4.651 s.d.=0.761
• Females: mean=3.311 s.d.=0.657
– Observed difference in mean FEV1 is 1.34
litres
• Is this difference due to chance or is it due
to a true difference in mean FEV1 in males
and females ? 9
Unpaired t-test: test statistic
• Difference in means is 1.34 litres
• Can calculate the standard error of this
difference (0.156 litres)
• Test statistic calculated:
– difference in means/SE of difference in means
– 1.34/0.156 = 8.6
• Look this up in appropriate statistical tables
to derive the p-value
9
The p-value
• P-value is a probability (ranges between zero and
one)
• Smaller our p-value (obtained by looking up the
appropriate test statistic in tables) LESS likely the
observed results are due to chance
• Smaller our p-value the STRONGER the evidence
against the null hypothesis
• Use p-value to decide whether to REJECT the null
hypothesis
11
L F
Probability
orktobe k
P-value (contd.) He rejected
if 0.05
p
• Often decision made to reject H if p-value less
0
than 0.05 (p< 0.05) (5% significance level)
• Less than 1 in 20 probability that observed results
due to chance or Higher
o
• Reject the null hypothesis (at the 5% significance
level) that states no difference or no effect
• Conclude results reflect a true difference in the
population of interest
12
P-value (contd.)
• P > 0.05 not enough evidence to reject the
null hypothesis
• Null hypothesis has not been proved
correct!
• Do NOT say we accept the null hypothesis
• We say: No evidence against the null
hypothesis
FSamplesized 13
medicine
1,000,001
eities
4 1
Back to example….
• Calculated our test statistic for unpaired t-
test = 8.6
e
• Look test statistic up in tables: p < 0.00001
• Strong evidence to reject (against) the null
hypothesis
• Less than 1 in 100,000 probability that
observed difference (1.34 litres) is due to
chance
14
teststatistic
Less Likelyto
beobservedbychance
Conclusion
• From p-value we say:
– If the null hypothesis were true, there is less than 1 in
100,000 probability, that the difference we have seen
between sample means in our data, would occur by
chance. We reject the null hypothesis and say the
difference is significant at the 0.001% significance level
• Difference reflects a true difference in FEV1
between the population of young adult males and
females
15
Example: Confidence Interval
• FEV1 (litres)
– Sample: 39 males, 46 females
• Males: mean=4.651 s.d.=0.761
• Females: mean=3.311 s.d.=0.657
– Observed difference in mean FEV1 is 1.34
litres (also calculated S.E. = 0.156 litres)
• Can calculate a 95% confidence interval for
this difference in means
16
95% CI for difference in means
• Difference in mean +/- (1.96 x SE of
difference)
• 1.34 +/- (1.96 x 0.156)
• 1.34 +/- 0.306
• (1.034 litres, 1.646 litres)
17
Ho
Interpreting the CI
• 95% certain that the true difference in mean FEV1 between
the population of young adult males and females lies
somewhere between 1.034 litres and 1.646 litres
• Under the null hypothesis: if there was no difference in
Had
mean FEV1 the difference would be zero
• We can see that the 95% CI does not contain the value zero
• Results from CI are incompatible with the null hypothesis
To
• Looking at the 95% CI we would reject the null hypothesis
at the 5% significance level
• From both CI and p-value from hypothesis test lead to
QQ
same conclusion - reject H0 18
General structure of hypothesis
test
Foul tank
• Define H0
• Collect data TBS
• Calculate summary measure (e.g. difference
Test in
statistic
means)
• Calculate value of test statistic
• Derive p-valueOR AND CI
• Decide whether enough evidence to reject H 0
19
Example –
Paired (one sample) t-test
• Define H0:
– Mean change in FEV1 after use of
bronchodilator is zero
• (Collect) Data:
– 29 students: FEV1 measured before and after
use of bronchodilator (N.B. data are PAIRED
observations)
• Calculate summary measure:
• Mean change = 0.071 litres 19
• Standard deviation of change = 0.236 litres
SE SFE
0
ME
Example (continued)
• Calculate test statistic (for paired t-test):
– Mean difference/SE of mean difference
i
teststatistic
– 0.071/0.0438 = 1.62
• Derive p-value:
– p=0.12 0.05
• Decide whether or not to reject H0
– 12 in 100 probability that results occurred by
chance if null hypothesis true, not enough
evidence to reject H0 (P > 0.05) 20
P-value and CI
• P-value 0.12 (do not reject H ) 0
• Can calculate 95% CI
– (-0.015 litres, 0.157 litres)
• Under null hypothesis we would expect mean
change to be zero
• Zero is contained in 95% CI
• Confidence interval is compatible with the null
hypothesis
• No evidence against H0 22
Summary - session objectives
• Understand the concept and relevance of
hypothesis testing (significance testing)
• Define the ‘null hypothesis’ when comparing two
groups
• Be aware of the t-test (paired and unpaired) and
its role in hypothesis testing
• Interpret a p-value Probability Ho lobe correct
of
• Understand the relationship between p-values and
confidence interval
23
n 100 50
mean 1 1.96
mean 64
SD 5 64 1.96 0.5
0.5
SE So