Statistics
Statistics
(laboratory science)
Dr Sebastiaan Winkler
School of Pharmacy, University of Nottingham
Scope of this course (3 hours)
• Complements Introduction to statistics with Graphpad Prism by
Ian Withers (January)
• There is some overlap between the courses, but also topics only
covered in one or the other
• Focus on:
• quantitative data (numerical)
• Physical/life science
Terms:
Population: all possible values
Sample: (representative) subset of the population, consisting of n
observations (data points)
Take note:
In statistics: sample ≠ data point! Observation = data point!
Can statistics solve all your problems?
Accuracy & precision
Accuracy
The closeness of a result to the true value.
Precision
The extent to which results agree with one another.
Experimental design
The extent to which results agree with one another
and the closeness a the result to the true value.
The normal distribution
If you make a many measurements, a
histogram can be used to display the
data.
The histogram will start to approach a
bell-shaped curve: the normal
distribution.
The normal distribution displays the
value occurring
Probability of
probability a particular value occurs.
Value (mean )
Karl Friedrich Gauss (1777–1855)
The normal distribution
If you make a many measurements, a
histogram can be used to display the
data.
The histogram will start to approach a
bell-shaped curve: the normal
distribution.
The normal distribution displays the
probability a particular value occurs.
value occurring
Probability of
When doing experiments, you
determined the mean of the sample
(𝑥),
ҧ which is your (best) estimate of
the mean of the population .
Value (mean )
The mean
Example
Estimating the protein concentration in a cell lysate;
three samples; measured values 0.245, 0.218, 0.437 (ng/ml)
The standard deviation (SD) quantifies variability or scatter — how much the values
vary from one another; it is expressed in the same units as your data.
The formula to find the standard deviation of the sample
Example
If you know the mean and n = 3, then:
• The first measurement can be any number
• The second measurement can be any number
• The third measurement must be a given number
Bad news:
• A difficult concept; difficult to determine in many experiments
Good news:
• Software such as Graphpad Prism and Microsoft Excel can help
The standard error of the mean
Example
Estimating the protein concentration in a cell lysate;
three samples; measured values 0.245, 0.218, 0.437 (ng/ml)
The standard error of the mean (SEM) quantifies the precision of the mean -how
precisely you know the true mean of the sample; it is expressed in the same units
as the data.
The formula to find the standard error of the sample mean
• E.g. if your data is in cells A1, A2, A3, and A4, then the range is A1:A4
2. Statistical testing
• Student t test
• One-tailed, two-tailed
• ANOVA – Analysis of variance
• one-way ANOVA, two-way ANOVA
• Post hoc tests
Student’s t test
The most commonly used statistical test?
Answers question: What is the probability that the means a and b are from the
same population?
‘The probable error of a mean’ published in 1908 by William Sealy Gosset, who
worked for the Guinness brewery in Dublin, Ireland
Scenario 2
p > 0.05: not significant
µa µb
Student’s t-test
The test is based on the t-statistic.
• Step 1: Calculate means 𝑥ҧ1 and 𝑥ҧ2 as well as standard deviations
𝑠1 and 𝑠2 of two pupulations with number of observations 𝑛1 and 𝑛2
• Step 2: Calculate the value of 𝑡
• Step 3: Using the degrees of freedom 𝑑𝑓, you can find the probability
that the two means are from the same population using a table with t-
statistics
• It can be done without a computer (some supervisors may remember
how to do this)
Student’s t-test
Step 1: The t-statistic:
𝑥1ҧ − 𝑥ҧ2
𝑡=
𝑠12 𝑠22
+
𝑛1 𝑛2
Paired t-test:
• Use when you have measurements that are matched, e.g. data point
taken before and after treatment
• Heart rate before and after exercise
• Conductivity across membrane before and after treatment with compound
In a paired t-test, the sample size of both groups should be
identifical. 𝑛1 = 𝑛2
Student’s t-test
Take home message:
Use a two-tailed t-test for simple comparison between experimental
values (treatment and control).
Use a paired t-test when you have matched measurements
When is a one-tailed t-test appropriate? Rarely
The t-test assumes normal distributions; the spread of values in both
groups should be comparable (homogeneity of variance).
Student’s t-test
Advantage:
Robust test even if assumptions are not completely valid
Limitation:
The t-test is designed for the comparison of the means of two
populations. So, what do you do when comparing the means of many
groups?
‘Type 1 error’ (false positives) will increase with each additional t-test.
Tools for t testing
• Microsoft Excel
• Build-in formulas and tools in the Data Analysis pack (need to be
installed separately; not part of default installation in Excel)
• Graphpad Prism
• University license available; needs to be installed separately (license
key can be requested)
• Graphpad online tool
• Useful for quick results, simple data.
Analysis of variance (ANOVA)
• Test to compare the means of three or more groups
• Similar assumptions as for t test (normality, similar variance in
samples)
• F statistic
• Variance within samples and variance between samples
Analysis of variance (ANOVA)
• F statistic
• If F statistic is sufficiently large, you can conclude that the sample means are not (all)
derived from the same population
• So, which sample(s) are different?
• Post hoc testing
Analysis of variance (ANOVA)
• Post-hoc testing
• Carried out after calculating the F statistic
• Tests based on t statistic with correction to avoid false positives (type 1
errors)
• Recommended tests:
• Tukey test: use when you want to compare all means with all other
means
• Dunnett’s test: use when you want to compare all means with one
control group
One-way/two-way ANOVA
One-way ANOVA
Use one-way ANOVA when comparing many (n 3) means in one group
• Example 1: what is optimal pH for enzyme?
• Example 2: which formulation delivers best RNAi?
Two-way ANOVA
Use two-way ANOVA when comparing many (n 3) means in two groups
• Example 1: compare drug treatments in two cell lines
• Example 2: Determine optimal pH under high and low salt conditions
ANOVA
Take home message:
1. If you want to compare the means of three or more means, decide
which ANOVA is suitable; one-way or two-way?
2. Calculate F statistic: is variance between samples greater than
variance within samples?
3. To identify which groups are different, use a post hoc test
Tukey test: compare every mean with every other mean (compare all
groups with each other)
Dunnett test: compare every mean with one mean (control)
Tools for ANOVA
• Microsoft Excel
• Build-in formulas and tools in the Data Analysis pack (need to be
installed separately; not part of default installation in Excel)
• Limitation: post hoc tests not build in
• Graphpad Prism
• University license available; needs to be installed separately (license
key can be requested)
3. Error propagation
Now it gets (even more) complicated:
What do you do when you do calculations with two or more numbers
that each have variability?
Examples:
• RT-qPCR: expression of a gene ESR1 relative to that of the household gene
GAPDH
• Determining the concentration of a solution using 5.00 g NaOH and 100.00 ml
water
• Conversion of a observed quantity using a formula (e.g. absorbance to
concentration using the Beer-Lambert law)
• Normalisation
Error propagation: multiplication
Multiplication and division:
Say you have two values with a standard deviation: 𝑎ത ± 𝑠𝑎 and 𝑏ത ± 𝑠𝑏
If 𝑥 = 𝑎 × 𝑏 then the standard deviation of x can be determined:
𝑠𝑥 2 𝑠𝑎 2 𝑠𝑏 2
= +
𝑥ҧ 𝑎ത 𝑏ത
𝑠𝑥 𝑠𝑎 2 𝑠𝑏 2
= +
𝑥ҧ 𝑎ҧ 𝑏ҧ
The relative standard deviation is given by the square root of the sum
of the squares of all relative standard deviations.
Error propagation: multiplication
Multiplication and division:
If you have multiple values, then continue adding the sum of squares
of the relative standard deviations.
𝑎×𝑏 𝑠𝑥 2 𝑠𝑎 2 𝑠𝑏 2 𝑠𝑐 2
E.g. if 𝑥 = then: = + +
𝑐 𝑥ҧ 𝑎ത 𝑏ത 𝑐ҧ
𝑠𝑥 𝑠𝑎 2 𝑠𝑏 2 𝑠𝑐 2
= + +
𝑥ҧ 𝑎ത 𝑏ത 𝑐ҧ
Bad news:
No build-in formulas in Microsoft Excel or Graphpad Prism!!
Error propagation: addition
Addition and subtraction:
If you have multiple values with standard deviation, then you add the
squares of the standard deviations.
E.g. if 𝑥 = 𝑎 + 𝑏 + 𝑐 then: 𝑠𝑥 2 = 𝑠𝑎 2 + 𝑠𝑏 2 + 𝑠𝑐 2
𝑠𝑥 = 𝑠𝑎 2 + 𝑠𝑏 2 + 𝑠𝑐 2
Let 𝑑 = 𝑎 + 𝑏 + 𝑐 then: 𝑠𝑑 2 = 𝑠𝑎 2 + 𝑠𝑏 2 + 𝑠𝑐 2
𝑠𝑑 = 𝑠𝑎 2 + 𝑠𝑏 2 + 𝑠𝑐 2
𝑠𝑑
and 𝑠𝑥 =
3
Error propagation: final comments
Complex formulas
• combine the rules for multiplication/division and addition/subtraction
There are additional rules
• logarithms, power calculations, anti-logs etc
Error propagation is very common
• Even relatively simple experiments become very quickly complex
Tools for error propagation
• Graphpad online tool
• Online error propagation calculator
• Graphpad Prism
• Enter data (x,y) as mean, s.d. (or s.e.m.) and n
• Microsoft Excel
• No build-in formulas
Final comments
Error propagation is very common
Even simple experiments quickly become quite complex; it is ok to ignore
some errors (as long as you know what you are doing)
Look at the data; be conservative.
For example, do not inflate n when you combine technical and biological
replicates unless you feel it is justified
Graphpad Prism and Microsoft Excel can be great help,
but use them with caution; it is not always straightforward to calculate p
values
Talk to others; speak to specialist statisticians when necessary