0% found this document useful (0 votes)
4 views46 pages

Statistics

This document outlines a course on statistics tailored for laboratory science, focusing on descriptive statistics, statistical testing, and error propagation. Key topics include the mean, standard deviation, t-tests, ANOVA, and the importance of sample size and data variability. It emphasizes a practical approach to statistical concepts, with resources available for further support.

Uploaded by

Farah Aina
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views46 pages

Statistics

This document outlines a course on statistics tailored for laboratory science, focusing on descriptive statistics, statistical testing, and error propagation. Key topics include the mean, standard deviation, t-tests, ANOVA, and the importance of sample size and data variability. It emphasizes a practical approach to statistical concepts, with resources available for further support.

Uploaded by

Farah Aina
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Introduction to statistics

(laboratory science)
Dr Sebastiaan Winkler
School of Pharmacy, University of Nottingham
Scope of this course (3 hours)
• Complements Introduction to statistics with Graphpad Prism by
Ian Withers (January)
• There is some overlap between the courses, but also topics only
covered in one or the other

• Focus on:
• quantitative data (numerical)
• Physical/life science

• Disclaimer: I am not a statistician


• Practical approach; intended a simple guide
Topics
1. Descriptive statistics
1. Source of errors, sampling, mean, standard error, standard deviation
2. Statistical testing
1. Student t-test (one-tailed, two-tailed)
2. ANOVA (one-way ANOVA, two-way ANOVA)
3. Error propagation
• What is the error in a value calculated using two measurements?
Online support and resources
• Online supporting material and further resources are available
here:
• https://www.nottingham.ac.uk/toolkits/play_21042
Assumptions (1)
The data …
• Continuous
Any value; often decimal places required to report values

Other types of data (not covered):


• Discrete, discontinuous
Whole numbers, counts (sequencing data)
• Categorical, nominal
Data divided into categories; no numerical values
1. Descriptive statistics - introduction
Why always do multiple measurements?
Statistical sampling: when you do one measurement, you do not know
how accurate it is.
Repeats: selection of a subset (sample) of a population of all possible
values to estimate the variability/scatter in the entire population

Not all measurements are the same.


• Intrinsic (physical) variability (height/weight of animal, heart rate)
• Experimental errors
Descriptive statistics - introduction
Why always do multiple measurements?
Statistical sampling: selection of a subset of a population to
estimate the variability/scatter in the entire population

Terms:
Population: all possible values
Sample: (representative) subset of the population, consisting of n
observations (data points)
Take note:
In statistics: sample ≠ data point! Observation = data point!
Can statistics solve all your problems?
Accuracy & precision
Accuracy
The closeness of a result to the true value.

Precision
The extent to which results agree with one another.

Experimental design
The extent to which results agree with one another
and the closeness a the result to the true value.
The normal distribution
If you make a many measurements, a
histogram can be used to display the
data.
The histogram will start to approach a
bell-shaped curve: the normal
distribution.
The normal distribution displays the

value occurring
Probability of
probability a particular value occurs.

Value (mean )
Karl Friedrich Gauss (1777–1855)
The normal distribution
If you make a many measurements, a
histogram can be used to display the
data.
The histogram will start to approach a
bell-shaped curve: the normal
distribution.
The normal distribution displays the
probability a particular value occurs.

value occurring
Probability of
When doing experiments, you
determined the mean of the sample
(𝑥),
ҧ which is your (best) estimate of
the mean of the population .
Value (mean )
The mean
Example
Estimating the protein concentration in a cell lysate;
three samples; measured values 0.245, 0.218, 0.437 (ng/ml)

Estimation of the ‘true’ protein concentration is the mean (arithmetic mean):


0.300 ng/ml
𝑥1 + 𝑥2 + 𝑥3 + … + 𝑥𝑛
The formula to find the sample mean
σ𝑛𝑖=1 𝑥𝑖
𝑥ҧ =
𝑛
In a normal distribution, the mean = median = mode
The standard deviation
Example
Estimating the protein concentration in a cell lysate;
three samples; measured values 0.245, 0.218, 0.437 (ng/ml)

The standard deviation (SD) quantifies variability or scatter — how much the values
vary from one another; it is expressed in the same units as your data.
The formula to find the standard deviation of the sample

σ𝑛𝑖=1(𝑥𝑖 − 𝑥)ҧ 2 Note:  is the standard


𝑠= deviation of the population
𝑛−1
The standard deviation is: 0.097 ng/ml.
Protein concentration: 0.300 ± 0.097 ng/ml (mean ± s.d.; n = 3)
Why n -1 ?
Degrees of freedom
Statistical concept; Number of values that is free to vary

Example
If you know the mean and n = 3, then:
• The first measurement can be any number
• The second measurement can be any number
• The third measurement must be a given number

Bad news:
• A difficult concept; difficult to determine in many experiments
Good news:
• Software such as Graphpad Prism and Microsoft Excel can help
The standard error of the mean
Example
Estimating the protein concentration in a cell lysate;
three samples; measured values 0.245, 0.218, 0.437 (ng/ml)

The standard error of the mean (SEM) quantifies the precision of the mean -how
precisely you know the true mean of the sample; it is expressed in the same units
as the data.
The formula to find the standard error of the sample mean

𝑠 Note: n is the standard error


𝑆𝐸 = of the mean of the population
𝑛
The standard error of the mean is: 0.033 ng/ml.
Protein concentration: 0.300 ± 0.033 ng/ml (mean ± s.e.m.; n = 3)
Standard deviation or standard error?
Scenario 1 Scenario 2
Blood pressure was monitored over Protein concentration of cell lysate:
24 hours in 14 healthy volunteers. 0.300 ng/ml (n = 3)

Average (mean) blood pressure:


110/72.

Intrinsic/physical variation Experimental variation


Typically use standard deviation Typically use standard error of the
mean
Useful to know…
• The s.e.m. is always smaller than the s.d.
• Increasing the sample size n:
• will not always decrease the s.d.
• will normally decrease s.e.m.
Standard deviation or standard error?
• Comparison of two groups of mice fed high-calorie diet; one
group treated with compound NCE1 and one control group
• standard deviation
• Determining the melting point of NCE1
• standard error of the mean
• Determining the inhibitory concentration IC50 of Flap
endonuclease 1 by compound NCE2
• standard error of the mean
• Determining the yield (mg/litre) of purified protein per litre of
culture
• standard deviation
The normal distribution, the mean and
the standard deviation
 mean
 standard deviation

You measure a value within 2


standard deviations from the
mean with 95.4% probability
The normal distribution, the mean and
the standard deviation
The standard deviation
determines the shape of the
bell curve.

When doing real lab


experiments, how does this
relate to precision and
accuracy?
Quick catch up
• Descriptive statistics
• Source of errors, sampling, mean, standard error, standard deviation
• You should now be able to:
• Understand the terms population, sample and observation in statistics
• Reflect on your sample size n
• Reflect on the sources of errors in your experiments
• Be confident about showing the standard error or standard deviation
• Not be confused when seeing
• 𝑥ҧ or  used for the mean; s or  used for standard deviation
Descriptive stats with Microsoft Excel
• Microsoft Excel
• Build-in formulas for:
• Mean: =AVERAGE(range)
• Standard deviation: =STDEV(range)
• No build-in formula for standard error
• =STDEV(range)/SQRT(COUNT(range))
• You can also use the Data Analysis pack (need to be installed
separately; not part of default installation in Excel)

• E.g. if your data is in cells A1, A2, A3, and A4, then the range is A1:A4
2. Statistical testing
• Student t test
• One-tailed, two-tailed
• ANOVA – Analysis of variance
• one-way ANOVA, two-way ANOVA
• Post hoc tests
Student’s t test
The most commonly used statistical test?
Answers question: What is the probability that the means a and b are from the
same population?

‘The probable error of a mean’ published in 1908 by William Sealy Gosset, who
worked for the Guinness brewery in Dublin, Ireland

One-tailed t test: paired samples


e.g. ‘before’ and ‘after’ treatment
Two-tailed t test: unpaired samples
e.g. two groups, one receiving treatment, and a control group
Student’s t-test

What is the probability that the means


a and b are from the same
population?
Scenario 1
p < 0.05: generally accepted as significant
µa µb

Scenario 2
p > 0.05: not significant

µa µb
Student’s t-test
The test is based on the t-statistic.
• Step 1: Calculate means 𝑥ҧ1 and 𝑥ҧ2 as well as standard deviations
𝑠1 and 𝑠2 of two pupulations with number of observations 𝑛1 and 𝑛2
• Step 2: Calculate the value of 𝑡
• Step 3: Using the degrees of freedom 𝑑𝑓, you can find the probability
that the two means are from the same population using a table with t-
statistics
• It can be done without a computer (some supervisors may remember
how to do this)
Student’s t-test
Step 1: The t-statistic:
𝑥1ҧ − 𝑥ҧ2
𝑡=
𝑠12 𝑠22
+
𝑛1 𝑛2

Where 𝑥ҧ average of population.


𝑠 standard deviation
𝑛 number of observations

Step 2: Work out the degrees of freedom:


𝑑𝑓 = 𝑛1 − 1 + (𝑛2 − 1)
Student’s t-test
Step 3: the table
Student’s t-test
Reporting probability P
(or p or p)
P < 0.05, or p < 0.05, or
p < 0.01

A computer can calculate the exact


P value.
Some scientist feel you should
always report the precise value, but
there is no consensus.
One-tailed / two-tailed and paired t-test
Use a two-tailed t-test for simple comparison between experimental
values (observation and control).
Practical advice: use the two-tailed test by default.
When is a one-tailed t-test appropriate?
• When you expect the treatment to move in one direction
• This is often not accepted as a good enough justification

Paired t-test:
• Use when you have measurements that are matched, e.g. data point
taken before and after treatment
• Heart rate before and after exercise
• Conductivity across membrane before and after treatment with compound
In a paired t-test, the sample size of both groups should be
identifical. 𝑛1 = 𝑛2
Student’s t-test
Take home message:
Use a two-tailed t-test for simple comparison between experimental
values (treatment and control).
Use a paired t-test when you have matched measurements
When is a one-tailed t-test appropriate? Rarely
The t-test assumes normal distributions; the spread of values in both
groups should be comparable (homogeneity of variance).
Student’s t-test
Advantage:
Robust test even if assumptions are not completely valid

Limitation:
The t-test is designed for the comparison of the means of two
populations. So, what do you do when comparing the means of many
groups?
‘Type 1 error’ (false positives) will increase with each additional t-test.
Tools for t testing
• Microsoft Excel
• Build-in formulas and tools in the Data Analysis pack (need to be
installed separately; not part of default installation in Excel)
• Graphpad Prism
• University license available; needs to be installed separately (license
key can be requested)
• Graphpad online tool
• Useful for quick results, simple data.
Analysis of variance (ANOVA)
• Test to compare the means of three or more groups
• Similar assumptions as for t test (normality, similar variance in
samples)
• F statistic
• Variance within samples and variance between samples
Analysis of variance (ANOVA)
• F statistic

• If F statistic is sufficiently large, you can conclude that the sample means are not (all)
derived from the same population
• So, which sample(s) are different?
• Post hoc testing
Analysis of variance (ANOVA)
• Post-hoc testing
• Carried out after calculating the F statistic
• Tests based on t statistic with correction to avoid false positives (type 1
errors)
• Recommended tests:
• Tukey test: use when you want to compare all means with all other
means
• Dunnett’s test: use when you want to compare all means with one
control group
One-way/two-way ANOVA
One-way ANOVA
Use one-way ANOVA when comparing many (n  3) means in one group
• Example 1: what is optimal pH for enzyme?
• Example 2: which formulation delivers best RNAi?

Two-way ANOVA
Use two-way ANOVA when comparing many (n  3) means in two groups
• Example 1: compare drug treatments in two cell lines
• Example 2: Determine optimal pH under high and low salt conditions
ANOVA
Take home message:
1. If you want to compare the means of three or more means, decide
which ANOVA is suitable; one-way or two-way?
2. Calculate F statistic: is variance between samples greater than
variance within samples?
3. To identify which groups are different, use a post hoc test
Tukey test: compare every mean with every other mean (compare all
groups with each other)
Dunnett test: compare every mean with one mean (control)
Tools for ANOVA
• Microsoft Excel
• Build-in formulas and tools in the Data Analysis pack (need to be
installed separately; not part of default installation in Excel)
• Limitation: post hoc tests not build in
• Graphpad Prism
• University license available; needs to be installed separately (license
key can be requested)
3. Error propagation
Now it gets (even more) complicated:
What do you do when you do calculations with two or more numbers
that each have variability?
Examples:
• RT-qPCR: expression of a gene ESR1 relative to that of the household gene
GAPDH
• Determining the concentration of a solution using 5.00 g NaOH and 100.00 ml
water
• Conversion of a observed quantity using a formula (e.g. absorbance to
concentration using the Beer-Lambert law)
• Normalisation
Error propagation: multiplication
Multiplication and division:
Say you have two values with a standard deviation: 𝑎ത ± 𝑠𝑎 and 𝑏ത ± 𝑠𝑏
If 𝑥 = 𝑎 × 𝑏 then the standard deviation of x can be determined:
𝑠𝑥 2 𝑠𝑎 2 𝑠𝑏 2
= +
𝑥ҧ 𝑎ത 𝑏ത

𝑠𝑥 𝑠𝑎 2 𝑠𝑏 2
= +
𝑥ҧ 𝑎ҧ 𝑏ҧ

The relative standard deviation is given by the square root of the sum
of the squares of all relative standard deviations.
Error propagation: multiplication
Multiplication and division:
If you have multiple values, then continue adding the sum of squares
of the relative standard deviations.
𝑎×𝑏 𝑠𝑥 2 𝑠𝑎 2 𝑠𝑏 2 𝑠𝑐 2
E.g. if 𝑥 = then: = + +
𝑐 𝑥ҧ 𝑎ത 𝑏ത 𝑐ҧ

𝑠𝑥 𝑠𝑎 2 𝑠𝑏 2 𝑠𝑐 2
= + +
𝑥ҧ 𝑎ത 𝑏ത 𝑐ҧ
Bad news:
No build-in formulas in Microsoft Excel or Graphpad Prism!!
Error propagation: addition
Addition and subtraction:
If you have multiple values with standard deviation, then you add the
squares of the standard deviations.
E.g. if 𝑥 = 𝑎 + 𝑏 + 𝑐 then: 𝑠𝑥 2 = 𝑠𝑎 2 + 𝑠𝑏 2 + 𝑠𝑐 2

𝑠𝑥 = 𝑠𝑎 2 + 𝑠𝑏 2 + 𝑠𝑐 2

Bad news again!


No build-in formulas in Microsoft Excel or Graphpad Prism!!
Error propagation: exact number
Multiplication/division using an exact number:
Multiply/divide the standard deviation by the number.
E.g. You determine the mean using three biological replicates, each
with three technical replicates
𝑎+𝑏+𝑐
𝑥=
3

Let 𝑑 = 𝑎 + 𝑏 + 𝑐 then: 𝑠𝑑 2 = 𝑠𝑎 2 + 𝑠𝑏 2 + 𝑠𝑐 2

𝑠𝑑 = 𝑠𝑎 2 + 𝑠𝑏 2 + 𝑠𝑐 2
𝑠𝑑
and 𝑠𝑥 =
3
Error propagation: final comments
Complex formulas
• combine the rules for multiplication/division and addition/subtraction
There are additional rules
• logarithms, power calculations, anti-logs etc
Error propagation is very common
• Even relatively simple experiments become very quickly complex
Tools for error propagation
• Graphpad online tool
• Online error propagation calculator
• Graphpad Prism
• Enter data (x,y) as mean, s.d. (or s.e.m.) and n
• Microsoft Excel
• No build-in formulas
Final comments
Error propagation is very common
Even simple experiments quickly become quite complex; it is ok to ignore
some errors (as long as you know what you are doing)
Look at the data; be conservative.
For example, do not inflate n when you combine technical and biological
replicates unless you feel it is justified
Graphpad Prism and Microsoft Excel can be great help,
but use them with caution; it is not always straightforward to calculate p
values
Talk to others; speak to specialist statisticians when necessary

You might also like