Stats Lecture1
Stats Lecture1
Lecture 1: Introduction
Dr Stephen Sawiak
These lectures aim to give an overview of principle and tools for successful experiments without focusing on
mathematical details
Important to understand the basis of how they work, the assumptions made so they are used appropriately
These six lectures
Topics for beginners to advanced questions: covering basics for beginners to advanced modelling
Dedicated statistics packages exist for some methods (e.g. R, Matlab) many Web-based tutorials etc.
Most important thing is to know what methods are available and when each is appropriate:
let software do the hard work
These lectures are not about software but I will point out commands / methods or go through some
standard output where it is helpful
This Lecture
Diet regime A
weightsA=[376.9,411.1,416.6,367.3,393.2,190.5,446.0,401.0,433.5,366.6,180.4,399.8];
Diet regime B
Imagine a huge supply of rats you can sample as much as you want, as many times as you want
𝑠
SEM =
√𝑛
Scatterplot
Median, percentiles and mode
Median: sort the data, the median is the value in the middle (here just 9 points)
0 1 2 3 4 5 6 7 8 9
0 11 22 33 44 55 67 78 89 100
Mean: 343g – biased by the two unexpected light rats
Mode: most common value (367g)
Median = 50th percentile: can calculate the value at any position, e.g. 0%, 100% are the lowest and highest
values, here 33rd percentile approximately 367g.
Percentiles can be more robust to outliers
mean
Boxplots
max
75th percentile
median
25th percentile
min
+ outlier
boxplot([weightsA weightsB])
Normal distribution
standard deviation
13.7%
68.3%
Cumulative probability distribution
90%
+1.3
50% 0
31% -0.5
10% -1.3
Is my data normally distributed?
qqplot(weightsA)
How bad does the plot have to be to panic?
If all the rats were the same, what is the probability of the mean of rat “A” weights
being greater than rat “B” weights by at least as much as observed?
Simulated null experiments
Assume we draw 10 samples from the closest normal distribution to these data and call them “rats A”
Draw 12 samples and call them “rats B”. Calculate the difference in their means. Repeat 1 million times.
p = 0.17
p-values
The p-value gives the probability of a result being at least as extreme if the null hypothesis were true
One-tailed tests only consider one direction (an increase or decrease but not both), two-tailed tests consider
that the effect could have been in either direction
Widely used (with some controversy): best used as an indicator of whether your results are “worth another
look”
For the rat data seen in this lecture, p = 0.17. These data do not offer good evidence to ‘reject the null
hypothesis’ that the diet regimes are the same.
Conventionally, an arbitrary cut-off of at most p < 0.05 is often used to decide whether the null hypothesis is
reasonably consistent with the data.
Lower p-values suggest less confidence in the null hypothesis as an explanation for the data.
Summary
Rat diet case study to explore data: the mean, median and mode
Introduce the standard error, normal distribution
Plots: bar graphs, histograms, scatter plots, error bars
Testing for normally distributed data with QQ-plots
Hypothesis testing and p-values
Next time: all about t-tests and confidence intervals, comparing means