Hypothesis Testing
Hypothesis Testing
2 Hypothesis testing
A project starts with a theory, that you then break down into testable hypothesis. Next you design an
experiment to generate the data that are required to test the hypothesis. After the data are collected,
statistics is used to test the hypothesis. In our Tet example, the theory is that Tet genes oxidize 5mC
to 5hmC and we were wondering in how far Tet1 and Tet2 can compensate for one another. Assuming
that 5hmC levels should be proportional to overall Tet catalytic activity, we can create the following
hypotheses:
1. If Tet1 and Tet2 both contribute to 5hmC levels, both single mutants should have lower 5hmC
levels than the wildtype.
2. If Tet1 and Tet2 can compensate for each other, the single mutants will have similar levels of
5hmC as the wildtype, but the double mutant will be lower.
3. If either Tet1 or Tet2 is mainly responsible for 5hmC levels, only one of the single mutants will
show a difference to the wildtype.
The experimental design involved single and double mutants of Tet1 and Tet2, multiple clones
for each genotype to have biological replicates and many technical replicates in that 5hmC
levels were measured for thousands of cells (Table 1.1). You already got to know the data hmC
.measurements.txt and had a first look at it, based on which you could already exclude some
hypotheses. The more promising hypotheses need more formal statistical testing.
Usually, we formulate a null hypothesis H0 and then test whether the data reject H0 , i.e.
another, alternative hypothesis H1 is more likely. The p-value is the probability that H0 is true.
In other words, our classic p-value threshold of < 0.05 means that we have a less than 5% chance that
the data conform to H0 .
Note, that not being able to reject H0 does not mean that H0 is true. Often, we
cannot reject H0 because the power of our test is not good enough.
The power of a statistical test is the probability that it will yield statistically significant results. —
J.Cohen
7
Hypothesis testing
Figure 2.1: The means and the difference in means (∆ = 5) are the same for all three plots, the only parameter
that changes is the standard deviation of the normal distributions.
Consider the three comparisons in Figure 2.1: in the first plot with low variability the blue and the
red distribution are rather distinct, less so in the second plot, but not so much in the third with high
variability. However, the means remain the same in all three plots: the blue distribution is always 5
and the mean of the red is always 10. This should illustrate the importance of the variance of the
distribution and that it is necessary to account for the variance in the significance test. The t-test
does this by calculating a signal to noise ratio.
The signal is the difference in means. This is also the parameter that we are testing and one of the
assumptions of the t-test is that the difference in means is normally distributed, which is definitely true
if the samples are normal. Hence, it is important to check that the sample measures are approximately
normal before using a t-test. The noise is a function of the variance estimates and the sample sizes.
Note that if we can assume that the groups only differ in their means, but have the same variances,
which would be a justified assumption if the variance is mostly technical, then Equation 2.1 simplifies
and we gain power because we can now estimate the variance pooling the samples from both groups.
In our Tet example, it is fair to assume that the variance for the comparison between two clones
should be the same. Let’s test whether hmC levels differ between the wildtype and Tet1 mutant
clone G2. Furthermore in your homework, you should have noted that log-transformation makes the
hmC-intensities more normal.
# log transform hmC.intensity -- more normal
hmC_measurement <- hmC_measurement %>% mutate( log_hmC= log10(hmC.intensity))
#check
glimpse( hmC_measurement )
## Rows: 8,744
## Columns: 6
## $ row <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1...
## $ clone <chr> "Tet1 D7", "Tet1 D7", "Tet1 D7",...
## $ hmC.intensity <dbl> 4382.31, 5182.50, 6948.82, 8963....
## $ genotype <chr> "Tet1", "Tet1", "Tet1", "Tet1", ...
## $ DAPI.intensity <dbl> 607376.1, 1120480.2, 1084091.8, ...
## $ log_hmC <dbl> 3.641703, 3.714539, 3.841911, 3....
8
Hypothesis testing
##
## Tet1 G2 wt
## 992 1196
##
## Two Sample t-test
##
## data: log_hmC by clone
## t = -5.5549, df = 2186, p-value = 3.116e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.05036956 -0.02408489
## sample estimates:
## mean in group Tet1 G2 mean in group wt
## 3.827900 3.865127
Mean hmC-levels are higher in the wt ( x̄wt = 3.87) than in the Tet1 mutant (x̄tet1G2 = 3.83). The
difference in mean intensity is 0.037± 0.0131 the last part is the 95% confidence interval, which
excludes zero: thus we can confidently reject our null hypothesis with a p-value of < 3.1e-08.
Next we can ask whether the hmC-levels differ between the Tet1-mutant genotypes and the wild-
type. This comparison also includes some clonal variance, thus we need a t-test with unequal variance,
which is also called Welch Two Sample t-test.
##
## Welch Two Sample t-test
##
## data: log_hmC by genotype
## t = -5.9925, df = 2692.4, p-value = 2.342e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.04600180 -0.02331892
## sample estimates:
## mean in group Tet1 mean in group wt
## 3.830467 3.865127
9
Hypothesis testing
All Tet-mutant genotypes have significantly lower hmC-levels than the wild-type.
So far we did two types of t-tests, but there are more:
Two-sample t-test Test whether two samples differ in the mean by µ. Sample sizes can differ but
the variance is assumed to be the same.
Paired t-test Test whether the difference between the two values of a pair is different. If the con-
ditions under which the pairs were sampled differ, this will take care of the extra variance and
thus increase the power.
Wilcoxon-Test If the sample distribution is far from normal, the Wilcoxon-test is a non-parametric
alternative to the t-test, i.e. a test that does not make a distribution assumption.
4 between
y
within
1 2 3 4
Figure 2.2: Example ANOVA data, with four groups with 30 samples each, all drawn from normal distribution
with σ = 2. The true population mean x̄1,2,4 = 5 for groups 1,2 and 4; and x̄3 = 3 for group 3. The blue lines
are the estimates of the group means and the distance of the black dots from the blue lines are the Resdiuals
(random errors) that measure the within group variance. The between group variance is measured as the
distance of the group means to the grand mean (dashed red line).
Analysis of variance (ANOVA) is one test to see whether there is any difference in means among
an arbitrary number of groups (Ng ). In an ANOVA we compare the variance between group means
with the variance within groups. The variance between group means x̄g is calculated as the difference
of the group means to the grand mean x̄tot which is simply the mean across all samples n ignoring
10
Hypothesis testing
the group. If the samples of all groups came from the same population distribution, then the between
group variance should not be any different from the within group variance (Figure 2.2). Thus our
H0 here is that the ratio of variances should not be different from 1. Ratios of variances follow an
F-distribution, which is used to determine significance.
Table 2.1: ANOVA table for example data in Figure 2.2. An ANOVA table contains all the ingredients
to calculate within and between group variances. A Sum of Squares for between group variance (in regression
this is called the model SSM) and a sum of squares for within group variance (in regression Residuals, or random
unexplained variance SSE) are calculated. For the Sum of Squares to be some kind of variance, we need to
consider the sample sizes, these are contained in the degrees of freedom (Df): we had four groups so between
Df is 3, each group has 30 samples this is 4 × 30 − 4 = 116. SSM is the sum of squared differences of each group
mean x̄g and the grand mean x̄tot , which is the mean across all observation. SSE is the sum of squares between
the observations and the group means. The calculation of all the values in this table is explained in Equation
2.2
Now, you understand how an ANOVA is calculated, but is this statistic valid for our data? There
are two major assumptions that need to be fulfilled in order to justify an ANOVA:
1. The residuals (i.e. SSE) should not be too far from a normal distribution: Normal Residuals
2. The variance of the residuals should not differ too much between groups: Homogeneity of
Variance
Let’s see whether the hmC-intensity measures fulfill these assumptions. We already had a look at
the distribution hmC-intensities for the t-test and found that log-transformation made them approx-
imately normal. If samples are normal the residuals are bound to be normal as well:
11
Hypothesis testing
clone
6
Tet1 Tet2 D7F12
density
4 Tet1 Tet2 G2B9
Tet1 Tet2 G2E3
2 Tet1 Tet2 G2H9
0
−0.2 0.0 0.2 0.4 0.6
residuals
Figure 2.3: Checking the Anova assumptions of normal and homogeneous residuals. Aside from some outliers,
the distribution looks approximatly normal and also the differences in the variance are only slight.
The residuals that we plotted in Figure 2.3 seem approximately normal and also the differences
in the variance among groups are not too big, hence an ANOVA is appropriate. There are also more
formal statistical tests for those assumptions. Normality can be tested using Shapiro-Wilk shapiro
.test() and for homogeneity the Levene-test can be used. Note that if the sample size are big as
in our case, these tests are probably too sensitive and already very slight deviations from the H0 will
lead to a significant test statistic, in this example the rejection of the homogeneity assumption.
If you are worried about the skew in the distribution or the deviations from variance homogeneity
(=heteroscedasticity), there is the possibility to use robust tests, that trim the 10 or 20% most
extreme values and base the test on the remaining ones. For more information on robust tests grab
any statistics book and have a look at the R-package WRS2. A non-parametric alternative to an
ANOVA is the Kruskal-Wallis test kruskal.test().
The ANOVA tells us that the four Tet1/Tet2 clones differ in their hmC levels, the biological
variance between clones is important to consider in the experimental design. It would not have
been sufficient to assess the hmC levels of only one clone per genotype to allow for a more general
interpretation.
summary(dm.anova)
Now that we have assessed that there is variation between clones, we would like to know which
ones differ. To this end we could go back to some post-hoc t-tests and adjust the p-values for the
FWER with an Bonferroni correction, i.e. lower your significance threshold α to α/m whereas m is
12
Hypothesis testing
the number of tests. An alternative post-hoc analysis is the Tukey ’Honest Significant Difference’-test
which is a t-test with an in-built p-value adjustment for the FWER.
Looking at the TukeyHSD result, we notice that out of the 6 pair-wise comparisons 3 are significant
and all 3 significant comparisons contain the clone G2B9, which appears to have slightly lower hmC
levels than the other clones.
Significance threshold This is up to you: how certain do you need to be about your finding. What
is the number of False Positives that you can live with? This is of particular interest if you are
planning expensive follow-up experiments based on your statistic.
13
Hypothesis testing
Effect size For a t-test or an ANOVA the effect size is the difference in means that we are trying to
detect. We obviously stand a better chance to see a large than a small difference.
Variance If the spread of a distribution is narrow or the measurement error is small, the power of
the test is higher than for large variance samples (Figure 2.1).
Sample Size The more samples we have the closer is the sample mean to the population mean, i.e.
our confidence in the measured mean value grows with the number of samples.
.
Figure 2.4: The factors that have an impact on the power of a statistical test. Power increases with effect and
sample size, while it decreases with increasing variance and the higher significance thresholds.
As you may have noticed the only factor that you can freely manipulate is your sample-size. The
effect size is an unknown parameter of interest, the largest part of the variance is usually of biological,
not of technical nature and thus also out of your control, and finally if you lower the significance
threshold too much, the whole test becomes meaningless due to an increase in the False Positive Rate.
0 1 2 3 4 0 1 2 3 4
Figure 2.5: Examples for an intermediate and a large effect size for the difference in the means of two normal
distributions with σ = 0.5.
Effect size is either measured as the difference between means (∆) or as a variance scaled effect
size. The most common effect size measure is Cohen’s d. You can think of Cohen’s d as the difference
in means in units of standard deviations (Figure 2.5).
kµ1 − µ2 k
d= (2.3)
σ
whereas σ is the standard deviation that we would expect within one group. If we actually have
estimates for both groups the best estimate for σ would be the weighted mean. If d ≈ 0.2 the effect
is small, with d ≈ 0.5 we have a medium effect and with d > 0.8 we can talk about a large effect [1].
Let’s calculate Cohen’s d for the comparison of wildtype and Tet1-mutant hmC-levels:
14
Hypothesis testing
wt_tet1.sum
## # A tibble: 2 x 4
## genotype mean sd n
## <chr> <dbl> <dbl> <int>
## 1 Tet1 3.83 0.162 1784
## 2 wt 3.87 0.150 1196
## [1] 0.2224025
With Cohen’s d of 0.22 the effect of a catalytic Tet1 mutant on hmC levels is small. Now let’s
see what power, i.e. what chance to detect the Tet1 effect, we would have had, if we had only 100
observations per group:
library(pwr)
ptest<-pwr.t.test( n = 100,
d = d,
sig.level = 0.05,
power = NULL)
ptest
##
## Two-sample t test power calculation
##
## n = 100
## d = 0.2224025
## sig.level = 0.05
## power = 0.3466453
## alternative = two.sided
##
## NOTE: n is number in *each* group
With only 100 nuclei per genotype the power that we would have detected the Tet1-effect on
hmC-levels is with 35% lower than a coin flip.
How small a difference could we have detected in 99% of the cases with our actual sample sizes?
pwr.t2n.test( n1 = wt_tet1.sum$n[1],
n2 = wt_tet1.sum$n[2],
d = NULL,
sig.level = 0.05,
power = 0.99)
##
## t test power calculation
##
## n1 = 1784
## n2 = 1196
## d = 0.1602344
15
Hypothesis testing
## sig.level = 0.05
## power = 0.99
## alternative = two.sided
For simplicity, we have focused on power analysis for a t-test (Equation 2.1), i.e. we have deter-
mined critical values for a t-distribution, given the significance level α, the effect size and the the
sample size. Similar calculations are possible for the F-statistic of an ANOVA
(https://www.statmethods.net/stats/power.html).
[1] J. Cohen. Statistical power analysis for the behavioral sciences, 1988.
2.4 Questions
1. Test whether the effect of an inactivating mutation in Tet1 on hmC-levels differ from the effect
of a Tet2 inactivating mutation?
2. Test whether the genotypes analysed here show any variation with respect to hmC-levels. If you
find significant variation in hmC according to Tet-genotype, find out which genotypes have a
significant difference in hmC-levels to the wild-type. Hint: specify a number of digits to print
to avoid getting p-values of 0.
4. How would you describe the effect size of the Tet1 on hmC-levels relative to the Tet2 mutant?
5. You are designing the experiment to test whether a Tet3 mutant also has lower hmC-levels than
the wt. You can re-use the intensity measures from the wild-type nuclei. Furthermore, w expect
a similar effect size as for the Tet1 mutant. How many nuclei of Tet3-mutant cells are sufficient
to detect this effect at the 0.01 significance level in 99% of the cases?
16