Chapter 8 Analytical Statistical Assessment - Data Analysis and Visualization in R (IN2339)
Chapter 8 Analytical Statistical Assessment - Data Analysis and Visualization in R (IN2339)
Assessment
In the last chapter we discussed how trends in data can arise by chance, leading us to wrong
conclusions. We saw that statistical hypothesis testing can help to guard us from being fooled
by randomness in this way. We developed the permutation test as an empirical way to perform
hypothesis tests.
While permutation testing is very general and requires few assumptions, it has its limitations,
as we will see shortly. In this chapter, we will therefore discuss a more analytical approach to
testing. We will see several classical tests, such as the binomial test and the t-test. These
tests often make stronger assumptions about the underlying data. Thus, it is important to
understand when they can and cannot be used. We will discuss the quantile-quantile plot (Q-Q
plot) as a method to check some of these assumptions.
In the last chapter we also touched on the topic of confidence intervals, which help us quantify
the uncertainty of our estimates, and also developed an empirical way to compute them. In this
chapter, we will briefly describe how to compute them analytically.
We have already discussed how to test specific hypotheses, for instance considering the
association between a genetic marker and growth rate in the yeast dataset (See Chapter 7).
However, in the era of big data, we often do not restrict ourselves to testing just one single
hypothesis. Molecular biologists can nowadays measure RNA abundance of all genes of a cell
population. So what if we test the association of the RNA abundance of all ~8,000 yeast genes
with every single genetic markers? For 1,000 genetic markers, this means we will have to do
more than 8 million tests!
Doing this many tests can lead to misleading results. Let us assume, for the sake of argument,
that our null hypothesis is always true and there is never an association between RNA
abundance and markers. If we reject this null hypothesis every time we observe P ≤ 0.05 , we
will falsely reject the null hypothesis in roughly 5% of the tests we do. With 8 million tests, we
will then falsely reject the null hypothesis 400,000 times.
This issue is called multiple testing and strategies to deal with this problem will be discussed
in detail in Chapter 9. For now, it suffices to say that when we do many tests, we will usually
require far lower P -values to reject the null hypothesis, to guard against the problem
described above.
r+1
With permutation testing, we estimated P -values using P =
m+1
, where m is the number of
permutations (Equation (7.1)). It follows that, with this method, the P -values we can compute
will never be smaller than 1
m+1
. If we now say that we will only reject the null hypothesis if, for
example, P ≤ 0.001 , then we will need at least m = 1, 000 permutations, otherwise our test
simply cannot reject the null hypothesis, regardless of the true associations in the data. Since
we are doing 8 million tests, this means we will end up doing more than 8 billion permutations.
Hence, permutation testing can become very costly in terms of computing power and time. We
thus require more scalable ways to estimate P -values for large datasets. This Chapter
provides methods for which P -values are computed from the observed test statistics directly.
The first test we will look at is the binomial test. We use it when we want to test hypotheses
concerning one binary variable.
To develop the binomial test, we consider an abstract example, namely testing whether a coin
is biased.
As we will see again and again in this chapter, to develop a statistical test we require three
ingredients:
The distribution of this test statistic under the null hypothesis, p(T |H0 )
(Note that in permutation testing the sampling procedure simulated the distribution of our test
statistic under the null hypothesis.)
To test whether a coin is biased, our null hypothesis is that the coin is fair:
H0 : μ = 0.5
H1 : μ ≠ 0.5
Now assume, for the sake of argument, we toss the coin only once (n = 1 ) and get a head (
Tobs = 1 ). What is the two-sided P -value in this case?
In this scenario, there are of course only 2 possible outcomes. Either we get one head or we
get one tail. Under the null hypothesis, both outcomes are equally likely. Therefore, the
distribution of the test statistic under the null hypothesis is given by:
= 2 × 0.5
= 1
Thus, if we only performed a single coin toss, the data cannot provide sufficient evidence for
rejecting the null hypothesis in a two-sided test. This, of course, does not allow us to conclude
that the null hypothesis is correct. In particular, we cannot write that “
p(H0 ) = p(μ = 0.5) = 1 .” After all, we could have generated the same data with a coin that
has heads on both sides, for which the null hypothesis clearly does not hold. Another way to
look at it, is to state that the data does not provide sufficient evidence to conclude that the coin
is biased.
Now assume we toss the coin n > 1 times and observe Tobs heads. What is the distribution of
the test statistic under the null hypothesis now?
We can easily simulate data under this assumption in R by sampling with replacement from a
vector c(0, 1) . The probability for each outcome can be provided with the prob argument.
Here is one such simulation of n = 10 trials under the null hypothesis μ = 0.5 .
# set.seed is optional
# we just pick an arbitrary seed of the random number generator to ensure reproducibilit
# See https://en.wikipedia.org/wiki/Random_number_generation
set.seed(7)
n <- 10
x <- sample(c(0,1), n, replace=TRUE, prob=c(0.5,0.5))
x
## [1] 0 1 1 1 1 0 1 0 1 1
t <- sum(x)
t
## [1] 7
The probability of observing T heads after tossing a coin n times is given by the binomial
distribution, which is the binomial coefficient, i.e. the number of possible sequences of events
with the same total number of heads, times the probability of a given sequence, which is itself
the product of the probability of each individual realization (i.i.d. assumption):
n
T n−T
p(T |n, μ) = ( )μ (1 − μ)
T
n
n
p(T |n, μ = 0.5) = ( )0.5
T
This is implemented in R with the function dbinom 14. The probability to have observed exactly
7 heads is therefore:
## [1] 0.1171875
We recall that the P -value is defined as the probability, under the null hypothesis, of observing
a test statistic as or more extreme as the one we actually observed. Since we just want to
know whether the coin is biased, and do not care in which direction, we need a two sided P -
value. This is given by (Figure 8.1):
n n
n n
= 2 min{ ∑ ( )0.5 , ∑ ( )0.5 } (8.1)
T T
T ≤Tobs T ≥Tobs
Figure 8.1: Two-sided P -value for the coin tossing example. The two-sided P -value equals to
the sum of the probabilities (total red area) under the null hypothesis of the realizations equal
or more extreme than the observed one (vertical line).
To apply the formula (8.1), one can compute the smaller of the two terms which is here
10
∑
T ≥7
(
T
)0.5
10
, and corresponds to the right tail marked in red in Figure 8.1. This is
10
1 − ∑
T ≤6
(
T
)0.5
10
and is obtained in R with:
## [1] 0.171875
## [1] 0.34375
Altogether we have P = 0.34375. We do not reject the null hypothesis that the coin is fair at a
significance level of 0.05.
In actual applications, we use the binom.test function of R. For the example above, we do:
## p-value = 0.3438
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
## 0.3475471 0.9332605
## sample estimates:
## probability of success
## 0.7
We see that the function has three parameters, which correspond to Tobs , n and the μ under
H0 respectively (R calls the last one p). Additionally, we can specify whether we want a two-
sided or one-sided test using the “alternative” option. The options are “two.sided,” “greater”
and “less.”
We also see that the function returns an object, which summarizes the test that was
performed. If we want to just get the P -value, we do:
## [1] 0.34375
Note that binom.test also returns sample estimates of the probability of success and
confidence intervals. Section 8.8 provides explanations.
Suppose we are trying to determine whether people who smoke are more likely to develop
severe symptoms if they contract a respiratory virus than non-smokers. For this we collect
data from n = 110 randomly sampled patients.
Assume we receive the following table as a result:
Severe Mild
Smoker 10 20
Non-smoker 10 70
We see that 30 of the patients were smokers, whereas 80 were non-smokers. We further
observe that only 1
8
th of non-smokers developed severe symptoms, whereas 1
3
rd of the
smokers did. The odds are 1:2 (10 severe versus 20 mild) for infected smokers to develop
severe symptoms against 1:7 (10 severe versus 70 mild) for non-smokers. Hence, these data
suggests that there is relationship between smoking and developing severe symptoms, with
odds about 3.5 times higher for smokers than for non-smokers.
Once again, we need to make sure that these results are statistically significant. We cannot
use the binomial test, because now we are not just considering one binary variable, but rather
we are investigating the relationship between two binary variables.15
It is enlightening to approach first this problem with permutation testing (See Chapter 7). To
this end, one shall first consider the underlying, not aggregated, dataset of individual cases.
Such tidy dataset, where one row is one patient and each column one variable, would have
the following structure:
patient_1 no mild
patient_3 no severe
… … …
For permutation testing, the null hypothesis is the independence of the Smoker and the
Symptoms variables. With permutation testing, data under the null hypothesis are simulated by
permuting values in one column (say “Symptoms”) keeping the order of the other column (say
“Smoker”) fixed. For each permutation, we get a different 2x2 contingency table which we will
denote as:
Smoker a b a+b
Non-smoker c d c+d
Note that any such permutation keeps the size of the dataset, the total number of smokers as
well as the total number of patients with severe symptoms constant. We say that these
permutations keep the margins (row and column totals) of the contingency table constant.
Consequently, one cell in the 2x2 contingency table suffices to characterize the entire table
because all other counts can then be derived using the margins. Therefore, we can use any
cell in the table as test statistic, but we usually use a, i.e. the upper left corner.
The good news is that the distribution of a under the null hypothesis (i.e. its frequency among
all possible distinct permutations) can be exactly computed. It is given by the hypergeometric
distribution:16
Using the hypergeometric distribution, we can now derive P -values, in the same way as we
did before, namely by summing the probability of observing a test statistic as, or more,
extreme as the one we observed. So, to compute the one-sided P -value, we would use:
P = ∑ p(k = i|H0 )
i≥a
For our application purposes, we do not need to know the formula of the hypergeometric
distribution, nor how it is derived. However, it is important to know that a formula exists and
what the underlying assumptions are, i.e. that the margins of the 2x2 contingency table are
considered to be fixed. That means that we do not consider , , ,
n a + b b + d a + c and c + d
as random variables, but instead take these quantities as given (we “condition” on them). Note
that this assumption is often violated in practice, as in the example above where we randomly
sampled patients rather than sampling a fixed amount from each subgroup. But Fisher’s exact
test is nevertheless applied as an exact instance of permutation testing.
There are alternatives to Fisher’s exact test that do not need all margins fixed assumptions.
One is the formerly popular Chi-squared test, which is based on large number approximations.
It is rarely needed nowadays, as Fisher’s exact test is exact and fast to compute. Another
approach is based on logistic regression and will be addressed in a later Chapter.
In R, we can perform Fisher’s exact test using fisher.test . This requires a contingency table
as input (See the base R function table to create them). For our contingency table, we get:
tbl = data.table(
severe = c(10, 10),
mild = c(20, 70)
)
tst <- fisher.test(tbl, alternative = "greater")
tst
##
## 1.316358 Inf
## sample estimates:
## odds ratio
## 3.453224
The one-sided P -value is 0.0148095. At the level α = 0.05 , one would therefore reject the
null hypothesis of independence of symptom severity and smoking status.
As for the binomial test, the P -value can be extracted with $p.value . The function
fisher.test also returns an estimate of the odds ratio and its confidence interval. The
estimate of the odds ratio is based on a estimation procedure robust for low counts, giving
close yet slightly different estimates than the sample odds ratio (we obtained 3.5 at the start of
Section 8.3).
We asked earlier on the yeast dataset (Section 7.1) whether the genotype at marker 5211
significantly associates with growth rates in Maltose media. We saw that yeast strains which
have the wild isolate genotype seemed to generally grow faster than those which had the lab
strain genotype at this particular marker:
dt <- merge(
growth[media == 'YPMalt'],
genotype[marker == mk, .(strain, genotype)],
by = 'strain'
)
p <- dt%>%
ggplot(., aes(genotype, growth_rate)) +
geom_boxplot() +
Here we are evaluating the association between a binary variable (the genotype at marker
5211) and a quantitative variable (the growth rate in Maltose media). This scenario does not fit
the tests we have seen previously. We have to develop a new one.
To formalize this problem, we first note that the binary variable splits the quantitative data into
two groups. Let X = x1 , . . . , xn
x
be the quantitative data of the first group (i.e. the growth
rates of yeast strains with lab strain genotype), and Y = y1 , . . . , yn
y
be the quantitative data
of the second group (i.e. the growth rates of yeast strains with wild isolate genotype).
To develop a test, we again need a null hypothesis, a test statistic and a distribution of the test
statistic under the null hypothesis. For this problem, we will consider two different tests.
x̄ − ȳ
t = c (8.2)
s
where x̄ and ȳ are the sample means of X and Y respectively, s is the pooled standard
deviation, and c is a constant that depends on the sample size of each group. In details:
1
x̄ = ∑ xi
nx
i
1
ȳ = ∑ yi
ny
i
2 2
∑ (xi − x̄) + ∑ (yi − ȳ )
i i
s = √
nx + ny − 2
nx ny
c = √
nx + ny
While the details can always be looked up, understanding Equation (8.2) is useful. Intuitively,
the t-statistic compares, up to the constant c, the “signal” of group difference, namely the
estimated difference of the means of the two groups, to the “noise,” i.e. how uncertain we are
about our estimate of this difference. This “noise” in our estimate is itself the ratio of the typical
variations within the groups (s) over a term capturing the sample size (c). One can thus
interpret it as a signal-to-noise ratio. If the t-statistic is large, then we see a clear difference in
means. By contrast, if the t-statistic is small, then the difference in means is not large
compared to the noise. Larger sample size (more data) or larger between-group differences
compared to within-group differences lead to larger t-statistics.
8.4.1.1 Student’s t-distribution
Before we can derive the distribution of this test statistic under the null hypothesis, we need to
make some additional assumptions about the data, namely:
All observations x1 , . . . , xn , y 1 , . . . , y n
x y
are independent of each other
we assume that X and Y both follow Gaussian distributions
X and Y have the same unknown variance
A consequence of these assumptions is that our null hypothesis simplifies. If both X and Y
are Gaussian with the same variance, the only way the two groups can differ is if the
Gaussians have different means. Therefore, the null hypothesis is that the expectations are
equal:
H0 : E(X) = E(Y )
We can make two observations. Firstly, the distribution of the t-statistic under H0 does not
depend on the (unknown) variance. Secondly, Student’s t-distribution has heavier tails than
the Gaussian. This intuitively comes from the fact that, while the numerator of the t-statistic is
normally distributed, the estimate of the standard deviation in the denominator is noisy. The
smaller the sample size n , the noisier the estimate. The degrees of freedom capture just this.
They are equal to the number of data points (here nx + ny ) minus the number of parameters
of the model (here minus 2 for the means of each group). The smaller the degrees of freedom,
the less accurate our estimates of the standard deviation is, and the more variation the
denominator of the t-statistics brings. Hence, the smaller the degrees of freedom, the heavier
the tails. For infinite degrees of freedom, Student’s t-distribution equals the normal distribution.
In R we can perform a t-test using the t.test function. Since in the basic Student’s t-test we
assume equal variances, we have to set the argument var.equal to True. One can extract
the values for each group and perform the test. Rather than manually extracting the two
groups, we use the formula syntax ( growth_rate ~ genotype ) and let the t.test function do
it for us:
##
## Two Sample t-test
##
Note that the function reports the t-statistic and the degrees of freedom, the confidence
intervals for the difference of the means, in addition to the p-value. Note also that the function
helpfully reminds us what null hypothesis we are testing against.
In practice, we generally do not assume equal variances. This is called Welch’s test and
slightly changes the degrees of freedom. This test is performed in R by default if we do not set
var.equal to True.
t.test(growth_rate ~ genotype, data=dt)
##
## -2.405189 -1.661599
## sample estimates:
## mean in group Lab strain mean in group Wild isolate
## 5.763086 7.796480
8.4.2.1 Assumptions
As we saw, the t-test assumes the data follows a specific distribution, namely a Gaussian.
There are many situations where this is reasonable, but in general we cannot guarantee that
this assumption holds. Using the t-test if the data is not normal can lead to wrong conclusions.
For instance, distributions that have heavier tails that the Gaussian can lead to false null
hypothesis rejections as sample mean differences can seem to be surprisingly large (See
section 8.4.3).
The Wilcoxon Rank-Sum test is a popular alternative to the t-test. It makes very few
assumptions about the data, namely that:
All observations x1 , . . . , xn , y 1 , . . . , y n
x y
are independent of each other
The responses are ordinal, i.e. we can rank them
Specifically, we assume that under the null hypothesis H0 , the probability of an observation
from the population X exceeding an observation from the second population Y equals the
probability of an observation from Y exceeding an observation from X:
H0 : p(X > Y ) = p(Y > X)
A stronger null hypothesis commonly used is “The distributions of both populations are equal”
which implies the previous hypothesis.
For a two-sided test, the alternative hypothesis is “the probability of an observation from the
population X exceeding an observation from the second population Y is different from the
probability of an observation from Y exceeding an observation from X:
p(X > Y ) ≠ p(Y > X) .” The alternative may also be stated in terms of a one-sided test, for
example: p(X > Y ) > p(Y > X) . This would mean that if we randomly take observations
x ∈ X and y ∈ Y , we would expect that x > y occurs more often than y > x .
Consider first that we rank all observed values (and ignore ties), e.g.:
The idea of the the Wilcoxon rank-sum test is that under the null hypothesis, the xi ’s and yi ’s
should be well interleaved in this ranking. In contrast, if say X tend to be smaller than Y , then
the xi ’s will get lower ranks. The test statistics is therefore based on the sum of the ranks of
the realizations of one the two variables.
nx (nx + 1)
Ux = Rx −
2
where Rx is the sum of the ranks of the xi ’s. In the example above, Rx = 1 + 5 + 6+. . . ,
nx (nx +1)
and nx is the number of observations of set X . The term 2
(this is the famous Gauss
sum) is a constant so that Ux = 0 when all the first values are from X .
Uy is defined analogously.
U = min{Ux , Uy }
P -values are then based on the distribution of Mann-Whitney U statistic under the null
hypothesis. It combines tabulated values for small sample sizes and Central Limit Theorem
approximation for large sample sizes (exploiting that the expectation and the variance of U
In R we can perform the Wilcoxon rank-sum test using the wilcox.test function, whose
usage is analogous to the usage of t.test :
##
## Wilcoxon rank sum test with continuity correction
##
## data: growth_rate by genotype
## W = 690, p-value = 2.264e-16
The Wilcoxon rank-sum test makes less assumptions than the t-test, specifically because it
does not require that the data follows a Gaussian distribution. We will now see an example to
illustrate this.
set.seed(10)
x1 <- rnorm(100, 10)
x2 <- c(rnorm(99, 5), 500)
grp_tbl <- data.table(
group = rep(c("group1", "group2"), each=100),
x = c(x1,x2)
)
ggplot(data = grp_tbl, aes(x=x, color=group)) + geom_boxplot() + scale_x_log10()
In this example, the groups are sampled from normal distributions with a different mean.
However, we add a non-normal outlier to the second group, which ensures that the overall
mean looks the same.
Recall that, as a consequence of assuming that the data is Gaussian, the null hypothesis of
the t-test is that the difference in means is zero. There is no difference in means here, so the t
t.test(x1, x2)
##
## Welch Two Sample t-test
##
## data: x1 and x2
## t = -0.00052129, df = 99.072, p-value = 0.9996
But clearly, these groups are overall quite different, and their means only appear similar due to
this one outlier. The null hypothesis of the Wilcoxon rank-sum test is not about the means.
Instead, the Wilcoxon rank-sum test uses the rank distribution, and in our example most
observations of the second group will rank above the observations of the first. The one outlier
will not affect the ranking much. Thus, the Wilcoxon rank-sum test will reject here:
wilcox.test(x1, x2)
##
## Wilcoxon rank sum test with continuity correction
##
## data: x1 and x2
## W = 9900, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
This is a rather synthetic example, but the underlying point is very general: if we are unsure
whether the distributional assumption is met, a test like the Wilcoxon rank-sum test will
generally be more robust than a test making distributional assumptions like the t-test. But do
note that there is a flip side to this: if the data is indeed Gaussian, then the t-test will be more
powerful (i.e. more sensitive in detecting violations of the null hypothesis) than the more
generally applicable Wilcoxon rank-sum test.
8.5 Association between two quantitative
variables
The last scenario we will consider in this chapter concerns the dependence between two
quantitative variables. That is, we assume we have quantitative data in the form of tuples
(X, Y ) : (x1 , y1 ), . . . , (xn , yn ) and we want to see if knowing one of the values in such a
tuple gives us information about the other one.
Looking at the plot above, it sure seems that there is a positive relationship between X and Y
in this data. Specifically, if we know that xi is relatively high, it seems that we can usually
assume that yi will be high too (and vice-versa). But once again, we need a test to prevent us
from being fooled by randomness.
This means we again need null and alternative hypotheses, a test statistic and a distribution of
the test statistic under the null hypothesis.
We will consider two different tests which are based on different notions of the concept of
correlation.
An important property is that when two variables (X,Y) form a bivariate Gaussian
distribution,19 their independence is equivalent to their population Pearson correlation
coefficient ρX,Y equals 0 (See Appendix D). This motivates for a Hypothesis test, called the
Pearson correlation coefficient test.
8.5.1.1 Pearson’s correlation coefficient
The Pearson correlation coefficient test is based on the sample estimate of the population
Pearson correlation coefficient, defined as:
n
∑ (xi − x̄)(yi − ȳ )
i=1
r = (8.3)
n n
2 2
√∑ (xi − x̄) √∑ (yi − ȳ )
i=1 i=1
n
where is the sample mean, and analogously for ȳ .
1
x̄ = ∑ xi
n i=1
Let us look at the components of this. The numerator compares the deviation of the xi and yi
to their respective means. Terms of the sum are positive if both xi and yi vary in the same
direction (larger or lesser) compared to their mean and negative otherwise. Hence, the
numerator is largely positive when deviations from the means agree in direction, largely
negative when they are opposite, and about 0 when deviations are independent of each other.
More formally, the numerator is proportional to the sample covariance (See Appendix D). The
terms in the denominator is proportional to the individual sample standard deviations of X and
Y (See Appendix D). Hence, r compares how much the X and Y vary together to the product
of how much they vary individually.
r = 1 implies that x and y are perfectly linearly related with a positive slope
r = −1 implies that that x and y are perfectly linearly related with a negative slope
The null hypothesis is that the two variables are statistically independent, which under the
above assumptions amounts to state that:
H0 : ρ(X,Y ) = 0
n − 2
t = r√
2
1 − r
Under H0 , the test statistic t defined above follows a Student’s t-distribution with degrees of
freedom n − 2 .
##
## Pearson's product-moment correlation
##
## sample estimates:
## cor
## 0.8164205
8.5.2.1 Motivation
Pearson’s correlation captures linear relationship between variables, which is quite restrictive.
For instance, if one of the variables is in log-scale or quadratic scale, then the linear
relationship is lost. Here is a constructed example:
## [1] 0.5643079
##
## Pearson's product-moment correlation
##
## data: anscombe$x4 and anscombe$y4
## t = 4.243, df = 9, p-value = 0.002165
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4246394 0.9507224
## sample estimates:
## cor
## 0.8165214
and
We then compute the position of each data point in the ranking, yielding the transformed
dataset:
rankx (X), ranky (Y ) = (rankx (x1 ), ranky (y1 )), . . . , (rankx (xn ), ranky (yn ))
For the rankings above, we would have for example that (rankx (x7 ), ranky (y7 )) = (1, 3) .
## [1] 0.5
## [1] 0.5
Based on the Spearman correlation, we can also define a test for the relationship between two
variables.
R implements a statistical test based on tabulated exact permutations for small sample sizes
and approximations for larger sample sizes.
##
## Spearman's rank correlation rho
##
## data: anscombe$x4 and anscombe$y4
We see that the Spearman test would not reject, which makes sense, as the rank is less likely
to be mislead by the outlier data point.22 Generally, the Spearman test is less powerful than
Pearson when the data is actually Gaussian, but it is more robust to outliers and captures
monotonic, yet non-linear, relationships. In practice, the Spearman test is often used.
Figure 8.3 summarizes the different tests we have seen for the association of two variables,
together with the typical companion plots:
Figure 8.3: Overview of two-variable tests
As we saw in this chapter, several tests assume that the data follows a particular distribution.
We will now explore a plot which we can use to check whether such an assumption is
reasonable.
We already know a plot which can be used to visualize distributions, namely the histogram.
We might think that it could be used to check distributional assumptions. However, this is
somewhat complicated by the difficulty of choosing the right bin size. Consider, for example,
the following histogram, visualizing a sample taken from a uniform distribution on the interval 0
to 1.:
Just looking at the histogram, it is hard to see that the underlying data comes from the uniform
distribution.
What could be a better approach here? One thing we can do is look at the quantiles.
The basic idea here is as follows: if the data actually follows a uniform distribution on the
interval 0 to 1, then we expect 10% of the data in the interval [0,0.1], 20% in the interval
[0,0.2], and so on…
We can now compute whether our data conforms to this expectation. We get that:
## 0.99981636
Here we implicitly chose to always make jumps of 10% . These quantiles are therefore called
deciles.
We can make a scatter plot which compares the expected and the theoretical deciles:
ggplot(
data.table(
x=seq(0,1,0.1),
y=dec
),
aes(x,y)
) + geom_point() +
xlim(c(0,1)) + ylim(c(0,1))+
For a finite sample we can estimate the quantile for every data point. One way is to use as
expected quantile (r − 0.5)/N (Hazen, 1914), where r is the rank of the data point. The R
function ppoints gives more accurate values.
ggplot(
data.table(
x=ppoints(length(x)),
y=sort(x)
),
aes(x,y)
) + geom_point() +
xlim(c(0,1)) + ylim(c(0,1)) +
xlab("Quantiles of the uniform distribution") +
ylab("Quantiles of the dataset") +
geom_abline(intercept=0,slope=1) ## diagonal y=x
This is called a Q-Q plot, which is short for Quantile-Quantile plot. When the distribution
matches the data, as above, the points should be close the diagonal.
Let us now recall the example we used to justify the Wilcoxon test. There we added an
extreme outlier to a gaussian, which mislead the t-test. Can we discover, using a Q-Q plot,
that this data violates an assumption of normality?
Figure 8.4 give more examples. We assume here the Normal distribution (Gaussian with mean
0 and variance 1) as reference theoretical distribution. These plots show how different
violations of the distributional assumption translate to different deviations from the diagonal in
a Q-Q plot.
Figure 8.4: Examples of Q-Q plots. The theoretical distribution is in each case the Normal
distribution (Gaussian with mean 0 and variance 1). The upper row shows histograms of some
observations, the lower row shows the matching Q-Q plots. The vertical red dashed line marks
the theoretical mean (0, top row) and the red lines the y=x diagonal (bottom row).
The middle three plots show what happens when one particular aspect of the distributional
assumption is incorrect. The second from the left shows what happens if the data has a mean
higher than we expected, but otherwise follows the distribution. The middle one shows what
happens if the data has fatter tails (i.e. more outliers) than we expected - this occurs frequently
in practice. The second from the right shows what happens if the distribution is narrower than
expected. The last plot shows a combination of these phenomena. There the data come from a
non-negative asymmetric distribution.23 The Q-Q plot shows a lack of low values (capped at 0)
and an excess of high values.
For instance, a 95% confidence interval for the expectation μ , would be an interval [a, b] such
that:
p(a ≤ μ ≤ b) = 0.95
We have seen how to approximate confidence intervals using the case-resampling bootstrap
in the previous chapter. But, confidence intervals can also be computed analytically under
some assumptions. We will see this in detail for the binomial case first.
1
X̄ = ∑ Xi
N
i
X̄(1 − X̄)
^
SE(X̄) = √
N
To determine the probability that the interval includes μ , we need to compute this:
^ ^
p (X̄ − SE(X̄) ≤ μ ≤ X̄ + SE(X̄))
By subtracting and dividing the same quantities in all parts of the equation, we get that the
above is equivalent to:
X̄ − μ
p (−1 ≤ ≤ 1)
^
SE(X̄)
The Central Limit Theorem implies that the sample mean distributes for large N as a Normal
distribution with mean E(X) and variance Var(X)/N :
It is known that Var(Xi ) = μ(1 − μ) (this is because the underlying data is bernoulli).
Hence,
X̄ − μ
p (−1 ≤ ≤ 1) = p (−1 ≤ Z ≤ 1)
^
SE(X̄)
This can be computed in R using the cumulative distribution function of the normal distribution:
pnorm(1) - pnorm(-1)
## [1] 0.6826895
If we want to have a larger probability, say 99%, we need to multiply by whatever z satisfies
the following:
Pr (−z ≤ Z ≤ z) = 0.99
This is obtained using the quantile function of the normal distribution. In R, using:
z <- qnorm(0.995)
z
## [1] 2.575829
## [1] 0.99
For ,
α = 0.05 1 − α/2 = 0.975 and we get the typically used 1.96 factor:
qnorm(0.975)
## [1] 1.959964
Most statistical tests in R provide confidence intervals for the relevant statistics. This is
reported as part of the returned test object. For example, for the binomial test we get:
mu <- 0.45
N <- 1000
## p-value = 6.537e-07
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
## 0.3901707 0.4522958
## sample estimates:
## probability of success
## 0.421
You can see that the binom.test function automatically gives us a 95 percent confidence
interval. It is reported in the conf.int slot. So we can extract it using:
binom.test(sum(x), length(x))$conf.int
## [1] 0.95
We can set the confidence level with the conf.level parameter. So if we want a 99%
interval, we do:
For some tests, you first need to set conf.int to TRUE to recieve a confidence interval:
Sometimes R will use more accurate estimations than the Normal approximation we have just
described. Details can usually be found in the documentation.
Consider again a scenario where we are comparing two groups, X and Y , in terms of their
means, X̄ and ¯
Y . Assume, for simplicity, that:
2
X ∼ N (μx , σ )
2
Y ∼ N (μy , σ )
with σ
2
known. Assume further that we have samples of each group of size ny = nx = n . We
then know that:
2
σ
x̄ ∼ N (μx , )
n
2
σ
ȳ ∼ N (μy , )
n
We can now construct two analytical 95% confidence intervals, one for each mean. We use
the same procedure as previously. We set up an interval:
σ σ
Pr (x̄ − z ≤ μx ≤ x̄ + z ) = 0.95
√n √n
σ
x̄ ± 2
√n
σ
ȳ ± 2
√n
Now assume we want to test the null hypothesis that the true mean difference is zero. In the
literature it is quite common practice to say that we reject this null hypothesis if and only if the
two confidence intervals do not overlap.
It is important to note that this is not the same as constructing a 95 confidence interval for the
difference in means x̄ − ȳ and rejecting if and only if that interval does not include zero. The
difference comes from how we add standard errors.
σ σ
In our overlap test, we would reject whenever ȳ + 2 < x̄ − 2 (assuming ȳ < x̄ ). We
√n √n
σ
4 < x̄ − ȳ
√n
Now let us construct a confidence interval for x̄ − ȳ . A basic property of normal random
variables tells us that:
2
σ
x̄ − ȳ ∼ N (μx − μy , 2 )
n
σ
(x̄ − ȳ ) ± 2√2
√n
σ
2√ 2 < x̄ − ȳ
√n
Now 2√ 2 < 4 , thus this will reject more often than the “overlap” test. in other words, the
overlap test is too conservative.
Nevertheless, this sort of overlap test is very often used in the literature. When you see it
being used, or when you use it yourself, keep in mind that it generally will fail to reject more
often than the confidence level indicates.
8.9 Discussion
In the last chapter, we saw Permutation-based testing, which is very general: we can use it to
test for any ad hoc statistics such as mean difference, median difference and so on. However,
in the beginning of this chapter, we saw that Permutation-based testing is computationally
intensive and not always appropriate in the context of big data. This is why we discussed a
number of analytical tests which can serve as alternatives, to compute P -values and
confidence intervals.
Some of the tests are parametric, i.e. they assume some parameterized function for the data
distribution, leading to the null hypothesis is be based on the parameters (for instance that two
groups distribute according to the Gaussian distribution and that the means of the two groups
are equal). We saw that for many scenarios, non-parametric tests exist that do not make little
assumptions on the distribution functions. Examples of such non-parametric tests are the
Fisher, the Wilcoxon rank-sum test, and the Spearman rank-correlation test. In general, these
should be preferred to their parametric counterparts, unless we have good reason to believe
that the more restrictive assumptions of the parametric test are met. We can check
distributional assumptions using Q-Q plots.
The two-variable tests we discussed assess in some ways the dependencies between
variables. By themselves, they cannot tell whether these relationships are causal and, if so,
what the direction of causality is (See Chapter 6). For instance, we may be tempted to
conclude from the viral infection example that smoking is a cause of severe symptoms, but
that study design unfortunately cannot guarantee this. People who smoke could, for example,
on average be older, have a less healthy diet, or have other risk-factors, compared to non-
smokers. In other words, there could be confounding variables not taken into account here.
8.10 Conclusion
8.11 Resources
The Testing Chapter of Modern Statistics for Modern Biology, by Holmes and Huber.
14. R typically provides 4 functions per distribution starting with the letters r, d, p, and q and
standing for random draws (rbinom, rnorm,…), density or probability mass (dbinom,
dnorm,…), cumulative distribution (pbinom, pnorm,…), and quantile (qbinom, qnorm,…)↩︎
15. It would also not be ideal to use the binomial test on the smoker data by fixing the
probability under the null to the probability estimated on the non-smokers, because that
probability would be a noisy estimate↩︎
17. https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test↩︎
18. https://en.wikipedia.org/wiki/Anscombe%27s_quartet↩︎
19. https://en.wikipedia.org/wiki/Multivariate_normal_distribution↩︎
20. In some cases, we may also be interested in non-monotonic data (e.g. a U-shaped
relationship). In this case, both Pearson and Spearman correlations will fail and we have
to use more complex measures, such as the distance correlation
(https://en.wikipedia.org/wiki/Distance_correlation) or an information-theoretic measure
such as mutual information.↩︎
22. This being said, the test rightfully warns of the large number of ties, so in this case the
Spearman P -value may not be the last word on the matter either. With data like this, we
should always take a step back and think carefully whether it can really answer the
questions we are trying to ask of it.↩︎
23. simulated with the Negative binomial distribution rnbinom(n, mu=100, size=2)/100 .↩︎