0% found this document useful (0 votes)
25 views43 pages

Chapter 8 Analytical Statistical Assessment - Data Analysis and Visualization in R (IN2339)

The document discusses analytical statistical assessment and hypothesis testing. It introduces classical tests like the binomial test and t-test, which make assumptions about underlying data. These tests are discussed as alternatives to permutation testing which was covered in the previous chapter. The document also discusses computing confidence intervals analytically and challenges with multiple testing on large datasets.

Uploaded by

ayan.spotify
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views43 pages

Chapter 8 Analytical Statistical Assessment - Data Analysis and Visualization in R (IN2339)

The document discusses analytical statistical assessment and hypothesis testing. It introduces classical tests like the binomial test and t-test, which make assumptions about underlying data. These tests are discussed as alternatives to permutation testing which was covered in the previous chapter. The document also discusses computing confidence intervals analytically and challenges with multiple testing on large datasets.

Uploaded by

ayan.spotify
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Chapter 8 Analytical Statistical

Assessment
In the last chapter we discussed how trends in data can arise by chance, leading us to wrong
conclusions. We saw that statistical hypothesis testing can help to guard us from being fooled
by randomness in this way. We developed the permutation test as an empirical way to perform
hypothesis tests.

While permutation testing is very general and requires few assumptions, it has its limitations,
as we will see shortly. In this chapter, we will therefore discuss a more analytical approach to
testing. We will see several classical tests, such as the binomial test and the t-test. These
tests often make stronger assumptions about the underlying data. Thus, it is important to
understand when they can and cannot be used. We will discuss the quantile-quantile plot (Q-Q
plot) as a method to check some of these assumptions.

In the last chapter we also touched on the topic of confidence intervals, which help us quantify
the uncertainty of our estimates, and also developed an empirical way to compute them. In this
chapter, we will briefly describe how to compute them analytically.

8.1 Motivation: Hypothesis testing in large


datasets

We have already discussed how to test specific hypotheses, for instance considering the
association between a genetic marker and growth rate in the yeast dataset (See Chapter 7).
However, in the era of big data, we often do not restrict ourselves to testing just one single
hypothesis. Molecular biologists can nowadays measure RNA abundance of all genes of a cell
population. So what if we test the association of the RNA abundance of all ~8,000 yeast genes
with every single genetic markers? For 1,000 genetic markers, this means we will have to do
more than 8 million tests!
Doing this many tests can lead to misleading results. Let us assume, for the sake of argument,
that our null hypothesis is always true and there is never an association between RNA
abundance and markers. If we reject this null hypothesis every time we observe P ≤ 0.05 , we
will falsely reject the null hypothesis in roughly 5% of the tests we do. With 8 million tests, we
will then falsely reject the null hypothesis 400,000 times.

This issue is called multiple testing and strategies to deal with this problem will be discussed
in detail in Chapter 9. For now, it suffices to say that when we do many tests, we will usually
require far lower P -values to reject the null hypothesis, to guard against the problem
described above.
r+1
With permutation testing, we estimated P -values using P =
m+1
, where m is the number of
permutations (Equation (7.1)). It follows that, with this method, the P -values we can compute
will never be smaller than 1

m+1
. If we now say that we will only reject the null hypothesis if, for
example, P ≤ 0.001 , then we will need at least m = 1, 000 permutations, otherwise our test
simply cannot reject the null hypothesis, regardless of the true associations in the data. Since
we are doing 8 million tests, this means we will end up doing more than 8 billion permutations.
Hence, permutation testing can become very costly in terms of computing power and time. We
thus require more scalable ways to estimate P -values for large datasets. This Chapter
provides methods for which P -values are computed from the observed test statistics directly.

8.2 The Binomial Test: testing hypotheses for a


single binary variable

The first test we will look at is the binomial test. We use it when we want to test hypotheses
concerning one binary variable.

8.2.1 Abstraction: Tossing a coin

To develop the binomial test, we consider an abstract example, namely testing whether a coin
is biased.

We first introduce some notation. Let:

n : the total number of independent random tosses of the coin.


Xi : the value of the i-th toss, with Xi = 1 if the result is head and Xi = 0 if it is tail.
μ = E(Xi ) = p(Xi = 1) : the probability of getting a head.
We assume the Xi to be i.i.d.

As we will see again and again in this chapter, to develop a statistical test we require three
ingredients:

A null hypothesis H0 (and a suitable alternative hypothesis H1 , either one or two-sided)


A test statistic T

The distribution of this test statistic under the null hypothesis, p(T |H0 )

(Note that in permutation testing the sampling procedure simulated the distribution of our test
statistic under the null hypothesis.)

To test whether a coin is biased, our null hypothesis is that the coin is fair:

H0 : μ = 0.5

And our alternative is that it is biased (either towards heads or tails):

H1 : μ ≠ 0.5

As test statistic, we will use the total number of heads, i.e. T = ∑ Xi


i
.

8.2.1.1 A single coin toss

Now assume, for the sake of argument, we toss the coin only once (n = 1 ) and get a head (
Tobs = 1 ). What is the two-sided P -value in this case?

In this scenario, there are of course only 2 possible outcomes. Either we get one head or we
get one tail. Under the null hypothesis, both outcomes are equally likely. Therefore, the
distribution of the test statistic under the null hypothesis is given by:

p(T = 0|H0 ) = 0.5 = p(T = 1|H0 )

The two-sided P -value is then given by:

P = 2 min{p(T ≤ Tobs |H0 ), p(T ≥ Tobs |H0 )}

= 2 × 0.5

= 1

Thus, if we only performed a single coin toss, the data cannot provide sufficient evidence for
rejecting the null hypothesis in a two-sided test. This, of course, does not allow us to conclude
that the null hypothesis is correct. In particular, we cannot write that “
p(H0 ) = p(μ = 0.5) = 1 .” After all, we could have generated the same data with a coin that
has heads on both sides, for which the null hypothesis clearly does not hold. Another way to
look at it, is to state that the data does not provide sufficient evidence to conclude that the coin
is biased.

8.2.1.2 Tossing the coin several times

Now assume we toss the coin n > 1 times and observe Tobs heads. What is the distribution of
the test statistic under the null hypothesis now?

We can easily simulate data under this assumption in R by sampling with replacement from a
vector c(0, 1) . The probability for each outcome can be provided with the prob argument.
Here is one such simulation of n = 10 trials under the null hypothesis μ = 0.5 .

# set.seed is optional
# we just pick an arbitrary seed of the random number generator to ensure reproducibilit
# See https://en.wikipedia.org/wiki/Random_number_generation
set.seed(7)

n <- 10
x <- sample(c(0,1), n, replace=TRUE, prob=c(0.5,0.5))
x

## [1] 0 1 1 1 1 0 1 0 1 1

Our test statistic T is the sum of heads:

t <- sum(x)
t

## [1] 7

The probability of observing T heads after tossing a coin n times is given by the binomial
distribution, which is the binomial coefficient, i.e. the number of possible sequences of events
with the same total number of heads, times the probability of a given sequence, which is itself
the product of the probability of each individual realization (i.i.d. assumption):
n
T n−T
p(T |n, μ) = ( )μ (1 − μ)
T

Hence, assuming the null distribution is true, we get that:

n
n
p(T |n, μ = 0.5) = ( )0.5
T

This is implemented in R with the function dbinom 14. The probability to have observed exactly
7 heads is therefore:

dbinom(x=t, size=n, prob=0.5)

## [1] 0.1171875

We recall that the P -value is defined as the probability, under the null hypothesis, of observing
a test statistic as or more extreme as the one we actually observed. Since we just want to
know whether the coin is biased, and do not care in which direction, we need a two sided P -
value. This is given by (Figure 8.1):

P = 2 min{p(T ≤ Tobs |H0 ), p(T ≥ Tobs |H0 )}

n n
n n
= 2 min{ ∑ ( )0.5 , ∑ ( )0.5 } (8.1)
T T
T ≤Tobs T ≥Tobs
Figure 8.1: Two-sided P -value for the coin tossing example. The two-sided P -value equals to
the sum of the probabilities (total red area) under the null hypothesis of the realizations equal
or more extreme than the observed one (vertical line).
To apply the formula (8.1), one can compute the smaller of the two terms which is here
10

T ≥7
(
T
)0.5
10
, and corresponds to the right tail marked in red in Figure 8.1. This is
10
1 − ∑
T ≤6
(
T
)0.5
10
and is obtained in R with:

1-pbinom(q=t-1, size=n, prob=0.5)

## [1] 0.171875

Hence the two-sided P -value is twice this value, yielding:

2*(1-pbinom(q=t-1, size=n, prob=0.5))

## [1] 0.34375

Altogether we have P = 0.34375. We do not reject the null hypothesis that the coin is fair at a
significance level of 0.05.

8.2.2 Computing a binomial test with R

In actual applications, we use the binom.test function of R. For the example above, we do:

binom.test(t, n, p = 0.5, alternative = c("two.sided") )


##
## Exact binomial test
##
## data: t and n
## number of successes = 7, number of trials = 10,

## p-value = 0.3438
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
## 0.3475471 0.9332605
## sample estimates:
## probability of success

## 0.7

We see that the function has three parameters, which correspond to Tobs , n and the μ under
H0 respectively (R calls the last one p). Additionally, we can specify whether we want a two-
sided or one-sided test using the “alternative” option. The options are “two.sided,” “greater”
and “less.”

We also see that the function returns an object, which summarizes the test that was
performed. If we want to just get the P -value, we do:

tst <- binom.test(t, n, p = 0.5, alternative = c("two.sided") )


tst$p.value

## [1] 0.34375

Note that binom.test also returns sample estimates of the probability of success and
confidence intervals. Section 8.8 provides explanations.

8.3 Fisher’s exact test: Testing the association


between two binary variables

Suppose we are trying to determine whether people who smoke are more likely to develop
severe symptoms if they contract a respiratory virus than non-smokers. For this we collect
data from n = 110 randomly sampled patients.
Assume we receive the following table as a result:

Severe Mild

Smoker 10 20

Non-smoker 10 70

We see that 30 of the patients were smokers, whereas 80 were non-smokers. We further
observe that only 1

8
th of non-smokers developed severe symptoms, whereas 1

3
rd of the
smokers did. The odds are 1:2 (10 severe versus 20 mild) for infected smokers to develop
severe symptoms against 1:7 (10 severe versus 70 mild) for non-smokers. Hence, these data
suggests that there is relationship between smoking and developing severe symptoms, with
odds about 3.5 times higher for smokers than for non-smokers.

Once again, we need to make sure that these results are statistically significant. We cannot
use the binomial test, because now we are not just considering one binary variable, but rather
we are investigating the relationship between two binary variables.15

8.3.1 Permutation testing and the hypergeometric


distribution

It is enlightening to approach first this problem with permutation testing (See Chapter 7). To
this end, one shall first consider the underlying, not aggregated, dataset of individual cases.
Such tidy dataset, where one row is one patient and each column one variable, would have
the following structure:

Patient Smoker Symptoms

patient_1 no mild

patient_2 yes severe

patient_3 no severe

patient_4 yes severe

patient_5 yes mild

… … …
For permutation testing, the null hypothesis is the independence of the Smoker and the
Symptoms variables. With permutation testing, data under the null hypothesis are simulated by
permuting values in one column (say “Symptoms”) keeping the order of the other column (say
“Smoker”) fixed. For each permutation, we get a different 2x2 contingency table which we will
denote as:

Severe Mild Row total

Smoker a b a+b

Non-smoker c d c+d

Column total a+c b+d n=a+b+c+d

Note that any such permutation keeps the size of the dataset, the total number of smokers as
well as the total number of patients with severe symptoms constant. We say that these
permutations keep the margins (row and column totals) of the contingency table constant.

Consequently, one cell in the 2x2 contingency table suffices to characterize the entire table
because all other counts can then be derived using the margins. Therefore, we can use any
cell in the table as test statistic, but we usually use a, i.e. the upper left corner.

The good news is that the distribution of a under the null hypothesis (i.e. its frequency among
all possible distinct permutations) can be exactly computed. It is given by the hypergeometric
distribution:16

(a + b)!(c + d)!(a + c)!(b + d)!


p(k = a|H0 ) =
a!b!c!d!n!

8.3.2 Fisher’s exact test

Using the hypergeometric distribution, we can now derive P -values, in the same way as we
did before, namely by summing the probability of observing a test statistic as, or more,
extreme as the one we observed. So, to compute the one-sided P -value, we would use:

P = ∑ p(k = i|H0 )

i≥a

This is called Fisher’s exact test.

For our application purposes, we do not need to know the formula of the hypergeometric
distribution, nor how it is derived. However, it is important to know that a formula exists and
what the underlying assumptions are, i.e. that the margins of the 2x2 contingency table are
considered to be fixed. That means that we do not consider , , ,
n a + b b + d a + c and c + d

as random variables, but instead take these quantities as given (we “condition” on them). Note
that this assumption is often violated in practice, as in the example above where we randomly
sampled patients rather than sampling a fixed amount from each subgroup. But Fisher’s exact
test is nevertheless applied as an exact instance of permutation testing.

There are alternatives to Fisher’s exact test that do not need all margins fixed assumptions.
One is the formerly popular Chi-squared test, which is based on large number approximations.
It is rarely needed nowadays, as Fisher’s exact test is exact and fast to compute. Another
approach is based on logistic regression and will be addressed in a later Chapter.

8.3.3 Fisher’s exact test in R

In R, we can perform Fisher’s exact test using fisher.test . This requires a contingency table
as input (See the base R function table to create them). For our contingency table, we get:

tbl = data.table(
severe = c(10, 10),
mild = c(20, 70)
)
tst <- fisher.test(tbl, alternative = "greater")

tst

##

## Fisher's Exact Test for Count Data


##
## data: tbl
## p-value = 0.01481
## alternative hypothesis: true odds ratio is greater than 1
## 95 percent confidence interval:

## 1.316358 Inf
## sample estimates:
## odds ratio
## 3.453224

The one-sided P -value is 0.0148095. At the level α = 0.05 , one would therefore reject the
null hypothesis of independence of symptom severity and smoking status.
As for the binomial test, the P -value can be extracted with $p.value . The function
fisher.test also returns an estimate of the odds ratio and its confidence interval. The

estimate of the odds ratio is based on a estimation procedure robust for low counts, giving
close yet slightly different estimates than the sample odds ratio (we obtained 3.5 at the start of
Section 8.3).

8.4 Testing the association between one


quantitative and one binary variable

We asked earlier on the yeast dataset (Section 7.1) whether the genotype at marker 5211
significantly associates with growth rates in Maltose media. We saw that yeast strains which
have the wild isolate genotype seemed to generally grow faster than those which had the lab
strain genotype at this particular marker:

genotype <- fread("extdata/eqtl/genotype.txt")


genotype <- genotype %>%
melt(id.vars = 'strain', variable.name = 'marker', value.name = 'genotype')
marker <- fread("extdata/eqtl/marker.txt")
growth <- fread("extdata/eqtl/growth.txt")
growth <- growth %>% melt(id.vars = "strain", variable.name = 'media', value.name = 'gro
mk <- marker[chrom == "chr07" & start == 1069229, id]

dt <- merge(
growth[media == 'YPMalt'],
genotype[marker == mk, .(strain, genotype)],

by = 'strain'
)

p <- dt%>%
ggplot(., aes(genotype, growth_rate)) +
geom_boxplot() +

xlab(paste0("Genotype at ", mk)) +


ylab("Growth rate in Maltose [Generations/day]") +
mytheme

Here we are evaluating the association between a binary variable (the genotype at marker
5211) and a quantitative variable (the growth rate in Maltose media). This scenario does not fit
the tests we have seen previously. We have to develop a new one.
To formalize this problem, we first note that the binary variable splits the quantitative data into
two groups. Let X = x1 , . . . , xn
x
be the quantitative data of the first group (i.e. the growth
rates of yeast strains with lab strain genotype), and Y = y1 , . . . , yn
y
be the quantitative data
of the second group (i.e. the growth rates of yeast strains with wild isolate genotype).

To develop a test, we again need a null hypothesis, a test statistic and a distribution of the test
statistic under the null hypothesis. For this problem, we will consider two different tests.

8.4.1 The t-test

The first test statistic we will look at is Student’s t, defined as:

x̄ − ȳ
t = c (8.2)
s

where x̄ and ȳ are the sample means of X and Y respectively, s is the pooled standard
deviation, and c is a constant that depends on the sample size of each group. In details:

1
x̄ = ∑ xi
nx
i

1
ȳ = ∑ yi
ny
i

2 2
∑ (xi − x̄) + ∑ (yi − ȳ )
i i
s = √
nx + ny − 2

nx ny
c = √
nx + ny

While the details can always be looked up, understanding Equation (8.2) is useful. Intuitively,
the t-statistic compares, up to the constant c, the “signal” of group difference, namely the
estimated difference of the means of the two groups, to the “noise,” i.e. how uncertain we are
about our estimate of this difference. This “noise” in our estimate is itself the ratio of the typical
variations within the groups (s) over a term capturing the sample size (c). One can thus
interpret it as a signal-to-noise ratio. If the t-statistic is large, then we see a clear difference in
means. By contrast, if the t-statistic is small, then the difference in means is not large
compared to the noise. Larger sample size (more data) or larger between-group differences
compared to within-group differences lead to larger t-statistics.
8.4.1.1 Student’s t-distribution

Before we can derive the distribution of this test statistic under the null hypothesis, we need to
make some additional assumptions about the data, namely:

All observations x1 , . . . , xn , y 1 , . . . , y n
x y
are independent of each other
we assume that X and Y both follow Gaussian distributions
X and Y have the same unknown variance

A consequence of these assumptions is that our null hypothesis simplifies. If both X and Y

are Gaussian with the same variance, the only way the two groups can differ is if the
Gaussians have different means. Therefore, the null hypothesis is that the expectations are
equal:

H0 : E(X) = E(Y )

Under H0 , our test statistic t follows a Student’s t-distribution with ν = nx + ny − 2 degrees


of freedom (the degrees of freedom ν is the parameter of Student’s t-distribution). Figure 8.2
shows the shape of Student’s t-distribution for different degrees of freedom ν .

Figure 8.2: Student’s t-distribution for various degrees of freedom. Source:


https://en.wikipedia.org/wiki/Student%27s_t-distribution

We can make two observations. Firstly, the distribution of the t-statistic under H0 does not
depend on the (unknown) variance. Secondly, Student’s t-distribution has heavier tails than
the Gaussian. This intuitively comes from the fact that, while the numerator of the t-statistic is
normally distributed, the estimate of the standard deviation in the denominator is noisy. The
smaller the sample size n , the noisier the estimate. The degrees of freedom capture just this.
They are equal to the number of data points (here nx + ny ) minus the number of parameters
of the model (here minus 2 for the means of each group). The smaller the degrees of freedom,
the less accurate our estimates of the standard deviation is, and the more variation the
denominator of the t-statistics brings. Hence, the smaller the degrees of freedom, the heavier
the tails. For infinite degrees of freedom, Student’s t-distribution equals the normal distribution.

8.4.1.2 Student’s t-test in R

In R we can perform a t-test using the t.test function. Since in the basic Student’s t-test we
assume equal variances, we have to set the argument var.equal to True. One can extract
the values for each group and perform the test. Rather than manually extracting the two
groups, we use the formula syntax ( growth_rate ~ genotype ) and let the t.test function do
it for us:

t.test(growth_rate ~ genotype, data=dt, var.equal=TRUE)

##
## Two Sample t-test
##

## data: growth_rate by genotype


## t = -10.77, df = 152, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Lab strain and group W
## 95 percent confidence interval:
## -2.406400 -1.660387
## sample estimates:

## mean in group Lab strain mean in group Wild isolate


## 5.763086 7.796480

Note that the function reports the t-statistic and the degrees of freedom, the confidence
intervals for the difference of the means, in addition to the p-value. Note also that the function
helpfully reminds us what null hypothesis we are testing against.

8.4.1.3 Unequal variance (Welch’s test) in R

In practice, we generally do not assume equal variances. This is called Welch’s test and
slightly changes the degrees of freedom. This test is performed in R by default if we do not set
var.equal to True.
t.test(growth_rate ~ genotype, data=dt)

##

## Welch Two Sample t-test


##
## data: growth_rate by genotype
## t = -10.805, df = 152, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Lab strain and group W
## 95 percent confidence interval:

## -2.405189 -1.661599
## sample estimates:
## mean in group Lab strain mean in group Wild isolate
## 5.763086 7.796480

8.4.2 Wilcoxon rank-sum test: An alternative to the t-test


for non-Gaussian data

8.4.2.1 Assumptions

As we saw, the t-test assumes the data follows a specific distribution, namely a Gaussian.
There are many situations where this is reasonable, but in general we cannot guarantee that
this assumption holds. Using the t-test if the data is not normal can lead to wrong conclusions.
For instance, distributions that have heavier tails that the Gaussian can lead to false null
hypothesis rejections as sample mean differences can seem to be surprisingly large (See
section 8.4.3).

The Wilcoxon Rank-Sum test is a popular alternative to the t-test. It makes very few
assumptions about the data, namely that:

All observations x1 , . . . , xn , y 1 , . . . , y n
x y
are independent of each other
The responses are ordinal, i.e. we can rank them

Specifically, we assume that under the null hypothesis H0 , the probability of an observation
from the population X exceeding an observation from the second population Y equals the
probability of an observation from Y exceeding an observation from X:
H0 : p(X > Y ) = p(Y > X)

In other words, if we randomly take observations x ∈ X and y ∈ Y , we would expect that


x > y occurs as often as y > x (ignoring ties).

A stronger null hypothesis commonly used is “The distributions of both populations are equal”
which implies the previous hypothesis.

For a two-sided test, the alternative hypothesis is “the probability of an observation from the
population X exceeding an observation from the second population Y is different from the
probability of an observation from Y exceeding an observation from X:
p(X > Y ) ≠ p(Y > X) .” The alternative may also be stated in terms of a one-sided test, for
example: p(X > Y ) > p(Y > X) . This would mean that if we randomly take observations
x ∈ X and y ∈ Y , we would expect that x > y occurs more often than y > x .

8.4.2.2 The Mann-Whitney U statistic and the Wilcoxon rank-sum


test

Consider first that we rank all observed values (and ignore ties), e.g.:

x5 < y10 < y12 < y3 < x4 < x17 <. . .

The idea of the the Wilcoxon rank-sum test is that under the null hypothesis, the xi ’s and yi ’s
should be well interleaved in this ranking. In contrast, if say X tend to be smaller than Y , then
the xi ’s will get lower ranks. The test statistics is therefore based on the sum of the ranks of
the realizations of one the two variables.

Specifically, we define the quantity Ux as:

nx (nx + 1)
Ux = Rx −
2

where Rx is the sum of the ranks of the xi ’s. In the example above, Rx = 1 + 5 + 6+. . . ,
nx (nx +1)
and nx is the number of observations of set X . The term 2
(this is the famous Gauss
sum) is a constant so that Ux = 0 when all the first values are from X .

Uy is defined analogously.

The Mann-Whitney U statistic is defined as:

U = min{Ux , Uy }
P -values are then based on the distribution of Mann-Whitney U statistic under the null
hypothesis. It combines tabulated values for small sample sizes and Central Limit Theorem
approximation for large sample sizes (exploiting that the expectation and the variance of U

under the null can be analytically derived.)17

8.4.2.3 Wilcoxon rank-sum test in R

In R we can perform the Wilcoxon rank-sum test using the wilcox.test function, whose
usage is analogous to the usage of t.test :

wilcox.test(growth_rate ~ genotype, data=dt)

##
## Wilcoxon rank sum test with continuity correction
##
## data: growth_rate by genotype
## W = 690, p-value = 2.264e-16

## alternative hypothesis: true location shift is not equal to 0

8.4.3 Why bother with the Wilcoxon rank-sum test?

The Wilcoxon rank-sum test makes less assumptions than the t-test, specifically because it
does not require that the data follows a Gaussian distribution. We will now see an example to
illustrate this.

We construct a highly pathological example:

set.seed(10)
x1 <- rnorm(100, 10)
x2 <- c(rnorm(99, 5), 500)
grp_tbl <- data.table(
group = rep(c("group1", "group2"), each=100),

x = c(x1,x2)
)
ggplot(data = grp_tbl, aes(x=x, color=group)) + geom_boxplot() + scale_x_log10()
In this example, the groups are sampled from normal distributions with a different mean.
However, we add a non-normal outlier to the second group, which ensures that the overall
mean looks the same.

Recall that, as a consequence of assuming that the data is Gaussian, the null hypothesis of
the t-test is that the difference in means is zero. There is no difference in means here, so the t

-test cannot reject the null hypothesis:

t.test(x1, x2)
##
## Welch Two Sample t-test
##
## data: x1 and x2
## t = -0.00052129, df = 99.072, p-value = 0.9996

## alternative hypothesis: true difference in means is not equal to 0


## 95 percent confidence interval:
## -9.829688 9.824525
## sample estimates:
## mean of x mean of y
## 9.863451 9.866033

But clearly, these groups are overall quite different, and their means only appear similar due to
this one outlier. The null hypothesis of the Wilcoxon rank-sum test is not about the means.
Instead, the Wilcoxon rank-sum test uses the rank distribution, and in our example most
observations of the second group will rank above the observations of the first. The one outlier
will not affect the ranking much. Thus, the Wilcoxon rank-sum test will reject here:

wilcox.test(x1, x2)

##
## Wilcoxon rank sum test with continuity correction
##
## data: x1 and x2
## W = 9900, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0

This is a rather synthetic example, but the underlying point is very general: if we are unsure
whether the distributional assumption is met, a test like the Wilcoxon rank-sum test will
generally be more robust than a test making distributional assumptions like the t-test. But do
note that there is a flip side to this: if the data is indeed Gaussian, then the t-test will be more
powerful (i.e. more sensitive in detecting violations of the null hypothesis) than the more
generally applicable Wilcoxon rank-sum test.
8.5 Association between two quantitative
variables

The last scenario we will consider in this chapter concerns the dependence between two
quantitative variables. That is, we assume we have quantitative data in the form of tuples
(X, Y ) : (x1 , y1 ), . . . , (xn , yn ) and we want to see if knowing one of the values in such a
tuple gives us information about the other one.

As a visual example, we consider a synthetic dataset, namely Anscombe’s first dataset18:

Looking at the plot above, it sure seems that there is a positive relationship between X and Y

in this data. Specifically, if we know that xi is relatively high, it seems that we can usually
assume that yi will be high too (and vice-versa). But once again, we need a test to prevent us
from being fooled by randomness.

This means we again need null and alternative hypotheses, a test statistic and a distribution of
the test statistic under the null hypothesis.

We will consider two different tests which are based on different notions of the concept of
correlation.

8.5.1 The Pearson correlation test

An important property is that when two variables (X,Y) form a bivariate Gaussian
distribution,19 their independence is equivalent to their population Pearson correlation
coefficient ρX,Y equals 0 (See Appendix D). This motivates for a Hypothesis test, called the
Pearson correlation coefficient test.
8.5.1.1 Pearson’s correlation coefficient

The Pearson correlation coefficient test is based on the sample estimate of the population
Pearson correlation coefficient, defined as:
n
∑ (xi − x̄)(yi − ȳ )
i=1
r = (8.3)
n n
2 2
√∑ (xi − x̄) √∑ (yi − ȳ )
i=1 i=1

n
where is the sample mean, and analogously for ȳ .
1
x̄ = ∑ xi
n i=1

Let us look at the components of this. The numerator compares the deviation of the xi and yi

to their respective means. Terms of the sum are positive if both xi and yi vary in the same
direction (larger or lesser) compared to their mean and negative otherwise. Hence, the
numerator is largely positive when deviations from the means agree in direction, largely
negative when they are opposite, and about 0 when deviations are independent of each other.
More formally, the numerator is proportional to the sample covariance (See Appendix D). The
terms in the denominator is proportional to the individual sample standard deviations of X and
Y (See Appendix D). Hence, r compares how much the X and Y vary together to the product
of how much they vary individually.

The Pearson correlation coefficient is symmetric. Moreover, it is invariant to affine


transformations of the variables. It ranges from -1 to 1, where:

r = 1 implies that x and y are perfectly linearly related with a positive slope
r = −1 implies that that x and y are perfectly linearly related with a negative slope

8.5.1.2 The test

The assumptions of the Pearson correlation test are:

(X, Y ) is a bivariate Gaussian distribution


The observations (Xi , Yi ) are i.i.d.

The null hypothesis is that the two variables are statistically independent, which under the
above assumptions amounts to state that:

H0 : ρ(X,Y ) = 0

The test statistic is given by:

n − 2
t = r√
2
1 − r
Under H0 , the test statistic t defined above follows a Student’s t-distribution with degrees of
freedom n − 2 .

In R, we can use cor.test with method=“pearson” to perform a Pearson correlation test.

cor.test(anscombe$x1, anscombe$y1, method="pearson")

##
## Pearson's product-moment correlation
##

## data: anscombe$x1 and anscombe$y1


## t = 4.2415, df = 9, p-value = 0.00217
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4243912 0.9506933

## sample estimates:
## cor
## 0.8164205

8.5.2 The Spearman rank correlation test

8.5.2.1 Motivation

Pearson’s correlation captures linear relationship between variables, which is quite restrictive.
For instance, if one of the variables is in log-scale or quadratic scale, then the linear
relationship is lost. Here is a constructed example:

x_vec <- seq(-1.55,1.55,0.05)

tan_x <- tan(x_vec)


dt_tan <- data.table(x = x_vec, tan_x = tan_x)
ggplot(data=dt_tan, aes(x=x, y=tan_x)) + geom_point() + mytheme
These two constructed variables relate exactly to each other by a monotonic relationship (the
tangent function).20 However their Pearson correlation is modest:

cor(x_vec, tan_x, method="pearson")

## [1] 0.5643079

Conversely Pearson’ correlation can be excessively large in presence of outliers. Anscombe’s


quartet provides an example:
cor.test(anscombe$x4, anscombe$y4, method="pearson")

##
## Pearson's product-moment correlation

##
## data: anscombe$x4 and anscombe$y4
## t = 4.243, df = 9, p-value = 0.002165
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4246394 0.9507224

## sample estimates:
## cor
## 0.8165214

We see there is a high Pearson correlation between X and Y . Furthermore, if we use a


significance level α = 0.05 , we reject the null hypothesis that H0 : r = 0 and conclude there
is a statistically significant association between X and Y . The plot however tells us this is
driven by a single outlier.21 The data is probably not Gaussian.
8.5.2.2 Spearman’s correlation coefficient

Spearman’s correlation (or rank-correlation, denoted ρ) addresses those issues by computing


the correlation not on the original scale but on rank-transformed values. To compute ρ, we
rank the variables X and Y separately, yielding rankings such as:

x7 < x3 < x5 < x1 <. . .

and

y3 < y5 < y7 < y2 <. . .

We then compute the position of each data point in the ranking, yielding the transformed
dataset:

rankx (X), ranky (Y ) = (rankx (x1 ), ranky (y1 )), . . . , (rankx (xn ), ranky (yn ))

For the rankings above, we would have for example that (rankx (x7 ), ranky (y7 )) = (1, 3) .

Spearman’s ρ is then computed as the Pearson correlation of the rank-transformed data.

In R we can compute it using:

cor(rank(anscombe$x4), rank(anscombe$y4), method="pearson")

## [1] 0.5

Or more directly, by specifying method=“spearman.”

cor(anscombe$x4, anscombe$y4, method="spearman")

## [1] 0.5

8.5.2.3 The test

Based on the Spearman correlation, we can also define a test for the relationship between two
variables.

This test does not make distributional assumptions.

The null hypothesis is:


H0 : The population rank-correlation is 0.

R implements a statistical test based on tabulated exact permutations for small sample sizes
and approximations for larger sample sizes.

Applied to the Anscombe’s dataset we get:

cor.test(anscombe$x4, anscombe$y4, method="spearman")

## Warning in cor.test.default(anscombe$x4, anscombe$y4, method


## = "spearman"): Cannot compute exact p-value with ties

##
## Spearman's rank correlation rho
##
## data: anscombe$x4 and anscombe$y4

## S = 110, p-value = 0.1173


## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.5

We see that the Spearman test would not reject, which makes sense, as the rank is less likely
to be mislead by the outlier data point.22 Generally, the Spearman test is less powerful than
Pearson when the data is actually Gaussian, but it is more robust to outliers and captures
monotonic, yet non-linear, relationships. In practice, the Spearman test is often used.

8.6 Testing associations of two variables:


Overview

Figure 8.3 summarizes the different tests we have seen for the association of two variables,
together with the typical companion plots:
Figure 8.3: Overview of two-variable tests

8.7 Assessing distributional assumptions with Q-


Q Plots

As we saw in this chapter, several tests assume that the data follows a particular distribution.
We will now explore a plot which we can use to check whether such an assumption is
reasonable.

8.7.1 Limitations of Histograms

We already know a plot which can be used to visualize distributions, namely the histogram.
We might think that it could be used to check distributional assumptions. However, this is
somewhat complicated by the difficulty of choosing the right bin size. Consider, for example,
the following histogram, visualizing a sample taken from a uniform distribution on the interval 0
to 1.:

x <- runif(50) ## uniformly distributed data points


# qplot is a quick plotting function
qplot(x, geom="histogram")
## `stat_bin()` using `bins = 30`. Pick better value with
## `binwidth`.

Just looking at the histogram, it is hard to see that the underlying data comes from the uniform
distribution.

8.7.2 Q-Q plots: Comparing empirical to theoretical


quantiles

What could be a better approach here? One thing we can do is look at the quantiles.

The basic idea here is as follows: if the data actually follows a uniform distribution on the
interval 0 to 1, then we expect 10% of the data in the interval [0,0.1], 20% in the interval
[0,0.2], and so on…

We can now compute whether our data conforms to this expectation. We get that:

dec <- quantile(x, seq(0,1,0.1))


dec
## 0% 10% 20% 30% 40%
## 0.02503686 0.09677631 0.14369892 0.29974460 0.36662386
## 50% 60% 70% 80% 90%
## 0.50953583 0.63289259 0.68685499 0.85694255 0.90948354
## 100%

## 0.99981636

Here we implicitly chose to always make jumps of 10% . These quantiles are therefore called
deciles.

We can make a scatter plot which compares the expected and the theoretical deciles:

ggplot(
data.table(
x=seq(0,1,0.1),

y=dec
),
aes(x,y)
) + geom_point() +
xlim(c(0,1)) + ylim(c(0,1))+

xlab("Deciles of the uniform distribution") +


ylab("Deciles of the dataset") +
geom_abline(intercept=0,slope=1) ## diagonal y=x
We see that they match quite well.

For a finite sample we can estimate the quantile for every data point. One way is to use as
expected quantile (r − 0.5)/N (Hazen, 1914), where r is the rank of the data point. The R
function ppoints gives more accurate values.

ggplot(

data.table(
x=ppoints(length(x)),
y=sort(x)
),
aes(x,y)
) + geom_point() +

xlim(c(0,1)) + ylim(c(0,1)) +
xlab("Quantiles of the uniform distribution") +
ylab("Quantiles of the dataset") +
geom_abline(intercept=0,slope=1) ## diagonal y=x
This is called a Q-Q plot, which is short for Quantile-Quantile plot. When the distribution
matches the data, as above, the points should be close the diagonal.

Let us now recall the example we used to justify the Wilcoxon test. There we added an
extreme outlier to a gaussian, which mislead the t-test. Can we discover, using a Q-Q plot,
that this data violates an assumption of normality?

group_qq <- c(rnorm(99, 0), 100)


qq_tbl <- data.table(sample = group_qq)
ggplot(data = qq_tbl, aes(sample = sample)) + geom_qq() + stat_qq_line()
Our artificially injected outlier shows up very clearly in the Q-Q plot as a strong deviation from
what we expect from a normal distribution.

8.7.3 Typical Q-Q plots

Figure 8.4 give more examples. We assume here the Normal distribution (Gaussian with mean
0 and variance 1) as reference theoretical distribution. These plots show how different
violations of the distributional assumption translate to different deviations from the diagonal in
a Q-Q plot.
Figure 8.4: Examples of Q-Q plots. The theoretical distribution is in each case the Normal
distribution (Gaussian with mean 0 and variance 1). The upper row shows histograms of some
observations, the lower row shows the matching Q-Q plots. The vertical red dashed line marks
the theoretical mean (0, top row) and the red lines the y=x diagonal (bottom row).
The middle three plots show what happens when one particular aspect of the distributional
assumption is incorrect. The second from the left shows what happens if the data has a mean
higher than we expected, but otherwise follows the distribution. The middle one shows what
happens if the data has fatter tails (i.e. more outliers) than we expected - this occurs frequently
in practice. The second from the right shows what happens if the distribution is narrower than
expected. The last plot shows a combination of these phenomena. There the data come from a
non-negative asymmetric distribution.23 The Q-Q plot shows a lack of low values (capped at 0)
and an excess of high values.

8.8 Analytical Confidence intervals

Remember the definition of confidence intervals: A confidence interval of confidence level


1 − α for a parameter θ is an interval, which would the data generation process be repeated,
would contain the parameter with probability 1 − α .

For instance, a 95% confidence interval for the expectation μ , would be an interval [a, b] such
that:
p(a ≤ μ ≤ b) = 0.95

Remember also that here, a and b are random, μ is not!

We have seen how to approximate confidence intervals using the case-resampling bootstrap
in the previous chapter. But, confidence intervals can also be computed analytically under
some assumptions. We will see this in detail for the binomial case first.

8.8.1 Binomial case

We use the same setup as previously:

We make N independent random tosses of a coin.


Xi : the value of the i-th toss. Xi = 1 for head Xi = 0 for tail.
μ = E(Xi ) = p(Xi = 1) the probability of getting a head (same for all tosses).

The sample mean is then given by:

1
X̄ = ∑ Xi
N
i

And the estimated standard deviation is given by:

X̄(1 − X̄)
^
SE(X̄) = √
N

We want to know the probability that the interval ^ ^


[X̄ − SE(X̄), X̄ + SE(X̄)] contains the true
proportion μ . (And do not forget that X̄ and ^
SE are random variables, whereas μ is not!)

To determine the probability that the interval includes μ , we need to compute this:

^ ^
p (X̄ − SE(X̄) ≤ μ ≤ X̄ + SE(X̄))

By subtracting and dividing the same quantities in all parts of the equation, we get that the
above is equivalent to:

X̄ − μ
p (−1 ≤ ≤ 1)
^
SE(X̄)

8.8.1.1 Normal approximation interval using the Central Limit


Theorem

The Central Limit Theorem implies that the sample mean distributes for large N as a Normal
distribution with mean E(X) and variance Var(X)/N :

p(X̄) = N (μ, Var(X)/N )

It is known that Var(Xi ) = μ(1 − μ) (this is because the underlying data is bernoulli).

Hence,

X̄ − μ
p (−1 ≤ ≤ 1) = p (−1 ≤ Z ≤ 1)
^
SE(X̄)

where Z is normally distributed with mean 0 and variance 1.

This can be computed in R using the cumulative distribution function of the normal distribution:

pnorm(1) - pnorm(-1)

## [1] 0.6826895

proving that we have approximately 68% probability.

8.8.1.2 Defining the interval for a predefined confidence level

If we want to have a larger probability, say 99%, we need to multiply by whatever z satisfies
the following:

Pr (−z ≤ Z ≤ z) = 0.99

This is obtained using the quantile function of the normal distribution. In R, using:

z <- qnorm(0.995)
z

## [1] 2.575829

will achieve this because by definition pnorm(qnorm(0.995)) is 0.995 and by symmetry


pnorm(1-qnorm(0.995)) is 1 - 0.995. As a consequence, we have that:
pnorm(z) - pnorm(-z)

## [1] 0.99

is 0.995 - 0.005 = 0.99 .

8.8.1.3 Boundaries of equi-tailed 95% confidence interval

We can use this approach for any confidence level 1 − α .

To obtain an equi-tailed confidence interval of level 1 − α we set z = qnorm(1 - alpha/2)


because (1 − α/2) − α/2 = 1 − α .

For ,
α = 0.05 1 − α/2 = 0.975 and we get the typically used 1.96 factor:

qnorm(0.975)

## [1] 1.959964

8.8.2 Confidence intervals in R

Most statistical tests in R provide confidence intervals for the relevant statistics. This is
reported as part of the returned test object. For example, for the binomial test we get:

mu <- 0.45
N <- 1000

x <- sample(c(0, 1), size = N, replace = TRUE, prob = c(1-mu, mu))


binom.test(sum(x), length(x))
##
## Exact binomial test
##
## data: sum(x) and length(x)
## number of successes = 421, number of trials = 1000,

## p-value = 6.537e-07
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
## 0.3901707 0.4522958
## sample estimates:
## probability of success

## 0.421

You can see that the binom.test function automatically gives us a 95 percent confidence
interval. It is reported in the conf.int slot. So we can extract it using:

binom.test(sum(x), length(x))$conf.int

## [1] 0.3901707 0.4522958


## attr(,"conf.level")

## [1] 0.95

We can set the confidence level with the conf.level parameter. So if we want a 99%

interval, we do:

binom.test(sum(x), length(x), conf.level=0.99)$conf.int

## [1] 0.3807435 0.4620192


## attr(,"conf.level")
## [1] 0.99

For some tests, you first need to set conf.int to TRUE to recieve a confidence interval:

wilcox.test(growth_rate ~ genotype, data=dt, conf.int=TRUE)


##
## Wilcoxon rank sum test with continuity correction
##
## data: growth_rate by genotype
## W = 690, p-value = 2.264e-16

## alternative hypothesis: true location shift is not equal to 0


## 95 percent confidence interval:
## -2.537024 -1.753245
## sample estimates:
## difference in location
## -2.134131

Sometimes R will use more accurate estimations than the Normal approximation we have just
described. Details can usually be found in the documentation.

8.8.3 Advanced: A note on overlapping confidence


intervals

Consider again a scenario where we are comparing two groups, X and Y , in terms of their
means, X̄ and ¯
Y . Assume, for simplicity, that:

2
X ∼ N (μx , σ )

2
Y ∼ N (μy , σ )

with σ
2
known. Assume further that we have samples of each group of size ny = nx = n . We
then know that:

2
σ
x̄ ∼ N (μx , )
n

2
σ
ȳ ∼ N (μy , )
n

We can now construct two analytical 95% confidence intervals, one for each mean. We use
the same procedure as previously. We set up an interval:

σ σ
Pr (x̄ − z ≤ μx ≤ x̄ + z ) = 0.95
√n √n

And rearrange to get:


⎛ x̄ − μx ⎞
Pr −z ≤ ≤ z = Pr (−z ≤ Z ≤ z) = 0.95
σ
⎝ ⎠
√n

As before, Z ∼ N (0, 1) and thus we get z ≈ 1.96 .

We round 1.96 ≈ 2 , to make the math nicer, yielding us the intervals:

σ
x̄ ± 2
√n

σ
ȳ ± 2
√n

Now assume we want to test the null hypothesis that the true mean difference is zero. In the
literature it is quite common practice to say that we reject this null hypothesis if and only if the
two confidence intervals do not overlap.

It is important to note that this is not the same as constructing a 95 confidence interval for the
difference in means x̄ − ȳ and rejecting if and only if that interval does not include zero. The
difference comes from how we add standard errors.

σ σ
In our overlap test, we would reject whenever ȳ + 2 < x̄ − 2 (assuming ȳ < x̄ ). We
√n √n

can rearrange to get that we reject whenever:

σ
4 < x̄ − ȳ
√n

Now let us construct a confidence interval for x̄ − ȳ . A basic property of normal random
variables tells us that:

2
σ
x̄ − ȳ ∼ N (μx − μy , 2 )
n

Thus the correct confidence interval for x̄ − ȳ is (using again 1.96 ≈ 2 ):

σ
(x̄ − ȳ ) ± 2√2
√n

Thus we reject whenever:

σ
2√ 2 < x̄ − ȳ
√n

Now 2√ 2 < 4 , thus this will reject more often than the “overlap” test. in other words, the
overlap test is too conservative.
Nevertheless, this sort of overlap test is very often used in the literature. When you see it
being used, or when you use it yourself, keep in mind that it generally will fail to reject more
often than the confidence level indicates.

8.9 Discussion

In the last chapter, we saw Permutation-based testing, which is very general: we can use it to
test for any ad hoc statistics such as mean difference, median difference and so on. However,
in the beginning of this chapter, we saw that Permutation-based testing is computationally
intensive and not always appropriate in the context of big data. This is why we discussed a
number of analytical tests which can serve as alternatives, to compute P -values and
confidence intervals.

Some of the tests are parametric, i.e. they assume some parameterized function for the data
distribution, leading to the null hypothesis is be based on the parameters (for instance that two
groups distribute according to the Gaussian distribution and that the means of the two groups
are equal). We saw that for many scenarios, non-parametric tests exist that do not make little
assumptions on the distribution functions. Examples of such non-parametric tests are the
Fisher, the Wilcoxon rank-sum test, and the Spearman rank-correlation test. In general, these
should be preferred to their parametric counterparts, unless we have good reason to believe
that the more restrictive assumptions of the parametric test are met. We can check
distributional assumptions using Q-Q plots.

The two-variable tests we discussed assess in some ways the dependencies between
variables. By themselves, they cannot tell whether these relationships are causal and, if so,
what the direction of causality is (See Chapter 6). For instance, we may be tempted to
conclude from the viral infection example that smoking is a cause of severe symptoms, but
that study design unfortunately cannot guarantee this. People who smoke could, for example,
on average be older, have a less healthy diet, or have other risk-factors, compared to non-
smokers. In other words, there could be confounding variables not taken into account here.

8.10 Conclusion

By now you should know:

the assumptions of the binomial test and how to apply it.


how to recognize and apply all the 2-variable tests in Figure 8.3
Interpret a Q-Q plot
In applications, always report:

a significance assessment (P -value or confidence interval)


a plot showing the effect assessed
(if relevant) a Q-Q plot showing that the distributional assumptions of the test are met

8.11 Resources

The Testing Chapter of Modern Statistics for Modern Biology, by Holmes and Huber.

An online version is available at: https://www.huber.embl.de/msmb/

14. R typically provides 4 functions per distribution starting with the letters r, d, p, and q and
standing for random draws (rbinom, rnorm,…), density or probability mass (dbinom,
dnorm,…), cumulative distribution (pbinom, pnorm,…), and quantile (qbinom, qnorm,…)↩︎

15. It would also not be ideal to use the binomial test on the smoker data by fixing the
probability under the null to the probability estimated on the non-smokers, because that
probability would be a noisy estimate↩︎

16. More details at https://en.wikipedia.org/wiki/Hypergeometric_distribution↩︎

17. https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test↩︎

18. https://en.wikipedia.org/wiki/Anscombe%27s_quartet↩︎

19. https://en.wikipedia.org/wiki/Multivariate_normal_distribution↩︎

20. In some cases, we may also be interested in non-monotonic data (e.g. a U-shaped
relationship). In this case, both Pearson and Spearman correlations will fail and we have
to use more complex measures, such as the distance correlation
(https://en.wikipedia.org/wiki/Distance_correlation) or an information-theoretic measure
such as mutual information.↩︎

21. always give a plot along with your stats!↩︎

22. This being said, the test rightfully warns of the large number of ties, so in this case the
Spearman P -value may not be the last word on the matter either. With data like this, we
should always take a step back and think carefully whether it can really answer the
questions we are trying to ask of it.↩︎
23. simulated with the Negative binomial distribution rnbinom(n, mu=100, size=2)/100 .↩︎

You might also like