Stat Proof Book
Stat Proof Book
DOI: 10.5281/zenodo.4305949
https://statproofbook.github.io/
[email protected]
2024-02-09, 15:20
Contents
I General Theorems 1
1 Probability theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1 Random experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Random experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Sample space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.3 Event space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.4 Probability space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Random event . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Random variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.3 Random vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.4 Random matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.5 Constant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.6 Discrete vs. continuous . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.7 Univariate vs. multivariate . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.1 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.2 Joint probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.3 Marginal probability . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.4 Conditional probability . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.5 Exceedance probability . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.6 Statistical independence . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.7 Conditional independence . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.8 Probability under independence . . . . . . . . . . . . . . . . . . 8
1.3.9 Mutual exclusivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.10 Probability under exclusivity . . . . . . . . . . . . . . . . . . . . 9
1.4 Probability axioms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4.1 Axioms of probability . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4.2 Monotonicity of probability . . . . . . . . . . . . . . . . . . . . 10
1.4.3 Probability of the empty set . . . . . . . . . . . . . . . . . . . . 11
1.4.4 Probability of the complement . . . . . . . . . . . . . . . . . . . 11
1.4.5 Range of probability . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4.6 Addition law of probability . . . . . . . . . . . . . . . . . . . . . 13
1.4.7 Law of total probability . . . . . . . . . . . . . . . . . . . . . . . 13
1.4.8 Probability of exhaustive events . . . . . . . . . . . . . . . . . . 14
1.4.9 Probability of exhaustive events . . . . . . . . . . . . . . . . . . 15
1.5 Probability distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.5.1 Probability distribution . . . . . . . . . . . . . . . . . . . . . . . . . 16
i
ii CONTENTS
1.10.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
1.10.2 Sample mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
1.10.3 Non-negative random variable . . . . . . . . . . . . . . . . . . . 41
1.10.4 Non-negativity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
1.10.5 Linearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
1.10.6 Monotonicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
1.10.7 (Non-)Multiplicativity . . . . . . . . . . . . . . . . . . . . . . . . 45
1.10.8 Expectation of a trace . . . . . . . . . . . . . . . . . . . . . . . . 47
1.10.9 Expectation of a quadratic form . . . . . . . . . . . . . . . . . . 48
1.10.10 Squared expectation of a product . . . . . . . . . . . . . . . . . 49
1.10.11 Law of total expectation . . . . . . . . . . . . . . . . . . . . . . . 51
1.10.12 Law of the unconscious statistician . . . . . . . . . . . . . . . . 51
1.10.13 Expected value of a random vector . . . . . . . . . . . . . . . . . . . 53
1.10.14 Expected value of a random matrix . . . . . . . . . . . . . . . . . . . 54
1.11 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
1.11.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
1.11.2 Sample variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
1.11.3 Partition into expected values . . . . . . . . . . . . . . . . . . . 55
1.11.4 Non-negativity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
1.11.5 Variance of a constant . . . . . . . . . . . . . . . . . . . . . . . . 56
1.11.6 Invariance under addition . . . . . . . . . . . . . . . . . . . . . . 57
1.11.7 Scaling upon multiplication . . . . . . . . . . . . . . . . . . . . . 57
1.11.8 Variance of a sum . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
1.11.9 Variance of linear combination . . . . . . . . . . . . . . . . . . . 58
1.11.10 Additivity under independence . . . . . . . . . . . . . . . . . . 59
1.11.11 Law of total variance . . . . . . . . . . . . . . . . . . . . . . . . . 59
1.11.12 Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
1.12 Skewness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
1.12.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
1.12.2 Sample skewness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
1.12.3 Partition into expected values . . . . . . . . . . . . . . . . . . . 61
1.13 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
1.13.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
1.13.2 Sample covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
1.13.3 Partition into expected values . . . . . . . . . . . . . . . . . . . 63
1.13.4 Symmetry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
1.13.5 Self-covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
1.13.6 Covariance under independence . . . . . . . . . . . . . . . . . . 64
1.13.7 Relationship to correlation . . . . . . . . . . . . . . . . . . . . . 65
1.13.8 Law of total covariance . . . . . . . . . . . . . . . . . . . . . . . 65
1.13.9 Covariance matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
1.13.10 Sample covariance matrix . . . . . . . . . . . . . . . . . . . . . . . . 66
1.13.11 Covariance matrix and expected values . . . . . . . . . . . . . 67
1.13.12 Symmetry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
1.13.13 Positive semi-definiteness . . . . . . . . . . . . . . . . . . . . . . 68
1.13.14 Invariance under addition of vector . . . . . . . . . . . . . . . . 69
1.13.15 Scaling upon multiplication with matrix . . . . . . . . . . . . . 70
1.13.16 Cross-covariance matrix . . . . . . . . . . . . . . . . . . . . . . . . . 70
iv CONTENTS
V Appendix 595
1 Proof by Number . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596
2 Definition by Number . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 621
3 Proof by Topic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 628
4 Definition by Topic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 639
Chapter I
General Theorems
1
2 CHAPTER I. GENERAL THEOREMS
1 Probability theory
1.1 Random experiments
1.1.1 Random experiment
Definition: A random experiment is any repeatable procedure that results in one (→ I/1.2.2) out
of a well-defined set of possible outcomes.
• The set of possible outcomes is called sample space (→ I/1.1.2).
• A set of zero or more outcomes is called a random event (→ I/1.2.1).
• A function that maps from events to probabilities is called a probability function (→ I/1.5.1).
Together, sample space (→ I/1.1.2), event space (→ I/1.1.3) and probability function (→ I/1.1.4)
characterize a random experiment.
Sources:
• Wikipedia (2020): “Experiment (probability theory)”; in: Wikipedia, the free encyclopedia, re-
trieved on 2020-11-19; URL: https://en.wikipedia.org/wiki/Experiment_(probability_theory).
Sources:
• Wikipedia (2021): “Sample space”; in: Wikipedia, the free encyclopedia, retrieved on 2021-11-26;
URL: https://en.wikipedia.org/wiki/Sample_space.
Sources:
• Wikipedia (2021): “Event (probability theory)”; in: Wikipedia, the free encyclopedia, retrieved on
2021-11-26; URL: https://en.wikipedia.org/wiki/Event_(probability_theory).
Sources:
1. PROBABILITY THEORY 3
• Wikipedia (2021): “Probability space”; in: Wikipedia, the free encyclopedia, retrieved on 2021-11-
26; URL: https://en.wikipedia.org/wiki/Probability_space#Definition.
Sources:
• Wikipedia (2020): “Event (probability theory)”; in: Wikipedia, the free encyclopedia, retrieved on
2020-11-19; URL: https://en.wikipedia.org/wiki/Event_(probability_theory).
Sources:
• Wikipedia (2020): “Random variable”; in: Wikipedia, the free encyclopedia, retrieved on 2020-05-
27; URL: https://en.wikipedia.org/wiki/Random_variable#Definition.
Sources:
• Wikipedia (2020): “Multivariate random variable”; in: Wikipedia, the free encyclopedia, retrieved
on 2020-05-27; URL: https://en.wikipedia.org/wiki/Multivariate_random_variable.
Sources:
4 CHAPTER I. GENERAL THEOREMS
• Wikipedia (2020): “Random matrix”; in: Wikipedia, the free encyclopedia, retrieved on 2020-05-27;
URL: https://en.wikipedia.org/wiki/Random_matrix.
1.2.5 Constant
Definition: A constant is a quantity which does not change and thus always has the same value.
From a statistical perspective, a constant is a random variable (→ I/1.2.2) which is equal to its
expected value (→ I/1.10.1)
X = E(X) (1)
or equivalently, whose variance (→ I/1.11.1) is zero
Var(X) = 0 . (2)
Sources:
• ProofWiki (2020): “Definition: Constant”; in: ProofWiki, retrieved on 2020-09-09; URL: https:
//proofwiki.org/wiki/Definition:Constant#Definition.
Sources:
• Wikipedia (2020): “Random variable”; in: Wikipedia, the free encyclopedia, retrieved on 2020-10-
29; URL: https://en.wikipedia.org/wiki/Random_variable#Standard_case.
Sources:
• Wikipedia (2020): “Multivariate random variable”; in: Wikipedia, the free encyclopedia, retrieved
on 2020-11-06; URL: https://en.wikipedia.org/wiki/Multivariate_random_variable.
1. PROBABILITY THEORY 5
1.3 Probability
1.3.1 Probability
Definition: Let E be a statement about an arbitrary event such as the outcome of a random
experiment (→ I/1.1.1). Then, p(E) is called the probability of E and may be interpreted as
• (objectivist interpretation of probability:) some physical state of affairs, e.g. the relative frequency
of occurrence of E, when repeating the experiment (“Frequentist probability”); or
• (subjectivist interpretation of probability:) a degree of belief in E, e.g. the price at which someone
would buy or sell a bet that pays 1 unit of utility if E and 0 if not E (“Bayesian probability”).
Sources:
• Wikipedia (2020): “Probability”; in: Wikipedia, the free encyclopedia, retrieved on 2020-05-10;
URL: https://en.wikipedia.org/wiki/Probability#Interpretations.
Sources:
• Wikipedia (2020): “Joint probability distribution”; in: Wikipedia, the free encyclopedia, retrieved
on 2020-05-10; URL: https://en.wikipedia.org/wiki/Joint_probability_distribution.
• Jason Browlee (2019): “A Gentle Introduction to Joint, Marginal, and Conditional Probability”;
in: Machine Learning Mastery, retrieved on 2021-08-01; URL: https://machinelearningmastery.
com/joint-marginal-and-conditional-probability-for-machine-learning/.
Sources:
• Wikipedia (2020): “Marginal distribution”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
05-10; URL: https://en.wikipedia.org/wiki/Marginal_distribution#Definition.
• Jason Browlee (2019): “A Gentle Introduction to Joint, Marginal, and Conditional Probability”;
in: Machine Learning Mastery, retrieved on 2021-08-01; URL: https://machinelearningmastery.
com/joint-marginal-and-conditional-probability-for-machine-learning/.
6 CHAPTER I. GENERAL THEOREMS
p(A, B)
p(A|B) = (1)
p(B)
where p(B) is the marginal probability (→ I/1.3.3) of B.
Sources:
• Wikipedia (2020): “Conditional probability”; in: Wikipedia, the free encyclopedia, retrieved on
2020-05-10; URL: https://en.wikipedia.org/wiki/Conditional_probability#Definition.
• Jason Browlee (2019): “A Gentle Introduction to Joint, Marginal, and Conditional Probability”;
in: Machine Learning Mastery, retrieved on 2021-08-01; URL: https://machinelearningmastery.
com/joint-marginal-and-conditional-probability-for-machine-learning/.
Sources:
• Stephan KE, Penny WD, Daunizeau J, Moran RJ, Friston KJ (2009): “Bayesian model selection for
group studies”; in: NeuroImage, vol. 46, pp. 1004–1017, eq. 16; URL: https://www.sciencedirect.
com/science/article/abs/pii/S1053811909002638; DOI: 10.1016/j.neuroimage.2009.03.025.
• Soch J, Allefeld C (2016): “Exceedance Probabilities for the Dirichlet Distribution”; in: arXiv
stat.AP, 1611.01439; URL: https://arxiv.org/abs/1611.01439.
Y
n
p(X1 = x1 , . . . , Xn = xn ) = p(Xi = xi ) for all xi ∈ Xi , i = 1, . . . , n (1)
i=1
where p(x1 , . . . , xn ) are the joint probabilities (→ I/1.3.2) of X1 , . . . , Xn and p(xi ) are the marginal
probabilities (→ I/1.3.3) of Xi .
where F are the joint (→ I/1.5.2) or marginal (→ I/1.5.3) cumulative distribution functions (→
I/1.8.1) and f are the respective probability density functions (→ I/1.7.1).
Sources:
• Wikipedia (2020): “Independence (probability theory)”; in: Wikipedia, the free encyclopedia, re-
trieved on 2020-06-06; URL: https://en.wikipedia.org/wiki/Independence_(probability_theory)
#Definition.
where p(x1 , . . . , xn |y) are the joint (conditional) probabilities (→ I/1.3.2) of X1 , . . . , Xn given Y and
p(xi |y) are the marginal (conditional) probabilities (→ I/1.3.3) of Xi given Y .
where F are the joint (conditional) (→ I/1.5.2) or marginal (conditional) (→ I/1.5.3) cumulative
distribution functions (→ I/1.8.1) and f are the respective probability density functions (→ I/1.7.1).
Sources:
• Wikipedia (2020): “Conditional independence”; in: Wikipedia, the free encyclopedia, retrieved on
2020-11-19; URL: https://en.wikipedia.org/wiki/Conditional_independence#Conditional_independence_
of_random_variables.
p(A) = p(A|B)
(1)
p(B) = p(B|A) .
Proof: If A and B are independent (→ I/1.3.6), then the joint probability (→ I/1.3.2) is equal to
the product of the marginal probabilities (→ I/1.3.3):
p(A, B)
p(A|B) = . (3)
p(B)
Combining (2) and (3), we have:
p(A) · p(B)
p(A|B) = = p(A) . (4)
p(B)
Equivalently, we can write:
Sources:
• Wikipedia (2021): “Independence (probability theory)”; in: Wikipedia, the free encyclopedia, re-
trieved on 2021-07-23; URL: https://en.wikipedia.org/wiki/Independence_(probability_theory)
#Definition.
p(A1 , . . . , An ) = 0 (1)
where p(A1 , . . . , An ) is the joint probability (→ I/1.3.2) of the statements A1 , . . . , An .
Sources:
• Wikipedia (2021): “Mutual exclusivity”; in: Wikipedia, the free encyclopedia, retrieved on 2021-
07-23; URL: https://en.wikipedia.org/wiki/Mutual_exclusivity#Probability.
Proof: If A and B are mutually exclusive (→ I/1.3.9), then their joint probability (→ I/1.3.2) is
zero:
p(A, B) = 0 . (2)
The addition law of probability (→ I/1.3.3) states that
Sources:
• Wikipedia (2021): “Mutual exclusivity”; in: Wikipedia, the free encyclopedia, retrieved on 2021-
07-23; URL: https://en.wikipedia.org/wiki/Mutual_exclusivity#Probability.
• Second axiom: The probability that at least one elementary event in the sample space will occur
is one:
P (Ω) = 1 . (2)
• Third axiom: The probability of any countable sequence of disjoint (i.e. mutually exclusive (→
I/1.3.9)) events E1 , E2 , E3 , . . . is equal to the sum of the probabilities of the individual events:
!
[∞ X∞
P Ei = P (Ei ) . (3)
i=1 i=1
Sources:
• A.N. Kolmogorov (1950): “Elementary Theory of Probability”; in: Foundations of the Theory of
Probability, p. 2; URL: https://archive.org/details/foundationsofthe00kolm/page/2/mode/2up.
• Alan Stuart & J. Keith Ord (1994): “Probability and Statistical Inference”; in: Kendall’s Advanced
Theory of Statistics, Vol. 1: Distribution Theory, ch. 8.6, p. 288, eqs. 8.2-8.4; URL: https://www.
wiley.com/en-us/Kendall%27s+Advanced+Theory+of+Statistics%2C+3+Volumes%2C+Set%2C+
6th+Edition-p-9780470669549.
• Wikipedia (2021): “Probability axioms”; in: Wikipedia, the free encyclopedia, retrieved on 2021-
07-30; URL: https://en.wikipedia.org/wiki/Probability_axioms#Axioms.
Proof: Set E1 = A, E2 = B \ A and Ei = ∅ for i ≥ 3. Then, the sets Ei are pairwise disjoint and
E1 ∪ E2 ∪ . . . = B, because A ⊆ B. Thus, from the third axiom of probability (→ I/1.4.1), we have:
X
∞
P (B) = P (A) + P (B \ A) + P (Ei ) . (2)
i=3
Since, by the first axiom of probability (→ I/1.4.1), the right-hand side is a series of non-negative
numbers converging to P (B) on the left-hand side, it follows that
Sources:
• A.N. Kolmogorov (1950): “Elementary Theory of Probability”; in: Foundations of the Theory of
Probability, p. 6; URL: https://archive.org/details/foundationsofthe00kolm/page/6/mode/2up.
• Alan Stuart & J. Keith Ord (1994): “Probability and Statistical Inference”; in: Kendall’s Advanced
Theory of Statistics, Vol. 1: Distribution Theory, pp. 288-289; URL: https://www.wiley.com/
en-us/Kendall%27s+Advanced+Theory+of+Statistics%2C+3+Volumes%2C+Set%2C+6th+Edition-p-9
• Wikipedia (2021): “Probability axioms”; in: Wikipedia, the free encyclopedia, retrieved on 2021-
07-30; URL: https://en.wikipedia.org/wiki/Probability_axioms#Monotonicity.
1. PROBABILITY THEORY 11
P (∅) = 0 . (1)
Assume that the probability of the empty set is not zero, i.e. P (∅) > 0. Then, the right-hand side of
(2) would be infinite. However, by the first axiom of probability (→ I/1.4.1), the left-hand side must
be finite. This is a contradiction. Therefore, P (∅) = 0.
■
Sources:
• A.N. Kolmogorov (1950): “Elementary Theory of Probability”; in: Foundations of the Theory of
Probability, p. 6, eq. 3; URL: https://archive.org/details/foundationsofthe00kolm/page/6/mode/
2up.
• Alan Stuart & J. Keith Ord (1994): “Probability and Statistical Inference”; in: Kendall’s Advanced
Theory of Statistics, Vol. 1: Distribution Theory, ch. 8.6, p. 288, eq. (b); URL: https://www.wiley.
com/en-us/Kendall%27s+Advanced+Theory+of+Statistics%2C+3+Volumes%2C+Set%2C+6th+
Edition-p-9780470669549.
• Wikipedia (2021): “Probability axioms”; in: Wikipedia, the free encyclopedia, retrieved on 2021-07-
30; URL: https://en.wikipedia.org/wiki/Probability_axioms#The_probability_of_the_empty_
set.
Proof: Since A and Ac are mutually exclusive (→ I/1.3.9) and A ∪ Ac = Ω, the third axiom of
probability (→ I/1.4.1) implies:
P (A ∪ Ac ) = P (A) + P (Ac )
P (Ω) = P (A) + P (Ac ) (2)
P (Ac ) = P (Ω) − P (A) .
The second axiom of probability (→ I/1.4.1) states that P (Ω) = 1, such that we obtain:
Sources:
• A.N. Kolmogorov (1950): “Elementary Theory of Probability”; in: Foundations of the Theory of
Probability, p. 6, eq. 2; URL: https://archive.org/details/foundationsofthe00kolm/page/6/mode/
2up.
• Alan Stuart & J. Keith Ord (1994): “Probability and Statistical Inference”; in: Kendall’s Advanced
Theory of Statistics, Vol. 1: Distribution Theory, ch. 8.6, p. 288, eq. (c); URL: https://www.wiley.
com/en-us/Kendall%27s+Advanced+Theory+of+Statistics%2C+3+Volumes%2C+Set%2C+6th+
Edition-p-9780470669549.
• Wikipedia (2021): “Probability axioms”; in: Wikipedia, the free encyclopedia, retrieved on 2021-
07-30; URL: https://en.wikipedia.org/wiki/Probability_axioms#The_complement_rule.
0 ≤ P (E) ≤ 1 . (1)
P (E) ≥ 0 . (2)
By combining the first axiom of probability (→ I/1.4.1) and the probability of the complement (→
I/1.4.4), we obtain:
1 − P (E) = P (E c ) ≥ 0
1 − P (E) ≥ 0 (3)
P (E) ≤ 1 .
0 ≤ P (E) ≤ 1 . (4)
Sources:
• A.N. Kolmogorov (1950): “Elementary Theory of Probability”; in: Foundations of the Theory of
Probability, p. 6; URL: https://archive.org/details/foundationsofthe00kolm/page/6/mode/2up.
• Alan Stuart & J. Keith Ord (1994): “Probability and Statistical Inference”; in: Kendall’s Advanced
Theory of Statistics, Vol. 1: Distribution Theory, pp. 288-289; URL: https://www.wiley.com/
en-us/Kendall%27s+Advanced+Theory+of+Statistics%2C+3+Volumes%2C+Set%2C+6th+Edition-p-9
• Wikipedia (2021): “Probability axioms”; in: Wikipedia, the free encyclopedia, retrieved on 2021-
07-30; URL: https://en.wikipedia.org/wiki/Probability_axioms#The_numeric_bound.
1. PROBABILITY THEORY 13
P (A ∪ B) = P (A) + P (B \ A)
(2)
P (A ∪ B) = P (A) + P (B \ [A ∩ B]) .
Then, let E1 = B \ [A ∩ B] and E2 = A ∩ B, such that E1 ∪ E2 = B. Again, from the third axiom of
probability (→ I/1.4.1), we obtain:
P (B) = P (B \ [A ∩ B]) + P (A ∩ B)
(3)
P (B \ [A ∩ B]) = P (B) − P (A ∩ B) .
Sources:
• A.N. Kolmogorov (1950): “Elementary Theory of Probability”; in: Foundations of the Theory of
Probability, p. 2; URL: https://archive.org/details/foundationsofthe00kolm/page/2/mode/2up.
• Alan Stuart & J. Keith Ord (1994): “Probability and Statistical Inference”; in: Kendall’s Advanced
Theory of Statistics, Vol. 1: Distribution Theory, ch. 8.6, p. 288, eq. (a); URL: https://www.wiley.
com/en-us/Kendall%27s+Advanced+Theory+of+Statistics%2C+3+Volumes%2C+Set%2C+6th+
Edition-p-9780470669549.
• Wikipedia (2021): “Probability axioms”; in: Wikipedia, the free encyclopedia, retrieved on 2021-
07-30; URL: https://en.wikipedia.org/wiki/Probability_axioms#Further_consequences.
Bi ∩ Bj = ∅ ⇒ (A ∩ Bi ) ∩ (A ∩ Bj ) = A ∩ (Bi ∩ Bj ) = A ∩ ∅ = ∅ . (2)
14 CHAPTER I. GENERAL THEOREMS
∪i Bi = Ω ⇒ ∪i (A ∩ Bi ) = A ∩ (∪i Bi ) = A ∩ Ω = A . (3)
Thus, the third axiom of probability (→ I/1.4.1) implies that
X
P (A) = P (A ∩ Bi ) . (4)
i
Sources:
• Alan Stuart & J. Keith Ord (1994): “Probability and Statistical Inference”; in: Kendall’s Advanced
Theory of Statistics, Vol. 1: Distribution Theory, p. 288, eq. (d); p. 289, eq. 8.7; URL: https://www.
wiley.com/en-us/Kendall%27s+Advanced+Theory+of+Statistics%2C+3+Volumes%2C+Set%2C+
6th+Edition-p-9780470669549.
• Wikipedia (2021): “Law of total probability”; in: Wikipedia, the free encyclopedia, retrieved on
2021-08-08; URL: https://en.wikipedia.org/wiki/Law_of_total_probability#Statement.
∪i Bi = Ω . (3)
Thus, the third axiom of probability (→ I/1.4.1) implies that
X
P (Bi ) = P (Ω) . (4)
i
and the second axiom of probability (→ I/1.4.1) implies that
X
P (Bi ) = 1 . (5)
i
Sources:
• Alan Stuart & J. Keith Ord (1994): “Probability and Statistical Inference”; in: Kendall’s Advanced
Theory of Statistics, Vol. 1: Distribution Theory, pp. 288-289; URL: https://www.wiley.com/
en-us/Kendall%27s+Advanced+Theory+of+Statistics%2C+3+Volumes%2C+Set%2C+6th+Edition-p-9
• Wikipedia (2021): “Probability axioms”; in: Wikipedia, the free encyclopedia, retrieved on 2021-
08-08; URL: https://en.wikipedia.org/wiki/Probability_axioms#Axioms.
1. PROBABILITY THEORY 15
Proof: The addition law of probability (→ I/1.4.6) states that for two events (→ I/1.2.1) A and B,
the probability (→ I/1.3.1) of at least one of them occurring is:
∪i Bi = Ω . (6)
Since the probability of the sample space is one (→ I/1.4.1), this means that the left-hand side of
(5) becomes equal to one:
X
n
1= P (Bi ) . (7)
i
■
16 CHAPTER I. GENERAL THEOREMS
Sources:
• Alan Stuart & J. Keith Ord (1994): “Probability and Statistical Inference”; in: Kendall’s Advanced
Theory of Statistics, Vol. 1: Distribution Theory, pp. 288-289; URL: https://www.wiley.com/
en-us/Kendall%27s+Advanced+Theory+of+Statistics%2C+3+Volumes%2C+Set%2C+6th+Edition-p-9
• Wikipedia (2022): “Probability axioms”; in: Wikipedia, the free encyclopedia, retrieved on 2022-
03-27; URL: https://en.wikipedia.org/wiki/Probability_axioms#Consequences.
Sources:
• Wikipedia (2020): “Probability distribution”; in: Wikipedia, the free encyclopedia, retrieved on
2020-05-17; URL: https://en.wikipedia.org/wiki/Probability_distribution.
Sources:
• Wikipedia (2020): “Joint probability distribution”; in: Wikipedia, the free encyclopedia, retrieved
on 2020-05-17; URL: https://en.wikipedia.org/wiki/Joint_probability_distribution.
Sources:
• Wikipedia (2020): “Marginal distribution”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
05-17; URL: https://en.wikipedia.org/wiki/Marginal_distribution.
(→ I/1.5.2) of X and Y and the marginal distribution (→ I/1.5.3) of Y using the law of conditional
probability (→ I/1.3.4).
Sources:
• Wikipedia (2020): “Conditional probability distribution”; in: Wikipedia, the free encyclopedia, re-
trieved on 2020-05-17; URL: https://en.wikipedia.org/wiki/Conditional_probability_distribution.
Sources:
• Wikipedia (2021): “Sampling distribution”; in: Wikipedia, the free encyclopedia, retrieved on 2021-
03-31; URL: https://en.wikipedia.org/wiki/Sampling_distribution.
fX (x) = 0 (1)
for all x ∈
/ X,
Sources:
• Wikipedia (2020): “Probability mass function”; in: Wikipedia, the free encyclopedia, retrieved on
2020-02-13; URL: https://en.wikipedia.org/wiki/Probability_mass_function.
X
fZ (z) = fX (z − y)fY (y)
y∈Y
X (1)
or fZ (z) = fY (z − x)fX (x)
x∈X
18 CHAPTER I. GENERAL THEOREMS
where fX (x), fY (y) and fZ (z) are the probability mass functions (→ I/1.6.1) of X, Y and Z.
Proof: Using the definition of the probability mass function (→ I/1.6.1) and the expected value (→
I/1.10.1), the first equation can be derived as follows:
fZ (z) = Pr(Z = z)
= Pr(X + Y = z)
= Pr(X = z − Y )
= E [Pr(X = z − Y |Y = y)]
(2)
= E [Pr(X = z − Y )]
= E [fX (z − Y )]
X
= fX (z − y)fY (y) .
y∈Y
Note that the third-last transition is justified by the fact that X and Y are independent (→ I/1.3.6),
such that conditional probabilities are equal to marginal probabilities (→ I/1.3.8). The second equa-
tion can be derived by switching X and Y .
Sources:
• Taboga, Marco (2017): “Sums of independent random variables”; in: Lectures on probability and
mathematical statistics, retrieved on 2021-08-30; URL: https://www.statlect.com/fundamentals-of-probabi
sums-of-independent-random-variables.
Y = {y = g(x) : x ∈ X } . (2)
Proof: Because a strictly increasing function is invertible, the probability mass function (→ I/1.6.1)
of Y can be derived as follows:
fY (y) = Pr(Y = y)
= Pr(g(X) = y)
(3)
= Pr(X = g −1 (y))
= fX (g −1 (y)) .
1. PROBABILITY THEORY 19
Sources:
• Taboga, Marco (2017): “Functions of random variables and their distribution”; in: Lectures on
probability and mathematical statistics, retrieved on 2020-10-29; URL: https://www.statlect.com/
fundamentals-of-probability/functions-of-random-variables-and-their-distribution#hid3.
Y = {y = g(x) : x ∈ X } . (2)
Proof: Because a strictly decreasing function is invertible, the probability mass function (→ I/1.6.1)
of Y can be derived as follows:
fY (y) = Pr(Y = y)
= Pr(g(X) = y)
(3)
= Pr(X = g −1 (y))
= fX (g −1 (y)) .
Sources:
• Taboga, Marco (2017): “Functions of random variables and their distribution”; in: Lectures on
probability and mathematical statistics, retrieved on 2020-11-06; URL: https://www.statlect.com/
fundamentals-of-probability/functions-of-random-variables-and-their-distribution#hid6.
Y = {y = g(x) : x ∈ X } . (2)
Proof: Because an invertible function is a one-to-one mapping, the probability mass function (→
I/1.6.1) of Y can be derived as follows:
fY (y) = Pr(Y = y)
= Pr(g(X) = y)
(3)
= Pr(X = g −1 (y))
= fX (g −1 (y)) .
Sources:
• Taboga, Marco (2017): “Functions of random vectors and their distribution”; in: Lectures on
probability and mathematical statistics, retrieved on 2021-08-30; URL: https://www.statlect.com/
fundamentals-of-probability/functions-of-random-vectors.
fX (x) ≥ 0 (1)
for all x ∈ R,
Z
Pr(X ∈ A) = fX (x) dx (2)
A
for any A ⊂ X and
Z
fX (x) dx = 1 . (3)
X
Sources:
• Wikipedia (2020): “Probability density function”; in: Wikipedia, the free encyclopedia, retrieved
on 2020-02-13; URL: https://en.wikipedia.org/wiki/Probability_density_function.
Z +∞
fZ (z) = fX (z − y)fY (y) dy
−∞
Z +∞ (1)
or fZ (z) = fY (z − x)fX (x) dx
−∞
where fX (x), fY (y) and fZ (z) are the probability density functions (→ I/1.7.1) of X, Y and Z.
Proof: The cumulative distribution function of a sum of independent random variables (→ I/1.8.2)
is
d
fZ (z) = FZ (z)
dz
d
= E [FX (z − Y )]
dz
d (3)
=E FX (z − Y )
dz
= E [fX (z − Y )]
Z +∞
= fX (z − y)fY (y) dy .
−∞
Sources:
• Taboga, Marco (2017): “Sums of independent random variables”; in: Lectures on probability and
mathematical statistics, retrieved on 2021-08-30; URL: https://www.statlect.com/fundamentals-of-probabi
sums-of-independent-random-variables.
Y = {y = g(x) : x ∈ X } . (2)
22 CHAPTER I. GENERAL THEOREMS
dFX (x)
fX (x) = , (4)
dx
the probability density function (→ I/1.7.1) of Y can be derived as follows:
1) If y does not belong to the support of Y , FY (y) is constant, such that
fY (y) = 0, if y∈
/Y. (5)
2) If y belongs to the support of Y , then fY (y) can be derived using the chain rule:
(4) d
fY (y) = FY (y)
dy
(3) d
= FX (g −1 (y)) (6)
dy
dg −1 (y)
= fX (g −1 (y)) .
dy
Taking together (5) and (6), eventually proves (1).
■
Sources:
• Taboga, Marco (2017): “Functions of random variables and their distribution”; in: Lectures on
probability and mathematical statistics, retrieved on 2020-10-29; URL: https://www.statlect.com/
fundamentals-of-probability/functions-of-random-variables-and-their-distribution#hid4.
Y = {y = g(x) : x ∈ X } . (2)
1 , if y > max(Y)
FY (y) = 1 − FX (g −1 (y)) + Pr(X = g −1 (y)) , if y ∈ Y (3)
0 , if y < min(Y)
Note that for continuous random variables, the probability (→ I/1.7.1) of point events is
Z a
Pr(X = a) = fX (x) dx = 0 . (4)
a
Because the probability density function is the first derivative of the cumulative distribution function
(→ I/1.7.7)
dFX (x)
fX (x) = , (5)
dx
the probability density function (→ I/1.7.1) of Y can be derived as follows:
1) If y does not belong to the support of Y , FY (y) is constant, such that
fY (y) = 0, if y∈
/Y. (6)
2) If y belongs to the support of Y , then fY (y) can be derived using the chain rule:
(5) d
fY (y) = FY (y)
dy
(3) d
= 1 − FX (g −1 (y)) + Pr(X = g −1 (y))
dy
(4) d
= 1 − FX (g −1 (y)) (7)
dy
d
= − FX (g −1 (y))
dy
dg −1 (y)
= −fX (g −1 (y)) .
dy
Sources:
• Taboga, Marco (2017): “Functions of random variables and their distribution”; in: Lectures on
probability and mathematical statistics, retrieved on 2020-11-06; URL: https://www.statlect.com/
fundamentals-of-probability/functions-of-random-variables-and-their-distribution#hid7.
f (g −1 (y)) |J −1 (y)| , if y ∈ Y
X g
fY (y) = , (1)
0 , if y ∈
/Y
if the Jacobian determinant satisfies
Y = {y = g(x) : x ∈ X } . (4)
Proof:
1) First, we obtain the cumulative distribution function (→ I/1.8.1) of Y = g(X). The joint CDF
(→ I/1.8.10) is given by
FY (y) = Pr(Y1 ≤ y1 , . . . , Yn ≤ yn )
= Pr(g1 (X) ≤ y1 , . . . , gn (X) ≤ yn )
Z (5)
= fX (x) dx
A(y)
Z
FY (z) = fX (g −1 (y)) dg −1 (y)
B(z)
Z zn Z z1 (7)
= ... fX (g −1 (y)) dg −1 (y) .
−∞ −∞
where we have the modified the integration regime B(z) which reads
Z zn Z z1
FY (z) = ... fX (g −1 (y)) |Jg−1 (y)| dy
Z−∞
zn Z−∞
z1 (10)
−1
= ... fX (g (y)) |Jg−1 (y)| dy1 . . . dyn .
−∞ −∞
4) Finally, we obtain the probability density function (→ I/1.7.1) of Y = g(X). Because the PDF is
the derivative of the CDF (→ I/1.7.7), we can differentiate the joint CDF to get
dn
fY (z) = FY (z)
dz1 . . . dzn
Z zn Z z1
dn (11)
= ... fX (g −1 (y)) |Jg−1 (y)| dy1 . . . dyn
dz1 . . . dzn −∞ −∞
−1
= fX (g (z)) |Jg−1 (z)|
Sources:
• Taboga, Marco (2017): “Functions of random vectors and their distribution”; in: Lectures on
probability and mathematical statistics, retrieved on 2021-08-30; URL: https://www.statlect.com/
fundamentals-of-probability/functions-of-random-vectors.
• Lebanon, Guy (2017): “Functions of a Random Vector”; in: Probability: The Analysis of Data,
Vol. 1, retrieved on 2021-08-30; URL: http://theanalysisofdata.com/probability/4_4.html.
• Poirier, Dale J. (1995): “Distributions of Functions of Random Variables”; in: Intermediate Statis-
tics and Econometrics: A Comparative Approach, ch. 4, pp. 149ff.; URL: https://books.google.de/
books?id=K52_YvD1YNwC&hl=de&source=gbs_navlinks_s.
• Devore, Jay L.; Berk, Kennth N. (2011): “Conditional Distributions”; in: Modern Mathemat-
ical Statistics with Applications, ch. 5.2, pp. 253ff.; URL: https://books.google.de/books?id=
5PRLUho-YYgC&hl=de&source=gbs_navlinks_s.
• peek-a-boo (2019): “How to come up with the Jacobian in the change of variables formula”; in:
StackExchange Mathematics, retrieved on 2021-08-30; URL: https://math.stackexchange.com/a/
3239222.
• Bazett, Trefor (2019): “Change of Variables & The Jacobian | Multi-variable Integration”; in:
YouTube, retrieved on 2021-08-30; URL: https://www.youtube.com/watch?v=wUF-lyyWpUc.
1
f (Σ−1 (y − µ)) , if y ∈ Y
|Σ| X
fY (y) = (1)
0 , if y ∈
/Y
where |Σ| is the determinant of Σ and Y is the set of possible outcomes of Y :
Y = {y = Σx + µ : x ∈ X } . (2)
Proof: Because the linear function g(X) = ΣX + µ is invertible and differentiable, we can determine
the probability density function of an invertible function of a continuous random vector (→ I/1.7.5)
using the relation
f (g −1 (y)) |J −1 (y)| , if y ∈ Y
X g
fY (y) = . (3)
0 , if y ∈
/Y
The inverse function is
Plugging (4) and (5) into (3) and applying the determinant property |A−1 | = |A|−1 , we obtain
1
fY (y) = fX (Σ−1 (y − µ)) . (6)
|Σ|
■
Sources:
• Taboga, Marco (2017): “Functions of random vectors and their distribution”; in: Lectures on
probability and mathematical statistics, retrieved on 2021-08-30; URL: https://www.statlect.com/
fundamentals-of-probability/functions-of-random-vectors.
dFX (x)
fX (x) = . (1)
dx
Proof: The cumulative distribution function in terms of the probability density function of a con-
tinuous random variable (→ I/1.8.6) is given by:
1. PROBABILITY THEORY 27
Z x
FX (x) = fX (t) dt, x ∈ R . (2)
−∞
The fundamental theorem of calculus states that, if f (x) is a continuous real-valued function defined
on the interval [a, b], then it holds that
Z x
F (x) = f (t) dt ⇒ F ′ (x) = f (x) for all x ∈ (a, b) . (4)
a
Sources:
• Wikipedia (2020): “Fundamental theorem of calculus”; in: Wikipedia, the free encyclopedia, re-
trieved on 2020-11-12; URL: https://en.wikipedia.org/wiki/Fundamental_theorem_of_calculus#
Formal_statements.
1) If X is a discrete (→ I/1.2.6) random variable (→ I/1.2.2) with possible outcomes X and the
probability mass function (→ I/1.6.1) fX (x), then the cumulative distribution function is the function
(→ I/1.8.5) FX (x) : R → [0, 1] with
X
FX (x) = fX (t) . (2)
t∈X
t≤x
Sources:
28 CHAPTER I. GENERAL THEOREMS
• Wikipedia (2020): “Cumulative distribution function”; in: Wikipedia, the free encyclopedia, re-
trieved on 2020-02-17; URL: https://en.wikipedia.org/wiki/Cumulative_distribution_function#
Definition.
FZ (z) = E [FX (z − Y )]
(1)
or FZ (z) = E [FY (z − X)]
where FX (x), FY (y) and FZ (z) are the cumulative distribution functions (→ I/1.8.1) of X, Y and Z
and E [·] denotes the expected value (→ I/1.10.1).
Proof: Using the definition of the cumulative distribution function (→ I/1.8.1), the first equation
can be derived as follows:
FZ (z) = Pr(Z ≤ z)
= Pr(X + Y ≤ z)
= Pr(X ≤ z − Y )
(2)
= E [Pr(X ≤ z − Y |Y = y)]
= E [Pr(X ≤ z − Y )]
= E [FX (z − Y )] .
Note that the second-last transition is justified by the fact that X and Y are independent (→
I/1.3.6), such that conditional probabilities are equal to marginal probabilities (→ I/1.3.8). The
second equation can be derived by switching X and Y .
Sources:
• Taboga, Marco (2017): “Sums of independent random variables”; in: Lectures on probability and
mathematical statistics, retrieved on 2021-08-30; URL: https://www.statlect.com/fundamentals-of-probabi
sums-of-independent-random-variables.
where g −1 (y) is the inverse function of g(x) and Y is the set of possible outcomes of Y :
Y = {y = g(x) : x ∈ X } . (2)
Proof: The support of Y is determined by g(x) and by the set of possible outcomes of X. More-
over, if g(x) is strictly increasing, then g −1 (y) is also strictly increasing. Therefore, the cumulative
distribution function (→ I/1.8.1) of Y can be derived as follows:
1) If y is lower than the lowest value (→ I/1.17.1) Y can take, then Pr(Y ≤ y) = 0, so
FY (y) = Pr(Y ≤ y)
= Pr(g(X) ≤ y)
(4)
= Pr(X ≤ g −1 (y))
= FX (g −1 (y)) .
3) If y is higher than the highest value (→ I/1.17.2) Y can take, then Pr(Y ≤ y) = 1, so
Sources:
• Taboga, Marco (2017): “Functions of random variables and their distribution”; in: Lectures on
probability and mathematical statistics, retrieved on 2020-10-29; URL: https://www.statlect.com/
fundamentals-of-probability/functions-of-random-variables-and-their-distribution#hid2.
Y = {y = g(x) : x ∈ X } . (2)
Proof: The support of Y is determined by g(x) and by the set of possible outcomes of X. More-
over, if g(x) is strictly decreasing, then g −1 (y) is also strictly decreasing. Therefore, the cumulative
distribution function (→ I/1.8.1) of Y can be derived as follows:
30 CHAPTER I. GENERAL THEOREMS
1) If y is higher than the highest value (→ I/1.17.2) Y can take, then Pr(Y ≤ y) = 1, so
FY (y) = Pr(Y ≤ y)
= 1 − Pr(Y > y)
= 1 − Pr(g(X) > y)
= 1 − Pr(X < g −1 (y))
(4)
= 1 − Pr(X < g −1 (y)) − Pr(X = g −1 (y)) + Pr(X = g −1 (y))
= 1 − Pr(X < g −1 (y)) + Pr(X = g −1 (y)) + Pr(X = g −1 (y))
= 1 − Pr(X ≤ g −1 (y)) + Pr(X = g −1 (y))
= 1 − FX (g −1 (y)) + Pr(X = g −1 (y)) .
3) If y is lower than the lowest value (→ I/1.17.1) Y can take, then Pr(Y ≤ y) = 0, so
Sources:
• Taboga, Marco (2017): “Functions of random variables and their distribution”; in: Lectures on
probability and mathematical statistics, retrieved on 2020-11-06; URL: https://www.statlect.com/
fundamentals-of-probability/functions-of-random-variables-and-their-distribution#hid5.
(2) X
FX (x) = Pr(X = t)
t∈X
t≤x
(4)
(3) X
= fX (t) .
t∈X
t≤x
(2)
FX (x) = Pr(X ∈ (−∞, x])
Z x (4)
(3)
= fX (t) dt .
−∞
Y = FX (X) (1)
has a standard uniform distribution (→ II/3.1.2).
FY (y) = Pr(Y ≤ y)
= Pr(FX (X) ≤ y)
= Pr(X ≤ FX−1 (y)) (2)
= FX (FX−1 (y))
=y
which is the cumulative distribution function of a continuous uniform distribution (→ II/3.1.4) with
a = 0 and b = 1, i.e. the cumulative distribution function (→ I/1.8.1) of the standard uniform
distribution (→ II/3.1.2) U(0, 1).
Sources:
• Wikipedia (2021): “Probability integral transform”; in: Wikipedia, the free encyclopedia, retrieved
on 2021-04-07; URL: https://en.wikipedia.org/wiki/Probability_integral_transform#Proof.
X = FX−1 (U ) (1)
has a probability distribution (→ I/1.5.1) characterized by the invertible (→ I/1.9.1) cumulative
distribution function (→ I/1.8.1) FX (x).
Proof: The cumulative distribution function (→ I/1.8.1) of the transformation X = FX−1 (U ) can be
derived as
Pr(X ≤ x)
= Pr(FX−1 (U ) ≤ x)
(2)
= Pr(U ≤ FX (x))
= FX (x) ,
because the cumulative distribution function (→ I/1.8.1) of the standard uniform distribution (→
II/3.1.2) U(0, 1) is
Sources:
• Wikipedia (2021): “Inverse transform sampling”; in: Wikipedia, the free encyclopedia, retrieved on
2021-04-07; URL: https://en.wikipedia.org/wiki/Inverse_transform_sampling#Proof_of_correctness.
1. PROBABILITY THEORY 33
Proof: The cumulative distribution function (→ I/1.8.1) of the transformation X̃ = FY−1 (FX (X))
can be derived as
FX̃ (y) = Pr X̃ ≤ y
= Pr FY−1 (FX (X)) ≤ y
= Pr (FX (X) ≤ FY (y)) (2)
= Pr X ≤ FX−1 (FY (y))
= FX FX−1 (FY (y))
= FY (y)
which shows that X̃ and Y have the same cumulative distribution function (→ I/1.8.1) and are thus
identically distributed (→ I/1.5.1).
Sources:
• Soch, Joram (2020): “Distributional Transformation Improves Decoding Accuracy When Predict-
ing Chronological Age From Structural MRI”; in: Frontiers in Psychiatry, vol. 11, art. 604268;
URL: https://www.frontiersin.org/articles/10.3389/fpsyt.2020.604268/full; DOI: 10.3389/fpsyt.2020.60426
Sources:
• Wikipedia (2021): “Cumulative distribution function”; in: Wikipedia, the free encyclopedia, re-
trieved on 2021-04-07; URL: https://en.wikipedia.org/wiki/Cumulative_distribution_function#
Definition_for_more_than_two_random_variables.
34 CHAPTER I. GENERAL THEOREMS
Sources:
• Wikipedia (2020): “Probability density function”; in: Wikipedia, the free encyclopedia, retrieved
on 2020-02-17; URL: https://en.wikipedia.org/wiki/Quantile_function#Definition.
Proof: The quantile function (→ I/1.9.1) QX (p) is defined as the function that, for a given quantile
p ∈ [0, 1], returns the smallest x for which FX (x) = p:
Sources:
• Wikipedia (2020): “Quantile function”; in: Wikipedia, the free encyclopedia, retrieved on 2020-11-
12; URL: https://en.wikipedia.org/wiki/Quantile_function#Definition.
Sources:
• Wikipedia (2021): “Characteristic function (probability theory)”; in: Wikipedia, the free ency-
clopedia, retrieved on 2021-09-22; URL: https://en.wikipedia.org/wiki/Characteristic_function_
(probability_theory)#Definition.
• Taboga, Marco (2017): “Joint characteristic function”; in: Lectures on probability and mathematical
statistics, retrieved on 2021-10-07; URL: https://www.statlect.com/fundamentals-of-probability/
joint-characteristic-function.
X
E[g(X)] = g(x)fX (x)
Z
x∈X
(3)
E[g(X)] = g(x)fX (x) dx ,
X
Sources:
• Taboga, Marco (2017): “Functions of random vectors and their distribution”; in: Lectures on
probability and mathematical statistics, retrieved on 2021-09-22; URL: https://www.statlect.com/
fundamentals-of-probability/functions-of-random-vectors.
h i
tT X
MX (t) = E e , t ∈ Rn . (2)
Sources:
• Wikipedia (2020): “Moment-generating function”; in: Wikipedia, the free encyclopedia, retrieved
on 2020-01-22; URL: https://en.wikipedia.org/wiki/Moment-generating_function#Definition.
• Taboga, Marco (2017): “Joint moment generating function”; in: Lectures on probability and mathe-
matical statistics, retrieved on 2021-10-07; URL: https://www.statlect.com/fundamentals-of-probability/
joint-moment-generating-function.
X
E[g(X)] = g(x)fX (x)
Z
x∈X
(3)
E[g(X)] = g(x)fX (x) dx ,
X
Sources:
• Taboga, Marco (2017): “Functions of random vectors and their distribution”; in: Lectures on
probability and mathematical statistics, retrieved on 2021-09-22; URL: https://www.statlect.com/
fundamentals-of-probability/functions-of-random-vectors.
MY (t) = E exp tT (AX + b)
= E exp tT AX · exp tT b
(3)
= exp tT b · E exp (At)T X
= exp tT b · MX (At) .
■
Sources:
• ProofWiki (2020): “Moment Generating Function of Linear Transformation of Random Variable”;
in: ProofWiki, retrieved on 2020-08-19; URL: https://proofwiki.org/wiki/Moment_Generating_
Function_of_Linear_Transformation_of_Random_Variable.
Because the expected value is multiplicative for independent random variables (→ I/1.10.7), we have
Y
n
MX (t) = E (exp [(ai t)Xi ])
i=1
(4)
Yn
= MXi (ai t) .
i=1
38 CHAPTER I. GENERAL THEOREMS
Sources:
• ProofWiki (2020): “Moment Generating Function of Linear Combination of Independent Random
Variables”; in: ProofWiki, retrieved on 2020-08-19; URL: https://proofwiki.org/wiki/Moment_
Generating_Function_of_Linear_Combination_of_Independent_Random_Variables.
Sources:
• Wikipedia (2020): “Probability-generating function”; in: Wikipedia, the free encyclopedia, retrieved
on 2020-05-31; URL: https://en.wikipedia.org/wiki/Probability-generating_function#Definition.
where fX (x) is the probability mass function (→ I/1.6.1) of X. Here, we have g(X) = z X , such that
X x
E zX = z fX (x) . (3)
x∈X
X
∞
X
E z = fX (x) z x . (4)
x=0
Sources:
• ProofWiki (2022): “Probability Generating Function as Expectation”; in: ProofWiki, retrieved on
2022-10-11; URL: https://proofwiki.org/wiki/Probability_Generating_Function_as_Expectation.
X
∞
GX (0) = fX (x) · 0x
x=0
= fX (0) + 01 · fX (1) + 02 · fX (2) + . . . (3)
= fX (0) + 0 + 0 + . . .
= fX (0) .
■
Sources:
• ProofWiki (2022): “Probability Generating Function of Zero”; in: ProofWiki, retrieved on 2022-
10-11; URL: https://proofwiki.org/wiki/Probability_Generating_Function_of_Zero.
GX (1) = 1 . (1)
40 CHAPTER I. GENERAL THEOREMS
X
∞
GX (1) = fX (x) · 1x
x=0
X
∞
= fX (x) · 1 (3)
x=0
X∞
= fX (x) .
x=0
Because the probability mass function (→ I/1.6.1) sums up to one, this becomes:
X
GX (1) = fX (x)
x∈X (4)
=1.
Sources:
• ProofWiki (2022): “Probability Generating Function of One”; in: ProofWiki, retrieved on 2022-
10-11; URL: https://proofwiki.org/wiki/Probability_Generating_Function_of_One.
Sources:
• Wikipedia (2020): “Cumulant”; in: Wikipedia, the free encyclopedia, retrieved on 2020-05-31; URL:
https://en.wikipedia.org/wiki/Cumulant#Definition.
1. PROBABILITY THEORY 41
2) The expected value (or, mean) of a continuous random variable (→ I/1.2.2) X with domain X is
Z
E(X) = x · fX (x) dx (2)
X
Sources:
• Wikipedia (2020): “Expected value”; in: Wikipedia, the free encyclopedia, retrieved on 2020-02-13;
URL: https://en.wikipedia.org/wiki/Expected_value#Definition.
1X
n
x̄ = xi . (1)
n i=1
Sources:
• Wikipedia (2021): “Sample mean and covariance”; in: Wikipedia, the free encyclopedia, retrieved on
2020-04-16; URL: https://en.wikipedia.org/wiki/Sample_mean_and_covariance#Definition_of_
the_sample_mean.
Proof: Because the cumulative distribution function gives the probability of a random variable being
smaller than a given value (→ I/1.8.1),
Z ∞ Z ∞ Z ∞
(1 − FX (x)) dx = fX (z) dz dx
0
Z0 ∞ Zx z
= fX (z) dx dz
Z ∞ 0
Z z 0
= fX (z) 1 dx dz (5)
Z ∞
0 0
= [x]z0 · fX (z) dz
Z0 ∞
= z · fX (z) dz
0
and by applying the definition of the expected value (→ I/1.10.1), we see that
Z ∞ Z ∞
(1 − FX (x)) dx = z · fX (z) dz = E(X) (6)
0 0
which proves the identity given above.
Sources:
• Kemp, Graham (2014): “Expected value of a non-negative random variable”; in: StackExchange
Mathematics, retrieved on 2020-05-18; URL: https://math.stackexchange.com/questions/958472/
expected-value-of-a-non-negative-random-variable.
1.10.4 Non-negativity
Theorem: If a random variable (→ I/1.2.2) is strictly non-negative, its expected value (→ I/1.10.1)
is also non-negative, i.e.
E(X) ≥ 0, if X ≥ 0 . (1)
Proof:
1) If X ≥ 0 is a discrete random variable, then, because the probability mass function (→ I/1.6.1)
is always non-negative, all the addends in
X
E(X) = x · fX (x) (2)
x∈X
Sources:
• Wikipedia (2020): “Expected value”; in: Wikipedia, the free encyclopedia, retrieved on 2020-02-13;
URL: https://en.wikipedia.org/wiki/Expected_value#Basic_properties.
1.10.5 Linearity
Theorem: The expected value (→ I/1.10.1) is a linear operator, i.e.
Proof:
1) If X and Y are discrete random variables (→ I/1.2.6), the expected value (→ I/1.10.1) is
X
E(X) = x · fX (x) (2)
x∈X
XX
E(X + Y ) = (x + y) · fX,Y (x, y)
x∈X y∈Y
XX XX
= x · fX,Y (x, y) + y · fX,Y (x, y)
x∈X y∈Y x∈X y∈Y
X X X X
= x fX,Y (x, y) + y fX,Y (x, y) (4)
x∈X y∈Y y∈Y x∈X
(3) X X
= x · fX (x) + y · fY (y)
x∈X y∈Y
(2)
= E(X) + E(Y )
as well as
44 CHAPTER I. GENERAL THEOREMS
X
E(a X) = a x · fX (x)
x∈X
X
=a x · fX (x) (5)
x∈X
(2)
= a E(X) .
2) If X and Y are continuous random variables (→ I/1.2.6), the expected value (→ I/1.10.1) is
Z
E(X) = x · fX (x) dx (6)
X
and the law of marginal probability (→ I/1.3.3) states that
Z
p(x) = p(x, y) dy . (7)
Y
Applying this, we have
Z Z
E(X + Y ) = (x + y) · fX,Y (x, y) dy dx
X Y
Z Z Z Z
= x · fX,Y (x, y) dy dx + y · fX,Y (x, y) dy dx
X Y X Y
Z Z Z Z
= x fX,Y (x, y) dy dx + y fX,Y (x, y) dx dy (8)
X Y Y X
Z Z
(7)
= x · fX (x) dx + y · fY (y) dy
X Y
(6)
= E(X) + E(Y )
as well as
Z
E(a X) = a x · fX (x) dx
X
Z
=a x · fX (x) dx (9)
X
(6)
= a E(X) .
Collectively, this shows that both requirements for linearity are fulfilled for the expected value (→
I/1.10.1), for discrete (→ I/1.2.6) as well as for continuous (→ I/1.2.6) random variables.
■
Sources:
• Wikipedia (2020): “Expected value”; in: Wikipedia, the free encyclopedia, retrieved on 2020-02-13;
URL: https://en.wikipedia.org/wiki/Expected_value#Basic_properties.
• Michael B, Kuldeep Guha Mazumder, Geoff Pilling et al. (2020): “Linearity of Expectation”; in:
brilliant.org, retrieved on 2020-02-13; URL: https://brilliant.org/wiki/linearity-of-expectation/.
1. PROBABILITY THEORY 45
1.10.6 Monotonicity
Theorem: The expected value (→ I/1.10.1) is monotonic, i.e.
Proof: Let Z = Y − X. Due to the linearity of the expected value (→ I/1.10.5), we have
Sources:
• Wikipedia (2020): “Expected value”; in: Wikipedia, the free encyclopedia, retrieved on 2020-02-17;
URL: https://en.wikipedia.org/wiki/Expected_value#Basic_properties.
1.10.7 (Non-)Multiplicativity
Theorem:
1) If two random variables (→ I/1.2.2) X and Y are independent (→ I/1.3.6), the expected value
(→ I/1.10.1) is multiplicative, i.e.
Proof:
1) If X and Y are independent (→ I/1.3.6), it holds that
XX
E(X Y ) = (x · y) · fX,Y (x, y)
x∈X y∈Y
(3) XX
= (x · y) · (fX (x) · fY (y))
x∈X y∈Y
X X
= x · fX (x) y · fY (y) (4)
x∈X y∈Y
X
= x · fX (x) · E(Y )
x∈X
= E(X) E(Y ) .
And applying it to the expected value for continuous random variables (→ I/1.10.1), we have
Z Z
E(X Y ) = (x · y) · fX,Y (x, y) dy dx
X Y
Z Z
(3)
= (x · y) · (fX (x) · fY (y)) dy dx
X Y
Z Z
(5)
= x · fX (x) y · fY (y) dy dx
X Y
Z
= x · fX (x) · E(Y ) dx
X
= E(X) E(Y ) .
2) Let X and Y be Bernoulli random variables (→ II/1.2.1) with the following joint probability (→
I/1.3.2) mass function (→ I/1.6.1)
p(X = 0, Y = 0) = 1/2
p(X = 0, Y = 1) = 0
(6)
p(X = 1, Y = 0) = 0
p(X = 1, Y = 1) = 1/2
X X
E(X Y ) = (x · y) · p(x, y)
x∈{0,1} y∈{0,1}
= (1 · 1) · p(X = 1, Y = 1) (9)
(6) 1
=
2
while the product of their expected values is
X X
E(X) E(Y ) = x · p(x) · y · p(y)
x∈{0,1} y∈{0,1}
(10)
= (1 · p(X = 1)) · (1 · p(Y = 1))
(7) 1
=
4
and thus,
Sources:
• Wikipedia (2020): “Expected value”; in: Wikipedia, the free encyclopedia, retrieved on 2020-02-17;
URL: https://en.wikipedia.org/wiki/Expected_value#Basic_properties.
Using this definition of the trace, the linearity of the expected value (→ I/1.10.5) and the expected
value of a random matrix (→ I/1.10.14), we have:
48 CHAPTER I. GENERAL THEOREMS
" #
X
n
E [tr(A)] = E aii
i=1
X
n
= E [aii ]
i=1
(3)
E[a ] . . . E[a1n ]
11
.. . . ..
= tr . . .
E[an1 ] . . . E[ann ]
= tr (E[A]) .
Sources:
• drerD (2018): “’Trace trick’ for expectations of quadratic forms”; in: StackExchange Mathematics,
retrieved on 2021-12-07; URL: https://math.stackexchange.com/a/3004034/480910.
E X T AX = tr A Σ + µµT
= tr AΣ + AµµT
= tr(AΣ) + tr(AµµT ) (7)
= tr(AΣ) + tr(µT Aµ)
= µT Aµ + tr(AΣ) .
Sources:
• Kendrick, David (1981): “Expectation of a quadratic form”; in: Stochastic Control for Economic
Models, pp. 170-171.
• Wikipedia (2020): “Multivariate random variable”; in: Wikipedia, the free encyclopedia, retrieved
on 2020-07-13; URL: https://en.wikipedia.org/wiki/Multivariate_random_variable#Expectation_
of_a_quadratic_form.
• Halvorsen, Kjetil B. (2012): “Expected value and variance of trace function”; in: StackExchange
CrossValidated, retrieved on 2020-07-13; URL: https://stats.stackexchange.com/questions/34477/
expected-value-and-variance-of-trace-function.
• Sarwate, Dilip (2013): “Expected Value of Quadratic Form”; in: StackExchange CrossValidated, re-
trieved on 2020-07-13; URL: https://stats.stackexchange.com/questions/48066/expected-value-of-quadrat
Proof: Note that Y 2 is a non-negative random variable (→ I/1.2.2) whose expected value is also
non-negative (→ I/1.10.4):
E Y2 ≥0. (2)
1) First, consider the case that E (Y 2 ) > 0. Define a new random variable Z as
E(XY )
Z =X −Y . (3)
E (Y 2 )
Once again, because Z 2 is always non-negative, we have the expected value:
E Z2 ≥ 0 . (4)
Thus, using the linearity of the expected value (→ I/1.10.5), we have
50 CHAPTER I. GENERAL THEOREMS
0 ≤ E Z2
2 !
E(XY )
≤E X −Y
E (Y 2 )
!
2
E(XY ) 2 [E(XY )]
≤ E X 2 − 2XY + Y
E (Y 2 ) [E (Y 2 )]2
2 (5)
E(XY ) 2 [E(XY )]
≤ E X − 2 E(XY )
2
+E Y
E (Y 2 ) [E (Y 2 )]2
[E(XY )]2 [E(XY )]2
≤E X −2
2
+
E (Y 2 ) E (Y 2 )
[E(XY )]2
≤ E X2 − ,
E (Y 2 )
giving
[E(XY )]2 ≤ E X 2 E Y 2 (6)
as required.
2) Next, consider the case that E (Y 2 ) = 0. In this case, Y must be a constant (→ I/1.2.5) with mean
(→ I/1.10.1) E(Y ) = 0 and variance (→ I/1.11.1) Var(Y ) = 0, thus we have
Pr(Y = 0) = 1 . (7)
This implies
Pr(XY = 0) = 1 , (8)
such that
E(XY ) = 0 . (9)
Thus, we can write
0 = [E(XY )]2 = E X 2 E Y 2 = 0 , (10)
giving
[E(XY )]2 ≤ E X 2 E Y 2 (11)
as required.
Sources:
• ProofWiki (2022): “Square of Expectation of Product is Less Than or Equal to Product of Ex-
pectation of Squares”; in: ProofWiki, retrieved on 2022-10-11; URL: https://proofwiki.org/wiki/
Square_of_Expectation_of_Product_is_Less_Than_or_Equal_to_Product_of_Expectation_
of_Squares.
1. PROBABILITY THEORY 51
Proof: Let X and Y be discrete random variables (→ I/1.2.6) with sets of possible outcomes X and
Y. Then, the expectation of the conditional expetectation can be rewritten as:
" #
X
E[E(X|Y )] = E x · Pr(X = x|Y )
x∈X
" #
X X
= x · Pr(X = x|Y = y) · Pr(Y = y) (2)
y∈Y x∈X
XX
= x · Pr(X = x|Y = y) · Pr(Y = y) .
x∈X y∈Y
XX
E[E(X|Y )] = x · Pr(X = x, Y = y)
x∈X y∈Y
X X (3)
= x Pr(X = x, Y = y) .
x∈X y∈Y
X
E[E(X|Y )] = x · Pr(X = x)
x∈X (4)
= E(X) .
Sources:
• Wikipedia (2021): “Law of total expectation”; in: Wikipedia, the free encyclopedia, retrieved
on 2021-11-26; URL: https://en.wikipedia.org/wiki/Law_of_total_expectation#Proof_in_the_
finite_and_countable_cases.
X
E[g(X)] = g(x)fX (x) . (1)
x∈X
2) If X is a continuous random variable with possible outcomes X and probability density function
(→ I/1.7.1) fX (x), the expected value (→ I/1.10.1) of g(X) is
Z
E[g(X)] = g(x)fX (x) dx . (2)
X
X
E[g(X)] = y Pr(g(x) = y)
y∈Y
X
= y Pr(x = g −1 (y))
y∈Y
X X
= y fX (x)
(4)
y∈Y x=g −1 (y)
X X
= yfX (x)
y∈Y x=g −1 (y)
X X
= g(x)fX (x) .
y∈Y x=g −1 (y)
Finally, noting that “for all y, then for all x = g −1 (y)” is equivalent to “for all x” if g −1 is a monotonic
function, we can conclude that
X
E[g(X)] = g(x)fX (x) . (5)
x∈X
FY (y) = Pr(Y ≤ y)
= Pr(g(X) ≤ y)
(9)
= Pr(X ≤ g −1 (y))
= FX (g −1 (y)) .
Differentiating to get the probability density function (→ I/1.7.1) of Y , the result is:
d
fY (y) = FY (y)
dy
(9) d
= FX (g −1 (y))
dy
(10)
d
= fX (g −1 (y)) (g −1 (y))
dy
(6) 1
= fX (g −1 (y)) ′ −1 .
g (g (y))
Finally, substituing (10) into (8), we have:
Z Z
g(x)fX (x) dx = y fY (y) dy = E[Y ] = E[g(X)] . (11)
X Y
■
Sources:
• Wikipedia (2020): “Law of the unconscious statistician”; in: Wikipedia, the free encyclopedia, re-
trieved on 2020-07-22; URL: https://en.wikipedia.org/wiki/Law_of_the_unconscious_statistician#
Proof.
• Taboga, Marco (2017): “Transformation theorem”; in: Lectures on probability and mathematical
statistics, retrieved on 2021-09-22; URL: https://www.statlect.com/glossary/transformation-theorem.
Sources:
• Taboga, Marco (2017): “Expected value”; in: Lectures on probability theory and mathematical
statistics, retrieved on 2021-07-08; URL: https://www.statlect.com/fundamentals-of-probability/
expected-value#hid12.
• Wikipedia (2021): “Multivariate random variable”; in: Wikipedia, the free encyclopedia, retrieved
on 2021-07-08; URL: https://en.wikipedia.org/wiki/Multivariate_random_variable#Expected_
value.
54 CHAPTER I. GENERAL THEOREMS
Sources:
• Taboga, Marco (2017): “Expected value”; in: Lectures on probability theory and mathematical
statistics, retrieved on 2021-07-08; URL: https://www.statlect.com/fundamentals-of-probability/
expected-value#hid13.
1.11 Variance
1.11.1 Definition
Definition: The variance of a random variable (→ I/1.2.2) X is defined as the expected value (→
I/1.10.1) of the squared deviation from its expected value (→ I/1.10.1):
Var(X) = E (X − E(X))2 . (1)
Sources:
• Wikipedia (2020): “Variance”; in: Wikipedia, the free encyclopedia, retrieved on 2020-02-13; URL:
https://en.wikipedia.org/wiki/Variance#Definition.
1X
n
σ̂ 2 = (xi − x̄)2 (1)
n i=1
and the unbiased sample variance of x is given by
1 X
n
2
s = (xi − x̄)2 (2)
n − 1 i=1
where x̄ is the sample mean (→ I/1.10.2).
Sources:
• Wikipedia (2021): “Variance”; in: Wikipedia, the free encyclopedia, retrieved on 2020-04-16; URL:
https://en.wikipedia.org/wiki/Variance#Sample_variance.
1. PROBABILITY THEORY 55
Var(X) = E (X − E[X])2
= E X 2 − 2 X E(X) + E(X)2
(3)
= E(X 2 ) − 2 E(X) E(X) + E(X)2
= E(X 2 ) − E(X)2 .
■
Sources:
• Wikipedia (2020): “Variance”; in: Wikipedia, the free encyclopedia, retrieved on 2020-05-19; URL:
https://en.wikipedia.org/wiki/Variance#Definition.
1.11.4 Non-negativity
Theorem: The variance (→ I/1.11.1) is always non-negative, i.e.
Var(X) ≥ 0 . (1)
1) If X is a discrete random variable (→ I/1.2.2), then, because squares and probabilities are strictly
non-negative, all the addends in
X
Var(X) = (x − E(X))2 · fX (x) (3)
x∈X
2) If X is a continuous random variable (→ I/1.2.2), then, because squares and probability densities
are strictly non-negative, the integrand in
Z
Var(X) = (x − E(X))2 · fX (x) dx (4)
X
is always non-negative, thus the term on the right-hand side is a Lebesgue integral, so that the result
on the left-hand side must be non-negative.
56 CHAPTER I. GENERAL THEOREMS
Sources:
• Wikipedia (2020): “Variance”; in: Wikipedia, the free encyclopedia, retrieved on 2020-06-06; URL:
https://en.wikipedia.org/wiki/Variance#Basic_properties.
Proof:
1) A constant (→ I/1.2.5) is defined as a quantity that always has the same value. Thus, if understood
as a random variable (→ I/1.2.2), the expected value (→ I/1.10.1) of a constant is equal to itself:
E(a) = a . (3)
Plugged into the formula of the variance (→ I/1.11.1), we have
Var(a) = E (a − E(a))2
= E (a − a)2 (4)
= E(0) .
(X − E(X))2 = 0 . (7)
This, in turn, requires that X is equal to its expected value (→ I/1.10.1)
X = E(X) (8)
which can only be the case, if X always has the same value (→ I/1.2.5):
X = const. (9)
1. PROBABILITY THEORY 57
Sources:
• Wikipedia (2020): “Variance”; in: Wikipedia, the free encyclopedia, retrieved on 2020-06-27; URL:
https://en.wikipedia.org/wiki/Variance#Basic_properties.
Proof: The variance (→ I/1.11.1) is defined in terms of the expected value (→ I/1.10.1) as
Var(X) = E (X − E(X))2 . (2)
Using this and the linearity of the expected value (→ I/1.10.5), we can derive (1) as follows:
(2)
Var(X + a) = E ((X + a) − E(X + a))2
= E (X + a − E(X) − a)2
(3)
= E (X − E(X))2
(2)
= Var(X) .
■
Sources:
• Wikipedia (2020): “Variance”; in: Wikipedia, the free encyclopedia, retrieved on 2020-07-07; URL:
https://en.wikipedia.org/wiki/Variance#Basic_properties.
Proof: The variance (→ I/1.11.1) is defined in terms of the expected value (→ I/1.10.1) as
Var(X) = E (X − E(X))2 . (2)
Using this and the linearity of the expected value (→ I/1.10.5), we can derive (1) as follows:
(2)
Var(aX) = E ((aX) − E(aX))2
= E (aX − aE(X))2
= E (a[X − E(X)])2
(3)
= E a2 (X − E(X))2
= a2 E (X − E(X))2
(2)
= a2 Var(X) .
58 CHAPTER I. GENERAL THEOREMS
Sources:
• Wikipedia (2020): “Variance”; in: Wikipedia, the free encyclopedia, retrieved on 2020-07-07; URL:
https://en.wikipedia.org/wiki/Variance#Basic_properties.
Proof: The variance (→ I/1.11.1) is defined in terms of the expected value (→ I/1.10.1) as
Var(X) = E (X − E(X))2 . (2)
Using this and the linearity of the expected value (→ I/1.10.5), we can derive (1) as follows:
(2)
Var(X + Y ) = E ((X + Y ) − E(X + Y ))2
= E ([X − E(X)] + [Y − E(Y )])2
= E (X − E(X))2 + (Y − E(Y ))2 + 2 (X − E(X))(Y − E(Y )) (3)
= E (X − E(X))2 + E (Y − E(Y ))2 + E [2 (X − E(X))(Y − E(Y ))]
(2)
= Var(X) + Var(Y ) + 2 Cov(X, Y ) .
Sources:
• Wikipedia (2020): “Variance”; in: Wikipedia, the free encyclopedia, retrieved on 2020-07-07; URL:
https://en.wikipedia.org/wiki/Variance#Basic_properties.
Proof: The variance (→ I/1.11.1) is defined in terms of the expected value (→ I/1.10.1) as
Var(X) = E (X − E(X))2 . (2)
Using this and the linearity of the expected value (→ I/1.10.5), we can derive (1) as follows:
1. PROBABILITY THEORY 59
(2)
Var(aX + bY ) = E ((aX + bY ) − E(aX + bY ))2
= E (a[X − E(X)] + b[Y − E(Y )])2
= E a2 (X − E(X))2 + b2 (Y − E(Y ))2 + 2ab (X − E(X))(Y − E(Y )) (3)
= E a2 (X − E(X))2 + E b2 (Y − E(Y ))2 + E [2ab (X − E(X))(Y − E(Y ))]
(2)
= a2 Var(X) + b2 Var(Y ) + 2ab Cov(X, Y ) .
Sources:
• Wikipedia (2020): “Variance”; in: Wikipedia, the free encyclopedia, retrieved on 2020-07-07; URL:
https://en.wikipedia.org/wiki/Variance#Basic_properties.
Proof: The variance of the sum of two random variables (→ I/1.11.8) is given by
Sources:
• Wikipedia (2020): “Variance”; in: Wikipedia, the free encyclopedia, retrieved on 2020-07-07; URL:
https://en.wikipedia.org/wiki/Variance#Basic_properties.
Proof: The variance can be decomposed into expected values (→ I/1.11.3) as follows:
Sources:
• Wikipedia (2021): “Law of total variance”; in: Wikipedia, the free encyclopedia, retrieved on 2021-
11-26; URL: https://en.wikipedia.org/wiki/Law_of_total_variance#Proof.
1.11.12 Precision
Definition: The precision of a random variable (→ I/1.2.2) X is defined as the inverse of the variance
(→ I/1.11.1), i.e. one divided by the expected value (→ I/1.10.1) of the squared deviation from its
expected value (→ I/1.10.1):
1
Prec(X) = Var(X)−1 = . (1)
E [(X − E(X))2 ]
Sources:
• Wikipedia (2020): “Precision (statistics)”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
04-21; URL: https://en.wikipedia.org/wiki/Precision_(statistics).
1. PROBABILITY THEORY 61
1.12 Skewness
1.12.1 Definition
Definition: Let X be a random variable (→ I/1.2.2) with expected value (→ I/1.10.1) µ and standard
deviation (→ I/1.16.1) σ. Then, the skewness of X is defined as the third standardized moment (→
I/1.18.9) of X:
E[(X − µ)3 ]
Skew(X) = . (1)
σ3
Sources:
• Wikipedia (2023): “Skewness”; in: Wikipedia, the free encyclopedia, retrieved on 2023-04-20; URL:
https://en.wikipedia.org/wiki/Skewness.
Sources:
• Joanes, D. N. and Gill, C. A. (1998): “Comparing measures of sample skewness and kurtosis”; in:
The Statistician, vol. 47, part 1, pp. 183-189; URL: https://www.jstor.org/stable/2988433.
E(X 3 ) − 3µσ 2 − µ3
Skew(X) = . (1)
σ3
Proof:
The skewness (→ I/1.12.1) of X is defined as
E[(X − µ)3 ]
Skew(X) = . (2)
σ3
Because the expected value is a linear operator (→ I/1.10.5), we can rewrite (2) as
62 CHAPTER I. GENERAL THEOREMS
E[(X − µ)3 ]
Skew(X) =
σ3
E[X − 3X 2 µ + 3Xµ2 − µ3 ]
3
=
σ3
E(X ) − 3E(X 2 )µ + 3E(X)µ2 − µ3
3
= (3)
σ3
E(X ) − 3µ [E(X 2 ) − E(X)µ] − µ3
3
=
σ3
E(X ) − 3µ [E(X 2 ) − E(X)2 ] − µ3
3
= .
σ3
Because the variance can be written in terms of expected values (→ I/1.11.3) as
E(X 3 ) − 3µσ 2 − µ3
Skew(X) = . (5)
σ3
This finishes the proof of (1).
■
Sources:
• Wikipedia (2023): “Skewness”; in: Wikipedia, the free encyclopedia, retrieved on 2023-04-20; URL:
https://en.wikipedia.org/wiki/Skewness.
1.13 Covariance
1.13.1 Definition
Definition: The covariance of two random variables (→ I/1.2.2) X and Y is defined as the expected
value (→ I/1.10.1) of the product of their deviations from their individual expected values (→
I/1.10.1):
Sources:
• Wikipedia (2020): “Covariance”; in: Wikipedia, the free encyclopedia, retrieved on 2020-02-06;
URL: https://en.wikipedia.org/wiki/Covariance#Definition.
1X
n
σ̂xy = (xi − x̄)(yi − ȳ) (1)
n i=1
1. PROBABILITY THEORY 63
1 X
n
sxy = (xi − x̄)(yi − ȳ) (2)
n − 1 i=1
where x̄ and ȳ are the sample means (→ I/1.10.2).
Sources:
• Wikipedia (2021): “Covariance”; in: Wikipedia, the free encyclopedia, retrieved on 2020-05-20;
URL: https://en.wikipedia.org/wiki/Covariance#Calculating_the_sample_covariance.
Sources:
• Wikipedia (2020): “Covariance”; in: Wikipedia, the free encyclopedia, retrieved on 2020-06-02;
URL: https://en.wikipedia.org/wiki/Covariance#Definition.
1.13.4 Symmetry
Theorem: The covariance (→ I/1.13.1) of two random variables (→ I/1.2.2) is a symmetric function:
Proof: The covariance (→ I/1.13.1) of random variables (→ I/1.2.2) X and Y is defined as:
(2)
Cov(Y, X) = E [(Y − E[Y ])(X − E[X])]
= E [(X − E[X])(Y − E[Y ])] (3)
= Cov(X, Y ) .
■
Sources:
• Wikipedia (2022): “Covariance”; in: Wikipedia, the free encyclopedia, retrieved on 2022-09-26;
URL: https://en.wikipedia.org/wiki/Covariance#Covariance_of_linear_combinations.
1.13.5 Self-covariance
Theorem: The covariance (→ I/1.13.1) of a random variable (→ I/1.2.2) with itself is equal to the
variance (→ I/1.11.1):
Proof: The covariance (→ I/1.13.1) of random variables (→ I/1.2.2) X and Y is defined as:
(2)
Cov(X, X) = E [(X − E[X])(X − E[X])]
= E (X − E[X])2 (3)
= Var(X) .
■
Sources:
• Wikipedia (2022): “Covariance”; in: Wikipedia, the free encyclopedia, retrieved on 2022-09-26;
URL: https://en.wikipedia.org/wiki/Covariance#Covariance_with_itself.
(2)
Cov(X, Y ) = E(X Y ) − E(X) E(Y )
(3)
= E(X) E(Y ) − E(X) E(Y ) (4)
=0.
Sources:
• Wikipedia (2020): “Covariance”; in: Wikipedia, the free encyclopedia, retrieved on 2020-09-03;
URL: https://en.wikipedia.org/wiki/Covariance#Uncorrelatedness_and_independence.
Cov(X, Y )
Corr(X, Y ) = . (2)
σX σY
which can be rearranged for the covariance (→ I/1.13.1) to give
Proof: The covariance can be decomposed into expected values (→ I/1.13.3) as follows:
Sources:
• Wikipedia (2021): “Law of total covariance”; in: Wikipedia, the free encyclopedia, retrieved on
2021-11-26; URL: https://en.wikipedia.org/wiki/Law_of_total_covariance#Proof.
Cov(X1 , X1 ) . . . Cov(X1 , Xn ) E [(X1 − E[X1 ])(X1 − E[X1 ])] . . . E [(X1 − E[X1 ])(Xn − E
.. . .. .
.. .. .. ..
ΣXX = . = . . .
Cov(Xn , X1 ) . . . Cov(Xn , Xn ) E [(Xn − E[Xn ])(X1 − E[X1 ])] . . . E [(Xn − E[Xn ])(Xn − E
(1)
Sources:
• Wikipedia (2020): “Covariance matrix”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
06-06; URL: https://en.wikipedia.org/wiki/Covariance_matrix#Definition.
1X
n
Σ̂ = (xi − x̄)(xi − x̄)T (1)
n i=1
and the unbiased sample covariance matrix of x is given by
1 X
n
S= (xi − x̄)(xi − x̄)T (2)
n − 1 i=1
1. PROBABILITY THEORY 67
Sources:
• Wikipedia (2021): “Sample mean and covariance”; in: Wikipedia, the free encyclopedia, retrieved on
2020-05-20; URL: https://en.wikipedia.org/wiki/Sample_mean_and_covariance#Definition_of_
sample_covariance.
ΣXX = E (X − E[X])(X − E[X])T
= E XX T − X E(X)T − E(X) X T + E(X)E(X)T
(4)
= E(XX T ) − E(X)E(X)T − E(X)E(X)T + E(X)E(X)T
= E(XX T ) − E(X)E(X)T .
Sources:
• Taboga, Marco (2010): “Covariance matrix”; in: Lectures on probability and statistics, retrieved
on 2020-06-06; URL: https://www.statlect.com/fundamentals-of-probability/covariance-matrix.
1.13.12 Symmetry
Theorem: Each covariance matrix (→ I/1.13.9) is symmetric:
ΣT
XX = ΣXX . (1)
Cov(X1 , X1 ) . . . Cov(X1 , Xn )
.. ... ..
ΣXX = . . . (2)
Cov(Xn , X1 ) . . . Cov(Xn , Xn )
A symmetric matrix is a matrix whose transpose is equal to itself. The transpose of ΣXX is
Cov(X1 , X1 ) . . . Cov(Xn , X1 )
.. ... ..
ΣT = . . . (3)
XX
Cov(X1 , Xn ) . . . Cov(Xn , Xn )
Because the covariance is a symmetric function (→ I/1.13.4), i.e. Cov(X, Y ) = Cov(Y, X), this matrix
is equal to
Cov(X1 , X1 ) . . . Cov(X1 , Xn )
.. . . ..
ΣXX =
T
. . . (4)
Cov(Xn , X1 ) . . . Cov(Xn , Xn )
which is equivalent to our original definition in (2).
■
Sources:
• Wikipedia (2022): “Covariance matrix”; in: Wikipedia, the free encyclopedia, retrieved on 2022-
09-26; URL: https://en.wikipedia.org/wiki/Covariance_matrix#Basic_properties.
Proof: The covariance matrix (→ I/1.13.9) of X can be expressed (→ I/1.13.11) in terms of expected
values (→ I/1.10.1) as follows
ΣXX = Σ(X) = E (X − E[X])(X − E[X])T (2)
A positive semi-definite matrix is a matrix whose eigenvalues are all non-negative or, equivalently,
Y = aT (X − µX ) . (6)
where µX = E[X] and note that
aT (X − µX ) = (X − µX )T a . (7)
Thus, combing (5) with (6), we have:
aT ΣXX a = E Y 2 . (8)
Because Y 2 is a random variable that cannot become negative and the expected value of a strictly
non-negative random variable is also non-negative (→ I/1.10.4), we finally have
aT ΣXX a ≥ 0 (9)
for any a ∈ Rn .
Sources:
• hkBattousai (2013): “What is the proof that covariance matrices are always semi-definite?”; in:
StackExchange Mathematics, retrieved on 2022-09-26; URL: https://math.stackexchange.com/a/
327872.
• Wikipedia (2022): “Covariance matrix”; in: Wikipedia, the free encyclopedia, retrieved on 2022-
09-26; URL: https://en.wikipedia.org/wiki/Covariance_matrix#Basic_properties.
Proof: The covariance matrix (→ I/1.13.9) of X can be expressed (→ I/1.13.11) in terms of expected
values (→ I/1.10.1) as follows:
ΣXX = Σ(X) = E (X − E[X])(X − E[X])T . (2)
Using this and the linearity of the expected value (→ I/1.10.5), we can derive (1) as follows:
(2)
Σ(X + a) = E ([X + a] − E[X + a])([X + a] − E[X + a])T
= E (X + a − E[X] − a)(X + a − E[X] − a)T
(3)
= E (X − E[X])(X − E[X])T
(2)
= Σ(X) .
■
70 CHAPTER I. GENERAL THEOREMS
Sources:
• Wikipedia (2022): “Covariance matrix”; in: Wikipedia, the free encyclopedia, retrieved on 2022-
09-22; URL: https://en.wikipedia.org/wiki/Covariance_matrix#Basic_properties.
Proof: The covariance matrix (→ I/1.13.9) of X can be expressed (→ I/1.13.11) in terms of expected
values (→ I/1.10.1) as follows:
ΣXX = Σ(X) = E (X − E[X])(X − E[X])T . (2)
Using this and the linearity of the expected value (→ I/1.10.5), we can derive (1) as follows:
(2)
Σ(AX) = E ([AX] − E[AX])([AX] − E[AX])T
= E (A[X − E[X]])(A[X − E[X]])T
= E A(X − E[X])(X − E[X])T AT (3)
= A E (X − E[X])(X − E[X])T AT
(2)
= A Σ(X)AT .
Sources:
• Wikipedia (2022): “Covariance matrix”; in: Wikipedia, the free encyclopedia, retrieved on 2022-
09-22; URL: https://en.wikipedia.org/wiki/Covariance_matrix#Basic_properties.
Cov(X1 , Y1 ) . . . Cov(X1 , Ym ) E [(X1 − E[X1 ])(Y1 − E[Y1 ])] . . . E [(X1 − E[X1 ])(Ym − E[Ym
.. .. .
.. .. .. ..
ΣXY = . . = . . .
Cov(Xn , Y1 ) . . . Cov(Xn , Ym ) E [(Xn − E[Xn ])(Y1 − E[Y1 ])] . . . E [(Xn − E[Xn ])(Ym − E[Y
(1)
Sources:
• Wikipedia (2022): “Cross-covariance matrix”; in: Wikipedia, the free encyclopedia, retrieved on
2022-09-26; URL: https://en.wikipedia.org/wiki/Cross-covariance_matrix#Definition.
1. PROBABILITY THEORY 71
Proof: The covariance matrix (→ I/1.13.9) of X can be expressed (→ I/1.13.11) in terms of expected
values (→ I/1.10.1) as follows
ΣXX = Σ(X) = E (X − E[X])(X − E[X])T (2)
and the cross-covariance matrix (→ I/1.13.16) of X and Y can similarly be written as
ΣXY = Σ(X, Y ) = E (X − E[X])(Y − E[Y ])T (3)
Using this and the linearity of the expected value (→ I/1.10.5) as well as the definitions of covariance
matrix (→ I/1.13.9) and cross-covariance matrix (→ I/1.13.16), we can derive (1) as follows:
(2)
Σ(X + Y ) = E ([X + Y ] − E[X + Y ])([X + Y ] − E[X + Y ])T
= E ([X − E(X)] + [Y − E(Y )])([X − E(X)] + [Y − E(Y )])T
= E (X − E[X])(X − E[X])T + (X − E[X])(Y − E[Y ])T + (Y − E[Y ])(X − E[X])T + (Y − E[Y ])
= E (X − E[X])(X − E[X])T + E (X − E[X])(Y − E[Y ])T + E (Y − E[Y ])(X − E[X])T + E
(2)
= ΣXX + ΣY Y + E (X − E[X])(Y − E[Y ])T + E (Y − E[Y ])(X − E[X])T
(3)
= ΣXX + ΣY Y + ΣXY + ΣY X .
(4)
■
Sources:
• Wikipedia (2022): “Covariance matrix”; in: Wikipedia, the free encyclopedia, retrieved on 2022-
09-26; URL: https://en.wikipedia.org/wiki/Covariance_matrix#Basic_properties.
E[(X1 −E[X1 ])(X1 −E[X1 ])] E[(X1 −E[X1 ])(Xn −E[Xn ])]
σX1 . . . 0 . . . σ . . . 0
σX1 σX1 σX1 σXn X1
.. . .. .
.. · .
.. . .. .
.. .. ... ..
ΣXX = . ·
. .
E[(Xn −E[Xn ])(X1 −E[X1 ])] E[(Xn −E[Xn ])(Xn −E[Xn ])]
0 . . . σXn σXn σX1
. . . σXn σXn
0 . . . σ Xn
σX1 ·E[(X1 −E[X1 ])(X1 −E[X1 ])] σ ·E[(X1 −E[X1 ])(Xn −E[Xn ])]
. . . X1 σ ... 0
σ X 1
σ X 1
σX1 σXn X1
.. .. .
.. . .. ..
= . . · .. . .
σXn ·E[(Xn −E[Xn ])(X1 −E[X1 ])] σXn ·E[(Xn −E[Xn ])(Xn −E[Xn ])]
σXn σX1
... σXn σXn
0 . . . σXn
σX1 ·E[(X1 −E[X1 ])(X1 −E[X1 ])]·σX1 σX1 ·E[(X1 −E[X1 ])(Xn −E[Xn ])]·σXn
. . .
σX1 σX1 σX1 σXn
.. . . ..
= . . .
σXn ·E[(Xn −E[Xn ])(X1 −E[X1 ])]·σX1 σXn ·E[(Xn −E[Xn ])(Xn −E[Xn ])]·σXn
σXn σX1
. . . σXn σXn
E [(X1 − E[X1 ])(X1 − E[X1 ])] . . . E [(X1 − E[X1 ])(Xn − E[Xn ])]
.. .. ..
= . . .
E [(Xn − E[Xn ])(X1 − E[X1 ])] . . . E [(Xn − E[Xn ])(Xn − E[Xn ])]
(4)
which is nothing else than the definition of the covariance matrix (→ I/1.13.9).
■
Sources:
• Penny, William (2006): “The correlation matrix”; in: Mathematics for Brain Imaging, ch. 1.4.5,
p. 28, eq. 1.60; URL: https://ueapsylabs.co.uk/sites/wpenny/mbi/mbi_course.pdf.
Sources:
1. PROBABILITY THEORY 73
• Wikipedia (2020): “Precision (statistics)”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
06-06; URL: https://en.wikipedia.org/wiki/Precision_(statistics).
ΛXX = D−1 −1 −1
X · CXX · DX , (1)
where D−1X is a diagonal matrix with the inverse standard deviations (→ I/1.16.1) of X1 , . . . , Xn as
entries on the diagonal:
1
. . . 0
σ X1
−1 .. . . ..
DX = diag(1/σX1 , . . . , 1/σXn ) = . . . . (2)
0 . . . σX1
n
Proof: The precision matrix (→ I/1.13.19) is defined as the inverse of the covariance matrix (→
I/1.13.9)
ΛXX = Σ−1
XX (3)
and the relation between covariance matrix and correlation matrix (→ I/1.13.18) is given by
we obtain
74 CHAPTER I. GENERAL THEOREMS
(3)
ΛXX = Σ−1
XX
(4)
= (DX · CXX · DX )−1
(6)
= D−1 −1
X · CXX · DX
−1
(8)
1 1
. . . 0 ... 0
σX1 σX1
(7) . . .. −1 . .. ..
= . . . . . · CXX · .. . .
0 . . . σX1 0 ... 1
σXn
n
1.14 Correlation
1.14.1 Definition
Definition: The correlation of two random variables (→ I/1.2.2) X and Y , also called Pearson
product-moment correlation coefficient (PPMCC), is defined as the ratio of the covariance (→
I/1.13.1) of X and Y relative to the product of their standard deviations (→ I/1.16.1):
Sources:
• Wikipedia (2020): “Correlation and dependence”; in: Wikipedia, the free encyclopedia, retrieved on
2020-02-06; URL: https://en.wikipedia.org/wiki/Correlation_and_dependence#Pearson’s_product-mom
coefficient.
1.14.2 Range
Theorem: Let X and Y be two random variables (→ I/1.2.2). Then, the correlation of X and Y is
between and including −1 and +1:
−1 ≤ Corr(X, Y ) ≤ +1 . (1)
Proof: Consider the variance (→ I/1.11.1) of X plus or minus Y , each divided by their standard
deviations (→ I/1.16.1):
X Y
Var ± . (2)
σX σY
Because the variance is non-negative (→ I/1.11.4), this term is larger than or equal to zero:
X Y
0 ≤ Var ± . (3)
σX σY
Using the variance of a linear combination (→ I/1.11.9), it can also be written as:
1. PROBABILITY THEORY 75
X Y X Y X Y
Var ± = Var + Var ± 2 Cov ,
σX σY σX σY σX σY
1 1 1
= 2
Var(X) + 2 Var(Y ) ± 2 Cov(X, Y ) (4)
σX σY σX σY
1 2 1 1
= 2 σX + 2 σY2 ± 2 σXY .
σX σY σX σY
0 ≤ 2 ± 2 Corr(X, Y ) (6)
which is equivalent to
−1 ≤ Corr(X, Y ) ≤ +1 . (7)
Sources:
• Dor Leventer (2021): “How can I simply prove that the pearson correlation coefficient is be-
tween -1 and 1?”; in: StackExchange Mathematics, retrieved on 2021-12-14; URL: https://math.
stackexchange.com/a/4260655/480910.
Sources:
• Wikipedia (2021): “Pearson correlation coefficient”; in: Wikipedia, the free encyclopedia, retrieved
on 2021-12-14; URL: https://en.wikipedia.org/wiki/Pearson_correlation_coefficient#For_a_sample.
where x̄ and ȳ are the sample means (→ I/1.10.2) and sx and sy are the sample variances (→
I/1.11.2).
1 Xn
rxy = (xi − x̄)(yi − ȳ) . (4)
(n − 1) sx sy i=1
Further simplifying, the result is:
1 X
n
xi − x̄ yi − ȳ
rxy = . (5)
n − 1 i=1 sx sy
Sources:
• Wikipedia (2021): “Peason correlation coefficient”; in: Wikipedia, the free encyclopedia, retrieved
on 2021-12-14; URL: https://en.wikipedia.org/wiki/Pearson_correlation_coefficient#For_a_sample.
E[(X1 −E[X1 ])(X1 −E[X1 ])] E[(X1 −E[X1 ])(Xn −E[Xn ])]
Corr(X1 , X1 ) . . . Corr(X1 , Xn ) ...
σX1 σX1 σX1 σXn
.. . .. .
.. .
.. ... ..
CXX = . = . .
E[(Xn −E[Xn ])(X1 −E[X1 ])] E[(Xn −E[Xn ])(Xn −E[Xn ])]
Corr(Xn , X1 ) . . . Corr(Xn , Xn ) σX σX
... σXn σXn
n 1
(1)
Sources:
• Wikipedia (2020): “Correlation and dependence”; in: Wikipedia, the free encyclopedia, retrieved
on 2020-06-06; URL: https://en.wikipedia.org/wiki/Correlation_and_dependence#Correlation_
matrices.
1. PROBABILITY THEORY 77
1X
n
x̄(j) = xij
n i=1
(3)
1X
n
x̄(k) = xik .
n i=1
1) Let x = {x1 , . . . , xn } be a sample from a random variable (→ I/1.2.2) X. Then, the median of x
is
x(n+1)/2 , if n is odd
median(x) = (1)
1 (x + x ) , if n is even ,
2 n/2 n/2+1
i.e. the median is the “middle” number when all numbers are sorted from smallest to largest.
Sources:
• Wikipedia (2020): “Median”; in: Wikipedia, the free encyclopedia, retrieved on 2020-10-15; URL:
https://en.wikipedia.org/wiki/Median.
78 CHAPTER I. GENERAL THEOREMS
1.15.2 Mode
Definition: The mode of a sample or random variable is the value which occurs most often or with
largest probability among all its values.
1) Let x = {x1 , . . . , xn } be a sample from a random variable (→ I/1.2.2) X. Then, the mode of x is
the value which occurs most often in the list x1 , . . . , xn .
2) Let X be a random variable (→ I/1.2.2) with probability mass function (→ I/1.6.1) or probability
density function (→ I/1.7.1) fX (x). Then, the mode of X is the the value which maximizes the PMF
or PDF:
Sources:
• Wikipedia (2020): “Mode (statistics)”; in: Wikipedia, the free encyclopedia, retrieved on 2020-10-
15; URL: https://en.wikipedia.org/wiki/Mode_(statistics).
Sources:
• Wikipedia (2020): “Standard deviation”; in: Wikipedia, the free encyclopedia, retrieved on 2020-09-
03; URL: https://en.wikipedia.org/wiki/Standard_deviation#Definition_of_population_values.
FHWM(X) = ∆x = x2 − x1 (1)
where x1 and x2 are specified, such that
1
fX (x1 ) = fX (x2 ) = fX (xM ) and x1 < x M < x 2 . (2)
2
Sources:
• Wikipedia (2020): “Full width at half maximum”; in: Wikipedia, the free encyclopedia, retrieved
on 2020-08-19; URL: https://en.wikipedia.org/wiki/Full_width_at_half_maximum.
1. PROBABILITY THEORY 79
1) Let x = {x1 , . . . , xn } be a sample from a random variable (→ I/1.2.2) X. Then, the minimum of
x is
2) Let X be a random variable (→ I/1.2.2) with possible values X . Then, the minimum of X is
Sources:
• Wikipedia (2020): “Sample maximum and minimum”; in: Wikipedia, the free encyclopedia, re-
trieved on 2020-11-12; URL: https://en.wikipedia.org/wiki/Sample_maximum_and_minimum.
1.17.2 Maximum
Definition: The maximum of a sample or random variable is its highest observed or possible value.
1) Let x = {x1 , . . . , xn } be a sample from a random variable (→ I/1.2.2) X. Then, the maximum of
x is
2) Let X be a random variable (→ I/1.2.2) with possible values X . Then, the maximum of X is
Sources:
• Wikipedia (2020): “Sample maximum and minimum”; in: Wikipedia, the free encyclopedia, re-
trieved on 2020-11-12; URL: https://en.wikipedia.org/wiki/Sample_maximum_and_minimum.
Sources:
• Wikipedia (2020): “Moment (mathematics)”; in: Wikipedia, the free encyclopedia, retrieved on
2020-08-19; URL: https://en.wikipedia.org/wiki/Moment_(mathematics)#Significance_of_the_
moments.
Proof: Using the definition of the moment-generating function (→ I/1.9.5), we can write:
(n) dn
MX (t) = E(etX ) . (2)
dtn
Using the power series expansion of the exponential function
X
∞
xn
x
e = , (3)
n=0
n!
equation (2) becomes
!
(n) dn X∞
tm X m
MX (t) = n E . (4)
dt m=0
m!
Because the expected value is a linear operator (→ I/1.10.5), we have:
m m
dn X
∞
(n) t X
MX (t) = n E
dt m=0 m!
(5)
X∞
dn tm
= n m!
E (X m ) .
m=0
dt
Y
n−1
m!
n
m = (m − i) = , (7)
i=0
(m − n)!
equation (5) becomes
(n)
X∞
mn tm−n
MX (t) = E (X m )
m=n
m!
(7) X
∞
m! tm−n
= E (X m )
m=n
(m − n)! m!
X
∞
tm−n
= E (X m )
m=n
(m − n)!
X∞ (8)
tn−n n tm−n
= E (X ) + E (X m )
(n − n)! m=n+1
(m − n)!
t0 X
∞
tm−n
n
= E (X ) + E (X m )
0! m=n+1
(m − n)!
X
∞
tm−n
= E (X n ) + E (X m ) .
m=n+1
(m − n)!
Setting t = 0 in (8) yields
(n)
X
∞
0m−n
MX (0) = E (X n ) + E (X m )
m=n+1
(m − n)! (9)
= E (X n )
which conforms to equation (1).
■
Sources:
• ProofWiki (2020): “Moment in terms of Moment Generating Function”; in: ProofWiki, retrieved
on 2020-08-19; URL: https://proofwiki.org/wiki/Moment_in_terms_of_Moment_Generating_
Function.
Sources:
• Wikipedia (2020): “Moment (mathematics)”; in: Wikipedia, the free encyclopedia, retrieved on
2020-10-08; URL: https://en.wikipedia.org/wiki/Moment_(mathematics)#Significance_of_the_
moments.
82 CHAPTER I. GENERAL THEOREMS
µ′1 = µ . (1)
Proof: The first raw moment (→ I/1.18.3) of a random variable (→ I/1.2.2) X is defined as
µ′1 = E (X − 0)1 (2)
which is equal to the expected value (→ I/1.10.1) of X:
Proof: The second raw moment (→ I/1.18.3) of a random variable (→ I/1.2.2) X is defined as
µ′2 = E (X − 0)2 . (2)
Using the partition of variance into expected values (→ I/1.11.3)
Sources:
• Wikipedia (2020): “Moment (mathematics)”; in: Wikipedia, the free encyclopedia, retrieved on
2020-10-08; URL: https://en.wikipedia.org/wiki/Moment_(mathematics)#Significance_of_the_
moments.
1. PROBABILITY THEORY 83
µ1 = 0 . (1)
Proof: The first central moment (→ I/1.18.6) of a random variable (→ I/1.2.2) X with mean (→
I/1.10.1) µ is defined as
µ1 = E (X − µ)1 . (2)
Due to the linearity of the expected value (→ I/1.10.5) and by plugging in µ = E(X), we have
µ1 = E [X − µ]
= E(X) − µ
(3)
= E(X) − E(X)
=0.
Sources:
• ProofWiki (2020): “First Central Moment is Zero”; in: ProofWiki, retrieved on 2020-09-09; URL:
https://proofwiki.org/wiki/First_Central_Moment_is_Zero.
µ2 = Var(X) . (1)
Proof: The second central moment (→ I/1.18.6) of a random variable (→ I/1.2.2) X with mean (→
I/1.10.1) µ is defined as
µ2 = E (X − µ)2 (2)
which is equivalent to the definition of the variance (→ I/1.11.1):
µ2 = E (X − E(X))2 = Var(X) . (3)
Sources:
• Wikipedia (2020): “Moment (mathematics)”; in: Wikipedia, the free encyclopedia, retrieved on
2020-10-08; URL: https://en.wikipedia.org/wiki/Moment_(mathematics)#Significance_of_the_
moments.
84 CHAPTER I. GENERAL THEOREMS
µn E[(X − µ)n ]
µ∗n = = . (1)
σn σn
Sources:
• Wikipedia (2020): “Moment (mathematics)”; in: Wikipedia, the free encyclopedia, retrieved on
2020-10-08; URL: https://en.wikipedia.org/wiki/Moment_(mathematics)#Standardized_moments.
2. INFORMATION THEORY 85
2 Information theory
2.1 Shannon entropy
2.1.1 Definition
Definition: Let X be a discrete random variable (→ I/1.2.2) with possible outcomes X and the
(observed or assumed) probability mass function (→ I/1.6.1) p(x) = fX (x). Then, the entropy (also
referred to as “Shannon entropy”) of X is defined as
X
H(X) = − p(x) · logb p(x) (1)
x∈X
where b is the base of the logarithm specifying in which unit the entropy is determined. By convention,
0 · log 0 is taken to be zero when calculating the entropy of X.
Sources:
• Shannon CE (1948): “A Mathematical Theory of Communication”; in: Bell System Technical
Journal, vol. 27, iss. 3, pp. 379-423; URL: https://ieeexplore.ieee.org/document/6773024; DOI:
10.1002/j.1538-7305.1948.tb01338.x.
2.1.2 Non-negativity
Theorem: The entropy of a discrete random variable (→ I/1.2.2) is a non-negative number:
H(X) ≥ 0 . (1)
Because the co-domain of probability mass functions (→ I/1.6.1) is [0, 1], we can deduce:
0 ≤ p(x) ≤ 1
−∞ ≤ logb p(x) ≤ 0
(4)
0 ≤ − logb p(x) ≤ +∞
0 ≤ p(x) · (− logb p(x)) ≤ +∞ .
By convention, 0 · logb (0) is taken to be 0 when calculating entropy, consistent with
Taking this together, each addend in (3) is positive or zero and thus, the entire sum must also be
non-negative.
86 CHAPTER I. GENERAL THEOREMS
Sources:
• Cover TM, Thomas JA (1991): “Elements of Information Theory”, p. 15; URL: https://www.
wiley.com/en-us/Elements+of+Information+Theory%2C+2nd+Edition-p-9780471241959.
2.1.3 Concavity
Theorem: The entropy (→ I/2.1.1) is concave in the probability mass function (→ I/1.6.1) p, i.e.
Proof: Let X be a discrete random variable (→ I/1.2.2) with possible outcomes X and let u(x) be
the probability mass function (→ I/1.6.1) of a discrete uniform distribution (→ II/1.1.1) on X ∈ X .
Then, the entropy (→ I/2.1.1) of an arbitrary probability mass function (→ I/1.6.1) p(x) can be
rewritten as
X
H[p] = − p(x) · log p(x)
x∈X
X p(x)
=− p(x) · log u(x)
x∈X
u(x)
X p(x) X
=− p(x) · log − p(x) · log u(x) (2)
x∈X
u(x) x∈X
1 X
= −KL[p||u] − log p(x)
|X | x∈X
= log |X | − KL[p||u]
log |X | − H[p] = KL[p||u]
where we have applied the definition of the Kullback-Leibler divergence (→ I/2.5.1), the probability
mass function of the discrete uniform distribution (→ II/1.1.2) and the total sum over the probability
mass function (→ I/1.6.1).
Note that the KL divergence is convex (→ I/2.5.5) in the pair of probability distributions (→ I/1.5.1)
(p, q):
Sources:
• Wikipedia (2020): “Entropy (information theory)”; in: Wikipedia, the free encyclopedia, retrieved
on 2020-08-11; URL: https://en.wikipedia.org/wiki/Entropy_(information_theory)#Further_properties.
• Cover TM, Thomas JA (1991): “Elements of Information Theory”, p. 30; URL: https://www.
wiley.com/en-us/Elements+of+Information+Theory%2C+2nd+Edition-p-9780471241959.
• Xie, Yao (2012): “Chain Rules and Inequalities”; in: ECE587: Information Theory, Lecture 3,
Slide 25; URL: https://www2.isye.gatech.edu/~yxie77/ece587/Lecture3.pdf.
• Goh, Siong Thye (2016): “Understanding the proof of the concavity of entropy”; in: StackEx-
change Mathematics, retrieved on 2020-11-08; URL: https://math.stackexchange.com/questions/
2000194/understanding-the-proof-of-the-concavity-of-entropy.
Sources:
• Cover TM, Thomas JA (1991): “Joint Entropy and Conditional Entropy”; in: Elements of Infor-
mation Theory, ch. 2.2, p. 15; URL: https://www.wiley.com/en-us/Elements+of+Information+
Theory%2C+2nd+Edition-p-9780471241959.
where b is the base of the logarithm specifying in which unit the entropy is determined.
Sources:
88 CHAPTER I. GENERAL THEOREMS
• Cover TM, Thomas JA (1991): “Joint Entropy and Conditional Entropy”; in: Elements of Infor-
mation Theory, ch. 2.2, p. 16; URL: https://www.wiley.com/en-us/Elements+of+Information+
Theory%2C+2nd+Edition-p-9780471241959.
2.1.6 Cross-entropy
Definition: Let X be a discrete random variable (→ I/1.2.2) with possible outcomes X and let P
and Q be two probability distributions (→ I/1.5.1) on X with the probability mass functions (→
I/1.6.1) p(x) and q(x). Then, the cross-entropy of Q relative to P is defined as
X
H(P, Q) = − p(x) · logb q(x) (1)
x∈X
where b is the base of the logarithm specifying in which unit the cross-entropy is determined.
Sources:
• Wikipedia (2020): “Cross entropy”; in: Wikipedia, the free encyclopedia, retrieved on 2020-07-28;
URL: https://en.wikipedia.org/wiki/Cross_entropy#Definition.
Proof: The relationship between Kullback-Leibler divergence, entropy and cross-entropy (→ I/2.5.8)
is:
Sources:
• Wikipedia (2020): “Cross entropy”; in: Wikipedia, the free encyclopedia, retrieved on 2020-08-11;
URL: https://en.wikipedia.org/wiki/Cross_entropy#Definition.
• gunes (2019): “Convexity of cross entropy”; in: StackExchange CrossValidated, retrieved on 2020-
11-08; URL: https://stats.stackexchange.com/questions/394463/convexity-of-cross-entropy.
Proof: Without loss of generality, we will use the natural logarithm, because a change in the base
of the logarithm only implies multiplication by a constant:
ln a
logb a = . (2)
ln b
Let I be the set of all x for which p(x) is non-zero. Then, proving (1) requires to show that
X p(x)
p(x) ln ≥0. (3)
x∈I
q(x)
For all x > 0, it holds that ln x ≤ x − 1, with equality only if x = 1. Multiplying this with −1, we
have ln x1 ≥ 1 − x. Applying this to (3), we can say about the left-hand side that
X
p(x) X q(x)
p(x) ln ≥ p(x) 1 −
x∈I
q(x) x∈I
p(x)
X X (4)
= p(x) − q(x) .
x∈I x∈I
Finally, since p(x) and q(x) are probability mass functions (→ I/1.6.1), we have
X
0 ≤ p(x) ≤ 1, p(x) = 1 and
x∈I
X (5)
0 ≤ q(x) ≤ 1, q(x) ≤ 1 ,
x∈I
X p(x) X X
p(x) ln ≥ p(x) − q(x)
x∈I
q(x) x∈I x∈I
X (6)
=1− q(x) ≥ 0 .
x∈I
90 CHAPTER I. GENERAL THEOREMS
Sources:
• Wikipedia (2020): “Gibbs’ inequality”; in: Wikipedia, the free encyclopedia, retrieved on 2020-09-
09; URL: https://en.wikipedia.org/wiki/Gibbs%27_inequality#Proof.
Proof: Without loss of generality, we will use the natural logarithm, because a change in the base
of the logarithm only implies multiplication by a constant:
ln a
.
logc a = (2)
ln c
Let f (x) = x ln x. Then, the left-hand side of (1) can be rewritten as
X
ai X
n n
ai
ai ln = bi f
i=1
bi i=1
bi
Xn (3)
bi ai
=b f .
i=1
b b i
bi
≥0
b
Xn
bi (4)
=1,
i=1
b
!
X
n
bi ai X
n
b i ai
b f ≥ bf
b bi b bi
i=1 i=1
!
1X
n
= bf ai (5)
b i=1
a
= bf
b
a
= a ln .
b
Finally, combining (3) and (5), this demonstrates (1).
2. INFORMATION THEORY 91
Sources:
• Wikipedia (2020): “Log sum inequality”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
09-09; URL: https://en.wikipedia.org/wiki/Log_sum_inequality#Proof.
• Wikipedia (2020): “Jensen’s inequality”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
09-09; URL: https://en.wikipedia.org/wiki/Jensen%27s_inequality#Statements.
Sources:
• Cover TM, Thomas JA (1991): “Differential Entropy”; in: Elements of Information Theory, ch.
8.1, p. 243; URL: https://www.wiley.com/en-us/Elements+of+Information+Theory%2C+2nd+
Edition-p-9780471241959.
2.2.2 Negativity
Theorem: Unlike its discrete analogue (→ I/2.1.2), the differential entropy (→ I/2.2.1) can become
negative.
Sources:
• Wikipedia (2020): “Differential entropy”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
03-02; URL: https://en.wikipedia.org/wiki/Differential_entropy#Definition.
Y = g(X) = X + c ⇔ X = g −1 (Y ) = Y − c . (3)
Note that g(X) is a strictly increasing function, such that the probability density function (→ I/1.7.3)
of Y is
dg −1 (y) (3)
fY (y) = fX (g −1 (y)) = fX (y − c) . (4)
dy
Writing down the differential entropy for Y , we have:
Z
h(Y ) = − fY (y) log fY (y) dy
Y
Z (5)
(4)
=− fX (y − c) log fX (y − c) dy
Y
Sources:
• Wikipedia (2020): “Differential entropy”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
02-12; URL: https://en.wikipedia.org/wiki/Differential_entropy#Properties_of_differential_entropy.
2. INFORMATION THEORY 93
dg −1 (y) (3) 1 y
fY (y) = fX (g −1 (y)) = fX ; (4)
dy a a
if a < 0, then g(X) is a strictly decreasing function, such that the probability density function (→
I/1.7.4) of Y is
dg −1 (y) (3) 1 y
fY (y) = −fX (g −1 (y)) = − fX ; (5)
dy a a
thus, we can write
1 y
fY (y) =fX . (6)
|a| a
Writing down the differential entropy for Y , we have:
Z
h(Y ) = − fY (y) log fY (y) dy
Y
Z y y (7)
(6) 1 1
=− fX log fX dy
Y |a| a |a| a
Substituting x = y/a, such that y = ax, this yields:
Z ax ax
1 1
h(Y ) = − fX log fX d(ax)
{y/a | y∈Y} |a| a |a| a
Z
1
=− fX (x) log fX (x) dx
X |a|
Z
(8)
=− fX (x) [log fX (x) − log |a|] dx
ZX Z
=− fX (x) log fX (x) dx + log |a| fX (x) dx
X X
(2)
= h(X) + log |a| .
94 CHAPTER I. GENERAL THEOREMS
Sources:
• Wikipedia (2020): “Differential entropy”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
02-12; URL: https://en.wikipedia.org/wiki/Differential_entropy#Properties_of_differential_entropy.
where fX (x) is the probability density function (→ I/1.7.1) of X and X is the set of possible values
of X.
The probability density function of a linear function of a continuous random vector (→ I/1.7.6)
Y = g(X) = ΣX + µ is
1 f (Σ−1 (y − µ)) , if y ∈ Y
|Σ| X
fY (y) = (3)
0 , if y ∈
/Y
where Y = {y = Σx + µ : x ∈ X } is the set of possible outcomes of Y .
Therefore, with Y = g(X) = AX, i.e. Σ = A and µ = 0n , the probability density function (→
I/1.7.1) of Y is given by
1 f (A−1 y) , if y ∈ Y
|A| X
fY (y) = (4)
0 , if y ∈
/Y
where Y = {y = Ax : x ∈ X }.
Thus, the differential entropy (→ I/2.2.1) of Y is
Z
(2)
h(Y ) = − fY (y) log fY (y) dy
Y
Z (5)
(4) 1 −1 1 −1
=− fX (A y) log fX (A y) dy .
Y |A| |A|
Z
1 −1 1 −1
h(Y ) = − fX (A Ax) log fX (A Ax) d(Ax)
X |A| |A|
Z (6)
1 1
=− fX (x) log fX (x) d(Ax) .
|A| X |A|
2. INFORMATION THEORY 95
Z
|A| 1
h(Y ) = − fX (x) log fX (x) dx
|A| X |A|
Z Z (7)
1
=− fX (x) log fX (x) dx − fX (x) log dx .
X X |A|
R
Finally, employing the fact (→ I/1.7.1) that X fX (x) dx = 1, we can derive the differential entropy
(→ I/2.2.1) of Y as
Z Z
h(Y ) = − fX (x) log fX (x) dx + log |A| fX (x) dx
X X (8)
(2)
= h(X) + log |A| .
■
Sources:
• Cover, Thomas M. & Thomas, Joy A. (1991): “Properties of Differential Entropy, Relative Entropy,
and Mutual Information”; in: Elements of Information Theory, sect. 8.6, p. 253; URL: https:
//www.google.de/books/edition/Elements_of_Information_Theory/j0DBDwAAQBAJ.
• Wikipedia (2021): “Differential entropy”; in: Wikipedia, the free encyclopedia, retrieved on 2021-
10-07; URL: https://en.wikipedia.org/wiki/Differential_entropy#Properties_of_differential_entropy.
where Y = {y = g(x) : x ∈ X } is the set of possible outcomes of Y and Jg−1 (y) is the Jacobian matrix
of g −1 (y)
dx1 dx1
. . .
dy1 dyn
.. . . .
Jg−1 (y) = . . .. . (5)
dxn
dy1
. . . dx
dyn
n
Z
(3)
h(Y ) = − fY (y) log fY (y) dy
Y
Z (6)
(4)
=− fX (g −1 (y)) |Jg−1 (y)| log fX (g −1 (y)) |Jg−1 (y)| dy .
Y
Substituting y = g(x) into the integral and applying Jf −1 (y) = Jf−1 (x), we obtain
Z
h(Y ) = − fX (g −1 (g(x))) |Jg−1 (y)| log fX (g −1 (g(x))) |Jg−1 (y)| d[g(x)]
ZX (7)
=− fX (x) Jg−1 (x) log fX (x) Jg−1 (x) d[g(x)] .
X
Using the relations y = f (x) ⇒ dy = |Jf (x)| dx and |A||B| = |AB|, this becomes
Z
h(Y ) = − fX (x) Jg−1 (x) |Jg (x)| log fX (x) Jg−1 (x) dx
ZX Z (8)
=− fX (x) log fX (x) dx − fX (x) log Jg−1 (x) dx .
X X
R
Finally, employing the fact (→ I/1.7.1) that X fX (x) dx = 1 and the determinant property |A−1 | =
1/|A|, we can derive the differential entropy (→ I/2.2.1) of Y as
Z Z
1
h(Y ) = − fX (x) log fX (x) dx − fX (x) log dx
X X |Jg (x)|
Z (9)
(3)
= h(X) + fX (x) log |Jg (x)| dx .
X
Because there exist X and Y , such that the integral term in (9) is non-zero, this also demonstrates
that there exist X and Y , such that (1) is fulfilled.
■
Sources:
• Wikipedia (2021): “Differential entropy”; in: Wikipedia, the free encyclopedia, retrieved on 2021-
10-07; URL: https://en.wikipedia.org/wiki/Differential_entropy#Properties_of_differential_entropy.
• Bernhard (2016): “proof of upper bound on differential entropy of f(X)”; in: StackExchange Math-
ematics, retrieved on 2021-10-07; URL: https://math.stackexchange.com/a/1759531.
2. INFORMATION THEORY 97
• peek-a-boo (2019): “How to come up with the Jacobian in the change of variables formula”; in:
StackExchange Mathematics, retrieved on 2021-08-30; URL: https://math.stackexchange.com/a/
3239222.
• Wikipedia (2021): “Jacobian matrix and determinant”; in: Wikipedia, the free encyclopedia, re-
trieved on 2021-10-07; URL: https://en.wikipedia.org/wiki/Jacobian_matrix_and_determinant#
Inverse.
• Wikipedia (2021): “Inverse function theorem”; in: Wikipedia, the free encyclopedia, retrieved on
2021-10-07; URL: https://en.wikipedia.org/wiki/Inverse_function_theorem#Statement.
• Wikipedia (2021): “Determinant”; in: Wikipedia, the free encyclopedia, retrieved on 2021-10-07;
URL: https://en.wikipedia.org/wiki/Determinant#Properties_of_the_determinant.
where b is the base of the logarithm specifying in which unit the differential entropy is determined.
Sources:
• Wikipedia (2020): “Cross entropy”; in: Wikipedia, the free encyclopedia, retrieved on 2020-07-28;
URL: https://en.wikipedia.org/wiki/Cross_entropy#Definition.
98 CHAPTER I. GENERAL THEOREMS
where p(x) and p(y) are the probability mass functions (→ I/1.6.1) of X and Y and p(x, y) is the
joint probability (→ I/1.3.2) mass function of X and Y .
2) The mutual information of two continuous random variables (→ I/1.2.2) X and Y is defined as
Z Z
p(x, y)
I(X, Y ) = − p(x, y) · log dy dx (2)
X Y p(x) · p(y)
where p(x) and p(y) are the probability density functions (→ I/1.6.1) of X and Y and p(x, y) is the
joint probability (→ I/1.3.2) density function of X and Y .
Sources:
• Cover TM, Thomas JA (1991): “Relative Entropy and Mutual Information”; in: Elements of
Information Theory, ch. 2.3/8.5, p. 20/251; URL: https://www.wiley.com/en-us/Elements+of+
Information+Theory%2C+2nd+Edition-p-9780471241959.
where H(X) and H(Y ) are the marginal entropies (→ I/2.1.1) of X and Y and H(X|Y ) and H(Y |X)
are the conditional entropies (→ I/2.1.4).
Applying the law of conditional probability (→ I/1.3.4), i.e. p(x, y) = p(x|y) p(y), we get:
XX XX
I(X, Y ) = p(x|y) p(y) log p(x|y) − p(x, y) log p(x) . (4)
x y x y
2. INFORMATION THEORY 99
Now considering the definitions of marginal (→ I/2.1.1) and conditional (→ I/2.1.4) entropy
X
H(X) = − p(x) log p(x)
x∈X
X (7)
H(X|Y ) = p(y) H(X|Y = y) ,
y∈Y
The conditioning of X on Y in this proof is without loss of generality. Thus, the proof for the
expression using the reverse conditional entropy of Y given X is obtained by simply switching x and
y in the derivation.
■
Sources:
• Wikipedia (2020): “Mutual information”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
01-13; URL: https://en.wikipedia.org/wiki/Mutual_information#Relation_to_conditional_and_
joint_entropy.
XX XX XX
I(X, Y ) = p(x, y) log p(x, y) − p(x, y) log p(x) − p(x, y) log p(y) . (3)
x y x y x y
! !
XX X X X X
I(X, Y ) = p(x, y) log p(x, y) − p(x, y) log p(x) − p(x, y) log p(y) . (4)
x y x y y x
P
Applying the law of marginal probability (→ I/1.3.3), i.e. p(x) = y p(x, y), we get:
XX X X
I(X, Y ) = p(x, y) log p(x, y) − p(x) log p(x) − p(y) log p(y) . (5)
x y x y
Now considering the definitions of marginal (→ I/2.1.1) and joint (→ I/2.1.5) entropy
X
H(X) = − p(x) log p(x)
x∈X
XX (6)
H(X, Y ) = − p(x, y) log p(x, y) ,
x∈X y∈Y
Sources:
• Wikipedia (2020): “Mutual information”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
01-13; URL: https://en.wikipedia.org/wiki/Mutual_information#Relation_to_conditional_and_
joint_entropy.
Proof: The existence of the joint probability mass function (→ I/1.6.1) ensures that the mutual
information (→ I/2.4.1) is defined:
XX p(x, y)
I(X, Y ) = p(x, y) log . (2)
x∈X y∈Y
p(x) p(y)
2. INFORMATION THEORY 101
Sources:
• Wikipedia (2020): “Mutual information”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
01-13; URL: https://en.wikipedia.org/wiki/Mutual_information#Relation_to_conditional_and_
joint_entropy.
Sources:
• Cover TM, Thomas JA (1991): “Relative Entropy and Mutual Information”; in: Elements of
Information Theory, ch. 2.3/8.5, p. 20/251; URL: https://www.wiley.com/en-us/Elements+of+
Information+Theory%2C+2nd+Edition-p-9780471241959.
102 CHAPTER I. GENERAL THEOREMS
where h(X) and h(Y ) are the marginal differential entropies (→ I/2.2.1) of X and Y and h(X|Y )
and h(Y |X) are the conditional differential entropies (→ I/2.2.7).
Applying the law of conditional probability (→ I/1.3.4), i.e. p(x, y) = p(x|y) p(y), we get:
Z Z Z Z
I(X, Y ) = p(x|y) p(y) log p(x|y) dy dx − p(x, y) log p(x) dy dx . (4)
X Y X Y
Now considering the definitions of marginal (→ I/2.2.1) and conditional (→ I/2.2.7) differential
entropy
Z
h(X) = − p(x) log p(x) dx
Z X
(7)
h(X|Y ) = p(y) h(X|Y = y) dy ,
Y
Sources:
• Wikipedia (2020): “Mutual information”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
02-21; URL: https://en.wikipedia.org/wiki/Mutual_information#Relation_to_conditional_and_
joint_entropy.
Z Z Z Z Z Z
I(X, Y ) = p(x, y) log p(x, y) dy dx − p(x, y) log p(x) dy dx − p(x, y) log p(y) dy dx .
X Y X Y X Y
(3)
Regrouping the variables, this reads:
Z Z Z Z Z Z
I(X, Y ) = p(x, y) log p(x, y) dy dx− p(x, y) dy log p(x) dx− p(x, y) dx log p(y) dy .
X Y X Y Y X
R (4)
Applying the law of marginal probability (→ I/1.3.3), i.e. p(x) = Y p(x, y), we get:
Z Z Z Z
I(X, Y ) = p(x, y) log p(x, y) dy dx − p(x) log p(x) dx − p(y) log p(y) dy . (5)
X Y X Y
Now considering the definitions of marginal (→ I/2.2.1) and joint (→ I/2.2.8) differential entropy
Z
h(X) = − p(x) log p(x) dx
ZX Z (6)
h(X, Y ) = − p(x, y) log p(x, y) dy dx ,
X Y
Sources:
• Wikipedia (2020): “Mutual information”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
02-21; URL: https://en.wikipedia.org/wiki/Mutual_information#Relation_to_conditional_and_
joint_entropy.
Proof: The existence of the joint probability density function (→ I/1.7.1) ensures that the mutual
information (→ I/2.4.1) is defined:
Z Z
p(x, y)
I(X, Y ) = p(x, y) log dy dx . (2)
X Y p(x) p(y)
The relation of mutual information to conditional differential entropy (→ I/2.4.2) is:
Sources:
• Wikipedia (2020): “Mutual information”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
02-21; URL: https://en.wikipedia.org/wiki/Mutual_information#Relation_to_conditional_and_
joint_entropy.
2. INFORMATION THEORY 105
where p(x) and q(x) are the probability mass functions (→ I/1.6.1) of P and Q.
2) The Kullback-Leibler divergence of P from Q for a continuous random variable X is defined as
Z
p(x)
KL[P ||Q] = p(x) · log dx (2)
X q(x)
where p(x) and q(x) are the probability density functions (→ I/1.7.1) of P and Q.
By convention (→ I/2.1.1), 0 · log 0 is taken to be zero when calculating the divergence between P
and Q.
Sources:
• MacKay, David J.C. (2003): “Probability, Entropy, and Inference”; in: Information Theory, In-
ference, and Learning Algorithms, ch. 2.6, eq. 2.45, p. 34; URL: https://www.inference.org.uk/
itprnn/book.pdf.
2.5.2 Non-negativity
Theorem: The Kullback-Leibler divergence (→ I/2.5.1) is always non-negative
Gibbs’ inequality (→ I/2.1.8) states that the entropy (→ I/2.1.1) of a probability distribution is
always less than or equal to the cross-entropy (→ I/2.1.6) with another probability distribution –
with equality only if the distributions are identical –,
X
n X
n
− p(xi ) log p(xi ) ≤ − p(xi ) log q(xi ) (4)
i=1 i=1
X
n X
n
p(xi ) log p(xi ) − p(xi ) log q(xi ) ≥ 0 . (5)
i=1 i=1
Sources:
• Wikipedia (2020): “Kullback-Leibler divergence”; in: Wikipedia, the free encyclopedia, retrieved on
2020-05-31; URL: https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence#Properties.
2.5.3 Non-negativity
Theorem: The Kullback-Leibler divergence (→ I/2.5.1) is always non-negative
X
p(x) ≥ 0, p(x) = 1 and
x∈X
X (4)
q(x) ≥ 0, q(x) = 1 ,
x∈X
Sources:
• Wikipedia (2020): “Log sum inequality”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
09-09; URL: https://en.wikipedia.org/wiki/Log_sum_inequality#Applications.
2. INFORMATION THEORY 107
2.5.4 Non-symmetry
Theorem: The Kullback-Leibler divergence (→ I/2.5.1) is non-symmetric, i.e.
Proof: Let X ∈ X = {0, 1, 2} be a discrete random variable (→ I/1.2.2) and consider the two
probability distributions (→ I/1.5.1)
P : X ∼ Bin(2, 0.5)
(2)
Q : X ∼ U (0, 2)
where Bin(n, p) indicates a binomial distribution (→ II/1.3.1) and U(a, b) indicates a discrete uniform
distribution (→ II/1.1.1).
Then, the probability mass function of the binomial distribution (→ II/1.3.2) entails that
1/4 , if x = 0
p(x) = 1/2 , if x = 1 (3)
1/4 , if x = 2
and the probability mass function of the discrete uniform distribution (→ II/1.1.2) entails that
1
q(x) =, (4)
3
such that the Kullback-Leibler divergence (→ I/2.5.1) of P from Q is
X p(x)
KL[P ||Q] = p(x) · log
x∈X
q(x)
1 3 1 3 1 3
= log + log + log
4 4 2 2 4 4
1 3 1 3
= log + log
2 4 2 2 (5)
1 3 3
= log + log
2 4 2
1 3 3
= log ·
2 4 2
1 9
= log = 0.0589
2 8
and the Kullback-Leibler divergence (→ I/2.5.1) of Q from P is
108 CHAPTER I. GENERAL THEOREMS
X q(x)
KL[Q||P ] = q(x) · log
x∈X
p(x)
1 4 1 2 1 4
= log + log + log
3 3 3 3 3 3
1 4 2 4
= log + log + log (6)
3 3 3 3
1 4 2 4
= log · ·
3 3 3 3
1 32
= log = 0.0566
3 27
which provides an example for
Sources:
• Kullback, Solomon (1959): “Divergence”; in: Information Theory and Statistics, ch. 1.3, pp. 6ff.;
URL: http://index-of.co.uk/Information-Theory/Information%20theory%20and%20statistics%20-%
20Solomon%20Kullback.pdf.
• Wikipedia (2020): “Kullback-Leibler divergence”; in: Wikipedia, the free encyclopedia, retrieved on
2020-08-11; URL: https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence#Basic_
example.
2.5.5 Convexity
Theorem: The Kullback-Leibler divergence (→ I/2.5.1) is convex in the pair of probability distri-
butions (→ I/1.5.1) (p, q), i.e.
Sources:
• Wikipedia (2020): “Kullback-Leibler divergence”; in: Wikipedia, the free encyclopedia, retrieved on
2020-08-11; URL: https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence#Properties.
• Xie, Yao (2012): “Chain Rules and Inequalities”; in: ECE587: Information Theory, Lecture 3,
Slides 22/24; URL: https://www2.isye.gatech.edu/~yxie77/ece587/Lecture3.pdf.
Z Z
p1 (x) p2 (y)
KL[P ||Q] = p1 (x) p2 (y) · log + log dy dx
X Y q1 (x) q2 (y)
Z Z Z Z
p1 (x) p2 (y)
= p1 (x) p2 (y) · log dy dx + p1 (x) p2 (y) · log dy dx
X Y q1 (x) X Y q2 (y)
Z Z Z Z
p1 (x) p2 (y) (5)
= p1 (x) · log p2 (y) dy dx + p2 (y) · log p1 (x) dx dy
X q1 (x) Y Y q2 (y) X
Z Z
p1 (x) p2 (y)
= p1 (x) · log dx + p2 (y) · log dy
X q1 (x) Y q2 (y)
(2)
= KL[P1 ||Q1 ] + KL[P2 ||Q2 ] .
■
Sources:
• Wikipedia (2020): “Kullback-Leibler divergence”; in: Wikipedia, the free encyclopedia, retrieved on
2020-05-31; URL: https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence#Properties.
p(x) dx = p(y) dy
(3)
q(x) dx = q(y) dy
dy
p(x) = p(y)
dx (4)
dy
q(x) = q(y) ,
dx
the KL divergence can be evaluated as follows:
2. INFORMATION THEORY 111
Z b
p(x)
KL[p(x)||q(x)] = p(x) · log dx
q(x)
a
Z !
y(b) dy
dy p(y) dx
= p(y) · log dy
dx
y(a) dx q(y) dx (5)
Z y(b)
p(y)
= p(y) · log dy
y(a) q(y)
= KL[p(y)||q(y)] .
■
Sources:
• Wikipedia (2020): “Kullback-Leibler divergence”; in: Wikipedia, the free encyclopedia, retrieved on
2020-05-27; URL: https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence#Properties.
• shimao (2018): “KL divergence invariant to affine transformation?”; in: StackExchange CrossVali-
dated, retrieved on 2020-05-28; URL: https://stats.stackexchange.com/questions/341922/kl-divergence-inv
Now considering the definitions of marginal entropy (→ I/2.1.1) and cross-entropy (→ I/2.1.6)
X
H(P ) = − p(x) log p(x)
x∈X
X (4)
H(P, Q) = − p(x) log q(x) ,
x∈X
Sources:
• Wikipedia (2020): “Kullback-Leibler divergence”; in: Wikipedia, the free encyclopedia, retrieved on
2020-05-27; URL: https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence#Motivation.
Now considering the definitions of marginal differential entropy (→ I/2.2.1) and differential cross-
entropy (→ I/2.2.9)
Z
h(P ) = − p(x) log p(x) dx
ZX (4)
h(P, Q) = − p(x) log q(x) dx ,
X
Sources:
• Wikipedia (2020): “Kullback-Leibler divergence”; in: Wikipedia, the free encyclopedia, retrieved on
2020-05-27; URL: https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence#Motivation.
3. ESTIMATION THEORY 113
3 Estimation theory
3.1 Point estimates
3.1.1 Mean squared error
Definition: Let θ̂ be an estimator of an unknown parameter θ̂ based on measured data y. Then, the
mean squared error is defined as the expected value (→ I/1.10.1) of the squared difference between
the estimated value and the true value of the parameter:
2
MSE = Eθ̂ θ̂ − θ . (1)
where Eθ̂ [·] is expectation calculated over all possible samples y leading to values of θ̂.
Sources:
• Wikipedia (2022): “Estimator”; in: Wikipedia, the free encyclopedia, retrieved on 2022-03-27; URL:
https://en.wikipedia.org/wiki/Estimator#Mean_squared_error.
3.1.2 Partition of the mean squared error into bias and variance
Theorem: The mean squared error (→ I/3.1.1) can be partitioned into variance and squared bias
Proof: The mean squared error (→ I/3.1.1) (MSE) is defined as the expected value (→ I/1.10.1) of
the squared deviation of the estimated value θ̂ from the true value θ of a parameter, over all values
θ̂:
2
MSE(θ̂) = Eθ̂ θ̂ − θ . (4)
2
MSE(θ̂) = Eθ̂ θ̂ − θ
2
= Eθ̂ θ̂ − Eθ̂ (θ̂) + Eθ̂ (θ̂) − θ
2 2 (5)
= Eθ̂ θ̂ − Eθ̂ (θ̂) + 2 θ̂ − Eθ̂ (θ̂) Eθ̂ (θ̂) − θ + Eθ̂ (θ̂) − θ
2 h i 2
= Eθ̂ θ̂ − Eθ̂ (θ̂) + Eθ̂ 2 θ̂ − Eθ̂ (θ̂) Eθ̂ (θ̂) − θ + Eθ̂ Eθ̂ (θ̂) − θ .
114 CHAPTER I. GENERAL THEOREMS
2 h i 2
MSE(θ̂) = Eθ̂ θ̂ − Eθ̂ (θ̂) + 2 Eθ̂ (θ̂) − θ Eθ̂ θ̂ − Eθ̂ (θ̂) + Eθ̂ (θ̂) − θ
2 2
= Eθ̂ θ̂ − Eθ̂ (θ̂) + 2 Eθ̂ (θ̂) − θ Eθ̂ (θ̂) − Eθ̂ (θ̂) + Eθ̂ (θ̂) − θ (6)
2 2
= Eθ̂ θ̂ − Eθ̂ (θ̂) + Eθ̂ (θ̂) − θ .
Sources:
• Wikipedia (2019): “Mean squared error”; in: Wikipedia, the free encyclopedia, retrieved on 2019-11-
27; URL: https://en.wikipedia.org/wiki/Mean_squared_error#Proof_of_variance_and_bias_
relationship.
Sources:
• Wikipedia (2022): “Confidence interval”; in: Wikipedia, the free encyclopedia, retrieved on 2022-
03-27; URL: https://en.wikipedia.org/wiki/Confidence_interval#Definition.
where 1 − α is the confidence level and χ21,1−α is the (1 − α)-quantile of the chi-squared distribution
(→ II/3.7.1) with 1 degree of freedom.
Proof: The confidence interval (→ I/3.2.1) is defined as the interval that, under infinitely repeated
random experiments (→ I/1.1.1), contains the true parameter value with a certain probability.
Let us define the likelihood ratio
p(y|ϕ, λ̂)
Λ(ϕ) = for all ϕ∈Φ (4)
p(y|ϕ̂, λ̂)
and compute the log-likelihood ratio
maxθ∈Θ0 p(y|θ)
H0 : θ ∈ Θ0 ⇒ −2 log ∼ χ2∆k as n→∞ (6)
maxθ∈Θ1 p(y|θ)
where ∆k is thendifference
o in dimensionality between Θ0 and Θ1 . Applied to our example in (5), we
note that Θ1 = ϕ, ϕ̂ and Θ0 = {ϕ}, such that ∆k = 1 and Wilks’ theorem implies:
h i
−2 log p(y|ϕ, λ̂) − log p(y|ϕ̂, λ̂) ≤ χ21,1−α
1
log p(y|ϕ, λ̂) − log p(y|ϕ̂, λ̂) ≥ − χ21,1−α (9)
2
1
log p(y|ϕ, λ̂) ≥ log p(y|ϕ̂, λ̂) − χ21,1−α
2
which is equivalent to the confidence interval given by (3).
Sources:
• Wikipedia (2020): “Confidence interval”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
02-19; URL: https://en.wikipedia.org/wiki/Confidence_interval#Methods_of_derivation.
• Wikipedia (2020): “Likelihood-ratio test”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
02-19; URL: https://en.wikipedia.org/wiki/Likelihood-ratio_test#Definition.
• Wikipedia (2020): “Wilks’ theorem”; in: Wikipedia, the free encyclopedia, retrieved on 2020-02-19;
URL: https://en.wikipedia.org/wiki/Wilks%27_theorem.
116 CHAPTER I. GENERAL THEOREMS
4 Frequentist statistics
4.1 Likelihood theory
4.1.1 Likelihood function
Definition: Let there be a generative model (→ I/5.1.1) m describing measured data y using model
parameters θ. Then, the probability density function (→ I/1.7.1) of the distribution of y given θ is
called the likelihood function of m:
The process of calculating θ̂ is called “maximum likelihood estimation” and the functional form lead-
ing from y to θ̂ given m is called “maximum likelihood estimator”. Maximum likelihood estimation,
estimator and estimates may all be abbreviated as “MLE”.
Proof: Consider a set of independent and identical normally distributed (→ II/3.2.1) observations
x = {x1 , . . . , xn } with unknown mean (→ I/1.10.1) µ and variance (→ I/1.11.1) σ 2 :
i.i.d.
xi ∼ N (µ, σ 2 ), i = 1, . . . , n . (2)
Then, we know that the maximum likelihood estimator (→ I/4.1.3) for the variance (→ I/1.11.1) σ 2
is underestimating the true variance of the data distribution (→ IV/1.1.2):
2 n−1 2
E σ̂MLE = σ ̸= σ 2 . (3)
n
This proofs the existence of cases such as those stated by the theorem.
4. FREQUENTIST STATISTICS 117
The maximum log-likelihood can be obtained by plugging the maximum likelihood estimates (→
I/4.1.3) into the log-likelihood function (→ I/4.1.2).
µ1 = f1 (θ1 , . . . , θk )
.. (1)
.
µk = fk (θ1 , . . . , θk ) ,
Sources:
• Wikipedia (2021): “Method of moments (statistics)”; in: Wikipedia, the free encyclopedia, retrieved
on 2021-04-29; URL: https://en.wikipedia.org/wiki/Method_of_moments_(statistics)#Method.
118 CHAPTER I. GENERAL THEOREMS
H : θ ∈ Θ∗ where Θ∗ ⊂ Θ . (1)
Sources:
• Wikipedia (2021): “Statistical hypothesis testing”; in: Wikipedia, the free encyclopedia, retrieved
on 2021-03-19; URL: https://en.wikipedia.org/wiki/Statistical_hypothesis_testing#Definition_
of_terms.
Sources:
• Wikipedia (2021): “Exclusion of the null hypothesis”; in: Wikipedia, the free encyclopedia, re-
trieved on 2021-03-19; URL: https://en.wikipedia.org/wiki/Exclusion_of_the_null_hypothesis#
Terminology.
H : θ = θ∗ ; (1)
• H is called a set hypothesis or inexact hypothesis, if it specifies a set of possible values with more
than one element for the parameter value (e.g. a range or an interval):
H : θ ∈ Θ∗ . (2)
Sources:
• Wikipedia (2021): “Exclusion of the null hypothesis”; in: Wikipedia, the free encyclopedia, re-
trieved on 2021-03-19; URL: https://en.wikipedia.org/wiki/Exclusion_of_the_null_hypothesis#
Terminology.
4. FREQUENTIST STATISTICS 119
H0 : θ = θ0 (1)
and consider a set (→ I/4.2.3) alternative hypothesis (→ I/4.3.3) H1 . Then,
• H1 is called a left-sided one-tailed hypothesis, if θ is assumed to be smaller than θ0 :
H1 : θ < θ0 ; (2)
• H1 is called a right-sided one-tailed hypothesis, if θ is assumed to be larger than θ0 :
H1 : θ > θ0 ; (3)
• H1 is called a two-tailed hypothesis, if θ is assumed to be unequal to θ0 :
H1 : θ ̸= θ0 . (4)
Sources:
• Wikipedia (2021): “One- and two-tailed tests”; in: Wikipedia, the free encyclopedia, retrieved on
2021-03-31; URL: https://en.wikipedia.org/wiki/One-_and_two-tailed_tests.
Sources:
• Wikipedia (2021): “Statistical hypothesis testing”; in: Wikipedia, the free encyclopedia, retrieved
on 2021-03-19; URL: https://en.wikipedia.org/wiki/Statistical_hypothesis_testing#The_testing_
process.
120 CHAPTER I. GENERAL THEOREMS
H0 : θ ∈ Θ0 where Θ0 ⊂ Θ . (1)
Sources:
• Wikipedia (2021): “Exclusion of the null hypothesis”; in: Wikipedia, the free encyclopedia, retrieved
on 2021-03-12; URL: https://en.wikipedia.org/wiki/Exclusion_of_the_null_hypothesis#Basic_
definitions.
H0 : θ ∈ Θ0 where Θ0 ⊂ Θ
(1)
H1 : θ ∈ Θ1 where Θ1 = Θ \ Θ0 .
Sources:
• Wikipedia (2021): “Exclusion of the null hypothesis”; in: Wikipedia, the free encyclopedia, retrieved
on 2021-03-12; URL: https://en.wikipedia.org/wiki/Exclusion_of_the_null_hypothesis#Basic_
definitions.
Sources:
• Wikipedia (2021): “One- and two-tailed tests”; in: Wikipedia, the free encyclopedia, retrieved on
2021-03-31; URL: https://en.wikipedia.org/wiki/One-_and_two-tailed_tests.
4. FREQUENTIST STATISTICS 121
Sources:
• Wikipedia (2021): “Statistical hypothesis testing”; in: Wikipedia, the free encyclopedia, retrieved
on 2021-03-19; URL: https://en.wikipedia.org/wiki/Statistical_hypothesis_testing#The_testing_
process.
Sources:
• Wikipedia (2021): “Size (statistics)”; in: Wikipedia, the free encyclopedia, retrieved on 2021-03-19;
URL: https://en.wikipedia.org/wiki/Size_(statistics).
Sources:
• Wikipedia (2021): “Power of a test”; in: Wikipedia, the free encyclopedia, retrieved on 2021-03-31;
URL: https://en.wikipedia.org/wiki/Power_of_a_test#Description.
122 CHAPTER I. GENERAL THEOREMS
Sources:
• Wikipedia (2021): “Statistical hypothesis testing”; in: Wikipedia, the free encyclopedia, retrieved
on 2021-03-19; URL: https://en.wikipedia.org/wiki/Statistical_hypothesis_testing#Definition_
of_terms.
Sources:
• Wikipedia (2021): “Statistical hypothesis testing”; in: Wikipedia, the free encyclopedia, retrieved
on 2021-03-19; URL: https://en.wikipedia.org/wiki/Statistical_hypothesis_testing#Definition_
of_terms.
4.3.10 p-value
Definition: Let there be a statistical test (→ I/4.3.1) of the null hypothesis (→ I/4.3.2) H0 and
the alternative hypothesis (→ I/4.3.3) H1 using the test statistic (→ I/4.3.5) T (Y ). Let y be the
measured data and let tobs = T (y) be the observed test statistic computed from y. Moreover, assume
that FT (t) is the cumulative distribution function (→ I/1.8.1) (CDF) of the distribution of T (Y )
under H0 .
Then, the p-value is the probability of obtaining a test statistic more extreme than or as extreme as
tobs , given that the null hypothesis H0 is true:
• p = FT (tobs ), if H1 is a left-sided one-tailed hypothesis (→ I/4.2.4);
• p = 1 − FT (tobs ), if H1 is a right-sided one-tailed hypothesis (→ I/4.2.4);
• p = 2 · min ([FT (tobs ), 1 − FT (tobs )]), if H1 is a two-tailed hypothesis (→ I/4.2.4).
Sources:
• Wikipedia (2021): “Statistical hypothesis testing”; in: Wikipedia, the free encyclopedia, retrieved
on 2021-03-19; URL: https://en.wikipedia.org/wiki/Statistical_hypothesis_testing#Definition_
of_terms.
4. FREQUENTIST STATISTICS 123
p ∼ U (0, 1) . (1)
Proof: Without loss of generality, consider a left-sided one-tailed hypothesis test (→ I/4.2.4). Then,
the p-value is a function of the test statistic (→ I/4.3.10)
P = FT (T )
(2)
p = FT (tobs )
where tobs is the observed test statistic (→ I/4.3.5) and FT (t) is the cumulative distribution function
(→ I/1.8.1) of the test statistic (→ I/4.3.5) under the null hypothesis (→ I/4.3.2).
Then, we can obtain the cumulative distribution function (→ I/1.8.1) of the p-value (→ I/4.3.10) as
which is the cumulative distribution function of a continuous uniform distribution (→ II/3.1.4) over
the interval [0, 1]:
Z x
FX (x) = U(z; 0, 1) dz = x where 0 ≤ x ≤ 1 . (4)
−∞
Sources:
• jll (2018): “Why are p-values uniformly distributed under the null hypothesis?”; in: StackExchange
CrossValidated, retrieved on 2022-03-18; URL: https://stats.stackexchange.com/a/345763/270304.
124 CHAPTER I. GENERAL THEOREMS
5 Bayesian statistics
5.1 Probabilistic modeling
5.1.1 Generative model
Definition: Consider measured data y and some unknown latent parameters θ. A statement about
the distribution (→ I/1.5.1) of y given θ is called a generative model m
m : y ∼ D(θ) , (1)
where D denotes an arbitrary probability distribution (→ I/1.5.1) and θ are the parameters of this
distribution.
Sources:
• Friston et al. (2008): “Bayesian decoding of brain images”; in: NeuroImage, vol. 39, pp. 181-205;
URL: https://www.sciencedirect.com/science/article/abs/pii/S1053811907007203; DOI: 10.1016/j.neuroim
θ ∼ D(λ) . (1)
The parameters λ of this distribution are called the prior hyperparameters and the probability density
function (→ I/1.7.1) is called the prior density:
Sources:
• Gelman A, Carlin JB, Stern HS, Dunson DB, Vehtari A, Rubin DB (2014): “Probability and
inference”; in: Bayesian Data Analysis, ch. 1, p. 3; URL: http://www.stat.columbia.edu/~gelman/
book/.
5. BAYESIAN STATISTICS 125
Proof: The joint likelihood (→ I/5.1.5) is defined as the joint probability (→ I/1.3.2) distribution
(→ I/1.5.1) of data y and parameters θ:
p(y, θ|m)
p(y|θ, m) =
p(θ|m)
(3)
⇔
p(y, θ|m) = p(y|θ, m) p(θ|m) .
Then, the value of θ at which the posterior density (→ I/5.1.7) attains its maximum is called the
“maximum-a-posteriori estimate”, “MAP estimate” or “posterior mode” of θ:
Sources:
• Wikipedia (2023): “Maximum a posteriori estimation”; in: Wikipedia, the free encyclopedia, re-
trieved on 2023-12-01; URL: https://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation#
Description.
Proof: In a full probability model (→ I/5.1.4), the posterior distribution (→ I/5.1.7) can be expressed
using Bayes’ theorem (→ I/5.3.1):
p(y|θ, m) p(θ|m)
p(θ|y, m) = . (2)
p(y|m)
Applying the law of conditional probability (→ I/1.3.4) to the numerator, we have:
p(y, θ|m)
p(θ|y, m) = . (3)
p(y|m)
Because the denominator does not depend on θ, it is constant in θ and thus acts a proportionality
factor between the posterior distribution and the joint likelihood:
p(θ|y1 ) · p(θ|y2 )
p(θ|y1 , y2 ) ∝ . (2)
p(θ)
Proof: Since p(θ|y1 ) and p(θ|y2 ) are posterior distributions (→ I/5.1.7), Bayes’ theorem (→ I/5.3.1)
holds for them:
5. BAYESIAN STATISTICS 127
Moreover, Bayes’ theorem must also hold for the combined posterior distribution (→ I/5.1.9):
Note that the second fraction does not depend on θ and thus, the posterior distribution over θ is
proportional to the first fraction:
p(θ|y1 ) · p(θ|y2 )
p(θ|y1 , y2 ) ∝ . (6)
p(θ)
■
and related to likelihood function (→ I/5.1.2) and prior distribution (→ I/5.1.3) as follows:
128 CHAPTER I. GENERAL THEOREMS
Z
p(y|m) = p(y|θ, m) p(θ|m) dθ . (2)
Θ
Proof: In a full probability model (→ I/5.1.4), the marginal likelihood (→ I/5.1.11) is defined as
the marginal probability of the data y, given only the model m:
p(y|m) . (3)
Using the law of marginal probabililty (→ I/1.3.3), this can be obtained by integrating the joint
likelihood (→ I/5.1.5) function over the entire parameter space:
Z
p(y|m) = p(y, θ|m) dθ . (4)
Θ
Applying the law of conditional probability (→ I/1.3.4), the integrand can also be written as the
product of likelihood function (→ I/5.1.2) and prior density (→ I/5.1.3):
Z
p(y|m) = p(y|θ, m) p(θ|m) dθ . (5)
Θ
Sources:
• Friston et al. (2002): “Classical and Bayesian Inference in Neuroimaging: Theory”; in: NeuroIm-
age, vol. 16, iss. 2, pp. 465-483, fn. 1; URL: https://www.sciencedirect.com/science/article/pii/
S1053811902910906; DOI: 10.1006/nimg.2002.1090.
• Friston et al. (2002): “Classical and Bayesian Inference in Neuroimaging: Applications”; in: Neu-
roImage, vol. 16, iss. 2, pp. 484-512, fn. 10; URL: https://www.sciencedirect.com/science/article/
pii/S1053811902910918; DOI: 10.1006/nimg.2002.1091.
• the distribution is called a “non-uniform prior”, if its density (→ I/1.7.1) or mass (→ I/1.6.1) is
not constant over Θ.
Sources:
• Wikipedia (2020): “Lindley’s paradox”; in: Wikipedia, the free encyclopedia, retrieved on 2020-11-
25; URL: https://en.wikipedia.org/wiki/Lindley%27s_paradox#Bayesian_approach.
Sources:
• Soch J, Allefeld C, Haynes JD (2016): “How to avoid mismodelling in GLM-based fMRI data
analysis: cross-validated Bayesian model selection”; in: NeuroImage, vol. 141, pp. 469-489, eq.
15, p. 473; URL: https://www.sciencedirect.com/science/article/pii/S1053811916303615; DOI:
10.1016/j.neuroimage.2016.07.047.
Sources:
• Soch J, Allefeld C, Haynes JD (2016): “How to avoid mismodelling in GLM-based fMRI data
analysis: cross-validated Bayesian model selection”; in: NeuroImage, vol. 141, pp. 469-489, eq.
13, p. 473; URL: https://www.sciencedirect.com/science/article/pii/S1053811916303615; DOI:
10.1016/j.neuroimage.2016.07.047.
Sources:
• Wikipedia (2020): “Conjugate prior”; in: Wikipedia, the free encyclopedia, retrieved on 2020-12-02;
URL: https://en.wikipedia.org/wiki/Conjugate_prior.
130 CHAPTER I. GENERAL THEOREMS
Sources:
• Wikipedia (2020): “Prior probability”; in: Wikipedia, the free encyclopedia, retrieved on 2020-12-
02; URL: https://en.wikipedia.org/wiki/Prior_probability#Uninformative_priors.
Sources:
• Wikipedia (2020): “Empirical Bayes method”; in: Wikipedia, the free encyclopedia, retrieved on
2020-12-02; URL: https://en.wikipedia.org/wiki/Empirical_Bayes_method#Introduction.
Sources:
• Wikipedia (2020): “Prior probability”; in: Wikipedia, the free encyclopedia, retrieved on 2020-12-
02; URL: https://en.wikipedia.org/wiki/Prior_probability#Uninformative_priors.
5. BAYESIAN STATISTICS 131
p(B|A) p(A)
p(A|B) = . (1)
p(B)
Proof: The conditional probability (→ I/1.3.4) is defined as the ratio of joint probability (→ I/1.3.2),
i.e. the probability of both statements being true, and marginal probability (→ I/1.3.3), i.e. the
probability of only the second one being true:
p(A, B)
p(A|B) = . (2)
p(B)
It can also be written down for the reverse situation, i.e. to calculate the probability that B is true,
given that A is true:
p(A, B)
p(B|A) = . (3)
p(A)
Both equations can be rearranged for the joint probability
(2) (3)
p(A|B) p(B) = p(A, B) = p(B|A) p(A) (4)
from which Bayes’ theorem can be directly derived:
Sources:
• Koch, Karl-Rudolf (2007): “Rules of Probability”; in: Introduction to Bayesian Statistics, Springer,
Berlin/Heidelberg, 2007, pp. 6/13, eqs. 2.12/2.38; URL: https://www.springer.com/de/book/
9783540727231; DOI: 10.1007/978-3-540-72726-2.
Proof: Using Bayes’ theorem (→ I/5.3.1), the conditional probabilities (→ I/1.3.4) on the left are
given by
132 CHAPTER I. GENERAL THEOREMS
p(B|A1 ) · p(A1 )
p(A1 |B) = (2)
p(B)
p(B|A2 ) · p(A2 )
p(A2 |B) = . (3)
p(B)
Dividing the two conditional probabilities by each other
Sources:
• Wikipedia (2019): “Bayes’ theorem”; in: Wikipedia, the free encyclopedia, retrieved on 2020-01-06;
URL: https://en.wikipedia.org/wiki/Bayes%27_theorem#Bayes%E2%80%99_rule.
Sources:
• Wikipedia (2021): “Empirical Bayes method”; in: Wikipedia, the free encyclopedia, retrieved on
2021-04-29; URL: https://en.wikipedia.org/wiki/Empirical_Bayes_method#Introduction.
• Bishop CM (2006): “The Evidence Approximation”; in: Pattern Recognition for Machine Learning,
ch. 3.5, pp. 165-172; URL: https://www.springer.com/gp/book/9780387310732.
5. BAYESIAN STATISTICS 133
for Bayesian inference (→ I/5.3.1), i.e. obtaining the posterior distribution (→ I/5.1.7) (from eq. (3))
and approximating the marginal likelihood (→ I/5.1.11) (by plugging eq. (3) into eq. (2)).
Sources:
• Wikipedia (2021): “Variational Bayesian methods”; in: Wikipedia, the free encyclopedia, retrieved
on 2021-04-29; URL: https://en.wikipedia.org/wiki/Variational_Bayesian_methods#Evidence_
lower_bound.
• Penny W, Flandin G, Trujillo-Barreto N (2007): “Bayesian Comparison of Spatially Regularised
General Linear Models”; in: Human Brain Mapping, vol. 28, pp. 275–293, eqs. 2-9; URL: https:
//onlinelibrary.wiley.com/doi/full/10.1002/hbm.20327; DOI: 10.1002/hbm.20327.
134 CHAPTER I. GENERAL THEOREMS
Chapter II
Probability Distributions
135
136 CHAPTER II. PROBABILITY DISTRIBUTIONS
X ∼ U (a, b) , (1)
if and only if each integer between and including a and b occurs with the same probability.
Sources:
• Wikipedia (2020): “Discrete uniform distribution”; in: Wikipedia, the free encyclopedia, retrieved
on 2020-07-28; URL: https://en.wikipedia.org/wiki/Discrete_uniform_distribution.
X ∼ U (a, b) . (1)
Then, the probability mass function (→ I/1.6.1) of X is
1
fX (x) = where x ∈ {a, a + 1, . . . , b − 1, b} . (2)
b−a+1
Proof: A discrete uniform variable is defined as (→ II/1.1.1) having the same probability for each
integer between and including a and b. The number of integers between and including a and b is
n=b−a+1 (3)
and because the sum across all probabilities (→ I/1.6.1) is
X
b
fX (x) = 1 , (4)
x=a
we have
1 1
fX (x) = = . (5)
n b−a+1
■
X ∼ U (a, b) . (1)
Then, the cumulative distribution function (→ I/1.8.1) of X is
1. UNIVARIATE DISCRETE DISTRIBUTIONS 137
0 , if x < a
⌊x⌋−a+1
FX (x) = , if a ≤ x ≤ b (2)
b−a+1
1 , if x > b .
Proof: The probability mass function of the discrete uniform distribution (→ II/1.1.2) is
1
U(x; a, b) = where x ∈ {a, a + 1, . . . , b − 1, b} . (3)
b−a+1
Thus, the cumulative distribution function (→ I/1.8.1) is:
Z x
FX (x) = U(z; a, b) dz (4)
−∞
From (3), it follows that the cumulative probability increases step-wise by 1/n at each integer between
and including a and b where
n=b−a+1 (5)
is the number of integers between and including a and b. This can be expressed by noting that
(3) ⌊x⌋ − a + 1
FX (x) = , if a ≤ x ≤ b . (6)
n
Also, because Pr(X < a) = 0, we have
Z x
(4)
FX (x) = 0 dz = 0, if x < a (7)
−∞
Z x
(4)
FX (x) = U(z; a, b) dz
−∞
Z b Z x
= U(z; a, b) dz + U(z; a, b) dz
−∞ (8)
Z x b
(6)
= FX (b) + 0 dz = 1 + 0
b
= 1, if x > b .
X ∼ U (a, b) . (1)
138 CHAPTER II. PROBABILITY DISTRIBUTIONS
Proof: The cumulative distribution function of the discrete uniform distribution (→ II/1.1.3) is:
0 , if x < a
⌊x⌋−a+1
FX (x) = , if a ≤ x ≤ b (3)
b−a+1
1 , if x > b .
The quantile function (→ I/1.9.1) QX (p) is defined as the smallest x, such that FX (x) = p:
QX (p) = a + p · (b − a + 1) − 1
= a + pb − pa + p − 1 (7)
= a(1 − p) + (b + 1)p − 1 .
X ∼ U (a, b) . (1)
Then, the (Shannon) entropy (→ I/2.1.1) of X in nats is
Proof: The entropy (→ I/2.1.1) is defined as the probability-weighted average of the logarithmized
probabilities for all possible values:
1. UNIVARIATE DISCRETE DISTRIBUTIONS 139
X
H(X) = − p(x) · logb p(x) . (3)
x∈X
Entropy is measured in nats by setting b = e. Then, with the probability mass function of the discrete
uniform distribution (→ II/1.1.2), we have:
X
H(X) = − p(x) · loge p(x)
x∈X
X
b
=− p(x) · ln p(x)
x=a
Xb
1 1
=− · ln (4)
x=a
b−a+1 b−a+1
1 1
= −(b − a + 1) · · ln
b−a+1 b−a+1
1
= − ln
b−a+1
= ln(b − a + 1) .
P : X ∼ U (a1 , b1 )
(1)
Q : X ∼ U (a2 , b2 ) .
X
a1
p(x) X
b1
p(x) X
+∞
p(x)
KL[P || Q] = p(x) ln + p(x) ln + p(x) ln (4)
x=−∞
q(x) x=a q(x) x=b q(x)
1 1
140 CHAPTER II. PROBABILITY DISTRIBUTIONS
and because p(x) = 0 for any x < a1 and any x > b1 , we have
X
a1
0 X
b1
p(x) X
+∞
0
KL[P || Q] = 0 · ln + p(x) ln + 0 · ln . (5)
x=−∞
q(x) x=a q(x) x=b q(x)
1 1
X
b1
p(x)
KL[P || Q] = p(x) ln (6)
x=a1
q(x)
and we can use the probability mass function of the discrete uniform distribution (→ II/1.1.2) to
evaluate:
X
b1
1
1
b1 −a1 +1
KL[P || Q] = · ln
x=a1
b 1 − a1 + 1 1
b2 −a2 +1
b 2 − a2 + 1 X
b1
1
= · ln 1
b 1 − a1 + 1 b1 − a1 + 1 x=a (7)
1
1 b 2 − a2 + 1
= · ln · (b1 − a1 + 1)
b 1 − a1 + 1 b 1 − a1 + 1
b 2 − a2 + 1
= ln .
b 1 − a1 + 1
■
Proof: A random variable with finite support is a discrete random variable (→ I/1.2.6). Let X be
such a random variable. Without loss of generality, we can assume that the possible values of the X
can be enumerated from 1 to n.
Let g(x) be the discrete uniform distribution with a = 1 and maximum b = n which assigns to equal
probability to all n possible values and let f (x) be an arbitrary discrete (→ I/1.2.6) probability
distribution (→ I/1.5.1) on the set {1, 2, . . . , n − 1, n}.
For a discrete random variable (→ I/1.2.6) X with set of possible values X and probability mass
function (→ I/1.6.1) p(x), the Shannon entropy (→ I/2.1.1) is defined as:
X
H(X) = − p(x) log p(x) (1)
x∈X
Consider the Kullback-Leibler divergence (→ I/2.5.1) of distribution f (x) from distribution g(x)
which is non-negative (→ I/2.5.2):
1. UNIVARIATE DISCRETE DISTRIBUTIONS 141
X f (x)
0 ≤ KL[f ||g] = f (x) log
x∈X
g(x)
X X
= f (x) log f (x) − f (x) log g(x) (2)
x∈X x∈X
(1) X
= −H[f (x)] − f (x) log g(x) .
x∈X
By plugging the probability mass function of the discrte uniform distribution (→ II/1.1.2) into the
second term, we obtain:
X X
n
1
f (x) log g(x) = f (x) log
x∈X x=1
n−1+1
1X
n
(3)
= log f (x)
n x=1
= − log(n) .
This is actually the negative of the entropy of the discrete uniform distribution (→ II/1.1.5), such
that:
X
f (x) log g(x) = −H[U(1, n)] = −H[g(x)] . (4)
x∈X
0 ≤ KL[f ||g]
0 ≤ −H[f (x)] − (−H[g(x)]) (5)
H[g(x)] ≥ H[f (x)]
which means that the entropy (→ I/2.1.1) of the discrete uniform distribution (→ II/1.1.1) U(a, b)
will be larger than or equal to any other distribution (→ I/1.5.1) defined on the same set of values
{a, . . . , b}.
Sources:
• Probability Fact (2023): “The entropy of a distribution with finite domain”; in: Twitter, retrieved
on 2023-08-18; URL: https://twitter.com/ProbFact/status/1673787091610750980.
X ∼ Bern(p) , (1)
if X = 1 with probability (→ I/1.3.1) p and X = 0 with probability (→ I/1.3.1) q = 1 − p.
Sources:
• Wikipedia (2020): “Bernoulli distribution”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
03-22; URL: https://en.wikipedia.org/wiki/Bernoulli_distribution.
X ∼ Bern(p) . (1)
Then, the probability mass function (→ I/1.6.1) of X is
p , if x = 1
fX (x) = . (2)
1 − p , if x = 0 .
Proof: This follows directly from the definition of the Bernoulli distribution (→ II/1.2.1).
■
1.2.3 Mean
Theorem: Let X be a random variable (→ I/1.2.2) following a Bernoulli distribution (→ II/1.2.1):
X ∼ Bern(p) . (1)
Then, the mean or expected value (→ I/1.10.1) of X is
E(X) = p . (2)
Proof: The expected value (→ I/1.10.1) is the probability-weighted average of all possible values:
X
E(X) = x · Pr(X = x) . (3)
x∈X
Since there are only two possible outcomes for a Bernoulli random variable (→ II/1.2.2), we have:
Sources:
• Wikipedia (2020): “Bernoulli distribution”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
01-16; URL: https://en.wikipedia.org/wiki/Bernoulli_distribution#Mean.
1. UNIVARIATE DISCRETE DISTRIBUTIONS 143
1.2.4 Variance
Theorem: Let X be a random variable (→ I/1.2.2) following a Bernoulli distribution (→ II/1.2.1):
X ∼ Bern(p) . (1)
Then, the variance (→ I/1.10.1) of X is
Var(X) = p (1 − p) . (2)
Proof: The variance (→ I/1.11.1) is the probability-weighted average of the squared deviation from
the expected value (→ I/1.10.1) across all possible values
X
Var(X) = (x − E(X))2 · Pr(X = x) (3)
x∈X
Var(X) = p − p2 = p (1 − p) . (7)
Sources:
• Wikipedia (2022): “Bernoulli distribution”; in: Wikipedia, the free encyclopedia, retrieved on 2022-
01-20; URL: https://en.wikipedia.org/wiki/Bernoulli_distribution#Variance.
X ∼ Bern(p) . (1)
Then, the variance (→ I/1.11.1) of X is necessarily between 0 and 1/4:
1
0 ≤ Var(X) ≤ . (2)
4
dVar(p)
= −2 p + 1 (5)
dp
and setting this deriative to zero
dVar(pM )
=0
dp
0 = −2 pM + 1 (6)
1
pM = ,
2
we obtain the maximum possible variance
2
1 1 1
max [Var(X)] = Var(pM ) = − + = . (7)
2 2 4
The function Var(p) is monotonically increasing for 0 < p < pM as dVar(p)/dp > 0 in this interval
and it is monotonically decreasing for pM < p < 1 as dVar(p)/dp < 0 in this interval. Moreover, as
variance is always non-negative (→ I/1.11.4), the minimum variance is
Sources:
• Wikipedia (2022): “Bernoulli distribution”; in: Wikipedia, the free encyclopedia, retrieved on 2022-
01-27; URL: https://en.wikipedia.org/wiki/Bernoulli_distribution#Variance.
X ∼ Bern(p) . (1)
Then, the (Shannon) entropy (→ I/2.1.1) of X in bits is
Proof: The entropy (→ I/2.1.1) is defined as the probability-weighted average of the logarithmized
probabilities for all possible values:
1. UNIVARIATE DISCRETE DISTRIBUTIONS 145
X
H(X) = − p(x) · logb p(x) . (3)
x∈X
Entropy is measured in bits by setting b = 2. Since there are only two possible outcomes for a
Bernoulli random variable (→ II/1.2.2), we have:
Sources:
• Wikipedia (2022): “Bernoulli distribution”; in: Wikipedia, the free encyclopedia, retrieved on 2022-
09-02; URL: https://en.wikipedia.org/wiki/Bernoulli_distribution.
• Wikipedia (2022): “Binary entropy function”; in: Wikipedia, the free encyclopedia, retrieved on
2022-09-02; URL: https://en.wikipedia.org/wiki/Binary_entropy_function.
P : X ∼ Bern(p1 )
(1)
Q : X ∼ Bern(p2 ) .
1 − p1 p1 (1 − p2 )
KL[P || Q] = ln + p1 · ln . (2)
1 − p2 p2 (1 − p1 )
X p(x)
KL[P || Q] = p(x) ln
q(x)
x∈{0,1}
(4)
p(X = 0) p(X = 1)
= p(X = 0) · ln + p(X = 1) · ln .
q(X = 0) q(X = 1)
Using the probability mass function of the Bernoulli distribution (→ II/1.2.2), this becomes:
146 CHAPTER II. PROBABILITY DISTRIBUTIONS
1 − p1 p1
KL[P || Q] = (1 − p1 ) · ln + p1 · ln
1 − p2 p2
1 − p1 p1 1 − p1
= ln + p1 · ln − p1 · ln
1 − p2 p 1 − p2
2 (5)
1 − p1 p1 1 − p2
= ln + p1 · ln + ln
1 − p2 p2 1 − p1
1 − p1 p1 (1 − p2 )
= ln + p1 · ln
1 − p2 p2 (1 − p1 )
■
X ∼ Bin(n, p) , (1)
if X is the number of successes observed in n independent (→ I/1.3.6) trials, where each trial has
two possible outcomes (→ II/1.2.1) (success/failure) and the probability of success and failure are
identical across trials (p/q = 1 − p).
Sources:
• Wikipedia (2020): “Binomial distribution”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
03-22; URL: https://en.wikipedia.org/wiki/Binomial_distribution.
X ∼ Bin(n, p) . (1)
Then, the probability mass function (→ I/1.6.1) of X is
n x
fX (x) = p (1 − p)n−x . (2)
x
Proof: A binomial variable (→ II/1.3.1) is defined as the number of successes observed in n inde-
pendent (→ I/1.3.6) trials, where each trial has two possible outcomes (→ II/1.2.1) (success/failure)
and the probability (→ I/1.3.1) of success and failure are identical across trials (p, q = 1 − p).
If one has obtained x successes in n trials, one has also obtained (n − x) failures. The probability of
a particular series of x successes and (n − x) failures, when order does matter, is
px (1 − p)n−x . (3)
When order does not matter, there is a number of series consisting of x successes and (n − x) failures.
This number is equal to the number of possibilities in which x objects can be choosen from n objects
which is given by the binomial coefficient:
1. UNIVARIATE DISCRETE DISTRIBUTIONS 147
n
. (4)
x
In order to obtain the probability of x successes and (n − x) failures, when order does not matter,
the probability in (3) has to be multiplied with the number of possibilities in (4) which gives
n x
p(X = x|n, p) = p (1 − p)n−x (5)
x
which is equivalent to the expression above.
■
X ∼ Bin(n, p) . (1)
Then, the probability-generating function (→ I/1.9.9) of X is
n
X n
GX (z) = px (1 − p)n−x z x
x=0
x
n
X
(5)
n
= (pz)x (1 − p)n−x .
x=0
x
Sources:
• ProofWiki (2022): “Probability Generating Function of Binomial Distribution”; in: ProofWiki, re-
trieved on 2022-10-11; URL: https://proofwiki.org/wiki/Probability_Generating_Function_of_
Binomial_Distribution.
1.3.4 Mean
Theorem: Let X be a random variable (→ I/1.2.2) following a binomial distribution (→ II/1.3.1):
X ∼ Bin(n, p) . (1)
Then, the mean or expected value (→ I/1.10.1) of X is
E(X) = np . (2)
Proof: By definition, a binomial random variable (→ II/1.3.1) is the sum of n independent and
identical Bernoulli trials (→ II/1.2.1) with success probability p. Therefore, the expected value is
Sources:
• Wikipedia (2020): “Binomial distribution”; in: Wikipedia, the free encyclopedia, retrieved on
2020-01-16; URL: https://en.wikipedia.org/wiki/Binomial_distribution#Expected_value_and_
variance.
1.3.5 Variance
Theorem: Let X be a random variable (→ I/1.2.2) following a binomial distribution (→ II/1.3.1):
X ∼ Bin(n, p) . (1)
Then, the variance (→ I/1.11.1) of X is
Var(X) = np (1 − p) . (2)
1. UNIVARIATE DISCRETE DISTRIBUTIONS 149
Proof: By definition, a binomial random variable (→ II/1.3.1) is the sum of n independent and
identical Bernoulli trials (→ II/1.2.1) with success probability p. Therefore, the variance is
Sources:
• Wikipedia (2022): “Binomial distribution”; in: Wikipedia, the free encyclopedia, retrieved on
2022-01-20; URL: https://en.wikipedia.org/wiki/Binomial_distribution#Expected_value_and_
variance.
X ∼ Bin(n, p) . (1)
Then, the variance (→ I/1.11.1) of X is necessarily between 0 and n/4:
n
0 ≤ Var(X) ≤ . (2)
4
Proof: By definition, a binomial random variable (→ II/1.3.1) is the sum of n independent and
identical Bernoulli trials (→ II/1.2.1) with success probability p. Therefore, the variance is
As the variance of a Bernoulli random variable is always between 0 and 1/4 (→ II/1.2.5)
1
0 ≤ Var(Xi ) ≤ for all i = 1, . . . , n , (5)
4
the minimum variance of X is
1 n
max [Var(X)] = n · = . (7)
4 4
Thus, we have:
h ni
Var(X) ∈ 0, . (8)
4
■
X ∼ Bin(n, p) . (1)
Then, the (Shannon) entropy (→ I/2.1.1) of X in bits is
Proof: The entropy (→ I/2.1.1) is defined as the probability-weighted average of the logarithmized
probabilities for all possible values:
X
H(X) = − p(x) · logb p(x) . (5)
x∈X
Entropy is measured in bits by setting b = 2. Then, with the probability mass function of the binomial
distribution (→ II/1.3.2), we have:
X
H(X) = − fX (x) · log2 fX (x)
x∈X
Xn
n x n x
=− p (1 − p) n−x
· log2 p (1 − p) n−x
x=0
x x
Xn (6)
n x n
=− p (1 − p) n−x
· log2 + x · log2 p + (n − x) · log2 (1 − p)
x=0
x x
Xn
n x n
=− p (1 − p) n−x
· log2 + x · log2 p + n · log2 (1 − p) − x · log2 (1 − p) .
x=0
x x
1. UNIVARIATE DISCRETE DISTRIBUTIONS 151
Since the first factor in the sum corresponds to the probability mass (→ I/1.6.1) of X = x, we can
rewrite this as the sum of the expected values (→ I/1.10.1) of the functions (→ I/1.10.12) of the
discrete random variable (→ I/1.2.6) x in the square bracket:
n
H(X) = − log2 − ⟨x · log2 p⟩p(x) − ⟨n · log2 (1 − p)⟩p(x) + ⟨x · log2 (1 − p)⟩p(x)
x p(x)
(7)
n
= − log2 − log2 p · ⟨x⟩p(x) − n · log2 (1 − p) + log2 (1 − p) · ⟨x⟩p(x) .
x p(x)
Using the expected value of the binomial distribution (→ II/1.3.4), i.e. X ∼ Bin(n, p) ⇒ ⟨x⟩ = np,
this gives:
n
H(X) = − log2 − np · log2 p − n · log2 (1 − p) + np · log2 (1 − p)
x p(x)
(8)
n
= − log2 + n [−p · log2 p − (1 − p) log2 (1 − p)] .
x p(x)
Finally, we note that the first term is the negative expected value (→ I/1.10.1) of the logarithm of a
binomial coefficient (→ II/1.3.2) and that the term in square brackets is the entropy of the Bernoulli
distribution (→ II/1.3.7), such that we finally get:
P : X ∼ Bin(n, p1 )
(1)
Q : X ∼ Bin(n, p2 ) .
X
n
p(x)
KL[P || Q] = p(x) ln
q(x)
x=0 (4)
p(X = 0) p(X = n)
= p(X = 0) · ln + . . . + p(X = n) · ln .
q(X = 0) q(X = n)
Using the probability mass function of the binomial distribution (→ II/1.3.2), this becomes:
n
X n
n
px (1 − p1 )n−x
x 1
KL[P || Q] = px1 (1 − p1 ) n−x
· ln
x=0
x n
x
px2 (1 − p2 )n−x
Xn
n x p1 1 − p1
= p (1 − p1 )n−x
· x · ln + (n − x) · ln (5)
x=0
x 1 p2 1 − p2
p1 X n x 1 − p1 X n x
n n
= ln · p1 (1 − p1 ) x + ln
n−x
· p1 (1 − p1 )n−x (n − x) .
p2 x=0 x 1 − p2 x=0 x
We can now see that some terms in this sum are expected values (→ I/1.10.1) with respect to
binomial distributions (→ II/1.3.1):
n
X n
px1 (1 − p1 )n−x x = E [x]Bin(n,p1 )
x=0
x
n
X (6)
n x
p1 (1 − p1 )n−x (n − x) = E [n − x]Bin(n,p1 ) .
x=0
x
Using the expected value of the binomial distribution (→ II/1.3.4), these can be simplified to
E [x]Bin(n,p1 ) = np1
(7)
E [n − x]Bin(n,p1 ) = n − np1 ,
Sources:
• PSPACEhard (2017): “Kullback-Leibler divergence for binomial distributions P and Q”; in: Stack-
Exchange Mathematics, retrieved on 2023-10-20; URL: https://math.stackexchange.com/a/2215384/
480910.
1. UNIVARIATE DISCRETE DISTRIBUTIONS 153
Y |X ∼ Bin(X, q) (1)
and X also follows a binomial distribution (→ II/1.3.1), but with different success frequency (→
II/1.3.1):
X ∼ Bin(n, p) . (2)
Then, the maginal distribution (→ I/1.5.3) of Y unconditional on X is again a binomial distribution
(→ II/1.3.1):
Proof: We are interested in the probability that Y equals a number m. According to the law of
marginal probability (→ I/1.3.3) or the law of total probability (→ I/1.4.7), this probability can be
expressed as:
X
∞
Pr(Y = m) = Pr(Y = m|X = k) · Pr(X = k) . (4)
k=0
Since, by definitions (2) and (1), Pr(X = k) = 0 when k > n and Pr(Y = m|X = k) = 0 when
k < m, we have:
X
n
Pr(Y = m) = Pr(Y = m|X = k) · Pr(X = k) . (5)
k=m
Now we can take the probability mass function of the binomial distribution (→ II/1.3.2) and plug it
in for the terms in the sum of (5) to get:
n
X
k n k
Pr(Y = m) = q (1 − q)
m k−m
· p (1 − p)n−k . (6)
k=m
m k
k n−m
Applying the binomial coefficient identity nk m = mn
k−m
and rearranging the terms, we have:
n
X
n n−m
Pr(Y = m) = pk q m (1 − p)n−k (1 − q)k−m . (7)
k=m
m k−m
Now we partition pk = pm · pk−m and pull all terms dependent on k out of the sum:
Xn
n n − m k−m
Pr(Y = m) = m m
p q p (1 − p)n−k (1 − q)k−m
m k=m
k − m
Xn (8)
n n−m
= (pq) m
(p(1 − q))k−m (1 − p)n−k .
m k=m
k − m
X n − m
n−m
n
Pr(Y = m) = (pq)m
(p − pq)i (1 − p)n−m−i . (9)
m i=0
i
According to the binomial theorem
n
X
n n
(x + y) = xn−k y k , (10)
k=0
k
the sum in equation (9) is equal to
X
n−m
n−m
(p − pq)i (1 − p)n−m−i = ((p − pq) + (1 − p))n−m . (11)
i=0
i
Thus, (9) develops into
n
Pr(Y = m) = (pq)m (p − pq + 1 − p)n−m
m
(12)
n
= (pq)m (1 − pq)n−m
m
which is the probability mass function of the binomial distribution (→ II/1.3.2) with parameters n
and pq, such that
Sources:
• Wikipedia (2022): “Binomial distribution”; in: Wikipedia, the free encyclopedia, retrieved on 2022-
10-07; URL: https://en.wikipedia.org/wiki/Binomial_distribution#Conditional_binomials.
p ∼ Bet(α, β) (1)
and let X be a random variable (→ I/1.2.2) following a binomial distribution (→ II/1.3.1) conditional
on p
X | p ∼ Bin(n, p) . (2)
Then, the marginal distribution (→ I/1.5.3) of X is called a beta-binomial distribution
X ∼ BetBin(n, α, β) (3)
with number of trials (→ II/1.3.1) n and shape parameters (→ II/3.9.1) α and β.
Sources:
1. UNIVARIATE DISCRETE DISTRIBUTIONS 155
• Wikipedia (2022): “Beta-binomial distribution”; in: Wikipedia, the free encyclopedia, retrieved on
2022-10-20; URL: https://en.wikipedia.org/wiki/Beta-binomial_distribution#Motivation_and_
derivation.
X ∼ BetBin(n, α, β) . (1)
Then, the probability mass function (→ I/1.6.1) of X is
n B(α + x, β + n − x)
fX (x) = · (2)
x B(α, β)
where B(x, y) is the beta function.
X | p ∼ Bin(n, p)
(3)
p ∼ Bet(α, β) .
Thus, we can combine the law of marginal probability (→ I/1.3.3) and the law of conditional prob-
ability (→ I/1.3.4) to derive the probability (→ I/1.3.1) of X as
Z
p(x) = p(x, p) dp
ZP (4)
= p(x|p) p(p) dp .
P
Now, we can plug in the probability mass function of the binomial distribution (→ II/1.3.2) and the
probability density function of the beta distribution (→ II/3.9.3) to get
Z
1
n x 1
p(x) = p (1 − p)n−x · pα−1 (1 − p)β−1 dp
x B(α, β)
0 Z 1
n 1
= · pα+x−1 (1 − p)β+n−x−1 dp (5)
x B(α, β) 0
Z
n B(α + x, β + n − x) 1 1
= · pα+x−1 (1 − p)β+n−x−1 dp .
x B(α, β) 0 B(α + x, β + n − x)
Finally, we recognize that the integrand is equal to the probability density function of a beta distri-
bution (→ II/3.9.3) and because probability density integrates to one (→ I/1.7.1), we have
n B(α + x, β + n − x)
p(x) = · = fX (x) . (6)
x B(α, β)
This completes the proof.
156 CHAPTER II. PROBABILITY DISTRIBUTIONS
Sources:
• Wikipedia (2022): “Beta-binomial distribution”; in: Wikipedia, the free encyclopedia, retrieved on
2022-10-20; URL: https://en.wikipedia.org/wiki/Beta-binomial_distribution#As_a_compound_
distribution.
X ∼ BetBin(n, α, β) . (1)
Then, the probability mass function (→ I/1.6.1) of X can be expressed as
Proof: The probability mass function of the beta-binomial distribution (→ II/1.4.2) is given by
n B(α + x, β + n − x)
fX (x) = · . (3)
x B(α, β)
Note that the binomial coefficient can be expressed in terms of factorials
n n!
= , (4)
x x! (n − x)!
that factorials are related to the gamma function via n! = Γ(n + 1)
n! Γ(n + 1)
= (5)
x! (n − x)! Γ(x + 1) Γ(n − x + 1)
and that the beta function is related to the gamma function via
Γ(α) Γ(β)
B(α, β) = . (6)
Γ(α + β)
Applying (4), (5) and (6) to (3), we get
Sources:
• Wikipedia (2022): “Beta-binomial distribution”; in: Wikipedia, the free encyclopedia, retrieved on
2022-10-20; URL: https://en.wikipedia.org/wiki/Beta-binomial_distribution#As_a_compound_
distribution.
1. UNIVARIATE DISCRETE DISTRIBUTIONS 157
X ∼ BetBin(n, α, β) . (1)
Then, the cumulative distribution function (→ I/1.8.1) of X is
1 Γ(n + 1) X
x
Γ(α + i) · Γ(β + n − i)
FX (x) = · · (2)
B(α, β) Γ(α + β + n) i=0 Γ(i + 1) · Γ(n − i + 1)
where B(x, y) is the beta function and Γ(x) is the gamma function.
With the probability mass function of the beta-binomial distribution (→ II/1.4.2), this becomes
Xx
n B(α + i, β + n − i)
FX (x) = · . (5)
i=0
i B(α, β)
Using the expression of binomial coefficients in terms of factorials
n n!
= , (6)
k k! (n − k)!
the relationship between factorials and the gamma function
n! = Γ(n + 1) (7)
and the link between gamma function and beta function
Γ(α) Γ(β)
B(α, β) = , (8)
Γ(α + β)
equation (5) can be further developped as follows:
(6) 1 X
x
n!
FX (x) = · · B(α + i, β + n − i)
B(α, β) i=0 i! (n − i)!
(8) 1 X
x
n! Γ(α + i) · Γ(β + n − i)
= · ·
B(α, β) i=0 i! (n − i)! Γ(α + β + n)
(9)
1 n! X
x
Γ(α + i) · Γ(β + n − i)
= · ·
B(α, β) Γ(α + β + n) i=0 i! (n − i)!
(7) 1 Γ(n + 1) X
x
Γ(α + i) · Γ(β + n − i)
= · · .
B(α, β) Γ(α + β + n) i=0 Γ(i + 1) · Γ(n − i + 1)
158 CHAPTER II. PROBABILITY DISTRIBUTIONS
X ∼ Poiss(λ) , (1)
if and only if its probability mass function (→ I/1.6.1) is given by
λx e−λ
Poiss(x; λ) = (2)
x!
where x ∈ N0 and λ > 0.
Sources:
• Wikipedia (2020): “Poisson distribution”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
05-25; URL: https://en.wikipedia.org/wiki/Poisson_distribution#Definitions.
X ∼ Poiss(λ) . (1)
Then, the probability mass function (→ I/1.6.1) of X is
λx e−λ
fX (x) = , x ∈ N0 . (2)
x!
Proof: This follows directly from the definition of the Poisson distribution (→ II/1.5.1).
1.5.3 Mean
Theorem: Let X be a random variable (→ I/1.2.2) following a Poisson distribution (→ II/1.5.1):
X ∼ Poiss(λ) . (1)
Then, the mean or expected value (→ I/1.10.1) of X is
E(X) = λ . (2)
such that, with the probability mass function of the Poisson distribution (→ II/1.5.2), we have:
X
∞
λx e−λ
E(X) = x·
x=0
x!
X
∞
λx e−λ
= x·
x=1
x!
(4)
X∞
x x
−λ
=e · λ
x=1
x!
X∞
λx−1
= λe−λ · .
x=1
(x − 1)!
E(X) = λe−λ · eλ
(7)
=λ.
Sources:
• ProofWiki (2020): “Expectation of Poisson Distribution”; in: ProofWiki, retrieved on 2020-08-19;
URL: https://proofwiki.org/wiki/Expectation_of_Poisson_Distribution.
1.5.4 Variance
Theorem: Let X be a random variable (→ I/1.2.2) following a Poisson distribution (→ II/1.5.1):
X ∼ Poiss(λ) . (1)
Then, the variance (→ I/1.11.1) of X is
Var(X) = λ . (2)
Proof: The variance (→ I/1.11.1) can be expressed in terms of expected values (→ I/1.11.3) as
E(X) = λ . (4)
Let us now consider the expectation (→ I/1.10.1) of X (X − 1) which is defined as
X
E[X (X − 1)] = x (x − 1) · fX (x) , (5)
x∈X
such that, with the probability mass function of the Poisson distribution (→ II/1.5.2), we have:
X
∞
λx e−λ
E[X (X − 1)] = x (x − 1) ·
x=0
x!
X∞
λx e−λ
= x (x − 1) ·
x=2
x!
(6)
X
∞
λx
= e−λ · x (x − 1) ·
x=2
x · (x − 1) · (x − 2)!
X∞
λx−2
−λ
=λ ·e 2
· .
x=2
(x − 2)!
Var(X) = λ2 + λ − λ2 = λ . (12)
■
Sources:
• jbstatistics (2013): “The Poisson Distribution: Mathematically Deriving the Mean and Variance”;
in: YouTube, retrieved on 2021-04-29; URL: https://www.youtube.com/watch?v=65n_v92JZeE.
2. MULTIVARIATE DISCRETE DISTRIBUTIONS 161
X ∼ Cat([p1 , . . . , pk ]) , (1)
if X = ei with probability (→ I/1.3.1) pi for all i = 1, . . . , k, where ei is the i-th elementary row
vector, i.e. a 1 × k vector of zeros with a one in i-th position.
Sources:
• Wikipedia (2020): “Categorical distribution”; in: Wikipedia, the free encyclopedia, retrieved on
2020-03-22; URL: https://en.wikipedia.org/wiki/Categorical_distribution.
X ∼ Cat([p1 , . . . , pk ]) . (1)
Then, the probability mass function (→ I/1.6.1) of X is
p , if x = e1
1
.. ..
fX (x) = . . (2)
p , if x = e .
k k
Proof: This follows directly from the definition of the categorical distribution (→ II/2.1.1).
2.1.3 Mean
Theorem: Let X be a random vector (→ I/1.2.3) following a categorical distribution (→ II/2.1.1):
X ∼ Cat([p1 , . . . , pk ]) . (1)
Then, the mean or expected value (→ I/1.10.1) of X is
X
E(X) = x · Pr(X = x)
x∈X
X
k
= ei · Pr(X = ei )
i=1 (3)
Xk
= e i · pi
i=1
= [p1 , . . . , pk ] .
■
2.1.4 Covariance
Theorem: Let X be a random vector (→ I/1.2.3) following a categorical distribution (→ II/2.1.1):
X ∼ Cat(n, p) . (1)
Then, the covariance matrix (→ I/1.13.9) of X is
Proof: The categorical distribution (→ II/2.1.1) is a special case of the multinomial distribution (→
II/2.2.1) in which n = 1:
X ∼ Cat(p) . (1)
Then, the (Shannon) entropy (→ I/2.1.1) of X is
X
k
H(X) = − pi · log pi . (2)
i=1
Proof: The entropy (→ I/2.1.1) is defined as the probability-weighted average of the logarithmized
probabilities for all possible values:
2. MULTIVARIATE DISCRETE DISTRIBUTIONS 163
X
H(X) = − p(x) · logb p(x) . (3)
x∈X
Since there are k possible values for a categorical random vector (→ II/2.1.1) with probabilities given
by the entries (→ II/2.1.2) of the 1 × k vector p, we have:
Sources:
• Wikipedia (2020): “Binomial distribution”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
03-22; URL: https://en.wikipedia.org/wiki/Multinomial_distribution.
Y
k
p i xi . (3)
i=1
When order does not matter, there is a number of series consisting of x1 observations for category
1, ..., xk observations for category k. This number is equal to the number of possibilities in which x1
category 1 objects, ..., xk category k objects can be distributed in a sequence of n objects which is
given by the multinomial coefficient that can be expressed in terms of factorials:
n n!
= . (4)
x1 , . . . , x k x1 ! · . . . · xk !
In order to obtain the probability of x1 observations for category 1, ..., xk observations for category
k, when order does not matter, the probability in (3) has to be multiplied with the number of
possibilities in (4) which gives
Y
k
n
p(X = x|n, [p1 , . . . , pk ]) = p i xi (5)
x1 , . . . , x k i=1
2.2.3 Mean
Theorem: Let X be a random vector (→ I/1.2.3) following a multinomial distribution (→ II/2.2.1):
Proof: By definition, a multinomial random variable (→ II/2.2.1) is the sum of n independent and
identical categorical trials (→ II/2.1.1) with category probabilities p1 , . . . , pk . Therefore, the expected
value is
■
2. MULTIVARIATE DISCRETE DISTRIBUTIONS 165
2.2.4 Covariance
Theorem: Let X be a random vector (→ I/1.2.3) following a multinomial distribution (→ II/2.2.1):
Proof: We first observe that the sample space (→ I/1.1.2) of each coordinate Xi is {0, 1, . . . , n} and
Xi is the sum of independent draws of category i, which is drawn with probability pi . Thus each
coordinate follows a binomial distribution (→ II/1.3.1):
i.i.d.
Xi ∼ Bin(n, pi ), i = 1, . . . , k , (3)
which has the variance (→ II/1.3.5) Var(Xi ) = npi (1 − pi ) = n(pi − p2i ), constituting the elements
of the main diagonal in Cov(X) in (2). To prove Cov(Xi , Xj ) = −npi pj for i ̸= j (which constitutes
the off-diagonal elements of the covariance matrix), we first recognize that
(
Xn
1 if k-th draw was of category i,
Xi = Ii (k), with Ii (k) = (4)
k=1
0 otherwise ,
where the indicator function Ii is a Bernoulli-distributed (→ II/1.2.1) random variable with the
expected value (→ II/1.2.3) pi . Then, we have
!
X
n X
n
Cov(Xi , Xj ) = Cov Ii (k), Ij (l)
k=1 l=1
n X
X n
= Cov (Ii (k), Ij (l))
k=1 l=1
X
n
X
n
= Cov (Ii (k), Ij (k)) + Cov (Ii (k), Ij (l))
| {z } (5)
k=1 l=1 =0
l̸=k
X
n
i̸=j
= E Ii (k) Ij (k) − E Ii (k) E Ij (k)
| {z }
k=1 =0
X
n
=− E Ii (k) E Ij (k)
k=1
= −npi pj ,
as desired.
Sources:
• Tutz (2012): “Regression for Categorical Data”, pp. 209ff..
166 CHAPTER II. PROBABILITY DISTRIBUTIONS
X ∼ Mult(n, p) . (1)
Then, the (Shannon) entropy (→ I/2.1.1) of X is
X
k
Hcat (p) = − pi · log pi (3)
i=1
and Elmc (n, p) is the expected value (→ I/1.10.1) of the logarithmized multinomial coefficient (→
II/2.2.2) with superset size n
n
Elmf (n, p) = E log where X ∼ Mult(n, p) . (4)
X1 , . . . , X k
Proof: The entropy (→ I/2.1.1) is defined as the probability-weighted average of the logarithmized
probabilities for all possible values:
X
H(X) = − p(x) · logb p(x) . (5)
x∈X
X
H(X) = − fX (x) · log fX (x)
x∈Xn,k
" Y #
X n
k
=− fX (x) · log p i xi (7)
x∈Xn,k
x1 , . . . , x k i=1
" #
X n X
k
=− fX (x) · log + xi · log pi .
x∈Xn,k
x1 , . . . , x k i=1
Since the first factor in the sum corresponds to the probability mass (→ I/1.6.1) of X = x, we can
rewrite this as the sum of the expected values (→ I/1.10.1) of the functions (→ I/1.10.12) of the
discrete random variable (→ I/1.2.6) x in the square bracket:
2. MULTIVARIATE DISCRETE DISTRIBUTIONS 167
* k +
n X
H(X) = − log − xi · log pi
x1 , . . . , x k p(x) i=1 p(x)
(8)
n Xk
= − log − ⟨xi · log pi ⟩p(x) .
x1 , . . . , x k p(x) i=1
Using the expected value of the multinomial distribution (→ II/2.2.3), i.e. X ∼ Mult(n, p) ⇒ ⟨xi ⟩ =
npi , this gives:
Xk
n
H(X) = − log − npi · log pi
x1 , . . . , x k p(x) i=1
(9)
n X k
= − log −n pi · log pi .
x1 , . . . , x k p(x) i=1
Finally, we note that the first term is the negative expected value (→ I/1.10.1) of the logarithm of
a multinomial coefficient (→ II/2.2.2) and that the second term is the entropy of the categorical
distribution (→ II/2.1.5), such that we finally get:
■
168 CHAPTER II. PROBABILITY DISTRIBUTIONS
X ∼ U (a, b) , (1)
if and only if each value between and including a and b occurs with the same probability.
Sources:
• Wikipedia (2020): “Uniform distribution (continuous)”; in: Wikipedia, the free encyclopedia, re-
trieved on 2020-01-27; URL: https://en.wikipedia.org/wiki/Uniform_distribution_(continuous).
X ∼ U (0, 1) . (1)
Sources:
• Wikipedia (2021): “Continuous uniform distribution”; in: Wikipedia, the free encyclopedia, re-
trieved on 2021-07-23; URL: https://en.wikipedia.org/wiki/Continuous_uniform_distribution#
Standard_uniform.
X ∼ U (a, b) . (1)
Then, the probability density function (→ I/1.7.1) of X is
1 , if a ≤ x ≤ b
b−a
fX (x) = (2)
0 , otherwise .
Proof: A continuous uniform variable is defined as (→ II/3.1.1) having a constant probability density
between minimum a and maximum b. Therefore,
1
fX (x) = for all x ∈ [a, b] (4)
c(a, b)
where the normalization factor c(a, b) is specified, such that
Z b
1
1 dx = 1 . (5)
c(a, b) a
Solving this for c(a, b), we obtain:
Z b
1 dx = c(a, b)
a
(6)
[x]ba = c(a, b)
c(a, b) = b − a .
X ∼ U (a, b) . (1)
Then, the cumulative distribution function (→ I/1.8.1) of X is
0 , if x < a
FX (x) = x−a
, if a ≤ x ≤ b (2)
b−a
1 , if x > b .
Proof: The probability density function of the continuous uniform distribution (→ II/3.1.3) is:
1 , if a ≤ x ≤ b
U(x; a, b) = b−a
(3)
0 , otherwise .
Z a Z x
FX (x) = U(z; a, b) dz + U(z; a, b) dz
−∞
Z a Z x a
1
= 0 dz + dz
−∞ a b−a (6)
1
=0+ [z]x
b−a a
x−a
= .
b−a
Finally, if x > b, we have
Z b Z x
FX (x) = U(z; a, b) dz + U(z; a, b) dz
−∞
Z x b
= FX (b) + 0 dz
b (7)
b−a
= +0
b−a
=1.
X ∼ U (a, b) . (1)
Then, the quantile function (→ I/1.9.1) of X is
−∞ , if p = 0
QX (p) = (2)
bp + a(1 − p) , if p > 0 .
Proof: The cumulative distribution function of the continuous uniform distribution (→ II/3.1.4) is:
if x < a
0,
FX (x) = x−a
, if a ≤ x ≤ b (3)
b−a
1, if x > b .
The quantile function (→ I/1.9.1) QX (p) is defined as the smallest x, such that FX (x) = p:
x−a
p=
b−a
x = p(b − a) + a (6)
x = bp + a(1 − p) .
3.1.6 Mean
Theorem: Let X be a random variable (→ I/1.2.2) following a continuous uniform distribution (→
II/3.1.1):
X ∼ U (a, b) . (1)
Then, the mean or expected value (→ I/1.10.1) of X is
1
E(X) = (a + b) . (2)
2
Proof: The expected value (→ I/1.10.1) is the probability-weighted average over all possible values:
Z
E(X) = x · fX (x) dx . (3)
X
With the probability density function of the continuous uniform distribution (→ II/3.1.3), this be-
comes:
Z b
1
E(X) = x· dx
a b−a
2
b
1 x
=
2 b−a a
1 b 2 − a2 (4)
=
2 b−a
1 (b + a)(b − a)
=
2 b−a
1
= (a + b) .
2
■
172 CHAPTER II. PROBABILITY DISTRIBUTIONS
3.1.7 Median
Theorem: Let X be a random variable (→ I/1.2.2) following a continuous uniform distribution (→
II/3.1.1):
X ∼ U (a, b) . (1)
Then, the median (→ I/1.15.1) of X is
1
median(X) = (a + b) . (2)
2
Proof: The median (→ I/1.15.1) is the value at which the cumulative distribution function (→
I/1.8.1) is 1/2:
1
FX (median(X)) = . (3)
2
The cumulative distribution function of the continuous uniform distribution (→ II/3.1.4) is
0 , if x < a
FX (x) = x−a
, if a ≤ x ≤ b (4)
b−a
1 , if x > b .
x = bp + a(1 − p) . (5)
Setting p = 1/2, we obtain:
1 1 1
median(X) = b · + a · 1 − = (a + b) . (6)
2 2 2
■
3.1.8 Mode
Theorem: Let X be a random variable (→ I/1.2.2) following a continuous uniform distribution (→
II/3.1.1):
X ∼ U (a, b) . (1)
Then, the mode (→ I/1.15.2) of X is
Proof: The mode (→ I/1.15.2) is the value which maximizes the probability density function (→
I/1.7.1):
The probability density function of the continuous uniform distribution (→ II/3.1.3) is:
3. UNIVARIATE CONTINUOUS DISTRIBUTIONS 173
1
, if a ≤ x ≤ b
b−a
fX (x) = (4)
0 , otherwise .
Since the PDF attains its only non-zero value whenever a ≤ x ≤ b,
1
max fX (x) = , (5)
x b−a
any value in the interval [a, b] may be considered the mode of X.
3.1.9 Variance
Theorem: Let X be a random variable (→ I/1.2.2) following a continuous uniform distribution (→
II/3.1.1):
X ∼ U (a, b) . (1)
Then, the variance (→ I/1.11.1) of X is
1
Var(X) = (b − a)2 . (2)
12
Proof: The variance (→ I/1.11.1) is the probability-weighted average of the squared deviation from
the mean (→ I/1.10.1):
Z
Var(X) = (x − E(X))2 · fX (x) dx . (3)
R
With the expected value (→ II/3.1.6) and probability density function (→ II/3.1.3) of the continuous
uniform distribution, this reads:
174 CHAPTER II. PROBABILITY DISTRIBUTIONS
Z b 2
1 1
Var(X) = x − (a + b) · dx
a 2 b−a
Z b 2
1 a+b
= · x− dx
b−a a 2
" 3 #b
1 1 a+b
= · x−
b−a 3 2
a
" 3 #b
1 2x − (a + b)
= ·
3(b − a) 2
a
b
1 1
= · (2x − a − b)3
3(b − a) 8 a
1
= · (2x − a − b)3 a
b (4)
24(b − a)
1
= · (2b − a − b)3 − (2a − a − b)3
24(b − a)
1
= · (b − a)3 − (a − b)3
24(b − a)
1
= · (b − a)3 + (−1)3 (a − b)3
24(b − a)
1
= · (b − a)3 + (b − a)3
24(b − a)
2
= (b − a)3
24(b − a)
1
= (b − a)2 .
12
■
X ∼ U (a, b) . (1)
Then, the differential entropy (→ I/2.2.1) of X is
Z
h(X) = − p(x) ln p(x) dx . (4)
X
With the probability density function of the continuous uniform distribution (→ II/3.1.3), the dif-
ferential entropy of X is:
Z b
1 1
h(X) = − ln dx
a b−a b−a
Z b
1
= · ln(b − a) dx
b−a a
1
= · [x · ln(b − a)]ba (5)
b−a
1
= · [b · ln(b − a) − a · ln(b − a)]
b−a
1
= (b − a) ln(b − a)
b−a
= ln(b − a) .
P : X ∼ U (a1 , b1 )
(1)
Q : X ∼ U (a2 , b2 ) .
Z a1 Z b1 Z +∞
0 p(x) 0
KL[P || Q] = 0 · ln dx + p(x) ln dx + 0 · ln dx . (5)
−∞ q(x) a1 q(x) b1 q(x)
Now, (0 · ln 0) is taken to be zero by convention (→ I/2.1.1), such that
Z b1
p(x)
KL[P || Q] = p(x) ln dx (6)
a1 q(x)
and we can use the probability density function of the continuous uniform distribution (→ II/3.1.3)
to evaluate:
Z b1 1
1 b1 −a1
KL[P || Q] = ln dx
a1 b 1 − a1 1
b2 −a2
Z
1 b 2 − a2 b 1
= ln dx
b 1 − a1 b1 − a1 a1
1 b2 − a2 b1 (7)
= ln [x]
b 1 − a1 b1 − a1 a1
1 b 2 − a2
= ln (b1 − a1 )
b 1 − a1 b 1 − a1
b 2 − a2
= ln .
b 1 − a1
■
Proof: Without loss of generality, let us assume that the random variable X is in the following range:
a ≤ X ≤ b.
Let g(x) be the probability density function (→ I/1.7.1) of a continuous uniform distribution (→
II/3.1.1) with minimum a and maximum b and let f (x) be an arbitrary probability density function
(→ I/1.7.1) defined on the same support X = [a, b].
For a random variable (→ I/1.2.2) X with set of possible values X and probability density function
(→ I/1.7.1) p(x), the differential entropy (→ I/2.2.1) is defined as:
Z
h(X) = − p(x) log p(x) dx (1)
X
Consider the Kullback-Leibler divergence (→ I/2.5.1) of distribution f (x) from distribution g(x)
which is non-negative (→ I/2.5.2):
Z
f (x)
0 ≤ KL[f ||g] = f (x) log dx
g(x)
ZX Z
= f (x) log f (x) dx − f (x) log g(x) dx (2)
X Z X
(1)
= −h[f (x)] − f (x) log g(x) dx .
X
3. UNIVARIATE CONTINUOUS DISTRIBUTIONS 177
By plugging the probability density function of the continuous uniform distribution (→ II/3.1.3) into
the second term, we obtain:
Z Z
1
f (x) log g(x) dx = f (x) log dx
X X b−a
Z
1 (3)
= log f (x) dx
b−a X
= − log(b − a) .
This is actually the negative of the differential entropy of the continuous uniform distribution (→
II/3.1.10), such that:
Z
f (x) log g(x) dx = −h[U(a, b)] = −h[g(x)] . (4)
X
Combining (2) with (4), we can show that
0 ≤ KL[f ||g]
0 ≤ −h[f (x)] − (−h[g(x)]) (5)
h[g(x)] ≥ h[f (x)]
which means that the differential entropy (→ I/2.2.1) of the continuous uniform distribution (→
II/3.1.1) U(a, b) will be larger than or equal to any other distribution (→ I/1.5.1) defined in the
same range.
■
Sources:
• Wikipedia (2023): “Maximum entropy probability distribution”; in: Wikipedia, the free encyclope-
dia, retrieved on 2023-08-25; URL: https://en.wikipedia.org/wiki/Maximum_entropy_probability_
distribution#Uniform_and_piecewise_uniform_distributions.
X ∼ N (µ, σ 2 ) , (1)
if and only if its probability density function (→ I/1.7.1) is given by
" 2 #
1 1 x − µ
N (x; µ, σ 2 ) = √ · exp − (2)
2πσ 2 σ
where µ ∈ R and σ 2 > 0.
Sources:
• Wikipedia (2020): “Normal distribution”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
01-27; URL: https://en.wikipedia.org/wiki/Normal_distribution.
178 CHAPTER II. PROBABILITY DISTRIBUTIONS
Proof: The probability density function of the multivariate normal distribution (→ II/4.1.4) is
1 1 T −1
N (x; µ, Σ) = p · exp − (x − µ) Σ (x − µ) . (1)
(2π)n |Σ| 2
Setting n = 1, such that x, µ ∈ R, and Σ = σ 2 , we obtain
1 1 2 −1
N (x; µ, σ ) = p
2
· exp − (x − µ) (σ ) (x − µ)
T
(2π)1 |σ 2 | 2
1 1
=p · exp − 2 (x − µ) 2
(2)
(2π)σ 2 2σ
" 2 #
1 1 x−µ
=√ · exp −
2πσ 2 σ
which is equivalent to the probability density function of the normal distribution (→ II/3.2.10).
Sources:
• Wikipedia (2022): “Multivariate normal distribution”; in: Wikipedia, the free encyclopedia, re-
trieved on 2022-08-19; URL: https://en.wikipedia.org/wiki/Multivariate_normal_distribution.
X ∼ N (0, 1) . (1)
Sources:
• Wikipedia (2020): “Normal distribution”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
05-26; URL: https://en.wikipedia.org/wiki/Normal_distribution#Standard_normal_distribution.
X ∼ N (µ, σ 2 ) . (1)
Then, the quantity Z = (X − µ)/σ will have a standard normal distribution (→ II/3.2.3) with mean
0 and variance 1:
3. UNIVARIATE CONTINUOUS DISTRIBUTIONS 179
X −µ
Z= ∼ N (0, 1) . (2)
σ
X = g −1 (Z) = σZ + µ . (4)
Because σ is positive, g(X) is strictly increasing and we can calculate the cumulative distribution
function of a strictly increasing function (→ I/1.8.3) as
0 , if y < min(Y)
FY (y) = FX (g −1 (y)) , if y ∈ Y (5)
1 , if y > max(Y) .
The cumulative distribution function of the normally distributed (→ II/3.2.12) X is
Z x " 2 #
1 1 t−µ
FX (x) = √ · exp − dt . (6)
−∞ 2πσ 2 σ
Applying (5) to (6), we have:
(5)
FZ (z) = FX (g −1 (z))
Z σz+µ " 2 #
(6) 1 1 t − µ (7)
= √ · exp − dt .
−∞ 2πσ 2 σ
Z " 2 #
([σz+µ]−µ)/σ
1 1 (σs + µ) − µ
FZ (z) = √ · exp − d(σs + µ)
(−∞−µ)/σ 2πσ 2 σ
Z z
σ 1 2 (8)
= √ · exp − s ds
−∞ 2πσ 2
Z z
1 1 2
= √ · exp − s ds
−∞ 2π 2
which is the cumulative distribution function (→ I/1.8.1) of the standard normal distribution (→
II/3.2.3).
■
180 CHAPTER II. PROBABILITY DISTRIBUTIONS
X ∼ N (µ, σ 2 ) . (1)
Then, the quantity Z = (X − µ)/σ will have a standard normal distribution (→ II/3.2.3) with mean
0 and variance 1:
X −µ
Z= ∼ N (0, 1) . (2)
σ
X = g −1 (Z) = σZ + µ . (4)
Because σ is positive, g(X) is strictly increasing and we can calculate the probability density function
of a strictly increasing function (→ I/1.7.3) as
f (g −1 (y)) dg−1 (y) , if y ∈ Y
X dy
fY (y) = (5)
0 , if y ∈
/Y
where Y = {y = g(x) : x ∈ X }. With the probability density function of the normal distribution (→
II/3.2.10), we have
" 2 #
1 1 g −1 (z) − µ dg −1 (z)
fZ (z) = √ · exp − ·
2πσ 2 σ dz
" 2 #
1 1 (σz + µ) − µ d(σz + µ)
=√ · exp − ·
2πσ 2 σ dz (6)
1 1
=√ · exp − z 2 · σ
2πσ 2
1 1 2
= √ · exp − z
2π 2
which is the probability density function (→ I/1.7.1) of the standard normal distribution (→ II/3.2.3).
■
X ∼ N (µ, σ 2 ) . (1)
3. UNIVARIATE CONTINUOUS DISTRIBUTIONS 181
Then, the quantity Z = (X − µ)/σ will have a standard normal distribution (→ II/3.2.3) with mean
0 and variance 1:
X −µ
Z= ∼ N (0, 1) . (2)
σ
Proof: The linear transformation theorem for multivariate normal distribution (→ II/4.1.12) states
Z ∼ N (0, 1) . (6)
1X
n
X̄ = Xi (2)
n i=1
and the unbiased sample variance (→ I/1.11.2)
1 X 2
n
2
s = Xi − X̄ . (3)
n − 1 i=1
Then, the sampling distribution (→ I/1.5.5) of the sample variance is given by a chi-squared distri-
bution (→ II/3.7.1) with n − 1 degrees of freedom:
s2
V = (n − 1) ∼ χ2 (n − 1) . (4)
σ2
Xi − µ
Ui = (5)
σ
which follows a standard normal distribution (→ II/3.2.4)
Ui ∼ N (0, 1) . (6)
Then, the sum of squared random variables Ui can be rewritten as
X
n n
X 2
Xi − µ
Ui2 =
i=1 i=1
σ
Xn 2
(Xi − X̄) + (X̄ − µ)
=
i=1
σ
(7)
X
n
(Xi − X̄)2 Xn
(X̄ − µ)2 Xn
(Xi − X̄)(X̄ − µ)
= 2
+ 2
+2
i=1
σ i=1
σ i=1
σ2
Xn 2 X n 2
(X̄ − µ) X
n
Xi − X̄ X̄ − µ
= + + (Xi − X̄) .
i=1
σ i=1
σ σ2 i=1
X
n X
n
(Xi − X̄) = Xi − nX̄
i=1 i=1
Xn
1X
n
= Xi − n · Xi
i=1
n i=1 (8)
Xn X
n
= Xi − Xi
i=1 i=1
=0,
X
n n
X 2 Xn 2
Xi − X̄ X̄ − µ
Ui2 = + . (9)
i=1 i=1
σ2 i=1
σ 2
Cochran’s theorem states that, if a sum of squared standard normal (→ II/3.2.3) random variables
(→ I/1.2.2) can be written as a sum of squared forms
X
n X
m X
n X
n
(j)
Ui2 = Qj where Qj = Uk Bkl Ul
i=1 j=1 k=1 l=1
X
m
(10)
with B (j) = In
j=1
then the terms Qj are independent (→ I/1.3.6) and each term Qj follows a chi-squared distribution
(→ II/3.7.1) with rj degrees of freedom:
Qj ∼ χ2 (rj ) . (11)
We observe that (9) can be represented as
X
n n
X 2 Xn 2
Xi − X̄ X̄ − µ
Ui2
= 2
+
i=1 i=1
σ i=1
σ2
!2 !2 (12)
Xn
1 Xn
1 Xn
= Q1 + Q2 = Ui − Uj + Ui
i=1
n j=1 n i=1
s2
(n − 1) ∼ χ2 (n − 1) . (15)
σ2
■
Sources:
• Glen-b (2014): “Why is the sampling distribution of variance a chi-squared distribution?”; in:
StackExchange CrossValidated, retrieved on 2021-05-20; URL: https://stats.stackexchange.com/
questions/121662/why-is-the-sampling-distribution-of-variance-a-chi-squared-distribution.
• Wikipedia (2021): “Cochran’s theorem”; in: Wikipedia, the free encyclopedia, retrieved on 2020-05-
20; URL: https://en.wikipedia.org/wiki/Cochran%27s_theorem#Sample_mean_and_sample_
variance.
1X
n
X̄ = Xi (2)
n i=1
184 CHAPTER II. PROBABILITY DISTRIBUTIONS
1 X 2
n
2
s = Xi − X̄ . (3)
n − 1 i=1
Then, subtracting µ from the sample
√ mean (→ I/1.10.1), dividing by the sample standard deviation
(→ I/1.16.1) and multiplying with n results in a qunatity that follows a t-distribution (→ II/3.3.1)
with n − 1 degrees of freedom:
√ X̄ − µ
t= n ∼ t(n − 1) . (4)
s
s2
V = (n − 1)2
∼ χ2 (n − 1) . (8)
σ
Observe that t is the ratio of a standard normal random variable (→ II/3.2.3) and the square root
of a chi-squared random variable (→ II/3.7.1), divided by its degrees of freedom:
√ X̄−µ
√ X̄ − µ n σ Z
t= n =q =p . (9)
s (n − 1) σ2 /(n − 1)
s2 V /(n − 1)
Thus, by definition of the t-distribution (→ II/3.3.1), this ratio follows a t-distribution with n − 1
degrees of freedom:
t ∼ t(n − 1) . (10)
Sources:
• Wikipedia (2021): “Student’s t-distribution”; in: Wikipedia, the free encyclopedia, retrieved on
2021-05-27; URL: https://en.wikipedia.org/wiki/Student%27s_t-distribution#Characterization.
3. UNIVARIATE CONTINUOUS DISTRIBUTIONS 185
• Wikipedia (2021): “Normal distribution”; in: Wikipedia, the free encyclopedia, retrieved on 2021-
05-27; URL: https://en.wikipedia.org/wiki/Normal_distribution#Operations_on_multiple_independent_
normal_variables.
Proof: Let
Z ∞
I= exp −x2 dx (2)
0
and
Z Z
P P
IP = exp −x2 dx = exp −y 2 dy . (3)
0 0
Then, we have
lim IP = I (4)
P →∞
and
Z Z
(3)
P P
IP2 = exp −x 2
dx exp −y 2 dy
0 0
Z P Z P
= exp − x2 + y 2 dx dy (6)
Z0Z 0
= exp − x2 + y 2 dx dy
SP
where SP is the square with corners (0, 0), (0, P ), (P, P ) and (P, 0). For this integral, we can write
down the following inequality
ZZ ZZ
exp − x + y
2 2
dx dy ≤ IP ≤2
exp − x2 + y 2 dx dy (7)
C1 C2
where C1 and C2 are the regions in the first quadrant bounded by circles with center at (0, 0) √
and going
through the
√ points (0,
√ P ) and (P, P ), respectively. The radii of these two circles are r 1 = P2 = P
and r2 = 2P 2 = P 2, such that we can rewrite equation (7) using polar coordinates as
Z Z Z Z
π π
2
r1 2
r2
exp −r2 r dr dθ ≤ IP2 ≤ exp −r2 r dr dθ . (8)
0 0 0 0
Solving the definite integrals yields:
186 CHAPTER II. PROBABILITY DISTRIBUTIONS
Z Z Z Z
π π
2
r1 2
r2
exp −r
r dr dθ ≤
2
IP2 ≤ exp −r2 r dr dθ
0 0 0 0
Z π Z π
2 1 2 r1 2 1 2 r2
− exp −r dθ ≤ IP2 ≤ − exp −r dθ
0 2 0 0 2 0
Z π Z π
1 2 1 2
− exp −r12 − 1 dθ ≤ IP2 ≤− exp −r22 − 1 dθ (9)
2 0 2 0
1 π 1 π
− exp −r12 − 1 θ 02 ≤ IP2 ≤− exp −r22 − 1 θ 02
2 2
1 π 1 π
1 − exp −r12 ≤ IP2 ≤ 1 − exp −r22
2 2 2 2
π π
1 − exp −P 2 ≤ IP2 ≤ 1 − exp −2P 2
4 4
Calculating the limit for P → ∞, we obtain
π π
lim 1 − exp −P 2 ≤ lim IP2 ≤ lim 1 − exp −2P 2
P →∞ 4 P →∞ P →∞ 4
π π (10)
≤ I2 ≤ ,
4 4
such that we have a preliminary result for I:
√
π π
I = ⇒ I= 2
. (11)
4 2
Because the integrand in (1) is an even function, we can calculate the final result as follows:
Z Z
+∞ ∞
exp −x 2
dx = 2 exp −x2 dx
−∞ 0
√
(11) π (12)
= 2
√ 2
= π.
Sources:
• ProofWiki (2020): “Gaussian Integral”; in: ProofWiki, retrieved on 2020-11-25; URL: https://
proofwiki.org/wiki/Gaussian_Integral.
• ProofWiki (2020): “Integral to Infinity of Exponential of minus t squared”; in: ProofWiki, retrieved
on 2020-11-25; URL: https://proofwiki.org/wiki/Integral_to_Infinity_of_Exponential_of_-t%
5E2.
X ∼ N (µ, σ 2 ) . (1)
3. UNIVARIATE CONTINUOUS DISTRIBUTIONS 187
Proof: This follows directly from the definition of the normal distribution (→ II/3.2.1).
X ∼ N (µ, σ 2 ) . (1)
Then, the moment-generating function (→ I/1.9.5) of X is
1 22
MX (t) = exp µt + σ t . (2)
2
Z " 2 #
+∞
1 1 x−µ
MX (t) = exp[tx] · √ · exp − dx
−∞ 2πσ 2 σ
Z +∞ " 2 # (5)
1 1 x−µ
=√ exp tx − dx .
2πσ −∞ 2 σ
√ √
Substituting u = (x − µ)/( 2σ), i.e. x = 2σu + µ, we have
188 CHAPTER II. PROBABILITY DISTRIBUTIONS
!2
Z √
√ 1 √
1 (+∞−µ)/( 2σ)
2σu + µ − µ √
MX (t) = √ exp t 2σu + µ − d 2σu + µ
2πσ (−∞−µ)/(√2σ) 2 σ
√ Z +∞ h√ i
2σ
= √ exp 2σu + µ t − u du
2
2πσ −∞
Z h√ i
exp(µt) +∞
= √ exp 2σut − u2 du
π −∞
Z +∞ h √ i (6)
exp(µt)
= √ exp − u − 2σut du
2
π −∞
!2
Z +∞ √
exp(µt) 2 1
= √ exp − u − σt + σ 2 t2 du
π −∞ 2 2
!2
1 2 2 Z +∞
√
exp µt + 2 σ t 2
= √ exp − u − σt du
π −∞ 2
√ √
Now substituting v = u − 2/2 σt, i.e. u = v + 2/2 σt, we have
Z +∞−√2/2 σt
exp µt + 12 σ 2 t2 2 √
MX (t) = √ √
exp −v d v + 2/2 σt
π −∞− 2/2 σt
Z +∞ (7)
exp µt + 21 σ 2 t2
= √ exp −v 2 dv .
π −∞
Sources:
• ProofWiki (2020): “Moment Generating Function of Gaussian Distribution”; in: ProofWiki, re-
trieved on 2020-03-03; URL: https://proofwiki.org/wiki/Moment_Generating_Function_of_Gaussian_
Distribution.
X ∼ N (µ, σ 2 ) . (1)
3. UNIVARIATE CONTINUOUS DISTRIBUTIONS 189
Proof: The probability density function of the normal distribution (→ II/3.2.10) is:
" 2 #
1 1 x−µ
fX (x) = √ · exp − . (4)
2πσ 2 σ
Thus, the cumulative distribution function (→ I/1.8.1) is:
Z x
FX (x) = N (z; µ, σ 2 ) dz
−∞
Z x " 2 #
1 1 z−µ
= √ · exp − dz (5)
−∞ 2πσ 2 σ
Z x " 2 #
1 z−µ
=√ exp − √ dz .
2πσ −∞ 2σ
√ √
Substituting t = (z − µ)/( 2σ), i.e. z = 2σt + µ, this becomes:
Z (x−µ)/(√2σ) √
1
FX (x) = √ 2
exp(−t ) d 2σt + µ
2πσ (−∞−µ)/(√2σ)
√ Z x−µ
√
2σ 2σ
= √ exp(−t2 ) dt
2πσ −∞
Z x−µ
√
1 2σ
(6)
= √ exp(−t2 ) dt
π −∞
Z 0 Z x−µ
√
1 1 2σ
= √ exp(−t ) dt + √
2
exp(−t2 ) dt
π −∞ π 0
Z ∞ Z x−µ
√
1 1 2σ
= √ exp(−t ) dt + √
2
exp(−t2 ) dt .
π 0 π 0
Applying (3) to (6), we have:
1 1 x−µ
FX (x) = lim erf(x) + erf √
2 x→∞ 2 2σ
1 1 x−µ
= + erf √ (7)
2 2 2σ
1 x−µ
= 1 + erf √ .
2 2σ
190 CHAPTER II. PROBABILITY DISTRIBUTIONS
Sources:
• Wikipedia (2020): “Normal distribution”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
03-20; URL: https://en.wikipedia.org/wiki/Normal_distribution#Cumulative_distribution_function.
• Wikipedia (2020): “Error function”; in: Wikipedia, the free encyclopedia, retrieved on 2020-03-20;
URL: https://en.wikipedia.org/wiki/Error_function.
X ∼ N (µ, σ 2 ) . (1)
Then, the cumulative distribution function (→ I/1.8.1) of X can be expressed as
X∞
x−µ 2i−1
x−µ 1
FX (x) = Φµ,σ (x) = φ · σ
+ (2)
σ i=1
(2i − 1)!! 2
where φ(x) is the probability density function (→ I/1.7.1) of the standard normal distribution (→
II/3.2.3) and n!! is a double factorial.
Proof:
1) First, consider the standard normal distribution (→ II/3.2.3) N (0, 1) which has the probability
density function (→ II/3.2.10)
1
φ(x) = √ · e− 2 x .
1 2
(3)
2π
Let T (x) be the indefinite integral of this function. It can be obtained using infinitely repeated
integration by parts as follows:
3. UNIVARIATE CONTINUOUS DISTRIBUTIONS 191
Z
T (x) = φ(x) dx
Z
1
√ · e− 2 x dx
1 2
=
2π
Z
1
1 · e− 2 x dx
1 2
=√
2π
Z
1 − 12 x2 − 12 x2
= √ · x·e + x ·e 2
dx
2π
Z
1 − 12 x2 1 3 − 1 x2 1 4 − 1 x2
= √ · x·e + x ·e 2 + x · e 2 dx (4)
2π 3 3
Z
1 − 12 x2 1 3 − 1 x2 1 5 − 1 x2 1 6 − 1 x2
= √ · x·e + x ·e 2 + x ·e 2 + x · e 2 dx
2π 3 15 15
= ...
" n Z #
1 X x2i−1 x 2n
· e− 2 x + · e− 2 x dx
1 2 1 2
=√ ·
2π (2i − 1)!! (2n − 1)!!
" i=1
∞ Z #
1 X x2i−1 x 2n
· e− 2 x + lim · e− 2 x dx .
1 2 1 2
=√ ·
2π i=1
(2i − 1)!! n→∞ (2n − 1)!!
X∞
1 x2i−1 − 21 x2
T (x) = √ · ·e +c
2π i=1 (2i − 1)!!
1 X∞
x2i−1
− 12 x2
= √ ·e · +c (6)
2π i=1
(2i − 1)!!
(3) X
∞
x2i−1
= φ(x) · +c.
i=1
(2i − 1)!!
2) Next, let Φ(x) be the cumulative distribution function (→ I/1.8.1) of the standard normal distri-
bution (→ II/3.2.3):
Z x
Φ(x) = φ(x) dx . (7)
−∞
It can be obtained by matching T (0) to Φ(0) which is 1/2, because the standard normal distribution
is symmetric around zero:
192 CHAPTER II. PROBABILITY DISTRIBUTIONS
X
∞
02i−1 1
T (0) = φ(0) · + c = = Φ(0)
i=1
(2i − 1)!! 2
1
⇔c= (8)
2
X
∞
x2i−1 1
⇒ Φ(x) = φ(x) · + .
i=1
(2i − 1)!! 2
3) Finally, the cumulative distribution functions (→ I/1.8.1) of the standard normal distribution (→
II/3.2.3) and the general normal distribution (→ II/3.2.1) are related to each other (→ II/3.2.4) as
x−µ
Φµ,σ (x) = Φ . (9)
σ
Combining (9) with (8), we have:
X∞
x−µ 2i−1
x−µ 1
Φµ,σ (x) = φ · σ
+ . (10)
σ i=1
(2i − 1)!! 2
Sources:
• Soch J (2015): “Solution for the Indefinite Integral of the Standard Normal Probability Density
Function”; in: arXiv stat.OT, 1512.04858; URL: https://arxiv.org/abs/1512.04858.
• Wikipedia (2020): “Normal distribution”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
03-20; URL: https://en.wikipedia.org/wiki/Normal_distribution#Cumulative_distribution_function.
Proof: The cumulative distribution function of a normally distributed (→ II/3.2.12) random variable
X is
1 x−µ
FX (x) = 1 + erf √ (2)
2 2σ
where erf(x) is the error function defined as
Z x
2
erf(x) = √ exp(−t2 ) dt (3)
π 0
3. UNIVARIATE CONTINUOUS DISTRIBUTIONS 193
Sources:
• Wikipedia (2022): “68-95-99.7 rule”; in: Wikipedia, the free encyclopedia, retrieved on 2022-05.08;
URL: https://en.wikipedia.org/wiki/68%E2%80%9395%E2%80%9399.7_rule.
X ∼ N (µ, σ 2 ) . (1)
Then, the quantile function (→ I/1.9.1) of X is
√
QX (p) = 2σ · erf −1 (2p − 1) + µ (2)
where erf −1 (x) is the inverse error function.
Proof: The cumulative distribution function of the normal distribution (→ II/3.2.12) is:
1 x−µ
FX (x) = 1 + erf √ . (3)
2 2σ
194 CHAPTER II. PROBABILITY DISTRIBUTIONS
Because the cumulative distribution function (CDF) is strictly monotonically increasing, the quantile
function is equal to the inverse of the CDF (→ I/1.9.2):
1 x−µ
p= 1 + erf √
2 2σ
x−µ
2p − 1 = erf √
2σ (5)
x − µ
erf −1 (2p − 1) = √
2σ
√
x = 2σ · erf −1 (2p − 1) + µ .
Sources:
• Wikipedia (2020): “Normal distribution”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
03-20; URL: https://en.wikipedia.org/wiki/Normal_distribution#Quantile_function.
3.2.16 Mean
Theorem: Let X be a random variable (→ I/1.2.2) following a normal distribution (→ II/3.2.1):
X ∼ N (µ, σ 2 ) . (1)
Then, the mean or expected value (→ I/1.10.1) of X is
E(X) = µ . (2)
Proof: The expected value (→ I/1.10.1) is the probability-weighted average over all possible values:
Z
E(X) = x · fX (x) dx . (3)
X
With the probability density function of the normal distribution (→ II/3.2.10), this reads:
Z " 2 #
+∞
1 1 x−µ
E(X) = x· √ · exp − dx
−∞ 2πσ 2 σ
Z +∞ " 2 # (4)
1 1 x−µ
=√ x · exp − dx .
2πσ −∞ 2 σ
Substituting z = x − µ, we have:
3. UNIVARIATE CONTINUOUS DISTRIBUTIONS 195
Z +∞−µ
1 1 z 2
E(X) = √ (z + µ) · exp − d(z + µ)
2πσ −∞−µ 2 σ
Z +∞
1 1 z 2
= √ (z + µ) · exp − dz
2πσ −∞ 2 σ
Z +∞ Z +∞ (5)
1 1 z 2 1 z 2
= √ z · exp − dz + µ exp − dz
2πσ −∞ 2 σ −∞ 2 σ
Z +∞ Z +∞
1 1 1
= √ z · exp − 2 · z dz + µ
2
exp − 2 · z dz .
2
2πσ −∞ 2σ −∞ 2σ
Z
1
x · exp −ax2 dx = − · exp −ax2
2a
Z r (6)
1 π √
exp −ax2 dx = · erf ax
2 a
where erf(x) is the error function. Using this, the integrals can be calculated as:
+∞ r +∞ !
1 1 π 1
E(X) = √ −σ · exp − 2 · z
2 2
+µ σ · erf √ z
2πσ 2σ −∞ 2 2σ −∞
1 1 1
=√ lim −σ · exp − 2 · z
2 2
− lim −σ · exp − 2 · z
2 2
2πσ z→∞ 2σ z→−∞ 2σ
r r
π 1 π 1
+ µ lim σ · erf √ z − lim σ · erf √ z
z→∞ 2 2σ z→−∞ 2 2σ (7)
r r
1 π π
=√ [0 − 0] + µ σ− − σ
2πσ 2 2
r
1 π
=√ ·µ·2 σ
2πσ 2
=µ.
Sources:
• Papadopoulos, Alecos (2013): “How to derive the mean and variance of Gaussian random vari-
able?”; in: StackExchange Mathematics, retrieved on 2020-01-09; URL: https://math.stackexchange.
com/questions/518281/how-to-derive-the-mean-and-variance-of-a-gaussian-random-variable.
3.2.17 Median
Theorem: Let X be a random variable (→ I/1.2.2) following a normal distribution (→ II/3.2.1):
X ∼ N (µ, σ 2 ) . (1)
196 CHAPTER II. PROBABILITY DISTRIBUTIONS
median(X) = µ . (2)
Proof: The median (→ I/1.15.1) is the value at which the cumulative distribution function (→
I/1.8.1) is 1/2:
1
FX (median(X)) =
. (3)
2
The cumulative distribution function of the normal distribution (→ II/3.2.12) is
1 x−µ
FX (x) = 1 + erf √ (4)
2 2σ
where erf(x) is the error function. Thus, the inverse CDF is
√
x= 2σ · erf −1 (2p − 1) + µ (5)
−1
where erf (x) is the inverse error function. Setting p = 1/2, we obtain:
√
median(X) = 2σ · erf −1 (0) + µ = µ . (6)
■
3.2.18 Mode
Theorem: Let X be a random variable (→ I/1.2.2) following a normal distribution (→ II/3.2.1):
X ∼ N (µ, σ 2 ) . (1)
Then, the mode (→ I/1.15.2) of X is
mode(X) = µ . (2)
Proof: The mode (→ I/1.15.2) is the value which maximizes the probability density function (→
I/1.7.1):
" 2 #
1 1 x − µ
fX′ (x) = 0 = √ · (−x + µ) · exp −
2πσ 3 2 σ
(7)
0 = −x + µ
x=µ.
1 1
fX′′ (µ) = − √ · exp(0) + √ · (0)2 · exp(0)
2πσ 3 2πσ 5
(8)
1
= −√ <0,
2πσ 3
we confirm that it is in fact a maximum which shows that
mode(X) = µ . (9)
3.2.19 Variance
Theorem: Let X be a random variable (→ I/1.2.2) following a normal distribution (→ II/3.2.1):
X ∼ N (µ, σ 2 ) . (1)
Then, the variance (→ I/1.11.1) of X is
Var(X) = σ 2 . (2)
Proof: The variance (→ I/1.11.1) is the probability-weighted average of the squared deviation from
the mean (→ I/1.10.1):
Z
Var(X) = (x − E(X))2 · fX (x) dx . (3)
R
With the expected value (→ II/3.2.16) and probability density function (→ II/3.2.10) of the normal
distribution, this reads:
Z " 2 #
+∞
1 1 x−µ
Var(X) = (x − µ)2 · √
· exp − dx
−∞ 2πσ 2 σ
Z +∞ " 2 # (4)
1 1 x−µ
=√ (x − µ)2 · exp − dx .
2πσ −∞ 2 σ
Substituting z = x − µ, we have:
198 CHAPTER II. PROBABILITY DISTRIBUTIONS
Z +∞−µ
1 1 z 2
Var(X) = √ z · exp −
2
d(z + µ)
2πσ −∞−µ 2 σ
Z +∞ (5)
1 1 z 2
=√ z · exp −
2
dz .
2πσ −∞ 2 σ
√
Now substituting z = 2σx, we have:
!2
Z √
1 +∞ √ 1 2σx √
Var(X) = √ ( 2σx)2 · exp − d( 2σx)
2πσ −∞ 2 σ
1 √ Z +∞ 2 (6)
=√ · 2σ · 2σ
2
x · exp −x2 dx
2πσ −∞
2 Z +∞
2σ
x2 · e−x dx .
2
= √
π −∞
Since the integrand is symmetric with respect to x = 0, we can write:
Z
4σ 2 ∞ 2 −x2
Var(X) = √ x ·e dx . (7)
π 0
√
If we define z = x2 , then x = z and dx = 1/2 z −1/2 dz. Substituting this into the integral
Z Z
4σ 2 ∞ −z 1 − 12 2σ 2 ∞ 3 −1 −z
Var(X) = √ z · e · z dz = √ z 2 · e dz (8)
π 0 2 π 0
and using the definition of the gamma function
Z ∞
Γ(x) = z x−1 · e−z dz , (9)
0
we can finally show that
√
2σ 2 3 2σ 2 π
Var(X) = √ · Γ = √ · = σ2 . (10)
π 2 π 2
■
Sources:
• Papadopoulos, Alecos (2013): “How to derive the mean and variance of Gaussian random vari-
able?”; in: StackExchange Mathematics, retrieved on 2020-01-09; URL: https://math.stackexchange.
com/questions/518281/how-to-derive-the-mean-and-variance-of-a-gaussian-random-variable.
X ∼ N (µ, σ 2 ) . (1)
Then, the full width at half maximum (→ I/1.16.2) (FWHM) of X is
3. UNIVARIATE CONTINUOUS DISTRIBUTIONS 199
√
FWHM(X) = 2 2 ln 2σ . (2)
mode(X) = µ , (4)
such that
(4) (3) 1
fmax = fX (mode(X)) = fX (µ) = √ . (5)
2πσ
The FWHM bounds satisfy the equation (→ I/1.16.2)
1 (5) 1
fX (xFWHM ) = fmax = √ . (6)
2 2 2πσ
Using (3), we can develop this equation as follows:
" 2 #
1 1 xFWHM − µ 1
√ · exp − = √
2πσ 2 σ 2 2πσ
" 2 #
1 xFWHM − µ 1
exp − =
2 σ 2
2
1 xFWHM − µ 1
− = ln
2 σ 2 (7)
2
xFWHM − µ 1
= −2 ln
σ 2
xFWHM − µ √
= ± 2 ln 2
σ √
xFWHM − µ = ± 2 ln 2σ
√
xFWHM = ± 2 ln 2σ + µ .
This implies the following two solutions for xFWHM
√
x1 = µ − 2 ln 2σ
√ (8)
x2 = µ + 2 ln 2σ ,
such that the full width at half maximum (→ I/1.16.2) of X is
FWHM(X) = ∆x = x2 − x1
(8)
√ √
= µ + 2 ln 2σ − µ − 2 ln 2σ (9)
√
= 2 2 ln 2σ .
200 CHAPTER II. PROBABILITY DISTRIBUTIONS
Sources:
• Wikipedia (2020): “Full width at half maximum”; in: Wikipedia, the free encyclopedia, retrieved
on 2020-08-19; URL: https://en.wikipedia.org/wiki/Full_width_at_half_maximum.
Proof: The probability density function of the normal distribution (→ II/3.2.10) is:
" 2 #
1 1 x−µ
fX (x) = √ · exp − . (1)
2πσ 2 σ
The first two deriatives of this function (→ II/3.2.18) are:
" 2 #
dfX (x) 1 1 x−µ
fX′ (x) = =√ · (−x + µ) · exp − (2)
dx 2πσ 3 2 σ
" 2 # " 2 #
d2 fX (x) 1 1 x−µ 1 1 x−µ
fX′′ (x) = = −√ ·exp − +√ ·(−x+µ) ·exp −
2
. (3)
dx2 2πσ 3 2 σ 2πσ 5 2 σ
−x + µ = 0 ⇔ x=µ. (4)
Since the second derivative is negative at this value
1
fX′′ (µ) = − √ <0, (5)
2πσ 3
there is a maximum at x = µ. From (2), it can be seen that fX′ (x) is positive for x < µ and negative
for x > µ. Thus, there are no further extrema and N (µ, σ 2 ) is unimodal (→ II/3.2.18).
Sources:
• Wikipedia (2021): “Normal distribution”; in: Wikipedia, the free encyclopedia, retrieved on 2021-
08-25; URL: https://en.wikipedia.org/wiki/Normal_distribution#Symmetries_and_derivatives.
3. UNIVARIATE CONTINUOUS DISTRIBUTIONS 201
Proof: The probability density function of the normal distribution (→ II/3.2.10) is:
" 2 #
1 1 x−µ
fX (x) = √ · exp − . (1)
2πσ 2 σ
The first three deriatives of this function are:
" 2 #
df (x) 1 x − µ 1 x − µ
fX′ (x) =
X
=√ · − 2 · exp − (2)
dx 2πσ σ 2 σ
" 2 # 2 " 2 #
d 2
f (x) 1 1 1 x − µ 1 x − µ 1 x − µ
fX′′ (x) =
X
=√ · − 2 · exp − +√ · · exp −
dx2 2πσ σ 2 σ 2πσ σ2 2 σ
" 2 # " 2 #
1 x−µ 1 1 x−µ
=√ · − 2 · exp −
2πσ σ2 σ 2 σ
(3)
" 2 # " 2 #
d3 fX (x) 1 2 x−µ 1 x−µ 1 x−µ 1 x−µ
fX′′′ (x) = =√ · · exp − −√ · − 2 · ·
dx3 2πσ σ 2 σ2 2 σ 2πσ σ2 σ σ2
" 3 # " 2 #
1 x−µ x−µ 1 x−µ
=√ · − +3 · exp − .
2πσ σ2 σ4 2 σ
(4)
" 2 #
x−µ 1
0= − 2
σ2 σ
x2 2µx µ2 1
0= 4
− 4 + 4− 2
σ σ σ σ
0 = x2 − 2µx + (µ2 − σ 2 ) (5)
s 2
−2µ −2µ
x1/2 = − ± − (µ2 − σ 2 )
2 2
p
x1/2 = µ ± µ2 − µ2 + σ 2
x1/2 = µ ± σ .
" 3 # " 2 #
1 ±σ ±σ 1 ±σ
fX′′′ (µ ± σ) = √ · − +3 · exp −
2πσ σ2 σ4 2 σ
(6)
1 2 1
=√ · ± 3 · exp − ̸= 0 ,
2πσ σ 2
there are inflection points at x1/2 = µ ± σ. Because µ is the mean and σ 2 is the variance of a normal
distribution (→ II/3.2.1), these points are exactly one standard deviation (→ I/1.16.1) away from
the mean.
Sources:
• Wikipedia (2021): “Normal distribution”; in: Wikipedia, the free encyclopedia, retrieved on 2021-
08-25; URL: https://en.wikipedia.org/wiki/Normal_distribution#Symmetries_and_derivatives.
X ∼ N (µ, σ 2 ) . (1)
Then, the differential entropy (→ I/2.2.1) of X is
1
h(X) = ln 2πσ 2 e . (2)
2
Note that E [(x − µ)2 ] corresponds to the variance (→ I/1.11.1) of X and the variance of the normal
distribution (→ II/3.2.19) is σ 2 . Thus, we can proceed:
1 1 1
h(X) = ln(2πσ 2 ) + · 2 · σ2
2 2 σ
1 1
= ln(2πσ 2 ) + (6)
2 2
1
= ln(2πσ 2 e) .
2
■
Sources:
• Wang, Peng-Hua (2012): “Differential Entropy”; in: National Taipei University; URL: https://
web.ntpu.edu.tw/~phwang/teaching/2012s/IT/slides/chap08.pdf.
P : X ∼ N (µ1 , σ12 )
(1)
Q : X ∼ N (µ2 , σ22 ) .
Z +∞
N (x; µ1 , σ12 )
KL[P || Q] = N (x; µ1 , σ12 ) ln dx
−∞ N (x; µ2 , σ22 )
(4)
N (x; µ1 , σ12 )
= ln .
N (x; µ2 , σ22 ) p(x)
Using the probability density function of the normal distribution (→ II/3.2.10), this becomes:
204 CHAPTER II. PROBABILITY DISTRIBUTIONS
2
* √ 1 · exp − 2 σ1
1 x−µ1 +
2πσ1
KL[P || Q] = ln 2
√ 1
2πσ2
· exp − 2 σ2
1 x−µ2
p(x)
* s " 2 2 #!+
σ22 1 x − µ1 1 x − µ2
= ln · exp − +
σ12 2 σ1 2 σ2
p(x)
* + (5)
2 2
1 σ22 1 x − µ1 1 x − µ2
= ln 2 − +
2 σ1 2 σ1 2 σ2
p(x)
* 2 2 +
1 x − µ1 x − µ2 σ 2
= − + − ln 12
2 σ1 σ2 σ2
p(x)
1 (x − µ1 ) 2
x − 2µ2 x + µ2
2 2 2
σ1
= − + − ln 2 .
2 σ12 σ22 σ2 p(x)
Because the expected value (→ I/1.10.1) is a linear operator (→ I/1.10.5), the expectation can be
moved into the sum:
1 ⟨(x − µ1 )2 ⟩ ⟨x2 − 2µ2 x + µ22 ⟩ σ12
KL[P || Q] = − + − ln 2
2 σ12 σ22 σ2
(6)
1 ⟨(x − µ1 )2 ⟩ ⟨x2 ⟩ − ⟨2µ2 x⟩ + ⟨µ22 ⟩ σ12
= − + − ln 2 .
2 σ12 σ22 σ2
The first expectation corresponds to the variance (→ I/1.11.1)
2
1 σ1 µ21 + σ12 − 2µ2 µ1 + µ22 σ12
KL[P || Q] = − 2+ − ln 2
2 σ1 σ22 σ
2 2
1 µ1 − 2µ1 µ2 + µ2 σ1
2 2
σ12
= + − ln −1 (10)
2 σ22 σ22 σ22
1 (µ1 − µ2 )2 σ12 σ12
= + 2 − ln 2 − 1
2 σ22 σ2 σ2
which is equivalent to (2).
■
3. UNIVARIATE CONTINUOUS DISTRIBUTIONS 205
Proof: For a random variable (→ I/1.2.2) X with set of possible values X and probability density
function (→ I/1.7.1) p(x), the differential entropy (→ I/2.2.1) is defined as:
Z
h(X) = − p(x) log p(x) dx (1)
X
Let g(x) be the probability density function (→ I/1.7.1) of a normal distribution (→ II/3.2.1) with
mean (→ I/1.10.1) µ and variance (→ I/1.11.1) σ 2 and let f (x) be an arbitrary probability density
function (→ I/1.7.1) with the same variance (→ I/1.11.1). Since differential entropy (→ I/2.2.1) is
translation-invariant (→ I/2.2.3), we can assume that f (x) has the same mean as g(x).
Consider the Kullback-Leibler divergence (→ I/2.5.1) of distribution f (x) from distribution g(x)
which is non-negative (→ I/2.5.2):
Z
f (x)
0 ≤ KL[f ||g] = f (x) log dx
g(x)
ZX Z
= f (x) log f (x) dx − f (x) log g(x) dx (2)
X X
Z
(1)
= −h[f (x)] − f (x) log g(x) dx .
X
By plugging the probability density function of the normal distribution (→ II/3.2.10) into the second
term, we obtain:
Z Z " 2 #!
1 1 x−µ
f (x) log g(x) dx = f (x) log √ · exp − dx
X X 2πσ 2 σ
Z Z
1 (x − µ)2 (3)
= f (x) log √ dx + f (x) log exp − dx
X 2πσ 2 X 2σ 2
Z Z
1 log(e)
= − log 2πσ 2
f (x) dx − 2
f (x)(x − µ)2 dx .
2 X 2σ X
Because the entire integral over a probability density function is one (→ I/1.7.1) and the second
central moment is equal to the variance (→ I/1.18.8), we have:
Z
1 log(e)σ 2
f (x) log g(x) dx = − log 2πσ 2 −
X 2 2σ 2
1
(4)
= − log 2πσ 2 + log(e)
2
1
= − log 2πσ 2 e .
2
This is actually the negative of the differential entropy of the normal distribution (→ II/3.2.23), such
that:
206 CHAPTER II. PROBABILITY DISTRIBUTIONS
Z
f (x) log g(x) dx = −h[N (µ, σ 2 )] = −h[g(x)] . (5)
X
0 ≤ KL[f ||g]
0 ≤ −h[f (x)] − (−h[g(x)]) (6)
h[g(x)] ≥ h[f (x)]
which means that the differential entropy (→ I/2.2.1) of the normal distribution (→ II/3.2.1) N (µ, σ 2 )
will be larger than or equal to any other distribution (→ I/1.5.1) with the same variance (→ I/1.11.1)
σ2.
Sources:
• Wikipedia (2021): “Differential entropy”; in: Wikipedia, the free encyclopedia, retrieved on 2021-
08-25; URL: https://en.wikipedia.org/wiki/Differential_entropy#Maximization_in_the_normal_
distribution.
with mean and variance which are functions of the individual means and variances.
µ σ12 ··· 0
1 2
. . ... ..
µ = .. and Σ = .. . = diag σ1 , . . . , σn .
2
(5)
µn 0 ··· σn 2
Thus, we can apply the linear transformation theorem for the multivariate normal distribution (→
II/4.1.12)
µ1
. X
n
Aµ = [a1 , . . . , an ] .. = ai µ i and
i=1
µn
(9)
σ12 · · · 0 a1
.. X 2 2
n
.. . . ..
AΣA = [a1 , . . . , an ] .
T
. . . = aσ .
i=1 i i
0 · · · σn2 an
3.3 t-distribution
3.3.1 Definition
Definition: Let Z and V be independent (→ I/1.3.6) random variables (→ I/1.2.2) following a
standard normal distribution (→ II/3.2.3) and a chi-squared distribution (→ II/3.7.1) with ν degrees
of freedom, respectively:
Z ∼ N (0, 1)
(1)
V ∼ χ2 (ν) .
Then, the ratio of Z to the square root of V , divided by the respective degrees of freedom, is said to
be t-distributed with degrees of freedom ν:
Z
Y =p ∼ t(ν) . (2)
V /ν
The t-distribution is also called “Student’s t-distribution”, after William S. Gosset a.k.a. “Student”.
208 CHAPTER II. PROBABILITY DISTRIBUTIONS
Sources:
• Wikipedia (2021): “Student’s t-distribution”; in: Wikipedia, the free encyclopedia, retrieved on
2021-04-21; URL: https://en.wikipedia.org/wiki/Student%27s_t-distribution#Characterization.
s −(ν+1)/2
1 Γ([ν + 1]/2) 1 T −1
t(x; 0, 1, ν) = 1 + (x − 0) 1 (x − 0)
(νπ)1 |1| Γ(ν/2) ν
r −(ν+1)/2
1 Γ([ν + 1]/2) x2 (2)
= 1+
νπ Γ(ν/2) ν
ν+1
− ν+1
1 Γ 2 x2 2
=√ · · 1 + .
νπ Γ ν2 ν
Sources:
• Wikipedia (2022): “Multivariate t-distribution”; in: Wikipedia, the free encyclopedia, retrieved on
2022-08-25; URL: https://en.wikipedia.org/wiki/Multivariate_t-distribution#Derivation.
Y = σX + µ (1)
is said to follow a non-standardized t-distribution with non-centrality µ, scale σ 2 and degrees of
freedom ν:
Y ∼ nst(µ, σ 2 , ν) . (2)
Sources:
• Wikipedia (2021): “Student’s t-distribution”; in: Wikipedia, the free encyclopedia, retrieved on
2021-05-20; URL: https://en.wikipedia.org/wiki/Student%27s_t-distribution#Generalized_Student’
s_t-distribution.
3. UNIVARIATE CONTINUOUS DISTRIBUTIONS 209
X ∼ nst(µ, σ 2 , ν) . (1)
Then, subtracting the mean and dividing by the square root of the scale results in a random variable
(→ I/1.2.2) following a t-distribution (→ II/3.3.1) with degrees of freedom ν:
X −µ
Y = ∼ t(ν) . (2)
σ
X −µ X µ
Y = = −
σ σ σ !
2
µ µ 1 (5)
∼t − , 2
σ ,ν
σ σ σ
= t (0, 1, ν) .
■
210 CHAPTER II. PROBABILITY DISTRIBUTIONS
T ∼ t(ν) . (1)
Then, the probability density function (→ I/1.7.1) of T is
ν+1
− ν+1
Γ t2 2
fT (t) = 2 √ · +1 . (2)
Γ ν
2
· νπ ν
Proof: A t-distributed random variable (→ II/3.3.1) is defined as the ratio of a standard normal
random variable (→ II/3.2.3) and the square root of a chi-squared random variable (→ II/3.7.1),
divided by its degrees of freedom
X
X ∼ N (0, 1), Y ∼ χ2 (ν) ⇒ T =p ∼ t(ν) (3)
Y /ν
where X and Y are independent of each other (→ I/1.3.6).
The probability density function (→ II/3.2.10) of the standard normal distribution (→ II/3.2.3) is
1 x2
fX (x) = √ · e− 2 (4)
2π
and the probability density function of the chi-squared distribution (→ II/3.7.3) is
1 ν
−1 − y2
fY (y) = · y 2 · e . (5)
Γ ν
2
· 2ν/2
Define the random variables T and W as functions of X and Y
r
ν
T =X·
Y (6)
W =Y ,
r
W
X=T·
ν (7)
Y =W .
q
dX dX W √T
J= dT dW
=
ν 2ν W/ν
dY dY
dT dW 0 1 (8)
r
W
|J| = .
ν
3. UNIVARIATE CONTINUOUS DISTRIBUTIONS 211
Because X and Y are independent (→ I/1.3.6), the joint density (→ I/1.5.2) of X and Y is equal to
the product (→ I/1.3.8) of the marginal densities (→ I/1.5.3):
r
w
fT,W (t, w) = fX t · · fY (w) · |J|
ν
√ 2 r
1 −
(t· wν ) 1 ν
−1 − w w
= √ ·e 2 · · w2 · e 2 · (11)
2π Γ 2 ·2
ν ν/2 ν
( 2 )
1 − w t
· w 2 −1 · e 2 ν
ν+1 +1
=√ .
2πν · Γ 2 · 2
ν ν/2
The marginal density (→ I/1.5.3) of T can now be obtained by integrating out (→ I/1.3.3) W :
Z ∞
fT (t) = fT,W (t, w) dw
0
Z ∞
1 ν+1
−1 1 t2
=√ · w 2 · exp − + 1 w dw
2πν · Γ ν2 · 2ν/2 0 2 ν
h i(ν+1)/2
ν+1
Z ∞ 1 t2 + 1
1 Γ 2 2 ν ν+1
−1 1 t2
=√ · (ν+1)/2 · ·w 2 · exp − + 1 w dw
2πν · Γ ν2 · 2ν/2 1 t2
+ 1 0 Γ ν+1
2
2 ν
2 ν
(12)
At this point, we can recognize that the integrand is equal to the probability density function of a
gamma distribution (→ II/3.4.7) with
ν+1 1 t2
a= and b = +1 , (13)
2 2 ν
and because a probability density function integrates to one (→ I/1.7.1), we finally have:
1 Γ ν+1
fT (t) = √ · 2
2πν · Γ ν2 · 2ν/2 1 t2 + 1 (ν+1)/2
2 ν
2 − ν+1 (14)
Γ ν+1 t 2
= 2
√ · + 1 .
Γ ν2 · νπ ν
■
Sources:
• Computation Empire (2021): “Student’s t Distribution: Derivation of PDF”; in: YouTube, retrieved
on 2021-10-11; URL: https://www.youtube.com/watch?v=6BraaGEVRY8.
212 CHAPTER II. PROBABILITY DISTRIBUTIONS
X ∼ Gam(a, b) , (1)
if and only if its probability density function (→ I/1.7.1) is given by
ba a−1
Gam(x; a, b) = x exp[−bx], x>0 (2)
Γ(a)
where a > 0 and b > 0, and the density is zero, if x ≤ 0.
Sources:
• Koch, Karl-Rudolf (2007): “Gamma Distribution”; in: Introduction to Bayesian Statistics, Springer,
Berlin/Heidelberg, 2007, p. 47, eq. 2.172; URL: https://www.springer.com/de/book/9783540727231;
DOI: 10.1007/978-3-540-72726-2.
Proof: Let X be a p×p positive-definite symmetric matrix, such that X follows a Wishart distribution
(→ II/5.2.1):
Y ∼ W(V, n) . (1)
Then, Y is described by the probability density function
1 1 1 −1
p(Y ) = ·p · |X| (n−p−1)/2
· exp − tr V X (2)
Γp n
2 2np |V |n 2
where |A| is a matrix determinant, A−1 is a matrix inverse and Γp (x) is the multivariate gamma
function of order p. If p = 1, then Γp (x) = Γ(x) is the ordinary gamma function, x = X and v = V
are real numbers. Thus, the probability density function (→ I/1.7.1) of x can be developed as
1 1 1 −1
p(x) = ·√ ·x (n−2)/2
· exp − tr v x
Γ n2 2n v n 2
(3)
(2v)−n/2 n/2−1 1
= n
·x · exp − x
Γ 2 2v
n 1
Finally, substituting a = 2
and b = 2v
, we get
ba a−1
p(x) = x exp[−bx] (4)
Γ(a)
which is the probability density function of the gamma distribution (→ II/3.4.7).
■
3. UNIVARIATE CONTINUOUS DISTRIBUTIONS 213
X ∼ Gam(a, 1) . (1)
Sources:
• JoramSoch (2017): “Gamma-distributed random numbers”; in: MACS – a new SPM toolbox for
model assessment, comparison and selection, retrieved on 2020-05-26; URL: https://github.com/
JoramSoch/MACS/blob/master/MD_gamrnd.m; DOI: 10.5281/zenodo.845404.
• NIST/SEMATECH (2012): “Gamma distribution”; in: e-Handbook of Statistical Methods, ch.
1.3.6.6.11; URL: https://www.itl.nist.gov/div898/handbook/eda/section3/eda366b.htm; DOI: 10.18434/M
X ∼ Gam(a, b) . (1)
Then, the quantity Y = bX will have a standard gamma distribution (→ II/3.4.3) with shape a and
rate 1:
Y = bX ∼ Gam(a, 1) . (2)
Y = g(X) = bX (3)
with the inverse function
1
X = g −1 (Y ) = Y . (4)
b
Because b is positive, g(X) is strictly increasing and we can calculate the cumulative distribution
function of a strictly increasing function (→ I/1.8.3) as
0 , if y < min(Y)
FY (y) = FX (g −1 (y)) , if y ∈ Y (5)
1 , if y > max(Y) .
The cumulative distribution function of the gamma-distributed (→ II/3.4.8) X is
Z x
ba a−1
FX (x) = t exp[−bt] dt . (6)
−∞ Γ(a)
(5)
FY (y) = FX (g −1 (y))
Z y/b a (7)
(6) b a−1
= t exp[−bt] dt .
−∞ Γ(a)
Z
ba s a−1
b(y/b) h s i s
FY (y) = exp −b d
−b∞ Γ(a) b b b
Z y a
b 1
= a−1
sa−1 exp[−s] ds (8)
Γ(a) b b
Z−∞
y
1 a−1
= s exp[−s] ds
−∞ Γ(a)
which is the cumulative distribution function (→ I/1.8.1) of the standard gamma distribution (→
II/3.4.3).
X ∼ Gam(a, b) . (1)
Then, the quantity Y = bX will have a standard gamma distribution (→ II/3.4.3) with shape a and
rate 1:
Y = bX ∼ Gam(a, 1) . (2)
Y = g(X) = bX (3)
with the inverse function
1
X = g −1 (Y ) = Y . (4)
b
Because b is positive, g(X) is strictly increasing and we can calculate the probability density function
of a strictly increasing function (→ I/1.7.3) as
f (g −1 (y)) dg−1 (y) , if y ∈ Y
X dy
fY (y) = (5)
0 , if y ∈
/Y
where Y = {y = g(x) : x ∈ X }. With the probability density function of the gamma distribution (→
II/3.4.7), we have
3. UNIVARIATE CONTINUOUS DISTRIBUTIONS 215
ba −1 dg −1 (y)
fY (y) = [g (y)]a−1 exp[−b g −1 (y)] ·
Γ(a) dy
a−1
b a
1 1 d 1b y
= y exp −b y ·
Γ(a) b b dy (6)
a
b 1 a−1 1
= a−1
y exp[−y] ·
Γ(a) b b
1 a−1
= y exp[−y]
Γ(a)
which is the probability density function (→ I/1.7.1) of the standard gamma distribution (→ II/3.4.3).
X ∼ Gam(a, b) . (1)
Then, the quantity Y = cX will also be gamma-distributed with shape a and rate b/c:
b
Y = bX ∼ Gam a, . (2)
c
Y = g(X) = cX (3)
with the inverse function
1
X = g −1 (Y ) = Y . (4)
c
Because the parameters of a gamma distribution are positive (→ II/3.4.1), c must also be positive.
Thus, g(X) is strictly increasing and we can calculate the probability density function of a strictly
increasing function (→ I/1.7.3) as
f (g −1 (y)) dg−1 (y) , if y ∈ Y
X dy
fY (y) = (5)
0 , if y ∈
/Y
The probability density function of the gamma-distributed (→ II/3.4.7) X is
ba a−1
fX (x) = x exp[−bx] . (6)
Γ(a)
Applying (5) to (6), we have:
216 CHAPTER II. PROBABILITY DISTRIBUTIONS
ba −1 dg −1 (y)
fY (y) = [g (y)]a−1 exp[−bg −1 (y)]
Γ(a) dy
a−1
b a
1 1 d 1c y
= y exp −b y
Γ(a) c c dy
a −1 (7)
ba 1 1 b 1
= y a−1 exp − y
Γ(a) c c c c
a
(b/a) a−1 b
= y exp − y
Γ(a) c
which is the probability density function (→ I/1.7.1) of a gamma distribution (→ II/3.4.1) with
shape a and rate b/c.
X ∼ Gam(a, b) . (1)
Then, the probability density function (→ I/1.7.1) of X is
ba a−1
fX (x) = x exp[−bx] . (2)
Γ(a)
Proof: This follows directly from the definition of the gamma distribution (→ II/3.4.1).
X ∼ Gam(a, b) . (1)
Then, the cumulative distribution function (→ I/1.8.1) of X is
γ(a, bx)
FX (x) = (2)
Γ(a)
where Γ(x) is the gamma function and γ(s, x) is the lower incomplete gamma function.
Proof: The probability density function of the gamma distribution (→ II/3.4.7) is:
ba a−1
fX (x) = x exp[−bx] . (3)
Γ(a)
Thus, the cumulative distribution function (→ I/1.8.1) is:
3. UNIVARIATE CONTINUOUS DISTRIBUTIONS 217
Z x
FX (x) = Gam(z; a, b) dz
Z0
x
ba a−1
= z exp[−bz] dz (4)
0 Γ(a)
Z x
ba
= z a−1 exp[−bz] dz .
Γ(a) 0
Z bx a−1
ba t t t
FX (x) = exp −b d
Γ(a) b·0 b b b
Z bx
ba 1 1
= · a−1 · ta−1 exp[−t] dt (5)
Γ(a) b b 0
Z bx
1
= ta−1 exp[−t] dt .
Γ(a) 0
γ(a, bx)
FX (x) = . (7)
Γ(a)
■
Sources:
• Wikipedia (2020): “Incomplete gamma function”; in: Wikipedia, the free encyclopedia, retrieved
on 2020-10-29; URL: https://en.wikipedia.org/wiki/Incomplete_gamma_function#Definition.
X ∼ Gam(a, b) . (1)
Then, the quantile function (→ I/1.9.1) of X is
−∞ , if p = 0
QX (p) = (2)
γ −1 (a, Γ(a) · p)/b , if p > 0
where γ −1 (s, y) is the inverse of the lower incomplete gamma function γ(s, x)
Proof: The cumulative distribution function of the gamma distribution (→ II/3.4.8) is:
218 CHAPTER II. PROBABILITY DISTRIBUTIONS
0 , if x < 0
FX (x) = (3)
γ(a,bx)
, if x ≥ 0 .
Γ(a)
The quantile function (→ I/1.9.1) QX (p) is defined as the smallest x, such that FX (x) = p:
γ(a, bx)
p=
Γ(a)
Γ(a) · p = γ(a, bx)
−1
(6)
γ (a, Γ(a) · p) = bx
γ −1 (a, Γ(a) · p)
x= .
b
■
Sources:
• Wikipedia (2020): “Incomplete gamma function”; in: Wikipedia, the free encyclopedia, retrieved
on 2020-11-19; URL: https://en.wikipedia.org/wiki/Incomplete_gamma_function#Definition.
3.4.10 Mean
Theorem: Let X be a random variable (→ I/1.2.2) following a gamma distribution (→ II/3.4.1):
X ∼ Gam(a, b) . (1)
Then, the mean or expected value (→ I/1.10.1) of X is
a
E(X) = . (2)
b
Proof: The expected value (→ I/1.10.1) is the probability-weighted average over all possible values:
Z
E(X) = x · fX (x) dx . (3)
X
With the probability density function of the gamma distribution (→ II/3.4.7), this reads:
Z ∞
ba a−1
E(X) = x· x exp[−bx] dx
Γ(a)
Z0 ∞
ba (a+1)−1
= x exp[−bx] dx (4)
Γ(a)
Z0 ∞
1 ba+1 (a+1)−1
= · x exp[−bx] dx .
0 b Γ(a)
3. UNIVARIATE CONTINUOUS DISTRIBUTIONS 219
Z
a ∞
E(X) = Gam(x; a + 1, b) dx
b 0 (6)
a
= .
b
■
Sources:
• Turlapaty, Anish (2013): “Gamma random variable: mean & variance”; in: YouTube, retrieved on
2020-05-19; URL: https://www.youtube.com/watch?v=Sy4wP-Y2dmA.
3.4.11 Variance
Theorem: Let X be a random variable (→ I/1.2.2) following a gamma distribution (→ II/3.4.1):
X ∼ Gam(a, b) . (1)
Then, the variance (→ I/1.11.1) of X is
a
Var(X) = . (2)
b2
Proof: The variance (→ I/1.11.1) can be expressed in terms of expected values (→ I/1.11.3) as
Z ∞
ba a−1
2
E(X ) = x2 · x exp[−bx] dx
Γ(a)
Z0 ∞
ba (a+2)−1
= x exp[−bx] dx (5)
Γ(a)
Z0
∞
1 ba+2 (a+2)−1
= · x exp[−bx] dx .
0 b2 Γ(a)
Twice-applying the relation Γ(x + 1) = Γ(x) · x, we have
Z ∞
a (a + 1) ba+2
2
E(X ) = 2
· x(a+2)−1 exp[−bx] dx (6)
0 b Γ(a + 2)
220 CHAPTER II. PROBABILITY DISTRIBUTIONS
and again using the density of the gamma distribution (→ II/3.4.7), we get
Z ∞
2 a (a + 1)
E(X ) = Gam(x; a + 2, b) dx
b2 0 (7)
a2 + a
= .
b2
Plugging (7) and (4) into (3), the variance of a gamma random variable finally becomes
a2 + a a 2
Var(X) = −
b2 b (8)
a
= 2 .
b
■
Sources:
• Turlapaty, Anish (2013): “Gamma random variable: mean & variance”; in: YouTube, retrieved on
2020-05-19; URL: https://www.youtube.com/watch?v=Sy4wP-Y2dmA.
X ∼ Gam(a, b) . (1)
Then, the expectation (→ I/1.10.1) of the natural logarithm of X is
Proof: Let Y = ln(X), such that E(Y ) = E(ln X) and consider the special case that b = 1. In this
case, the probability density function of the gamma distribution (→ II/3.4.7) is
1
fX (x) = xa−1 exp[−x] . (3)
Γ(a)
Multiplying this function with dx, we obtain
1 dx
fX (x) dx = xa exp[−x] . (4)
Γ(a) x
Substituting y = ln x, i.e. x = ey , such that dx/dy = x, i.e. dx/x = dy, we get
1
fY (y) dy = (ey )a exp[−ey ] dy
Γ(a)
(5)
1
= exp [ay − ey ] dy .
Γ(a)
Z
1= fY (y) dy
ZR
1
1= exp [ay − ey ] dy (6)
R Γ(a)
Z
Γ(a) = exp [ay − ey ] dy .
R
d
exp [ay − ey ] dy = y exp [ay − ey ] dy
da (7)
(5)
= Γ(a) y fY (y) dy .
Z
E(Y ) = y fY (y) dy
R
Z
(7) 1 d
= exp [ay − ey ] dy
Γ(a) R da
Z
1 d
= exp [ay − ey ] dy (8)
Γ(a) da R
(6) 1 d
= Γ(a)
Γ(a) da
Γ′ (a)
= .
Γ(a)
Using the derivative of a logarithmized function
d f ′ (x)
ln f (x) = (9)
dx f (x)
and the definition of the digamma function
d
ψ(x) = ln Γ(x) , (10)
dx
we have
it follows that
Sources:
• whuber (2018): “What is the expected value of the logarithm of Gamma distribution?”; in:
StackExchange CrossValidated, retrieved on 2020-05-25; URL: https://stats.stackexchange.com/
questions/370880/what-is-the-expected-value-of-the-logarithm-of-gamma-distribution.
3.4.13 Expectation of x ln x
Theorem: Let X be a random variable (→ I/1.2.2) following a gamma distribution (→ II/3.4.1):
X ∼ Gam(a, b) . (1)
Then, the mean or expected value (→ I/1.10.1) of (X · ln X) is
a
E(X ln X) = [ψ(a) − ln(b)] . (2)
b
Proof: With the definition of the expected value (→ I/1.10.1), the law of the unconscious statistician
(→ I/1.10.12) and the probability density function of the gamma distribution (→ II/3.4.7), we have:
Z ∞
ba a−1
E(X ln X) = x ln x · x exp[−bx] dx
Γ(a)
0
Z ∞
1 ba+1 a
= ln x · x exp[−bx] dx (3)
Γ(a) 0 b
Z
Γ(a + 1) ∞ ba+1
= ln x · x(a+1)−1 exp[−bx] dx
Γ(a) b 0 Γ(a + 1)
The integral now corresponds to the logarithmic expectation of a gamma distribution (→ II/3.4.12)
with shape a + 1 and rate b
Γ(x + 1)
Γ(x + 1) = Γ(x) · x ⇔ =x, (6)
Γ(x)
the expression in equation (3) develops into:
a
E(X ln X) = [ψ(a) − ln(b)] . (7)
b
■
3. UNIVARIATE CONTINUOUS DISTRIBUTIONS 223
Sources:
• gunes (2020): “What is the expected value of x log(x) of the gamma distribution?”; in: StackEx-
change CrossValidated, retrieved on 2020-10-15; URL: https://stats.stackexchange.com/questions/
457357/what-is-the-expected-value-of-x-logx-of-the-gamma-distribution.
X ∼ Gam(a, b) (1)
Then, the differential entropy (→ I/2.2.1) of X in nats is
a
b
h(X) = −E ln a−1
x exp[−bx]
Γ(a)
(5)
= −E [a · ln b − ln Γ(a) + (a − 1) ln x − bx]
= −a · ln b + ln Γ(a) − (a − 1) · E(ln x) + b · E(x) .
Using the mean (→ II/3.4.10) and logarithmic expectation (→ II/3.4.12) of the gamma distribution
(→ II/3.4.1)
a
X ∼ Gam(a, b) ⇒ E(X) = and E(ln X) = ψ(a) − ln(b) , (6)
b
the differential entropy (→ I/2.2.1) of X becomes:
a
h(X) = −a · ln b + ln Γ(a) − (a − 1) · (ψ(a) − ln b) + b ·
b
= −a · ln b + ln Γ(a) + (1 − a) · ψ(a) + a · ln b − ln b + a (7)
= a + ln Γ(a) + (1 − a) · ψ(a) − ln b .
Sources:
• Wikipedia (2021): “Gamma distribution”; in: Wikipedia, the free encyclopedia, retrieved on 2021-
07-14; URL: https://en.wikipedia.org/wiki/Gamma_distribution#Information_entropy.
224 CHAPTER II. PROBABILITY DISTRIBUTIONS
P : X ∼ Gam(a1 , b1 )
(1)
Q : X ∼ Gam(a2 , b2 ) .
b1 Γ(a1 ) a1
KL[P || Q] = a2 ln − ln + (a1 − a2 ) ψ(a1 ) − (b1 − b2 ) . (2)
b2 Γ(a2 ) b1
Z +∞
Gam(x; a1 , b1 )
KL[P || Q] = Gam(x; a1 , b1 ) ln dx
−∞ Gam(x; a2 , b2 )
(4)
Gam(x; a1 , b1 )
= ln .
Gam(x; a2 , b2 ) p(x)
Using the probability density function of the gamma distribution (→ II/3.4.7), this becomes:
* b1 a1 a1 −1
+
Γ(a1 )
x exp[−b1 x]
KL[P || Q] = ln a2
b2
Γ(a2 )
xa2 −1 exp[−b2 x]
p(x)
a1
b1 Γ(a2 ) a1 −a2 (5)
= ln · ·x · exp[−(b1 − b2 )x]
b2 a2 Γ(a1 ) p(x)
Using the mean of the gamma distribution (→ II/3.4.10) and the expected value of a logarithmized
gamma variate (→ II/3.4.12)
a
x ∼ Gam(a, b) ⇒ ⟨x⟩ = and
b (6)
⟨ln x⟩ = ψ(a) − ln(b) ,
a1
KL[P || Q] = a1 · ln b1 − a2 · ln b2 − ln Γ(a1 ) + ln Γ(a2 ) + (a1 − a2 ) · (ψ(a1 ) − ln(b1 )) − (b1 − b2 ) ·
b1
a1
= a2 · ln b1 − a2 · ln b2 − ln Γ(a1 ) + ln Γ(a2 ) + (a1 − a2 ) · ψ(a1 ) − (b1 − b2 ) · .
b1
(7)
3. UNIVARIATE CONTINUOUS DISTRIBUTIONS 225
b1 Γ(a1 ) a1
KL[P || Q] = a2 ln − ln + (a1 − a2 ) ψ(a1 ) − (b1 − b2 ) . (8)
b2 Γ(a2 ) b1
■
Sources:
• Penny, William D. (2001): “KL-Divergences of Normal, Gamma, Dirichlet and Wishart densi-
ties”; in: University College, London; URL: https://www.fil.ion.ucl.ac.uk/~wpenny/publications/
densities.ps.
X ∼ Exp(λ) , (1)
if and only if its probability density function (→ I/1.7.1) is given by
Sources:
• Wikipedia (2020): “Exponential distribution”; in: Wikipedia, the free encyclopedia, retrieved on
2020-02-08; URL: https://en.wikipedia.org/wiki/Exponential_distribution#Definitions.
λ1 1−1
Gam(x; 1, λ) = x exp[−λx]
Γ(1)
x0 (2)
= λ exp[−λx]
Γ(1)
= λ exp[−λx]
which is equivalent to the probability density function of the exponential distribution (→ II/3.5.3).
■
226 CHAPTER II. PROBABILITY DISTRIBUTIONS
X ∼ Exp(λ) . (1)
Then, the probability density function (→ I/1.7.1) of X is
Proof: This follows directly from the definition of the exponential distribution (→ II/3.5.1).
■
X ∼ Exp(λ) . (1)
Then, the moment generating function (→ I/1.9.5) of X is
λ
MX (t) = (2)
λ−t
which is well-defined for t < λ.
Proof: Suppose X follows an exponential distribution (→ II/3.5.1) with rate λ; that is, X ∼ Exp(λ).
Then, the probability density function (→ II/3.5.3) is given by
Note that t cannot be equal to λ, else MX (t) is undefined. Further, if t > λ, then limx→∞ ex(t−λ) = ∞,
which implies that MX (t) diverges for t ≥ λ. So, we must restrict the domain of MX (t) to t < λ.
Assuming this, we can further simplify (5):
λ h i
MX (t) = lim ex(t−λ) − 1
t − λ x→∞
λ
= [0 − 1] (6)
t−λ
λ
= .
λ−t
This completes the proof of (2).
X ∼ Exp(λ) . (1)
Then, the cumulative distribution function (→ I/1.8.1) of X is
0 , if x < 0
FX (x) = (2)
1 − exp[−λx] , if x ≥ 0 .
Proof: The probability density function of the exponential distribution (→ II/3.5.3) is:
0 , if x < 0
Exp(x; λ) = (3)
λ exp[−λx] , if x ≥ 0 .
If x < 0, we have:
Z x
FX (x) = 0 dz = 0 . (5)
−∞
Z 0 Z x
FX (x) = Exp(z; λ) dz + Exp(z; λ) dz
−∞ 0
Z 0 Z x
= 0 dz + λ exp[−λz] dz
−∞
0
x
1 (6)
= 0 + λ − exp[−λz]
λ
0
1 1
= λ − exp[−λx] − − exp[−λ · 0]
λ λ
= 1 − exp[−λx] .
X ∼ Exp(λ) . (1)
Then, the quantile function (→ I/1.9.1) of X is
−∞ , if p = 0
QX (p) = (2)
− ln(1−p) , if p > 0 .
λ
Proof: The cumulative distribution function of the exponential distribution (→ II/3.5.5) is:
0 , if x < 0
FX (x) = (3)
1 − exp[−λx] , if x ≥ 0 .
The quantile function (→ I/1.9.1) QX (p) is defined as the smallest x, such that FX (x) = p:
p = 1 − exp[−λx]
exp[−λx] = 1 − p
−λx = ln(1 − p) (6)
ln(1 − p)
x=− .
λ
■
3. UNIVARIATE CONTINUOUS DISTRIBUTIONS 229
3.5.7 Mean
Theorem: Let X be a random variable (→ I/1.2.2) following an exponential distribution (→
II/3.5.1):
X ∼ Exp(λ) . (1)
Then, the mean or expected value (→ I/1.10.1) of X is
1
E(X) = . (2)
λ
Proof: The expected value (→ I/1.10.1) is the probability-weighted average over all possible values:
Z
E(X) = x · fX (x) dx . (3)
X
With the probability density function of the exponential distribution (→ II/3.5.3), this reads:
Z +∞
E(X) = x · λ exp(−λx) dx
0
Z +∞ (4)
=λ x · exp(−λx) dx .
0
+∞
1 1
E(X) = λ − x − 2 exp(−λx)
λ λ
0
1 1 1 1
= λ lim − x − 2 exp(−λx) − − · 0 − 2 exp(−λ · 0)
x→∞ λ λ λ λ (6)
1
=λ 0+ 2
λ
1
= .
λ
■
Sources:
• Koch, Karl-Rudolf (2007): “Expected Value”; in: Introduction to Bayesian Statistics, Springer,
Berlin/Heidelberg, 2007, p. 39, eq. 2.142a; URL: https://www.springer.com/de/book/9783540727231;
DOI: 10.1007/978-3-540-72726-2.
230 CHAPTER II. PROBABILITY DISTRIBUTIONS
3.5.8 Median
Theorem: Let X be a random variable (→ I/1.2.2) following an exponential distribution (→
II/3.5.1):
X ∼ Exp(λ) . (1)
Then, the median (→ I/1.15.1) of X is
ln 2
median(X) = . (2)
λ
Proof: The median (→ I/1.15.1) is the value at which the cumulative distribution function (→
I/1.8.1) is 1/2:
1
FX (median(X)) = . (3)
2
The cumulative distribution function of the exponential distribution (→ II/3.5.5) is
ln(1 − p)
x=− (5)
λ
and setting p = 1/2, we obtain:
ln(1 − 21 ) ln 2
median(X) = − = . (6)
λ λ
■
3.5.9 Mode
Theorem: Let X be a random variable (→ I/1.2.2) following an exponential distribution (→
II/3.5.1):
X ∼ Exp(λ) . (1)
Then, the mode (→ I/1.15.2) of X is
mode(X) = 0 . (2)
Proof: The mode (→ I/1.15.2) is the value which maximizes the probability density function (→
I/1.7.1):
Since
fX (0) = λ (5)
and
mode(X) = 0 . (7)
3.5.10 Variance
Theorem: Let X be a random variable (→ I/1.2.2) following an exponential distribution (→
II/3.5.1):
X ∼ Exp(λ). (1)
Then, the variance (→ I/1.11.1) of X is
1
Var(X) = . (2)
λ2
Z +∞
2
E[X ] = x2 · fX (x) dx
−∞
Z+∞
= x2 · λ exp(−λx) dx (6)
0
Z +∞
=λ x2 · exp(−λx) dx
0
Z +∞ Z
1 2 1
x · exp(−λx) dx = − x · exp(−λx)
2
− 2x − x · exp(−λx) dx
λ λ
!
0
+∞ +∞ Z
1 2 1 1
= − x · exp(−λx) − 2x · exp(−λx) − 2 · exp(−λx) dx
λ 0 λ2 0 λ2
2 +∞ +∞ +∞ !
x 2x 2
= − · exp(−λx) − · exp(−λx) − − 3 · exp(−λx)
λ 0 λ2 0 λ 0
2 +∞
x 2x 2
= − − 2 − 3 exp(−λx) ,
λ λ λ 0
(7)
2 +∞
x 2x 2
E[X ] = λ − − 2 − 3 exp(−λx)
2
λ λ λ
2 0
x 2x 2 2
= λ lim − − 2 − 3 exp(−λx) − 0 − 0 − 3 exp(−λ · 0)
x→∞ λ λ λ λ (8)
2
=λ 0+ 3
λ
2
= 2 .
λ
Plugging (8) and (5) into (4), we have:
Var(X) = E X 2 − E [X]2
2
2 1
= 2−
λ λ
2 1 (9)
= 2− 2
λ λ
1
= 2 .
λ
■
Sources:
• Taboga, Marco (2023): “Exponential distribution”; in: Lectures on probability theory and mathe-
matical statistics, retrieved on 2023-01-23; URL: https://www.statlect.com/probability-distributions/
exponential-distribution.
• Wikipedia (2023): “Variance”; in: Wikipedia, the free encyclopedia, retrieved on 2023-01-23; URL:
https://en.wikipedia.org/wiki/Variance#Definition.
3. UNIVARIATE CONTINUOUS DISTRIBUTIONS 233
3.5.11 Skewness
Theorem: Let X be a random variable (→ I/1.2.2) following an exponential distribution (→
II/3.5.1):
X ∼ Exp(λ) . (1)
Then the skewness (→ I/1.12.1) of X is
Skew(X) = 2 . (2)
Proof:
To compute the skewness of X, we partition the skewness into expected values (→ I/1.12.3):
E(X 3 ) − 3µσ 2 − µ3
Skew(X) = , (3)
σ3
where µ and σ are the mean and standard deviation of X, respectively. Since X follows an exponential
distribution (→ II/3.5.1), the mean (→ II/3.5.7) of X is given by
1
µ = E(X) = (4)
λ
and the standard deviation (→ II/3.5.10) of X is given by
r
p 1 1
σ = Var(X) = 2
= . (5)
λ λ
Substituting (4) and (5) into (3) gives:
E(X 3 ) − 3µσ 2 − µ3
Skew(X) =
σ3
3
E(X ) 3µσ 2 + µ3
= −
σ3 σ3
2
1 3
E(X 3 ) 3 λ1 λ1 + (6)
= −
1 3
1 3
λ
λ λ
3 1
+
= λ · E(X ) −
3 3 λ3
1
λ3
λ3
= λ3 · E(X 3 ) − 4 .
Thus, the remaining work is to compute E(X 3 ). To do this, we use the moment-generating function
of the exponential distribution (→ II/3.5.4) to calculate
Finally, one more application of the chain rule gives us the third derivative:
Skew(X) = λ3 · E(X 3 ) − 4
6
=λ ·
3
−4
λ3 (13)
=6−4
=2.
Then, the exponential function of Y is said to have a log-normal distribution with location parameter
µ and scale parameter σ
Sources:
• Wikipedia (2022): “Log-normal distribution”; in: Wikipedia, the free encyclopedia, retrieved on
2022-02-07; URL: https://en.wikipedia.org/wiki/Log-normal_distribution.
X ∼ ln N (µ, σ 2 ) . (1)
Then, the probability density function (→ I/1.7.1) of X is given by:
" #
1 (ln x − µ)2
fX (x) = √ · exp − . (2)
xσ 2π 2σ 2
Proof: A log-normally distributed random variable (→ II/3.6.1) is defined as the exponential function
of a normal random variable (→ II/3.2.1):
dg −1 (x)
fX (x) = fY (g −1 (x)) ·
" dx 2 #
1 1 g −1 (x) − µ dg −1 (x)
=√ · exp − ·
2πσ 2 σ dx
" 2 #
1 1 (ln x) − µ d(ln x)
=√ · exp − · (8)
2πσ 2 σ dx
" 2 #
1 1 ln x − µ 1
=√ · exp − ·
2πσ 2 σ x
" #
1 (ln x − µ)2
= √ · exp −
xσ 2π 2σ 2
which is the probability density function (→ I/1.7.1) of the log-normal distribution (→ II/3.6.1).
Sources:
• Taboga, Marco (2021): “Log-normal distribution”; in: Lectures on probability and statistics, re-
trieved on 2022-02-13; URL: https://www.statlect.com/probability-distributions/log-normal-distribution.
X ∼ ln N (µ, σ 2 ) . (1)
Then, the cumulative distribution function (→ I/1.8.1) of X is
1 ln x − µ
FX (x) = 1 + erf √ (2)
2 2σ
where erf(x) is the error function defined as
Z x
2
erf(x) = √ exp(−t2 ) dt . (3)
π 0
Proof: The probability density function of the log-normal distribution (→ II/3.6.2) is:
" 2 #
1 ln x − µ
fX (x) = √ · exp − √ . (4)
xσ 2π 2σ
Thus, the cumulative distribution function (→ I/1.8.1) is:
3. UNIVARIATE CONTINUOUS DISTRIBUTIONS 237
Z x
FX (x) = ln N (z; µ, σ 2 ) dz
−∞
Z x " 2 #
1 ln z − µ
= √ · exp − √ dz (5)
−∞ zσ 2π 2σ
Z x " 2 #
1 1 ln z − µ
= √ · exp − √ dz .
σ 2π −∞ z 2σ
From this point forward, the proof is similar to the derivation of the cumulative
√ distribution√ function
for the normal
√ distribution (→ II/3.2.12). Substituting t = (ln z − µ)/( 2σ), i.e. ln z = 2σt + µ,
z = exp( 2σt + µ) this becomes:
Z (ln x−µ)/(√2σ)
1 1 h √ i
FX (x) = √ √ · exp −t d exp
2
2σt + µ
σ 2π (−∞−µ)/(√2σ) exp( 2σt + µ)
√ Z ln√x−µ √
2σ 2σ 1
= √ √ · exp(−t2 ) · exp 2σt + µ dt
σ 2π −∞ exp( 2σt + µ)
Z ln√x−µ
1 2σ
(6)
= √ exp(−t2 ) dt
π −∞
Z 0 Z ln√x−µ
1 1 2σ
= √ exp(−t ) dt + √
2
exp(−t2 ) dt
π −∞ π 0
Z ∞ Z ln√x−µ
1 1 2σ
= √ exp(−t ) dt + √
2
exp(−t2 ) dt .
π 0 π 0
1 1 ln x − µ
FX (x) = lim erf(x) + erf √
2 x→∞ 2 2σ
1 1 ln x − µ
= + erf √ (7)
2 2 2σ
1 ln x − µ
= 1 + erf √ .
2 2σ
■
Sources:
• skdhfgeq2134 (2015): “How to derive the cdf of a lognormal distribution from its pdf”; in: StackEx-
change, retrieved on 2022-06-29; URL: https://stats.stackexchange.com/questions/151398/how-to-derive-t
151404#151404.
X ∼ ln N (µ, σ 2 ) . (1)
Then, the quantile function (→ I/1.9.1) of X is
√
QX (p) = exp(µ + 2σ · erf −1 (2p − 1)) (2)
where erf −1 (x) is the inverse error function.
Proof: The cumulative distribution function of the log-normal distribution (→ II/3.6.3) is:
1 ln x − µ
FX (x) = 1 + erf √ . (3)
2 2σ
From this point forward, the proof is similar to the derivation of the quantile function for the
normal distribution (→ II/3.2.15). Because the cumulative distribution function (CDF) is strictly
monotonically increasing, the quantile function is equal to the inverse of the CDF (→ I/1.9.2):
1 ln x − µ
p= 1 + erf √
2 2σ
ln x − µ
2p − 1 = erf √
2σ (5)
ln x − µ
erf −1 (2p − 1) = √
2σ
√
x = exp(µ + 2σ · erf −1 (2p − 1)) .
Sources:
• Wikipedia (2022): “Log-normal distribution”; in: Wikipedia, the free encyclopedia, retrieved on
2022-07-08; URL: https://en.wikipedia.org/wiki/Log-normal_distribution#Mode,_median,_quantiles.
3.6.5 Mean
Theorem: Let X be a random variable (→ I/1.2.2) following a log-normal distribution (→ II/3.6.1):
X ∼ ln N (µ, σ 2 ) (1)
Then, the mean or expected value (→ I/1.10.1) of X is
1 2
E(X) = exp µ + σ (2)
2
Proof: The expected value (→ I/1.10.1) is the probability-weighted average over all possible values:
Z
E(X) = x · fX (x) dx (3)
X
3. UNIVARIATE CONTINUOUS DISTRIBUTIONS 239
With the probability density function of the log-normal distribution (→ II/3.6.2), this is:
Z " #
+∞
1 1 (ln x − µ)2
E(X) = x· √ · exp − dx
0 x 2πσ 2 2 σ2
Z +∞ " # (4)
1 1 (ln x − µ)2
=√ exp − dx
2πσ 2 0 2 σ2
ln x−µ
Substituting z = σ
, i.e. x = exp (µ + σz), we have:
Z
(ln x−µ)/(σ)
1 1 2
E(X) = √ exp − z d [exp (µ + σz)]
2πσ 2 (−∞−µ)/(σ) 2
Z +∞
1 1
=√ exp − z 2 σ exp (µ + σz) dz
2πσ 2 −∞ 2
Z +∞ (5)
1 1 2
=√ exp − z + σz + µ dz
2π −∞ 2
Z +∞
1 1 2
=√ exp − z − 2σz − 2µ dz
2π −∞ 2
Now multiplying exp 21 σ 2 and exp − 12 σ 2 , we have:
Z +∞
1 1 2
E(X) = √ exp − z − 2σz + σ − 2µ − σ
2 2
dz
2π −∞ 2
Z +∞
1 1 2 1 2
=√ exp − z − 2σz + σ 2
exp µ + σ dz (6)
2π −∞ 2 2
Z +∞
1 1 1
= exp µ + σ 2 √ exp − (z − σ)2 dz
2 −∞ 2π 2
Sources:
• Taboga, Marco (2022): “Log-normal distribution”; in: Lectures on probability theory and mathe-
matical statistics, retrieved on 2022-10-01; URL: https://www.statlect.com/probability-distributions/
log-normal-distribution.
3.6.6 Median
Theorem: Let X be a random variable (→ I/1.2.2) following a log-normal distribution (→ II/3.6.1):
X ∼ ln N (µ, σ 2 ) . (1)
Then, the median (→ I/1.15.1) of X is
median(X) = eµ . (2)
Proof: The median (→ I/1.15.1) is the value at which the cumulative distribution function is 1/2:
1
FX (median(X)) = . (3)
2
The cumulative distribution function of the lognormal distribution (→ II/3.6.3) is
1 ln(x) − µ
FX (x) = 1 + erf √ (4)
2 σ 2
where erf(x) is the error function. Thus, the inverse CDF is
√
ln(x) = σ 2 · erf −1 (2p − 1) + µ
h √ i (5)
−1
x = exp σ 2 · erf (2p − 1) + µ
where erf −1 (x) is the inverse error function. Setting p = 1/2, we obtain:
√
ln [median(X)] = σ 2 · erf −1 (0) + µ
(6)
median(X) = eµ .
3.6.7 Mode
Theorem: Let X be a random variable (→ I/1.2.2) following a log-normal distribution (→ II/3.6.1):
X ∼ ln N (µ, σ 2 ) . (1)
Then, the mode (→ I/1.15.2) of X is
mode(X) = e(µ−σ ) .
2
(2)
3. UNIVARIATE CONTINUOUS DISTRIBUTIONS 241
Proof: The mode (→ I/1.15.2) is the value which maximizes the probability density function (→
I/1.7.1):
" #
′′ 1 (ln x − µ)2
ln x − µ
fX (x) = √ exp − · (ln x − µ) · 1 +
2πσ 2 x3 2σ 2 σ2
√ " #
2 (ln x − µ)2 ln x − µ
+ √ 3 exp − · 1+ (6)
πx 2σ 2 σ2
" #
1 (ln x − µ)2
−√ exp − .
2πσ 2 x3 2σ 2
" #
′ 1 (ln x − µ) 2
ln x − µ
fX (x) = 0 = − √ · exp − · 1+
x2 σ 2π 2σ 2 σ2
ln x − µ (7)
−1 =
σ2
x = e( ).
µ−σ 2
2
1 σ
fX′′ (e(µ−σ ) )
2
=√ exp − · σ 2
· (0)
2πσ 2 (e(µ−σ2 ) )3 2
√ 2
2 σ
+ √ (µ−σ2 ) 3 exp − · (0)
π(e ) 2
2 (8)
1 σ
−√ 2) 3
exp −
2
2πσ (e (µ−σ ) 2
2
1 σ
= −√ 2
exp − <0,
2πσ 2 (e(µ−σ ) )3 2
mode(X) = e(µ−σ ) .
2
(9)
242 CHAPTER II. PROBABILITY DISTRIBUTIONS
Sources:
• Wikipedia (2022): “Log-normal distribution”; in: Wikipedia, the free encyclopedia, retrieved on
2022-02-12; URL: https://en.wikipedia.org/wiki/Log-normal_distribution#Mode.
• Mdoc (2015): “Mode of lognormal distribution”; in: Mathematics Stack Exchange, retrieved on
2022-02-12; URL: https://math.stackexchange.com/questions/1321221/mode-of-lognormal-distribution/
1321626.
3.6.8 Variance
Theorem: Let X be a random variable (→ I/1.2.2) following a log-normal distribution (→ II/3.6.1):
X ∼ ln N (µ, σ 2 ). (1)
Then, the variance (→ I/1.11.1) of X is
Var(X) = exp 2µ + 2σ 2 − exp 2µ + σ 2 . (2)
Z (ln x−µ)/(σ)
1 1 2
E[X ] = √
2
exp (µ + σz) exp − z d [exp (µ + σz)]
2πσ 2 (−∞−µ)/(σ) 2
Z +∞
1 1
=√ exp − z 2 σ exp (2µ + 2σz) dz (7)
2πσ 2 −∞ 2
Z +∞
1 1 2
=√ exp − z − 4σz − 4µ dz
2π −∞ 2
3. UNIVARIATE CONTINUOUS DISTRIBUTIONS 243
Z +∞
1 1 2
E[X ] = √
2
exp − z − 4σz + 4σ − 4σ − 4µ dz
2 2
2π −∞ 2
Z +∞
1 1 2
=√ exp − z − 4σz + 4σ 2
exp 2σ 2 + 2µ dz (8)
2π −∞ 2
Z +∞
1 1
2
= exp 2σ + 2µ √ exp − (z − 2σ) dz2
−∞ 2π 2
Var(X) = E X 2 − E [X]2
2
1 2
= exp 2σ + 2µ − exp µ + σ
2 (13)
2
= exp 2σ + 2µ − exp 2µ + σ 2 .
2
Sources:
• Taboga, Marco (2022): “Log-normal distribution”; in: Lectures on probability theory and mathe-
matical statistics, retrieved on 2022-10-01; URL: https://www.statlect.com/probability-distributions/
log-normal-distribution.
• Wikipedia (2022): “Variance”; in: Wikipedia, the free encyclopedia, retrieved on 2022-10-01; URL:
https://en.wikipedia.org/wiki/Variance#Definition.
244 CHAPTER II. PROBABILITY DISTRIBUTIONS
X
k
Y = Xi2 ∼ χ2 (k) where k > 0 . (2)
i=1
The probability density function of the chi-squared distribution (→ II/3.7.3) with k degrees of free-
dom is
1
χ2 (x; k) = xk/2−1 e−x/2 (3)
2k/2 Γ(k/2)
where k > 0 and the density is zero if x ≤ 0.
Sources:
• Wikipedia (2020): “Chi-square distribution”; in: Wikipedia, the free encyclopedia, retrieved on
2020-10-12; URL: https://en.wikipedia.org/wiki/Chi-square_distribution#Definitions.
• Robert V. Hogg, Joseph W. McKean, Allen T. Craig (2018): “The Chi-Squared-Distribution”;
in: Introduction to Mathematical Statistics, Pearson, Boston, 2019, p. 178, eq. 3.3.7; URL: https:
//www.pearson.com/store/p/introduction-to-mathematical-statistics/P100000843744.
Proof: The probability density function of the gamma distribution (→ II/3.4.7) for x > 0, where α
is the shape parameter and β is the rate parameter, is as follows:
β α α−1 −βx
Gam(x; α, β) = x e (2)
Γ(α)
If we let α = k/2 and β = 1/2, we obtain
k 1 xk/2−1 e−x/2 1
Gam x; , = k/2
= k/2 xk/2−1 e−x/2 (3)
2 2 Γ(k/2)2 2 Γ(k/2)
which is equivalent to the probability density function of the chi-squared distribution (→ II/3.7.3).
■
3. UNIVARIATE CONTINUOUS DISTRIBUTIONS 245
Y ∼ χ2 (k) . (1)
Then, the probability density function (→ I/1.7.1) of Y is
1
fY (y) = y k/2−1 e−y/2 . (2)
2k/2 Γ(k/2)
X
k
X1 , . . . , Xk ∼ N (0, 1) ⇒ Y = Xi2 ∼ χ2 (k) . (3)
i=1
X
k
y= x2i (4)
i=1
and let fY (y) and FY (y) be the probability density function (→ I/1.7.1) and cumulative distribution
function (→ I/1.8.1) of Y . Because the PDF is the first derivative of the CDF (→ I/1.7.7), we can
write:
FY (y)
FY (y) = dy = fY (y) dy . (5)
dy
Then, the cumulative distribution function (→ I/1.8.1) of Y can be expressed as
Z Y
k
fY (y) dy = (N (xi ; 0, 1) dxi ) (6)
V i=1
where N (xi ; 0, 1) is the probability density function (→ I/1.7.1) of the standard normal distribution
(→ II/3.2.3) and V is the elemental shell volume at y(x), which is proportional to the (k − 1)-
dimensional surface in k-space for which equation (4) is fulfilled. Using the probability density func-
tion of the normal distribution (→ II/3.2.10), equation (6) can be developed as follows:
Z Y k
1 1 2
fY (y) dy = √ · exp − xi dxi
V i=1 2π 2
Z 1 2
exp − 2 (x1 + . . . + x2k ) (7)
= dx1 . . . dxk
(2π)k/2
V
Z h yi
1
= exp − dx1 . . . dxk .
(2π)k/2 V 2
Because y is constant within the set V , it can be moved out of the integral:
246 CHAPTER II. PROBABILITY DISTRIBUTIONS
Z
exp [−y/2]
fY (y) dy = dx1 . . . dxk . (8)
(2π)k/2 V
√
Now, the integral is simply the surface area of the (k − 1)-dimensional sphere with radius r = y,
which is
k−1 π k/2
A = 2r , (9)
Γ(k/2)
times the infinitesimal thickness of the sphere, which is
dr 1 dy
= y −1/2 ⇔ dr = . (10)
dy 2 2y 1/2
Substituting (9) and (10) into (8), we have:
exp [−y/2]
fY (y) dy = · A dr
(2π)k/2
k/2
exp [−y/2] k−1 π dy
= k/2
· 2r · 1/2
(2π) Γ(k/2) 2y
√ k−1 (11)
1 2 y
= k/2 · √ · exp [−y/2] dy
2 Γ(k/2) 2 y
1 h yi
· y 2 −1 · exp −
k
= k/2 dy .
2 Γ(k/2) 2
Sources:
• Wikipedia (2020): “Proofs related to chi-squared distribution”; in: Wikipedia, the free encyclopedia,
retrieved on 2020-11-25; URL: https://en.wikipedia.org/wiki/Proofs_related_to_chi-squared_
distribution#Derivation_of_the_pdf_for_k_degrees_of_freedom.
• Wikipedia (2020): “n-sphere”; in: Wikipedia, the free encyclopedia, retrieved on 2020-11-25; URL:
https://en.wikipedia.org/wiki/N-sphere#Volume_and_surface_area.
3.7.4 Moments
Theorem: Let X be a random variable (→ I/1.2.2) following a chi-squared distribution (→ II/3.7.1):
X ∼ χ2 (k) . (1)
If m > −k/2, then E(X m ) exists and is equal to:
m k
2 Γ 2
+ m
E(X m ) = . (2)
Γ k2
3. UNIVARIATE CONTINUOUS DISTRIBUTIONS 247
Proof: Combining the definition of the m-th raw moment (→ I/1.18.3) with the probability density
function of the chi-squared distribution (→ II/3.7.3), we have:
Z ∞
1
m
E(X ) = k
k/2
x(k/2)+m−1 e−x/2 dx . (3)
0 Γ 2 2
Now define a new variable u = x/2. As a result, we obtain:
Z ∞
1
m
E(X ) = k
(k/2)−1
2(k/2)+m−1 u(k/2)+m−1 e−u du . (4)
0 Γ 2
2
This leads to the desired result when m > −k/2. Observe that, if m is a nonnegative integer, then
m > −k/2 is always true. Therefore, all moments (→ I/1.18.1) of a chi-squared distribution (→
II/3.7.1) exist and the m-th raw moment is given by the foregoing equation.
■
Sources:
• Robert V. Hogg, Joseph W. McKean, Allen T. Craig (2018): “The �2-Distribution”; in: Introduction
to Mathematical Statistics, Pearson, Boston, 2019, p. 179, eq. 3.3.8; URL: https://www.pearson.
com/store/p/introduction-to-mathematical-statistics/P100000843744.
3.8 F-distribution
3.8.1 Definition
Definition: Let X1 and X2 be independent (→ I/1.3.6) random variables (→ I/1.2.2) following a
chi-squared distribution (→ II/3.7.1) with d1 and d2 degrees of freedom, respectively:
X1 ∼ χ2 (d1 )
(1)
X2 ∼ χ2 (d2 ) .
Then, the ratio of X1 to X2 , divided by their respective degrees of freedom, is said to be F -distributed
with numerator degrees of freedom d1 and denominator degrees of freedom d2 :
X1 /d1
Y = ∼ F (d1 , d2 ) where d1 , d2 > 0 . (2)
X2 /d2
The F -distribution is also called “Snedecor’s F -distribution” or “Fisher–Snedecor distribution”, after
Ronald A. Fisher and George W. Snedecor.
Sources:
• Wikipedia (2021): “F-distribution”; in: Wikipedia, the free encyclopedia, retrieved on 2021-04-21;
URL: https://en.wikipedia.org/wiki/F-distribution#Characterization.
F ∼ F(u, v) . (1)
Then, the probability density function (→ I/1.7.1) of F is
248 CHAPTER II. PROBABILITY DISTRIBUTIONS
u+v
u u2 u − u+v
Γ
· f 2 −1 ·
u
2 ·
2
fF (f ) = f +1 . (2)
Γ u
2
·Γ v
2
v v
Proof: An F-distributed random variable (→ II/3.8.1) is defined as the ratio of two chi-squared
random variables (→ II/3.7.1), divided by their degrees of freedom
X/u
X ∼ χ2 (u), Y ∼ χ2 (v) ⇒ F = ∼ F(u, v) (3)
Y /v
where X and Y are independent of each other (→ I/1.3.6).
The probability density function of the chi-squared distribution (→ II/3.7.3) is
1 u
−1 − x2
fX (x) = · x 2 · e . (4)
Γ u2 · 2u/2
Define the random variables F and W as functions of X and Y
X/u
F =
Y /v (5)
W =Y ,
u
X= FW
v (6)
Y =W .
dX dX u u
W F
J= dF dW
= v v
dY dY
dF dW
0 1 (7)
u
|J| = W .
v
Because X and Y are independent (→ I/1.3.6), the joint density (→ I/1.5.2) of X and Y is equal to
the product (→ I/1.3.8) of the marginal densities (→ I/1.5.3):
u
fF,W (f, w) = fX f w · fY (w) · |J|
v
1 u u2 −1 1 u
· e− 2 ( v f w ) ·
1 u
· w 2 −1 · e− 2 · w
v w
= · fw
Γ 2 ·2
u u/2 v Γ 2 ·2
v v/2 v (10)
u
· f 2 −1
u 2 u
u+v
−1 −w u
2 (v
f +1)
= v · w 2 · e .
Γ 2 ·Γ 2 ·2
u v (u+v)/2
The marginal density (→ I/1.5.3) of F can now be obtained by integrating out (→ I/1.3.3) W :
Z ∞
fF (f ) = fF,W (f, w) dw
0
u2 Z ∞
· f 2 −1 1 u
u u
u+v
−1
= v · w 2 · exp − f + 1 w dw
Γ u2 · Γ v2 · 2(u+v)/2 0 2 v
u Z ∞ 1 u (u+v)/2
−1
1 u
u
u 2
· f 2 Γ u+v
f + 1 u+v
−1
= v ·
2
· 2 v · w 2 · exp − f + 1 w
Γ u2 · Γ v2 · 2(u+v)/2 1 u f + 1 (u+v)/2 0 Γ u+v
2
2 v
2 v
(11)
At this point, we can recognize that the integrand is equal to the probability density function of a
gamma distribution (→ II/3.4.7) with
u+v 1 u
a= and b = f +1 , (12)
2 2 v
and because a probability density function integrates to one (→ I/1.7.1), we finally have:
u2
· f 2 −1
u u
Γ u+v
fF (f ) = v · 2
Γ u2 · Γ v2 · 2(u+v)/2 1 u f + 1 (u+v)/2
2 v (13)
Γ u+v u u2 u − u+v
u
−1
·
2
= 2
·f 2 · f +1 .
Γ 2 ·Γ 2
u v
v v
Sources:
• statisticsmatt (2018): “Statistical Distributions: Derive the F Distribution”; in: YouTube, retrieved
on 2021-10-11; URL: https://www.youtube.com/watch?v=AmHiOKYmHkI.
X ∼ Bet(α, β) , (1)
250 CHAPTER II. PROBABILITY DISTRIBUTIONS
Sources:
• Wikipedia (2020): “Beta distribution”; in: Wikipedia, the free encyclopedia, retrieved on 2020-05-
10; URL: https://en.wikipedia.org/wiki/Beta_distribution#Definitions.
X
Z=
X +Y (4)
W =Y ,
ZW
X=
1−Z (5)
Y =W .
dX dX W Z
J= dZ dW
= (1−Z)2 1−Z
dY dY
dZ dW
0 1 (6)
W
|J| = .
(1 − Z)2
Because X and Y are independent (→ I/1.3.6), the joint density (→ I/1.5.2) of X and Y is equal to
the product (→ I/1.3.8) of the marginal densities (→ I/1.5.3):
3. UNIVARIATE CONTINUOUS DISTRIBUTIONS 251
zw
fZ,W (z, w) = fX · fY (w) · |J|
1−z
m2 −1
1 zw 1 w
· e− 2 ( 1−z ) ·
1 zw
· w 2 −1 · e− 2 ·
n w
= ·
Γ 2 · 2m/2
m
1−z Γ ·2n/2 n
2
(1 − z)2
m2 −1 2 (9)
1 z 1 w(1−z)
· w 2 + 2 −1 e− 2 ( 1−z + 1−z )
m n 1 zw
= ·
Γ m
2
Γ n
2
· 2m/2 2n/2 1−z (1 − z)
1
· e− 2 ( 1−z ) .
1 w
· z 2 −1 · (1 − z)− 2 −1 · w −1
m m m+n
= 2
Γ m
2
Γ n
2
· 2(m+n)/2
The marginal density (→ I/1.5.3) of Z can now be obtained by integrating out (→ I/1.3.3) W :
Z ∞
fZ (z) = fZ,W (z, w) dw
0
Z ∞
1
· e− 2 ( 1−z ) dw
1 w
m
−1 −m −1 m+n
−1
= ·z 2 · (1 − z) 2 · w 2
Γ m
Γ n
· 2(m+n)/2 0
2 2
m+n
1 m
−1 −m −1 Γ (10)
= ·z 2 · (1 − z) 2 · 2
m+n ·
Γ m
2
Γ n
2
· 2(m+n)/2 1 2
2(1−z)
m+n
Z ∞
1 2
2(1−z)
· e− 2(1−z) w dw .
1
m+n
−1
m+n
·w 2
0 Γ 2
At this point, we can recognize that the integrand is equal to the probability density function of a
gamma distribution (→ II/3.4.7) with
m+n 1
a= and b = , (11)
2 2(1 − z)
and because a probability density function integrates to one (→ I/1.7.1), we have:
m+n
1 m
−1 −m −1 Γ
fZ (z) = ·z 2 · (1 − z) 2 · 2
m+n
Γ m
2
Γ n
2
· 2(m+n)/2 1 2
2(1−z)
Γ · 2(m+n)/2
m+n
(12)
· z 2 −1 · (1 − z)− 2 + 2 −1
m m m+n
= 2
Γ 2 Γ 2 ·2
m n (m+n)/2
Γ m+n
2 n · z 2 −1 · (1 − z) 2 −1 .
m n
= m
Γ 2 Γ 2
252 CHAPTER II. PROBABILITY DISTRIBUTIONS
which is the probability density function of the beta distribution (→ II/3.9.3) with parameters
m n
α= and β = , (14)
2 2
such that
m n
Z ∼ Bet , . (15)
2 2
■
Sources:
• Probability Fact (2021): “If X chisq(m) and Y chisq(n) are independent”; in: Twitter, retrieved
on 2022-10-17; URL: https://twitter.com/ProbFact/status/1450492787854647300.
X ∼ Bet(α, β) . (1)
Then, the probability density function (→ I/1.7.1) of X is
1
fX (x) = xα−1 (1 − x)β−1 . (2)
B(α, β)
Proof: This follows directly from the definition of the beta distribution (→ II/3.9.1).
■
X ∼ Bet(α, β) . (1)
Then, the moment-generating function (→ I/1.9.5) of X is
!
X
∞ Y
n−1
α+m tn
MX (t) = 1 + . (2)
n=1 m=0
α+β+m n!
MX (t) = E etX . (4)
Using the expected value for continuous random variables (→ I/1.10.1), the moment-generating
function of X therefore is
Z 1
1
MX (t) = exp[tx] · xα−1 (1 − x)β−1 dx
0 B(α, β)
Z 1 (5)
1
= etx xα−1 (1 − x)β−1 dx .
B(α, β) 0
Γ(α) Γ(β)
B(α, β) = (6)
Γ(α + β)
and the integral representation of the confluent hypergeometric function (Kummer’s function of the
first kind)
Z 1
Γ(b)
1 F1 (a, b, z) = ezu ua−1 (1 − u)(b−a)−1 du , (7)
Γ(a) Γ(b − a) 0
the moment-generating function can be written as
Y
n−1
mn = (m + i) , (10)
i=0
Applying the rising factorial equation (10) and using m0 = x0 = 0! = 1, we finally have:
!
X∞ Y α+m
n−1
tn
MX (t) = 1 + . (12)
n=1 m=0
α + β + m n!
Sources:
254 CHAPTER II. PROBABILITY DISTRIBUTIONS
• Wikipedia (2020): “Beta distribution”; in: Wikipedia, the free encyclopedia, retrieved on 2020-11-
25; URL: https://en.wikipedia.org/wiki/Beta_distribution#Moment_generating_function.
• Wikipedia (2020): “Confluent hypergeometric function”; in: Wikipedia, the free encyclopedia, re-
trieved on 2020-11-25; URL: https://en.wikipedia.org/wiki/Confluent_hypergeometric_function#
Kummer’s_equation.
X ∼ Bet(α, β) . (1)
Then, the cumulative distribution function (→ I/1.8.1) of X is
B(x; α, β)
FX (x) = (2)
B(α, β)
where B(a, b) is the beta function and B(x; a, b) is the incomplete gamma function.
Proof: The probability density function of the beta distribution (→ II/3.9.3) is:
1
fX (x) = xα−1 (1 − x)β−1 . (3)
B(α, β)
Thus, the cumulative distribution function (→ I/1.8.1) is:
Z x
FX (x) = Bet(z; α, β) dz
Z
0
x
1
= z α−1 (1 − z)β−1 dz (4)
B(α, β)
0
Z x
1
= z α−1 (1 − z)β−1 dz .
B(α, β) 0
B(x; α, β)
FX (x) = . (6)
B(α, β)
■
Sources:
• Wikipedia (2020): “Beta distribution”; in: Wikipedia, the free encyclopedia, retrieved on 2020-11-
19; URL: https://en.wikipedia.org/wiki/Beta_distribution#Cumulative_distribution_function.
• Wikipedia (2020): “Beta function”; in: Wikipedia, the free encyclopedia, retrieved on 2020-11-19;
URL: https://en.wikipedia.org/wiki/Beta_function#Incomplete_beta_function.
3. UNIVARIATE CONTINUOUS DISTRIBUTIONS 255
3.9.6 Mean
Theorem: Let X be a random variable (→ I/1.2.2) following a beta distribution (→ II/3.9.1):
X ∼ Bet(α, β) . (1)
Then, the mean or expected value (→ I/1.10.1) of X is
α
E(X) = . (2)
α+β
Proof: The expected value (→ I/1.10.1) is the probability-weighted average over all possible values:
Z
E(X) = x · fX (x) dx . (3)
X
The probability density function of the beta distribution (→ II/3.9.3) is
1
fX (x) = xα−1 (1 − x)β−1 , 0 ≤ x ≤ 1 (4)
B(α, β)
where the beta function is given by a ratio gamma functions:
Γ(α) · Γ(β)
B(α, β) = . (5)
Γ(α + β)
Combining (3), (4) and (5), we have:
Z 1
Γ(α + β) α−1
E(X) = x· x (1 − x)β−1 dx
0 Γ(α) · Γ(β)
Z 1 (6)
Γ(α + β) Γ(α + 1) Γ(α + 1 + β) (α+1)−1
= · x (1 − x)β−1 dx .
Γ(α) Γ(α + 1 + β) 0 Γ(α + 1) · Γ(β)
Employing the relation Γ(x + 1) = Γ(x) · x, we have
Z 1
Γ(α + β) α · Γ(α) Γ(α + 1 + β) (α+1)−1
E(X) = · x (1 − x)β−1 dx
Γ(α) (α + β) · Γ(α + β) 0 Γ(α + 1) · Γ(β)
Z 1 (7)
α Γ(α + 1 + β) (α+1)−1
= x (1 − x)β−1 dx
α + β 0 Γ(α + 1) · Γ(β)
and again using the density of the beta distribution (→ II/3.9.3), we get
Z 1
α
E(X) = Bet(x; α + 1, β) dx
α+β 0 (8)
α
= .
α+β
■
Sources:
• Boer Commander (2020): “Beta Distribution Mean and Variance Proof”; in: YouTube, retrieved
on 2021-04-29; URL: https://www.youtube.com/watch?v=3OgCcnpZtZ8.
256 CHAPTER II. PROBABILITY DISTRIBUTIONS
3.9.7 Variance
Theorem: Let X be a random variable (→ I/1.2.2) following a beta distribution (→ II/3.9.1):
X ∼ Bet(α, β) . (1)
Then, the variance (→ I/1.11.1) of X is
αβ
Var(X) = . (2)
(α + β + 1) · (α + β)2
Proof: The variance (→ I/1.11.1) can be expressed in terms of expected values (→ I/1.11.3) as
Γ(α) · Γ(β)
B(α, β) = . (6)
Γ(α + β)
Therefore, the expected value of a squared beta random variable becomes
Z 1
Γ(α + β) α−1
2
E(X ) = x2 · x (1 − x)β−1 dx
0 Γ(α) · Γ(β)
Z 1 (7)
Γ(α + β) Γ(α + 2) Γ(α + 2 + β) (α+2)−1
= · x (1 − x)β−1 dx .
Γ(α) Γ(α + 2 + β) 0 Γ(α + 2) · Γ(β)
Z 1
Γ(α + β) (α + 1) · α · Γ(α) Γ(α + 2 + β) (α+2)−1
2
E(X ) = · x (1 − x)β−1 dx
Γ(α) (α + β + 1) · (α + β) · Γ(α + β) 0 Γ(α + 2) · Γ(β)
Z 1 (8)
(α + 1) · α Γ(α + 2 + β) (α+2)−1
= x (1 − x)β−1
dx
(α + β + 1) · (α + β) 0 Γ(α + 2) · Γ(β)
and again using the density of the beta distribution (→ II/3.9.3), we get
Z 1
2 (α + 1) · α
E(X ) = Bet(x; α + 2, β) dx
(α + β + 1) · (α + β) 0
(9)
(α + 1) · α
= .
(α + β + 1) · (α + β)
3. UNIVARIATE CONTINUOUS DISTRIBUTIONS 257
Plugging (9) and (4) into (3), the variance of a beta random variable finally becomes
2
(α + 1) · α α
Var(X) = −
(α + β + 1) · (α + β) α+β
(α + α) · (α + β)
2
α2 · (α + β + 1)
= −
(α + β + 1) · (α + β)2 (α + β + 1) · (α + β)2
(10)
(α3 + α2 β + α2 + αβ) − (α3 + α2 β + α2 )
=
(α + β + 1) · (α + β)2
αβ
= .
(α + β + 1) · (α + β)2
■
Sources:
• Boer Commander (2020): “Beta Distribution Mean and Variance Proof”; in: YouTube, retrieved
on 2021-04-29; URL: https://www.youtube.com/watch?v=3OgCcnpZtZ8.
X ∼ Wald(γ, α) , (1)
if and only if its probability density function (→ I/1.7.1) is given by
α (α − γx)2
Wald(x; γ, α) = √ exp − (2)
2πx3 2x
where γ > 0, α > 0, and the density is zero if x ≤ 0.
Sources:
• Anders, R., Alario, F.-X., and van Maanen, L. (2016): “The Shifted Wald Distribution for Response
Time Data Analysis”; in: Psychological Methods, vol. 21, no. 3, pp. 309-327; URL: https://dx.doi.
org/10.1037/met0000066; DOI: 10.1037/met0000066.
X ∼ Wald(γ, α) . (1)
Then, the probability density function (→ I/1.7.1) of X is
α (α − γx)2
fX (x) = √ exp − . (2)
2πx3 2x
Proof: This follows directly from the definition of the Wald distribution (→ II/3.10.1).
258 CHAPTER II. PROBABILITY DISTRIBUTIONS
X ∼ Wald(γ, α) . (1)
Then, the moment-generating function (→ I/1.9.5) of X is
h p i
MX (t) = exp αγ − α (γ − 2t) .
2 2 (2)
Z ∞
α (α − γx)2
MX (t) = e ·√ tx
· exp − dx
2πx3 2x
0
Z ∞ (5)
α −3/2 (α − γx)2
=√ x · exp tx − dx .
2π 0 2x
To evaluate this integral, we will need two identities about modified Bessel functions of the second
kind1 , denoted Kp . The function Kp (for p ∈ R) is one of the two linearly independent solutions of
the differential equation
d2 y dy
x2 2
+ x − (x2 + p2 )y = 0 . (6)
dx dx
2
The first of these identities gives an explicit solution for K−1/2 :
r
π −x
K−1/2 (x) = e . (7)
2x
The second of these identities3 gives an integral representation of Kp :
Z
√ 1 a p/2 ∞ p−1 1 b
Kp ( ab) = x · exp − ax + dx . (8)
2 b 0 2 x
Starting from (5), we can expand the binomial term and rearrange the moment generating function
into the following form:
1
https://dlmf.nist.gov/10.25
2
https://dlmf.nist.gov/10.39.2
3
https://dlmf.nist.gov/10.32.10
3. UNIVARIATE CONTINUOUS DISTRIBUTIONS 259
Z ∞
α −3/2 α2 γ 2x
MX (t) = √ x · exp tx − + αγ − dx
2π 0 2x 2
Z ∞
α −3/2 γ2 α2
= √ ·e αγ
x · exp t − x− dx (9)
2π 2 2x
0
Z ∞
α −3/2 1 2 1 α2
= √ ·e αγ
x · exp − γ − 2t x − · dx .
2π 0 2 2 x
The integral now has the form of the integral in (8) with p = −1/2, a = γ 2 − 2t, and b = α2 . This
allows us to write the moment-generating function in terms of the modified Bessel function K−1/2 :
1/4 p
α γ 2 − 2t
MX (t) = √ · eαγ · 2 · K−1/2 α2 (γ 2 − 2t) . (10)
2π α2
Combining with (7) and simplifying gives
1/4 s h p i
α γ 2 − 2t π
MX (t) = √ · eαγ · 2 · p · exp − α2 (γ 2 − 2t)
2π α2 2 α2 (γ 2 − 2t)
√ h p i
α (γ 2 − 2t)1/4 π
= √ √ ·e ·2· αγ
√ ·√ √ · exp − α2 (γ 2 − 2t) (11)
2· π α 2 · α · (γ 2 − 2t)1/4
h p i
= eαγ · exp − α2 (γ 2 − 2t)
h p i
= exp αγ − α2 (γ 2 − 2t) .
Sources:
• Siegrist, K. (2020): “The Wald Distribution”; in: Random: Probability, Mathematical Statistics,
Stochastic Processes, retrieved on 2020-09-13; URL: https://www.randomservices.org/random/
special/Wald.html.
• National Institute of Standards and Technology (2020): “NIST Digital Library of Mathematical
Functions”, retrieved on 2020-09-13; URL: https://dlmf.nist.gov.
3.10.4 Mean
Theorem: Let X be a positive random variable (→ I/1.2.2) following a Wald distribution (→
II/3.10.1):
X ∼ Wald(γ, α) . (1)
Then, the mean or expected value (→ I/1.10.1) of X is
α
E(X) = . (2)
γ
260 CHAPTER II. PROBABILITY DISTRIBUTIONS
Proof: The mean or expected value E(X) is the first moment (→ I/1.18.1) of X, so we can use (→
I/1.18.2) the moment-generating function of the Wald distribution (→ II/3.10.3) to calculate
h p i 1 −1/2
MX′ (t) = exp αγ − α2 (γ 2 − 2t) · − α2 (γ 2 − 2t) · −2α2
2
h p i α2 (5)
= exp αγ − α2 (γ 2 − 2t) · p .
α2 (γ 2 − 2t)
h p i α2
MX′ (0) = exp αγ − α2 (γ 2 − 2(0)) · p
α2 (γ 2 − 2(0))
h p i α2
= exp αγ − α2 · γ 2 · p
α2 · γ 2 (6)
α2
= exp[0] ·
αγ
α
= .
γ
■
3.10.5 Variance
Theorem: Let X be a positive random variable (→ I/1.2.2) following a Wald distribution (→
II/3.10.1):
X ∼ Wald(γ, α) . (1)
Then, the variance (→ I/1.11.1) of X is
α
Var(X) = . (2)
γ3
Proof: To compute the variance of X, we partition the variance into expected values (→ I/1.11.3):
h p i
MX (t) = exp αγ − α (γ − 2t)
2 2 (5)
with respect to t. Using the chain rule gives
h p i 1 −1/2
MX′ (t) = exp αγ − α2 (γ 2 − 2t) · − α2 (γ 2 − 2t) · −2α2
2
h p i α2
= exp αγ − α2 (γ 2 − 2t) · p (6)
α2 (γ 2 − 2t)
h p i
= α · exp αγ − α2 (γ 2 − 2t) · (γ 2 − 2t)−1/2 .
h p i 1 −1/2
MX′′ (t) = α · exp αγ − α2 (γ 2 − 2t) · (γ 2 − 2t)−1/2 · − α2 (γ 2 − 2t) · −2α2
2
h p i 1
+ α · exp αγ − α2 (γ 2 − 2t) · − (γ 2 − 2t)−3/2 · −2
h p i 2
= α2 · exp αγ − α2 (γ 2 − 2t) · (γ 2 − 2t)−1 (7)
h p i
+ α · exp αγ − α2 (γ 2 − 2t) · (γ 2 − 2t)−3/2
" #
h p i α 1
= α · exp αγ − α2 (γ 2 − 2t) +p .
γ 2 − 2t (γ 2 − 2t)3
3.10.6 Skewness
Theorem: Let X be a random variable (→ I/1.2.2) following a Wald distribution (→ II/3.10.1):
X ∼ Wald(γ, α) . (1)
Then the skewness (→ I/1.12.1) of X is
3
Skew(X) = √ . (2)
αγ
Proof:
To compute the skewness of X, we partition the skewness into expected values (→ I/1.12.3):
E(X 3 ) − 3µσ 2 − µ3
Skew(X) = , (3)
σ3
where µ and σ are the mean and standard deviation of X, respectively. Since X follows an Wald
distribution (→ II/3.10.1), the mean (→ II/3.10.4) of X is given by
α
µ = E(X) = (4)
γ
and the standard deviation (→ II/3.10.5) of X is given by
p r
α
σ = Var(X) = . (5)
γ3
Substituting (4) and (5) into (3) gives:
E(X 3 ) − 3µσ 2 − µ3
Skew(X) =
σ3
3
E(X ) − 3 αγ
3 α
γ3
− αγ
= q 3 (6)
α
γ3
γ 9/2 3α2 α3
= 3/2 E(X ) − 4 − 3 .
3
α γ γ
Thus, the remaining work is to compute E(X 3 ). To do this, we use the moment-generating function
of the Wald distribution (→ II/3.10.3) to calculate
h p i 1 −1/2
MX′ (t) = exp αγ − α2 (γ 2 − 2t) · − α2 (γ 2 − 2t) · −2α2
2
h p i α2
= exp αγ − α (γ − 2t) · p
2 2 (9)
α2 (γ 2 − 2t)
h p i
= α · exp αγ − α2 (γ 2 − 2t) · (γ 2 − 2t)−1/2 .
h p i 1 −1/2
MX′′ (t) = α · exp αγ − α (γ − 2t) · (γ 2 − 2t)−1/2 · − α2 (γ 2 − 2t)
2 2 · −2α2
2
h p i 1 −3/2
+ α · exp αγ − α2 (γ 2 − 2t) · − (γ − 2t)
2
· −2
h p i 2
= α2 · exp αγ − α2 (γ 2 − 2t) · (γ 2 − 2t)−1 (10)
h p i
+ α · exp αγ − α2 (γ 2 − 2t) · (γ 2 − 2t)−3/2
α2 h p i α h p i
= 2 exp αγ − α (γ − 2t) + 2
2 2 exp αγ − α (γ − 2t) .
2 2
γ − 2t (γ − 2t)3/2
Finally, one more application of the chain rule will give us the third derivative. To start, we will
decompose the second derivative obtained in (10) as
α2 h p i
f (t) = exp αγ − α 2 (γ 2 − t2) (12)
γ 2 − 2t
and
α h p i
g(t) = exp αγ − α 2 (γ 2 − 2t) . (13)
(γ 2 − 2t)3/2
With this decomposition, MX′′′ (t) = f ′ (t) + g ′ (t). Applying the product rule to f gives:
h p i
f ′ (t) = 2α2 (γ 2 − 2t)−2 exp αγ − α2 (γ 2 − 2t)
h p i 1 −1/2
+ α2 (γ 2 − 2t)−1 exp αγ − α2 (γ 2 − 2t) · − α2 (γ 2 − 2t) · −2α2
2
2α2 h p i (14)
= 2 exp αγ − α 2 (γ 2 − 2t)
(γ − 2t)2
α3 h p i
+ 2 exp αγ − α (γ − 2t) .
2 2
(γ − 2t)3/2
3 h p i
g ′ (t) = − α(γ 2 − 2t)−5/2 (−2) exp αγ − α2 (γ 2 − 2t)
2
h p i 1 −1/2
+ α(γ 2 − 2t)−3/2 exp αγ − α2 (γ 2 − 2t) · − α2 (γ 2 − 2t) · −2α2
2
3α h p i (15)
= 2 exp αγ − α 2 (γ 2 − 2t)
(γ − 2t)5/2
α2 h p i
+ 2 exp αγ − α 2 (γ 2 − 2t) .
(γ − 2t)2
γ 9/2 3α2 α3
Skew(X) = 3/2 E(X ) − 4 − 3
3
α γ γ
2
γ 9/2 3α α 3
3α 3α2 α3
= 3/2 + 3+ 5 − 4 − 3
α γ4 γ γ γ γ
γ 9/2 3α (17)
= 3/2 · 5
α γ
3
= 1/2 1/2
α ·γ
3
=√ .
αγ
r
ȳ
γ̂ =
v̄
r (2)
ȳ 3
α̂ =
v̄
where ȳ is the sample mean (→ I/1.10.2) and v̄ is the unbiased sample variance (→ I/1.11.2):
1X
n
ȳ = yi
n i=1
(3)
1 X
n
v̄ = (yi − ȳ)2 .
n − 1 i=1
Proof: The mean (→ II/3.10.4) and variance (→ II/3.10.5) of the Wald distribution (→ II/3.10.1)
in terms of the parameters γ and α are given by
α
E(X) =
γ
α (4)
Var(X) = 3 .
γ
Thus, matching the moments (→ I/4.1.6) requires us to solve the following system of equations for
γ and α:
α
ȳ =
γ
α (5)
v̄ = 3 .
γ
To this end, our first step is to express the second equation of (5) as follows:
α
v̄ =
γ3
α (6)
= · γ −2
γ
= ȳ · γ −2 .
α = ȳ · γ
r
ȳ
= ȳ ·
v̄
r
p ȳ (9)
= ȳ 2 ·
v̄
r
ȳ 3
= .
v̄
Together, (8) and (9) constitute the method-of-moment estimates of γ and α.
X ∼ ex-Gaussian(µ, σ, λ) , (1)
where µ ∈ R, σ > 0, and λ > 0.
Sources:
• Luce, R. D. (1986): “Response Times: Their Role in Inferring Elementary Mental Organization”,
35-36; URL: https://global.oup.com/academic/product/response-times-9780195036428.
X ∼ ex-Gaussian(µ, σ, λ) . (1)
Then the probability density function (→ I/1.7.1) of X is
2 2 Z t−µ −λσ
λ λσ σ 1 2
fX (t) = √ exp − λ(t − µ) · exp − y dy . (2)
2π 2 −∞ 2
" 2 #
1 1 t−µ
fA (t) = √ exp − , (3)
σ 2π 2 σ
and the probability density function (→ II/3.5.3) for B is given by
λ exp[−λt] , if t ≥ 0
fB (t) = (4)
0 , if t < 0 .
Thus, the probability density function for the sum (→ I/1.7.2) X = A + B is given by taking the
convolution of fA and fB :
Z ∞
fX (t) = fA (x)fB (t − x)dx
−∞
Z t Z ∞
= fA (x)fB (t − x)dx + fA (x)fB (t − x)dx (5)
−∞ t
Z t
= fA (x)fB (t − x)dx ,
−∞
which follows from the fact that fB (t − x) = 0 for x > t. From here, we substitute the expressions
(3) and (4) for the probability density functions fA and fB in (5):
Z " 2 #
t
1 1 x−µ
fX (t) = √ exp − · λ exp[−λ(t − x)]dx
−∞ σ 2π 2 σ
Z t " 2 #
λ 1 x−µ
= √ exp − · exp[−λt + λx]dx
σ 2π −∞ 2 σ
Z t " 2 # (6)
λ 1 x−µ
= √ exp − · exp[−λt] · exp[λx]dx
σ 2π −∞ 2 σ
Z " 2 #
λ exp[−λt] t 1 x−µ
= √ exp − + λx dx .
σ 2π −∞ 2 σ
We can further simplify the integrand with a substitution; to this end, let
x−µ
y = g(x) = − λσ (7)
σ
This gives the following three identities:
dy 1
= , or equivalently, dx = σdy , (8)
dx σ
x−µ
= y + λσ , and (9)
σ
x = yσ + λσ 2 + µ . (10)
Substituting these identities into (6) gives
268 CHAPTER II. PROBABILITY DISTRIBUTIONS
Z g(t)
λ exp[−λt] 1
fX (t) = √ exp − (y + λσ) + λ(yσ + λσ + µ) σdy
2 2
σ 2π −∞ 2
Z x−µ +λσ
λσ exp[−λt] σ 1 2
= √ exp − (y + 2yλσ + λ σ ) + λyσ + λ σ + λµ dy
2 2 2 2
σ 2π −∞ 2
Z x−µ +λσ
λ exp[−λt] σ 1 2 λ2 σ 2
= √ exp − y − yλσ − 2 2
+ λyσ + λ σ + λµ dy
2π −∞ 2 2
Z x−µ +λσ 2 2
λ exp[−λt] σ 1 2 λσ
= √ exp − y · exp + λµ dy (11)
2π −∞ 2 2
2 2 Z x−µ +λσ
λ exp[−λt] λσ σ 1 2
= √ · exp + λµ exp − y · dy
2π 2 −∞ 2
Z x−µ
+λσ
λ λ2 σ 2 σ 1 2
= √ · exp −λt + + λµ exp − y · dy
2π 2 −∞ 2
2 2 Z x−µ +λσ
λ λσ σ 1 2
= √ · exp − λ(t − µ) exp − y · dy .
2π 2 −∞ 2
X ∼ ex-Gaussian(µ, σ, λ) . (1)
Then, the moment generating function (→ I/1.9.5) of X is
λ 1 22
MX (t) = exp µt + σ t . (2)
λ−t 2
3.11.4 Mean
Theorem: Let X be a random variable (→ I/1.2.2) following an ex-Gaussian distribution (→
II/3.11.1):
X ∼ ex-Gaussian(µ, σ, λ) . (1)
Then, the mean or expected value (→ I/1.10.1) of X is
1
E(X) = µ + . (2)
λ
Proof: The mean or expected value E(X) is the first raw moment (→ I/1.18.1) of X, so we can
use (→ I/1.18.2) the moment-generating function of the ex-Gaussian distribution (→ II/3.11.3) to
calculate
λ 1 22 λ 1 22
MX′ (t) = exp µt + σ t + exp µt + σ t (µ + σ 2 t)
(λ − t)2 2 λ−t 2
(5)
λ 1 1
= · exp µt + σ 2 t2 · + µ + σ2t .
λ−t 2 λ−t
Evaluating (5) at t = 0 gives the desired result:
λ 1 2 2 1
MX′ (0) = · exp µ · 0 + σ · 0 · +µ+σ ·0
2
λ−0 2 λ−0
1 (6)
=1·1· +µ
λ
1
=µ+ .
λ
■
270 CHAPTER II. PROBABILITY DISTRIBUTIONS
3.11.5 Variance
Theorem: Let X be a random variable (→ I/1.2.2) following an ex-Gaussian distribution (→
II/3.11.1):
X ∼ ex-Gaussian(µ, σ, λ) . (1)
Then, the variance (→ I/1.11.1) of X is
1
Var(X) = σ 2 + . (2)
λ2
Proof: To compute the variance of X, we partition the variance into expected values (→ I/1.11.3):
λ 1 22 λ 1 22
MX′ (t) = exp µt + σ t + exp µt + σ t (µ + σ 2 t)
(λ − t)2 2 λ−t 2
λ 1 1
= · exp µt + σ 2 t2 · + µ + σ2t (6)
λ−t 2 λ−t
1
= MX (t) · 2
+µ+σ t .
λ−t
1 1
MX′′ (t) = MX′ (t)· + µ + σ t + MX (t) ·
2
+σ 2
λ−t (λ − t)2
2
(6) 1 1
= MX (t) · + µ + σ t + MX (t) ·
2
+σ 2
λ−t (λ − t)2
" 2 #
1 1 (7)
= MX (t) · + µ + σ2t + + σ2
λ−t (λ − t)2
" 2 #
λ 1 22 1 1
= · exp µt + σ t · + µ + σ2t + + σ2
λ−t 2 λ−t (λ − t)2
3.11.6 Skewness
Theorem: Let X be a random variable (→ I/1.2.2) following an ex-Gaussian distribution (→
II/3.11.1):
X ∼ ex-Gaussian(µ, σ, λ) . (1)
Then the skewness (→ I/1.12.1) of X is
2
Skew(X) = 23 . (2)
1
λ3 σ 2 + λ2
Proof:
To compute the skewness of X, we partition the skewness into expected values (→ I/1.12.3):
E(X 3 ) − 3µσ 2 − µ3
Skew(X) = , (3)
σ3
272 CHAPTER II. PROBABILITY DISTRIBUTIONS
where µ and σ are the mean and standard deviation of X, respectively. To prevent confusion between
the labels used for the ex-Gaussian parameters in (1) and the mean and standard deviation of X,
we rewrite (3) as
λ 1 22 λ 1 22
MX′ (t) = exp µt + σ t + exp µt + σ t (µ + σ 2 t)
(λ − t)2 2 λ−t 2
λ 1 1
= · exp µt + σ 2 t2 · + µ + σ2t (9)
λ−t 2 λ−t
1
= MX (t) · 2
+µ+σ t .
λ−t
We then use the product rule to obtain the second derivative:
1 1
MX′′ (t) = MX′ (t)· + µ + σ t + MX (t) ·
2
+σ 2
λ−t (λ − t)2
2
1 1
= MX (t) · + µ + σ t + MX (t) ·
2
+σ 2
(10)
λ−t (λ − t)2
" 2 #
1 1
= MX (t) · + µ + σ2t + + σ2 .
λ−t (λ − t)2
Finally, we use the product rule and chain rule to obtain the third derivative:
" 2 #
1 1 1 1
MX′′′ (t) = MX′ (t) 2
+µ+σ t + 2
+ σ +MX (t) 2 2
+µ+σ t +σ +2
λ−t (λ − t)2 λ−t (λ − t)2 (λ
(11)
3. UNIVARIATE CONTINUOUS DISTRIBUTIONS 273
6µ 6 3µ2 + 3σ 2 1 1
E(X ) − 3 · E(X) · Var(X) − E(X) =
3 3
+ 3+ + 3µσ + µ − 3 µ +
2 3
σ + 2 − µ
2
λ2 λ λ λ λ
2 2 2
6µ 6 3µ + 3σ 3µ 3σ 3 3
= 2 + 3+ + 3µσ 2 + µ3 − 3µσ 2 − 2 − − 3 − µ3 −
λ λ λ λ λ λ
2
= 3 .
λ
(13)
Thus, we have:
r
3 s̄ · v̄ 3/2
µ̂ = ȳ −
v 2
u r !
u s̄2
σ̂ = tv̄ · 1− (2)
3
4
r
3 2
λ̂ = ,
s̄ · v̄ 3/2
where ȳ is the sample mean (→ I/1.10.2), v̄ is the sample variance (→ I/1.11.2) and s̄ is the sample
skewness (→ I/1.12.2)
1X
n
ȳ = yi
n i=1
1 X
n
v̄ = (yi − ȳ)2 (3)
n − 1 i=1
Pn
1
(yi − ȳ)3
s̄ = nPn i=1 .
1
(y − ȳ)2 3/2
n i=1 i
Proof: The mean (→ II/3.11.4), variance (→ II/3.11.5), and skewness (→ II/3.11.6) of the ex-
Gaussian distribution (→ II/3.11.1) in terms of the parameters µ, σ, and λ are given by
1
E(X) = µ +
λ
1
Var(X) = σ 2 + (4)
λ2
2
Skew(X) =
1 3/2
.
λ3 σ 2 + λ 2
Thus, matching the moments (→ I/4.1.6) requires us to solve the following system of equations for
µ, σ, and λ:
1
ȳ = µ +
λ
1
v̄ = σ 2 + (5)
λ2
2
s̄ =
1 3/2
.
λ3 σ 2 + λ 2
To this end, our first step is to substitute the second equation of (5) into the third equation:
2
s̄ = 3/2
λ3 σ 2 + λ12 (6)
2
= 3 3/2 .
λ · v̄
3. UNIVARIATE CONTINUOUS DISTRIBUTIONS 275
1
µ = ȳ −
λ
r (8)
3 s̄ · v̄ 3/2
= ȳ − .
2
Finally, we solve the second equation of (5) for σ:
1
σ 2 = v̄ − . (9)
λ2
Taking the square root gives and substituting (7) gives:
r
1
σ= v̄ − 2
λ
v !2
u r
u · 3/2
= tv̄ −
3 s̄ v̄
2
s r (10)
2
3 s̄
= v̄ − v̄ ·
4
v r !
u
u 3 s̄
2
= tv̄ · 1 − .
4
Together, (8), (10), and (7) constitute the method-of-moment estimates of µ, σ, and λ.
■
276 CHAPTER II. PROBABILITY DISTRIBUTIONS
X ∼ N (µ, Σ) , (1)
if and only if its probability density function (→ I/1.7.1) is given by
1 1 T −1
N (x; µ, Σ) = p · exp − (x − µ) Σ (x − µ) (2)
(2π)n |Σ| 2
where µ is an n × 1 real vector and Σ is an n × n positive definite matrix.
Sources:
• Koch KR (2007): “Multivariate Normal Distribution”; in: Introduction to Bayesian Statistics,
ch. 2.5.1, pp. 51-53, eq. 2.195; URL: https://www.springer.com/gp/book/9783540727231; DOI:
10.1007/978-3-540-72726-2.
1 1 −1 T −1
MN (x; µ, Σ, 1) = p · exp − tr 1 (x − µ) Σ (x − µ)
(2π)n |1|n |Σ|1 2
(2)
1 1 T −1
=p · exp − (x − µ) Σ (x − µ)
(2π)n |Σ| 2
which is equivalent to the probability density function of the multivariate normal distribution (→
II/4.1.4).
Sources:
• Wikipedia (2022): “Matrix normal distribution”; in: Wikipedia, the free encyclopedia, retrieved on
2022-07-31; URL: https://en.wikipedia.org/wiki/Matrix_normal_distribution#Definition.
4. MULTIVARIATE CONTINUOUS DISTRIBUTIONS 277
x ∼ N (0, Σ) . (1)
Then, the quadratic form of x, weighted by Σ, follows a chi-squared distribution (→ II/3.7.1) with
n degrees of freedom:
z = Σ−1/2 x . (3)
where Σ−1/2 is the matrix square root of Σ. This matrix must exist, because Σ is a covariance matrix
(→ I/1.13.9) and thus positive semi-definite (→ I/1.13.13). Due to the linear transformation theorem
(→ II/4.1.12), z is distributed as
−1/2 −1/2 −1/2 T
z∼N Σ 0, Σ ΣΣ
(4)
∼ N Σ−1/2 0, Σ−1/2 Σ1/2 Σ1/2 Σ−1/2
∼ N (0, In ) ,
i.e. each entry of this vector follows (→ II/4.1.13) a standard normal distribution (→ II/3.2.3):
y ∼ χ2 (n) . (8)
Sources:
• Koch KR (2007): “Chi-Squared Distribution”; in: Introduction to Bayesian Statistics, ch. 2.4.5, pp.
48-49, eq. 2.180; URL: https://www.springer.com/gp/book/9783540727231; DOI: 10.1007/978-3-
540-72726-2.
278 CHAPTER II. PROBABILITY DISTRIBUTIONS
X ∼ N (µ, Σ) . (1)
Then, the probability density function (→ I/1.7.1) of X is
1 1 T −1
fX (x) = p · exp − (x − µ) Σ (x − µ) . (2)
(2π)n |Σ| 2
Proof: This follows directly from the definition of the multivariate normal distribution (→ II/4.1.1).
X ∼ N (µ, Σ) (1)
with means (→ I/1.10.1) x1 and x2 , variances (→ I/1.11.1) σ12 and σ22 and covariance (→ I/1.13.1)
σ12 :
2
x1 σ1 σ12
µ = and Σ = . (2)
2
x2 σ12 σ2
Sources:
• Wikipedia (2023): “Multivariate normal distribution”; in: Wikipedia, the free encyclopedia, re-
trieved on 2023-09-22; URL: https://en.wikipedia.org/wiki/Multivariate_normal_distribution#
Bivariate_case.
1 1 σ22 (x1 − µ1 )2 − 2σ12 (x1 − µ1 )(x2 − µ2 ) + σ12 (x2 − µ2 )2
fX (x) = p · exp − . (2)
2π σ12 σ22 − σ12
2 2 σ12 σ22 − σ12
2
4. MULTIVARIATE CONTINUOUS DISTRIBUTIONS 279
Proof: The probability density function of the multivariate normal distribution (→ II/4.1.4) for an
n × 1 random vector (→ I/1.2.3) x is:
1 1 T −1
fX (x) = p · exp − (x − µ) Σ (x − µ) . (3)
(2π)n |Σ| 2
2
µ σ σ12
1 1
Plugging in n = 2, µ = and Σ = , we obtain:
µ2 σ12 σ22
T −1
2
1 1 x1 µ1 σ1 σ12 x1 µ1
fX (x) = v · exp − − −
u 2
u 2 x2 µ2 σ12 σ22 x2 µ2
u σ1 σ12
t(2π)2
2
σ12 σ2
−1 (4)
1 1 h i 2
σ1 σ12 (x1 − µ1 )
= 2
1 · exp − (x 1 − µ 1 ) (x 2 − µ 2 ) .
2 σ12 σ22
(x2 − µ2 )
σ12 σ12
2π
σ12 σ22
1 1 h i σ 2 −σ12 (x − µ )
· exp − (x1 − µ1 ) (x2 − µ2 ) 1
2 1
fX (x) = p
2π σ12 σ22 − σ12
2 2(σ1 σ2 − σ12 )
2 2 2
−σ12 σ1 2
(x2 − µ2 )
1 1 h i (
= p · exp − σ22 (x1 − µ1 ) − σ12 (x2 − µ2 ) σ12 (x2 − µ2 ) − σ12 (x1 − µ1 )
2π σ12 σ22 − σ12
2 2(σ1 σ2 − σ12 )
2 2 2
(
1 1
= p · exp − (σ 2 (x1 − µ1 )2 − σ12 (x1 − µ1 )(x2 − µ2 ) + σ12 (x2 − µ2 )2 − σ12 (x
2π σ12 σ22 − σ12
2 2(σ1 σ2 − σ12
2 2 2
) 2
1 1 σ22 (x1 − µ1 )2 − 2σ12 (x1 − µ1 )(x2 − µ2 ) + σ12 (x2 − µ2 )2
= p 2 2 · exp − .
2π σ1 σ2 − σ122 2 σ12 σ22 − σ12
2
(7)
■
280 CHAPTER II. PROBABILITY DISTRIBUTIONS
" 2 2 !#
1 1 x1 − µ 1 (x1 − µ1 )(x2 − µ2 ) x2 − µ 2
fX (x) = p · exp − − 2ρ +
2π σ1 σ2 1 − ρ2 2(1 − ρ2 ) σ1 σ1 σ2 σ2
(2)
where ρ is the correlation (→ I/1.14.1) between X1 and X2 .
Proof: Since X follows a special case of the multivariate normal distribution, its covariance matrix
is (→ II/4.1.9)
σ12 σ12
Cov(X) = Σ = (3)
2
σ12 σ2
and the covariance matrix can be decomposed into correlation matrix and standard deviations (→
I/1.13.18):
σ12ρ σ 1 σ2
Σ=
2
ρ σ1 σ2 σ2
(4)
σ1 0 1 ρ σ1 0
= .
0 σ2 ρ 1 0 σ2
σ1 0 1 ρ σ1 0
|Σ| =
0 σ2 ρ 1 0 σ2
σ1 0 1 ρ σ1 0 (5)
= · ·
0 σ2 ρ 1 0 σ2
= (σ1 σ2 )(1 − ρ2 )(σ1 σ2 )
= σ12 σ22 (1 − ρ2 )
−1
σ1 0 1 ρ σ1 0
Σ−1 =
0 σ2 ρ 1 0 σ2
−1 −1 −1
σ1 0 1 ρ σ 0
= 1 (6)
0 σ2 ρ 1 0 σ2
1 1/σ1 0 1 −ρ 1/σ1 0
= .
1−ρ 2
0 1/σ2 −ρ 1 0 1/σ2
The probability density function of the multivariate normal distribution (→ II/4.1.4) for an n × 1
random vector (→ I/1.2.3) x is:
1 1 T −1
fX (x) = p · exp − (x − µ) Σ (x − µ) . (7)
(2π)n |Σ| 2
Plugging in n = 2, µ from (1) and Σ from (5) and (6), the probability density function becomes:
T
1 1 x1 µ1 1 1/σ1 0 1 −ρ 1/σ1 0
fX (x) = p · exp − −
(2π) σ1 σ2 (1 − ρ )
2 2 2 2 2 x2 µ2 1 − ρ 2
0 1/σ2 −ρ 1 0 1/σ2
1 1 h i 1 −ρ x1 −µ1
= p · exp − x1 −µ1 x2 −µ2 σ1
2π σ1 σ2 1 − ρ 2 2(1 − ρ )
2 σ 1 σ 2
−ρ 1 x2 −µ2
σ2
1 1 h i x1 −µ1
= p · exp − x1 −µ1
− ρ x2σ−µ 2 x2 −µ2
− ρ x1σ−µ 1 σ1
2π σ1 σ2 1 − ρ 2 2(1 − ρ )
2 σ 1 2 σ 2 1 x2 −µ2
σ2
" 2 2 !#
1 1 x1 − µ 1 (x1 − µ1 )(x2 − µ2 ) x2 − µ2
= p · exp − − 2ρ + .
2π σ1 σ2 1 − ρ 2 2(1 − ρ )
2 σ1 σ1 σ2 σ2
(8)
Sources:
• Wikipedia (2023): “Multivariate normal distribution”; in: Wikipedia, the free encyclopedia, re-
trieved on 2023-09-29; URL: https://en.wikipedia.org/wiki/Multivariate_normal_distribution#
Bivariate_case.
4.1.8 Mean
Theorem: Let x follow a multivariate normal distribution (→ II/4.1.1):
x ∼ N (µ, Σ) . (1)
Then, the mean or expected value (→ I/1.10.1) of x is
282 CHAPTER II. PROBABILITY DISTRIBUTIONS
E(x) = µ . (2)
Proof:
1) First, consider a set of independent (→ I/1.3.6) and standard normally (→ II/3.2.3) distributed
random variables (→ I/1.2.2):
i.i.d.
zi ∼ N (0, 1), i = 1, . . . , n . (3)
Then, these variables together form a multivariate normally (→ II/4.1.15) distributed random vector
(→ I/1.2.3):
z ∼ N (0n , In ) . (4)
By definition, the expected value of a random vector is equal to the vector of all expected values (→
I/1.10.13):
z E(z1 )
1
.. ..
E(z) = E . = . . (5)
zn E(zn )
Because the expected value of all its entries is zero (→ II/3.2.16), the expected value of the random
vector is
E(z1 ) 0
.. ..
E(z) = . = . = 0n . (6)
E(zn ) 0
2) Next, consider an n × n matrix A solving the equation AAT = Σ. Such a matrix exists, because Σ
is defined to be positive definite (→ II/4.1.1). Then, x can be represented as a linear transformation
of (→ II/4.1.12) z:
E(x) = E(Az + µ)
= E(Az) + E(µ)
= A E(z) + µ (9)
(6)
= A 0n + µ
=µ.
■
4. MULTIVARIATE CONTINUOUS DISTRIBUTIONS 283
Sources:
• Taboga, Marco (2021): “Multivariate normal distribution”; in: Lectures on probability theory and
mathematical statistics, retrieved on 2022-09-15; URL: https://www.statlect.com/probability-distributions/
multivariate-normal-distribution.
4.1.9 Covariance
Theorem: Let x follow a multivariate normal distribution (→ II/4.1.1):
x ∼ N (µ, Σ) . (1)
Then, the covariance matrix (→ I/1.13.9) of x is
Cov(x) = Σ . (2)
Proof:
1) First, consider a set of independent (→ I/1.3.6) and standard normally (→ II/3.2.3) distributed
random variables (→ I/1.2.2):
i.i.d.
zi ∼ N (0, 1), i = 1, . . . , n . (3)
Then, these variables together form a multivariate normally (→ II/4.1.15) distributed random vector
(→ I/1.2.3):
z ∼ N (0n , In ) . (4)
Because the covariance is zero for independent random variables (→ I/1.13.6), we have
Cov(x) = Cov(Az + µ)
(10)
= Cov(Az)
(11)
= A Cov(z)AT
(12)
(7) T
= AIn A
= AAT
=Σ.
Sources:
• Rosenfeld, Meni (2016): “Deriving the Covariance of Multivariate Gaussian”; in: StackExchange
Mathematics, retrieved on 2022-09-15; URL: https://math.stackexchange.com/questions/1905977/
deriving-the-covariance-of-multivariate-gaussian.
x ∼ N (µ, Σ) . (1)
Then, the differential entropy (→ I/2.2.1) of x in nats is
n 1 1
h(x) = ln(2π) + ln |Σ| + n . (2)
2 2 2
" !#
1 1
h(x) = −E ln p · exp − (x − µ)T Σ−1 (x − µ)
(2π) |Σ|
n 2
n 1 1 T −1 (5)
= −E − ln(2π) − ln |Σ| − (x − µ) Σ (x − µ)
2 2 2
n 1 1
= ln(2π) + ln |Σ| + E (x − µ)T Σ−1 (x − µ) .
2 2 2
The last term can be evaluted as
E (x − µ)T Σ−1 (x − µ) = E tr (x − µ)T Σ−1 (x − µ)
= E tr Σ−1 (x − µ)(x − µ)T
= tr Σ−1 E (x − µ)(x − µ)T
(6)
= tr Σ−1 Σ
= tr (In )
=n,
Sources:
• Kiuhnm (2018): “Entropy of the multivariate Gaussian”; in: StackExchange Mathematics, retrieved
on 2020-05-14; URL: https://math.stackexchange.com/questions/2029707/entropy-of-the-multivariate-gau
P : x ∼ N (µ1 , Σ1 )
(1)
Q : x ∼ N (µ2 , Σ2 ) .
Z
N (x; µ1 , Σ1 )
KL[P || Q] = N (x; µ1 , Σ1 ) ln dx
Rn N (x; µ2 , Σ2 )
(4)
N (x; µ1 , Σ1 )
= ln .
N (x; µ2 , Σ2 ) p(x)
Using the probability density function of the multivariate normal distribution (→ II/4.1.4), this
becomes:
* +
√ 1
· exp − 12 (x − µ1 )T Σ−1
1 (x − µ1 )
(2π)n |Σ1 |
KL[P || Q] = ln
√ 1
· exp − 12 (x − µ2 )T Σ−1
2 (x − µ2 )
(2π)n |Σ2 |
p(x)
1 |Σ2 | 1 T −1 1 T −1 (5)
= ln − (x − µ1 ) Σ1 (x − µ1 ) + (x − µ2 ) Σ2 (x − µ2 )
2 |Σ1 | 2 2 p(x)
1 |Σ2 |
= ln − (x − µ1 )T Σ−1 T −1
1 (x − µ1 ) + (x − µ2 ) Σ2 (x − µ2 ) .
2 |Σ1 | p(x)
Now, using the fact that x = tr(x), if a is scalar, and the trace property tr(ABC) = tr(BCA), we
have:
1 |Σ2 | −1 −1
KL[P || Q] = ln − tr Σ1 (x − µ1 )(x − µ1 ) + tr Σ2 (x − µ2 )(x − µ2 )
T T
2 |Σ1 | p(x)
(6)
1 |Σ2 | −1 −1
= ln − tr Σ1 (x − µ1 )(x − µ1 ) + tr Σ2 xx − 2µ2 x + µ2 µ2
T T T T
.
2 |Σ1 | p(x)
Because trace function and expected value are both linear operators (→ I/1.10.8), the expectation
can be moved inside the trace:
h i h i
1 |Σ2 |
KL[P || Q] = ln − tr Σ−1 (x − µ1 )(x − µ1 )T + tr Σ−1 xxT − 2µ2 xT + µ2 µT
2 |Σ1 | 1 p(x) 2 2 p(x)
h i h i
1 |Σ2 |
= ln − tr Σ−1 (x − µ1 )(x − µ1 )T + tr Σ−1 xxT − 2µ2 xT + µ2 µT
2 |Σ1 | 1 p(x) 2 p(x) p(x) 2 p(x)
(7)
Using the expectation of a linear form for the multivariate normal distribution (→ II/4.1.12)
1 |Σ2 | −1 −1
KL[P || Q] = ln − tr Σ1 Σ1 + tr Σ2 Σ1 + µ1 µ1 − 2µ2 µ1 + µ2 µ2
T T T
2 |Σ1 |
1 |Σ2 | −1 −1
= ln − tr [In ] + tr Σ2 Σ1 + tr Σ2 µ1 µ1 − 2µ2 µ1 + µ2 µ2
T T T
2 |Σ1 |
(10)
1 |Σ2 | −1 T −1 T −1 T −1
= ln − n + tr Σ2 Σ1 + tr µ1 Σ2 µ1 − 2µ1 Σ2 µ2 + µ2 Σ2 µ2
2 |Σ1 |
1 |Σ2 | −1 T −1
= ln − n + tr Σ2 Σ1 + (µ2 − µ1 ) Σ2 (µ2 − µ1 ) .
2 |Σ1 |
Finally, rearranging the terms, we get:
1 T −1 −1 |Σ1 |
KL[P || Q] = (µ2 − µ1 ) Σ2 (µ2 − µ1 ) + tr(Σ2 Σ1 ) − ln −n . (11)
2 |Σ2 |
■
Sources:
• Duchi, John (2014): “Derivations for Linear Algebra and Optimization”; in: University of Cali-
fornia, Berkeley; URL: http://www.eecs.berkeley.edu/~jduchi/projects/general_notes.pdf.
x ∼ N (µ, Σ) . (1)
Then, any linear transformation of x is also multivariate normally distributed:
(2)
My (t) = E exp tT (Ax + b)
= E exp tT Ax · exp tT b
(4)
= exp tT b · E exp tT Ax
(3)
= exp tT b · Mx (AT t) .
(4)
My (t) = exp tT b · Mx (AT t)
(5) T 1 T
= exp t b · exp t Aµ + t AΣA t
T T
(6)
2
T 1 T T
= exp t (Aµ + b) + t AΣA t .
2
Because moment-generating function and probability density function of a random variable are equiv-
alent, this demonstrates that y is following a multivariate normal distribution with mean Aµ + b and
covariance AΣAT .
Sources:
• Taboga, Marco (2010): “Linear combinations of normal random variables”; in: Lectures on probabil-
ity and statistics, retrieved on 2019-08-27; URL: https://www.statlect.com/probability-distributions/
normal-distribution-linear-combinations.
x ∼ N (µ, Σ) . (1)
Then, the marginal distribution (→ I/1.5.3) of any subset vector xs is also a multivariate normal
distribution
xs ∼ N (µs , Σs ) (2)
where µs drops the irrelevant variables (the ones not in the subset, i.e. marginalized out) from the
mean vector µ and Σs drops the corresponding rows and columns from the covariance matrix Σ.
Proof: Define an m × n subset matrix S such that sij = 1, if the j-th element in xs corresponds to
the i-th element in x, and sij = 0 otherwise. Then,
xs = Sx (3)
and we can apply the linear transformation theorem (→ II/4.1.12) to give
■
4. MULTIVARIATE CONTINUOUS DISTRIBUTIONS 289
x ∼ N (µ, Σ) . (1)
Then, the conditional distribution (→ I/1.5.4) of any subset vector x1 , given the complement vector
x2 , is also a multivariate normal distribution
µ1
µ=
µ2
(4)
Σ11 Σ12
Σ= .
Σ21 Σ22
x1 , x2 ∼ N (µ, Σ) . (6)
Moreover, the marginal distribution (→ I/1.5.3) of x2 follows from (→ II/4.1.13) (1) and (4) as
p(x1 , x2 )
p(x1 |x2 ) = (8)
p(x2 )
Applying (6) and (7) to (8), we have:
N (x; µ, Σ)
p(x1 |x2 ) = . (9)
N (x2 ; µ2 , Σ22 )
290 CHAPTER II. PROBABILITY DISTRIBUTIONS
Using the probability density function of the multivariate normal distribution (→ II/4.1.4), this
becomes:
p
1/ (2π)n |Σ| · exp − 12 (x − µ)T Σ−1 (x − µ)
p(x1 |x2 ) = p
1/ (2π)n2 |Σ22 | · exp − 12 (x2 − µ2 )T Σ−1
22 (x2 − µ2 )
s (10)
1 |Σ22 | 1 T −1 1 T −1
=p · · exp − (x − µ) Σ (x − µ) + (x2 − µ2 ) Σ22 (x2 − µ2 ) .
(2π)n−n2 |Σ| 2 2
s
1 |Σ22 |
p(x1 |x2 ) = p · ·
(2π)n−n2 |Σ|
T
11 12
1 x1 µ1 Σ Σ x µ
exp − − 1 − 1 (12)
2 x2 µ2 Σ21 Σ22 x2 µ2
1 T −1
+ (x2 − µ2 ) Σ22 (x2 − µ2 ) .
2
s
1 |Σ22 |
p(x1 |x2 ) = p · ·
n−n
(2π) 2 |Σ|
1
exp − (x1 − µ1 )T Σ11 (x1 − µ1 ) + 2(x1 − µ1 )T Σ12 (x2 − µ2 ) + (x2 − µ2 )T Σ22 (x2 − µ2 )
2
1 T −1
+ (x2 − µ2 ) Σ22 (x2 − µ2 )
2
(13)
where we have used the fact that Σ21 = Σ12 , because Σ−1 is a symmetric matrix.
T
−1
Σ11 Σ12 (Σ11 − Σ12 Σ−1
22 Σ21 )
−1
−(Σ11 − Σ12 Σ−1 −1 −1
22 Σ21 ) Σ12 Σ22
= .
Σ21 Σ22 −Σ−1 Σ
22 21 (Σ11 − Σ Σ −1
Σ
12 22 21 ) −1
Σ −1
22 + Σ −1
Σ
22 21 (Σ11 − Σ Σ−1
Σ
12 22 21 )−1
Σ Σ−1
12 22
(15)
Plugging this into (13), we have:
s
1 |Σ22 |
p(x1 |x2 ) = p · ·
(2π)n−n2 |Σ|
1
exp − (x1 − µ1 )T (Σ11 − Σ12 Σ−1 −1
22 Σ21 ) (x1 − µ1 ) −
2
(16)
2(x1 − µ1 )T (Σ11 − Σ12 Σ−1 −1 −1
22 Σ21 ) Σ12 Σ22 (x2 − µ2 )+
(x2 − µ2 )T Σ−1 −1 −1 −1 −1
22 + Σ22 Σ21 (Σ11 − Σ12 Σ22 Σ21 ) Σ12 Σ22 (x2 − µ2 )
1 T −1
+ (x2 − µ2 ) Σ22 (x2 − µ2 ) .
2
Eliminating some terms, we have:
s
1 |Σ22 |
p(x1 |x2 ) = p · ·
(2π)n−n2 |Σ|
1
exp − (x1 − µ1 )T (Σ11 − Σ12 Σ−1 −1
22 Σ21 ) (x1 − µ1 ) −
(17)
2
2(x1 − µ1 )T (Σ11 − Σ12 Σ−1 −1 −1
22 Σ21 ) Σ12 Σ22 (x2 − µ2 )+
(x2 − µ2 )T Σ−1 −1 −1 −1
22 Σ21 (Σ11 − Σ12 Σ22 Σ21 ) Σ12 Σ22 (x2 − µ2 ) .
Rearranging the terms, we have
s
1 |Σ22 | 1
p(x1 |x2 ) = p · · exp − ·
n−n
(2π) 2 |Σ| 2
−1
T −1 −1
−1
i
(x1 − µ1 ) − Σ12 Σ22 (x2 − µ2 ) (Σ11 − Σ12 Σ22 Σ21 ) (x1 − µ1 ) − Σ12 Σ22 (x2 − µ2 )
s
1 |Σ22 | 1
=p · · exp − ·
(2π)n−n2 |Σ| 2
−1
T i
x1 − µ1 + Σ12 Σ22 (x2 − µ2 ) (Σ11 − Σ12 Σ−1 Σ
22 21 )−1
x 1 − µ 1 + Σ Σ −1
12 22 (x 2 − µ 2 )
(18)
A B
= |D| · |A − BD−1 C| , (19)
C D
292 CHAPTER II. PROBABILITY DISTRIBUTIONS
Σ11 Σ12
= |Σ22 | · |Σ11 − Σ12 Σ−1
22 Σ21 | . (20)
Σ21 Σ22
With this and n − n2 = n1 , we finally arrive at
1 1
p(x1 |x2 ) = p −1
· exp − ·
(2π)n1 |Σ11 − Σ12 Σ22 Σ21 | 2
−1
T −1 −1
−1
i
x1 − µ1 + Σ12 Σ22 (x2 − µ2 ) (Σ11 − Σ12 Σ22 Σ21 ) x1 − µ1 + Σ12 Σ22 (x2 − µ2 )
(21)
Sources:
• Wang, Ruye (2006): “Marginal and conditional distributions of multivariate normal distribution”;
in: Computer Image Processing and Analysis; URL: http://fourier.eng.hmc.edu/e161/lectures/
gaussianprocess/node7.html.
• Wikipedia (2020): “Multivariate normal distribution”; in: Wikipedia, the free encyclopedia, re-
trieved on 2020-03-20; URL: https://en.wikipedia.org/wiki/Multivariate_normal_distribution#
Conditional_distributions.
x ∼ N (µ, Σ) . (1)
Then, the components of x are statistically independent (→ I/1.3.6), if and only if the covariance
matrix (→ I/1.13.9) is a diagonal matrix:
p(x) = p(x1 ) · . . . · p(xn ) ⇔ Σ = diag σ12 , . . . , σn2 . (2)
Proof: The marginal distribution of one entry from a multivariate normal random vector is a uni-
variate normal distribution (→ II/4.1.13) where mean (→ I/1.10.1) and variance (→ I/1.11.1) are
equal to the corresponding entries of the mean vector and covariance matrix:
1 1
p(x) = p · exp − (x − µ)T Σ−1 (x − µ) (4)
(2π) |Σ|
n 2
and the probability density function of the univariate normal distribution (→ II/3.2.10) is
" 2 #
1 1 xi − µ i
p(xi ) = p · exp − . (5)
2πσi2 2 σi
1) Let
" 2 #
1 1 (4),(5) Y n
1 1 xi − µ i
p · exp − (x − µ)T Σ−1 (x − µ) = p · exp −
(2π) |Σ|
n 2 2πσi 2 2 σi
i=1
" #
1 1 1 1 Xn
1
p · exp − (x − µ)T Σ−1 (x − µ) = p Qn · exp − (xi − µi ) 2 (xi − µi )
(2π) |Σ|
n 2 (2π) n
i=1 σi
2 2 i=1 σi
1X 1X
n n
1 1 1
− log |Σ| − (x − µ)T Σ−1 (x − µ) = − log σi2 − (xi − µi ) 2 (xi − µi )
2 2 2 i=1 2 i=1 σi
(7)
2) Let
Σ = diag σ12 , . . . , σn2 . (12)
Then, we have
294 CHAPTER II. PROBABILITY DISTRIBUTIONS
(4) 1 1 2
2 −1
p(x) = p · exp − (x − µ) diag σ1 , . . . , σn
T
(x − µ)
(2π)n |diag ([σ12 , . . . , σn2 ]) | 2
1 1
=p Q · exp − (x − µ) diag 1/σ1 , . . . , 1/σn (x − µ)
T 2 2
(2π)n ni=1 σi2 2
" # (13)
1 X (xi − µi )2
n
1
=p Q · exp −
(2π)n ni=1 σi2 2 i=1 σi2
" 2 #
Yn
1 1 xi − µi
= p · exp −
i=1 2πσ 2
i
2 σi
X ∼ N (µ, Σ) (1)
and consider two matrices A ∈ Rk×n and B ∈ Rl×n . Then, AX and BX are independent (→ I/1.3.6),
if and only if the cross-matrix product, weighted with the covariance matrix (→ II/4.1.9) is equal to
the zero matrix:
Sources:
• jld (2018): “Understanding t-test for linear regression”; in: StackExchange CrossValidated, re-
trieved on 2022-12-13; URL: https://stats.stackexchange.com/a/344008.
X ∼ t(µ, Σ, ν) , (1)
if and only if its probability density function (→ I/1.7.1) is given by
s −(ν+n)/2
1 Γ([ν + n]/2) 1 T −1
t(x; µ, Σ, ν) = 1 + (x − µ) Σ (x − µ) (2)
(νπ)n |Σ| Γ(ν/2) ν
where µ is an n × 1 real vector, Σ is an n × n positive definite matrix and ν > 0.
Sources:
• Koch KR (2007): “Multivariate t-Distribution”; in: Introduction to Bayesian Statistics, ch. 2.5.2,
pp. 53-55; URL: https://www.springer.com/gp/book/9783540727231; DOI: 10.1007/978-3-540-
72726-2.
X ∼ t(µ, Σ, ν) . (1)
Then, the probability density function (→ I/1.7.1) of X is
s −(ν+n)/2
1 Γ([ν + n]/2) 1 T −1
fX (x) = 1 + (x − µ) Σ (x − µ) . (2)
(νπ)n |Σ| Γ(ν/2) ν
Proof: This follows directly from the definition of the multivariate t-distribution (→ II/4.2.1).
■
296 CHAPTER II. PROBABILITY DISTRIBUTIONS
X ∼ t(µ, Σ, ν) . (1)
Then, the centered, weighted and standardized quadratic form of X follows an F-distribution (→
II/3.8.1) with degrees of freedom n and ν:
Proof: The linear transformation theorem for the multivariate t-distribution states
where Σ−1/2 is a matrix square root of the inverse of Σ. Then, applying (3) to (4) with (1), one
obtains the distribution of Y as
i.e. the marginal distributions (→ I/1.5.3) of the individual entries of Y are univariate t-distributions
(→ II/3.3.1) with ν degrees of freedom:
Yi ∼ t(ν), i = 1, . . . , n . (6)
Note that, when X follows a t-distribution with n degrees of freedom, this is equivalent to (→
II/3.3.1) an expression of X in terms of a standard normal (→ II/3.2.3) random variable Z and a
chi-squared (→ II/3.7.1) random variable V :
Z
X ∼ t(n) ⇔ X=p with independent Z ∼ N (0, 1) and V ∼ χ2 (n) . (7)
V /n
With that, Z from (4) can be rewritten as follows:
4. MULTIVARIATE CONTINUOUS DISTRIBUTIONS 297
(4)
Z = Y T Y /n
1X 2
n
= Y
n i=1 i
!2 (8)
(7) 1 X
n
Z
= p i
n i=1 V /ν
Pn 2
( i=1 Zi ) /n
= .
V /ν
Because by definition, the sum of squared standard normal random variables follows a chi-squared
distribution (→ II/3.7.1)
X
n
Xi ∼ N (0, 1), i = 1, . . . , n ⇒ Xi2 ∼ χ2 (n) , (9)
i=1
W/n
Z= with W ∼ χ2 (n) and V ∼ χ2 (ν) , (10)
V /ν
such that Z, by definition, follows an F-distribution (→ II/3.8.1):
W/n
Z= ∼ F (n, ν) . (11)
V /ν
■
Sources:
• Lin, Pi-Erh (1972): “Some Characterizations of the Multivariate t Distribution”; in: Journal of
Multivariate Analysis, vol. 2, pp. 339-344, Lemma 2; URL: https://core.ac.uk/download/pdf/
81139018.pdf; DOI: 10.1016/0047-259X(72)90021-8.
• Nadarajah, Saralees; Kotz, Samuel (2005): “Mathematical Properties of the Multivariate t Dis-
tribution”; in: Acta Applicandae Mathematicae, vol. 89, pp. 53-84, page 56; URL: https://link.
springer.com/content/pdf/10.1007/s10440-005-9003-4.pdf; DOI: 10.1007/s10440-005-9003-4.
X, Y ∼ NG(µ, Λ, a, b) , (1)
if the distribution of X conditional on Y is a multivariate normal distribution (→ II/4.1.1) with
mean vector µ and covariance matrix (yΛ)−1 and Y follows a gamma distribution (→ II/3.4.1) with
shape parameter a and rate parameter b:
298 CHAPTER II. PROBABILITY DISTRIBUTIONS
Sources:
• Koch KR (2007): “Normal-Gamma Distribution”; in: Introduction to Bayesian Statistics, ch. 2.5.3,
pp. 55-56, eq. 2.212; URL: https://www.springer.com/gp/book/9783540727231; DOI: 10.1007/978-
3-540-72726-2.
Proof: Let X be an n × p real matrix and let Y be a p × p positive-definite symmetric matrix, such
that X and Y jointly follow a normal-Wishart distribution (→ II/5.3.1):
X, Y ∼ NW(M, U, V, ν) . (1)
Then, X and Y are described by the probability density function (→ II/5.3.2)
√
1 2−νp
p(X, Y ) = p · · |Y |(ν+n−p−1)/2 ·
(2π) |U | |V | Γp 2
np p ν ν
(2)
1 T −1 −1
exp − tr Y (X − M ) U (X − M ) + V
2
where |A| is a matrix determinant, A−1 is a matrix inverse and Γp (x) is the multivariate gamma
function of order p. If p = 1, then Γp (x) = Γ(x) is the ordinary gamma function, x = X is a column
vector and y = Y is a real number. Thus, the probability density function (→ I/1.7.1) of x and y
can be developed as
√
1 2−ν (ν+n−2)/2
p(x, y) = p · ·y ·
(2π)n |U ||V |ν Γ 2 ν
1 T −1 −1
exp − tr y (x − M ) U (x − M ) + V
2
s q
|U −1 | (2|V |)−ν
· y 2 + 2 −1 ·
ν n
= n
· ν
(2π) Γ 2
(3)
1 T −1 −1
exp − y (x − M ) U (x − M ) + 2 (2V )
2
s ν2
1
|U −1 | 2|V |
· y 2 + 2 −1 ·
ν n
= n
· ν
(2π) Γ
2
y T −1 1
exp − (x − M ) U (x − M ) + 2
2 2V
4. MULTIVARIATE CONTINUOUS DISTRIBUTIONS 299
x, y ∼ NG(µ, Λ, a, b) . (1)
Then, the joint probability (→ I/1.3.2) density function (→ I/1.7.1) of x and y is
s
|Λ| ba h y i
a+ n −1
p(x, y) = ·y 2 exp − (x − µ) Λ(x − µ) + 2b .
T
(2)
(2π)n Γ(a) 2
Thus, using the probability density function of the multivariate normal distribution (→ II/4.1.4)
and the probability density function of the gamma distribution (→ II/3.4.7), we have the following
probabilities:
Using the relation |yA| = y n |A| for an n × n matrix A and rearranging the terms, we have:
s
|Λ| ba h y i
a+ n −1
p(x, y) = ·y 2 exp − (x − µ) Λ(x − µ) + 2b .
T
(7)
(2π)n Γ(a) 2
Sources:
• Koch KR (2007): “Normal-Gamma Distribution”; in: Introduction to Bayesian Statistics, ch. 2.5.3,
pp. 55-56, eq. 2.212; URL: https://www.springer.com/gp/book/9783540727231; DOI: 10.1007/978-
3-540-72726-2.
4.3.4 Mean
Theorem: Let x ∈ Rn and y > 0 follow a normal-gamma distribution (→ II/4.3.1):
x, y ∼ NG(µ, Λ, a, b) . (1)
Then, the expected value (→ I/1.10.1) of x and y is
a
E[(x, y)] = µ, . (2)
b
ZZ
E(x) = x · p(x, y) dx dy
ZZ
= x · p(x|y) · p(y) dx dy
Z Z
= p(y) x · p(x|y) dx dy
Z
(6)
= p(y) ⟨x⟩N (µ,(yΛ)−1 ) dy
Z
= p(y) · µ dy
Z
= µ p(y) dy
=µ,
and with the expected value of the gamma distribution (→ II/3.4.10), E(y) becomes
Z
E(y) = y · p(y) dy
= ⟨y⟩Gam(a,b) (7)
a
= .
b
Thus, the expectation of the random vector in equations (3) and (4) is
x µ
E = , (8)
y a/b
as indicated by equation (2).
4.3.5 Covariance
Theorem: Let x ∈ Rn and y > 0 follow a normal-gamma distribution (→ II/4.3.1):
x, y ∼ NG(µ, Λ, a, b) . (1)
Then,
1) the covariance (→ I/1.13.1) of x, conditional (→ I/1.5.4) on y is
1
Cov(x|y) = Λ−1 ; (2)
y
2) the covariance (→ I/1.13.1) of x, unconditional (→ I/1.5.3) on y is
b
Cov(x) = Λ−1 ; (3)
a−1
3) the variance (→ I/1.11.1) of y is
302 CHAPTER II. PROBABILITY DISTRIBUTIONS
a
Var(y) = . (4)
b2
Proof:
1) According to the definition of the normal-gamma distribution (→ II/4.3.1), the distribution of x
given y is a multivariate normal distribution (→ II/4.1.1):
y ∼ Gam(a, b) . (11)
The variance of the gamma distribution (→ II/3.4.11) is
a
x ∼ Gam(a, b) ⇒ Var(x) = , (12)
b2
such that we have:
a
Var(y) = . (13)
b2
■
4. MULTIVARIATE CONTINUOUS DISTRIBUTIONS 303
n 1 1
h(x, y) = ln(2π) − ln |Λ| + n
2 2 2 (2)
n − 2 + 2a n−2
+ a + ln Γ(a) − ψ(a) + ln b .
2 2
Using the law of conditional probability (→ I/1.3.4), this can be evaluated as follows:
Z ∞ Z
h(x, y) = − p(x|y) p(y) ln p(x|y) p(y) dx dy
Rn
Z ∞Z0
Z ∞Z
=− p(x|y) p(y) ln p(x|y) dx dy − p(x|y) p(y) ln p(y) dx dy
Z ∞0 Rn
Z Z ∞ 0 Rn
Z (8)
= p(y) p(x|y) ln p(x|y) dx dy + p(y) ln p(y) p(x|y) dx dy
0 Rn 0 Rn
= ⟨h(x|y)⟩p(y) + h(y) .
In other words, the differential entropy of the normal-gamma distribution over x and y is equal to the
sum of a multivariate normal entropy regarding x conditional on y, expected over y, and a univariate
gamma entropy regarding y.
304 CHAPTER II. PROBABILITY DISTRIBUTIONS
n 1 −1 1
⟨h(x|y)⟩p(y) = ln(2π) + ln |(yΛ) | + n
2 2 2 p(y)
n 1 1
= ln(2π) − ln |(yΛ)| + n
2 2 2 p(y)
(9)
n 1 1
= ln(2π) − ln(y |Λ|) + n
n
2 2 2 p(y)
n 1 1 Dn E
= ln(2π) − ln |Λ| + n − ln y
2 2 2 2 p(y)
and using the relation (→ II/3.4.12) y ∼ Gam(a, b) ⇒ ⟨ln y⟩ = ψ(a) − ln(b), we have
n 1 1 n n
⟨h(x|y)⟩p(y) = ln(2π) − ln |Λ| + n − ψ(a) + ln b . (10)
2 2 2 2 2
By plugging (10) and (5) into (8), one arrives at the differential entropy given by (2).
1 a1 1 1 |Λ2 | n
KL[P || Q] = (µ2 − µ1 )T Λ2 (µ2 − µ1 ) + tr(Λ2 Λ−1 1 )− ln −
2 b1 2 2 |Λ1 | 2
(2)
b1 Γ(a1 ) a1
+ a2 ln − ln + (a1 − a2 ) ψ(a1 ) − (b1 − b2 ) .
b2 Γ(a2 ) b1
b1 Γ(a1 ) a1
KL[P || Q] = a2 ln − ln + (a1 − a2 ) ψ(a1 ) − (b1 − b2 ) (5)
b2 Γ(a2 ) b1
where Γ(x) is the gamma function and ψ(x) is the digamma function.
Z ∞ Z
p(x|y) p(y)
KL[P || Q] = p(x|y) p(y) ln dx dy
Rn q(x|y) q(y)
Z ∞Z
0
Z ∞Z
p(x|y) p(y)
= p(x|y) p(y) ln dx dy + p(x|y) p(y) ln dx dy
Rn q(x|y) Rn q(y) (8)
Z ∞
0
Z Z ∞
0
Z
p(x|y) p(y)
= p(y) p(x|y) ln dx dy + p(y) ln p(x|y) dx dy
0 Rn q(x|y) 0 q(y) Rn
= ⟨KL[p(x|y) || q(x|y)]⟩p(y) + KL[p(y) || q(y)] .
In other words, the KL divergence between two normal-gamma distributions over x and y is equal
to the sum of a multivariate normal KL divergence regarding x conditional on y, expected over y,
and a univariate gamma KL divergence regarding y.
⟨KL[p(x|y) || q(x|y)]⟩p(y)
1 −1
|(yΛ1 )−1 |
= (µ2 − µ1 ) (yΛ2 )(µ2 − µ1 ) + tr (yΛ2 )(yΛ1 )
T
− ln −n
2 |(yΛ2 )−1 | p(y) (9)
y 1 1 |Λ | n
(µ2 − µ1 )T Λ2 (µ2 − µ1 ) + tr(Λ2 Λ−1
2
= 1 )− ln −
2 2 2 |Λ1 | 2 p(y)
1 a1 1 1 |Λ2 | n
⟨KL[p(x|y) || q(x|y)]⟩p(y) = (µ2 − µ1 )T Λ2 (µ2 − µ1 ) + tr(Λ2 Λ1−1 ) − ln − . (10)
2 b1 2 2 |Λ1 | 2
By plugging (10) and (5) into (8), one arrives at the KL divergence given by (2).
■
Sources:
• Soch J, Allefeld A (2016): “Kullback-Leibler Divergence for the Normal-Gamma Distribution”;
in: arXiv math.ST, 1611.01437; URL: https://arxiv.org/abs/1611.01437.
306 CHAPTER II. PROBABILITY DISTRIBUTIONS
x, y ∼ NG(µ, Λ, a, b) . (1)
Then, the marginal distribution (→ I/1.5.3) of y is a gamma distribution (→ II/3.4.1)
y ∼ Gam(a, b) (2)
and the marginal distribution (→ I/1.5.3) of x is a multivariate t-distribution (→ II/4.2.1)
a −1
x ∼ t µ, Λ , 2a . (3)
b
Proof: The probability density function of the normal-gamma distribution (→ II/4.3.3) is given by
Using the law of marginal probability (→ I/1.3.3), the marginal distribution of y can be derived as
Z
p(y) = p(x, y) dx
Z
= N (x; µ, (yΛ)−1 ) Gam(y; a, b) dx
Z (5)
= Gam(y; a, b) N (x; µ, (yΛ)−1 ) dx
= Gam(y; a, b)
which is the probability density function of the gamma distribution (→ II/3.4.7) with shape param-
eter a and rate parameter b.
Using the law of marginal probability (→ I/1.3.3), the marginal distribution of x can be derived as
4. MULTIVARIATE CONTINUOUS DISTRIBUTIONS 307
Z
p(x) = p(x, y) dy
Z
= N (x; µ, (yΛ)−1 ) Gam(y; a, b) dy
Z s
|yΛ| 1 ba a−1
= exp − (x − µ) T
(yΛ)(x − µ) · y exp[−by] dy
(2π)n 2 Γ(a)
Z s n
y |Λ| 1 ba a−1
= exp − (x − µ) T
(yΛ)(x − µ) · y exp[−by] dy
(2π)n 2 Γ(a)
Z s
|Λ| ba a+ n −1 1
= · · y 2 · exp − b + (x − µ) Λ(x − µ) y dy T
(2π)n Γ(a) 2
Z s
|Λ| ba Γ a + n2 n 1
= · · n · Gam y; a + , b + (x − µ) Λ(x − µ) dy
T
(2π)n Γ(a) b + 1 (x − µ)T Λ(x − µ) a+ 2 2 2
s 2
Z
|Λ| ba Γ a + n2 n 1
= · · n Gam y; a + , b + (x − µ) Λ(x − µ) dy T
(2π)n Γ(a) b + 1 (x − µ)T Λ(x − µ) a+ 2 2 2
s 2
|Λ| ba Γ a + n2
= · · n
(2π)n Γ(a) b + 1 (x − µ)T Λ(x − µ) a+ 2
2
p −(a+ n2 )
|Λ| Γ 2a+n 2 1
= n · · ba · b + (x − µ)T Λ(x − µ)
(2π) 2 Γ 2a 2
p 2
−a −a − n2
|Λ| Γ 2a+n 2 1 1 −n 1
= n · · · b + (x − µ) Λ(x − µ) T
· 2 2 · b + (x − µ) Λ(x − µ) T
π2 Γ 2a b 2 2
p 2
−a
|Λ| Γ 2a+n 2 1 − n
= n · 2a
· 1 + (x − µ) T
Λ(x − µ) · 2b + (x − µ)T Λ(x − µ) 2
π2 Γ 2 2b
p −a −a b − n2
|Λ| Γ 2a+n 2 1 T a T a
= n · · · 2a + (x − µ) Λ (x − µ) · · 2a + (x − µ) Λ (x − µ
π2 Γ 2a 2a b a b
q 2
a n
|Λ| Γ 2a+n −a − n2
b 2 T a T a
= n · · 2a + (x − µ) Λ (x − µ) · 2a + (x − µ) Λ (x − µ)
(2a)−a π 2 Γ 2a b b
q 2
a n −a
|Λ| Γ 2a+n 1
b 2 −a T a −n 1 T a
= n · · (2a) · 1 + (x − µ) Λ (x − µ) · (2a) 2 · 1 + (x − µ) Λ (
(2a)−a π 2 Γ 2a 2a b 2a b
q 2
a n
|Λ| Γ 2a+n 1 − 2a+n
T a
2
b 2
= n n · · 1 + (x − µ) Λ (x − µ)
(2a) 2 π 2 Γ 2a 2a b
s 2
− 2a+n
a
Λ Γ 2a+n 1
T a
2
2
= b
· 2a
· 1 + (x − µ) Λ (x − µ)
(2a π)n Γ 2 2a b
(6)
which is the probability density function of a multivariate t-distribution (→ II/4.2.2) with mean
−1
vector µ, shape matrix ab Λ and 2a degrees of freedom.
308 CHAPTER II. PROBABILITY DISTRIBUTIONS
x, y ∼ NG(µ, Λ, a, b) . (1)
Then,
1) the conditional distribution (→ I/1.5.4) of x given y is a multivariate normal distribution (→
II/4.1.1)
where µ1 , µ2 and Σ11 , Σ12 , Σ22 , Σ21 are block-wise components (→ II/4.1.14) of µ and Σ(y) = (yΛ)−1 ;
3) the conditional distribution (→ I/1.5.4) of y given x is a gamma distribution (→ II/3.4.1)
n 1
y|x ∼ Gam a + , b + (x − µ) Λ(x − µ)
T
(5)
2 2
where n is the dimensionality of x.
Proof:
1) This follows from the definition of the normal-gamma distribution (→ II/4.3.1):
2) This follows from (2) and the conditional distributions of the multivariate normal distribution (→
II/4.1.14):
x ∼ N (µ, Σ)
⇒ x1 |x2 ∼ N (µ1|2 , Σ1|2 )
(7)
µ1|2 = µ1 + Σ12 Σ−1 22 (x2 − µ2 )
Σ1|2 = Σ11 − Σ12 Σ−1 22 Σ21 .
p(x|y) · p(y)
p(y|x) = . (8)
p(x)
4. MULTIVARIATE CONTINUOUS DISTRIBUTIONS 309
a −1
p(x) = t x; µ, Λ , 2a
b
s − 2a+n
a
Λ Γ 2a+n 1
T a
2
2
= b
· · 1 + (x − µ) Λ (x − µ) (11)
(2a π)n Γ 2a 2a b
s 2
−(a+ n2 )
|Λ| Γ a + n2 1
= · · b · b + (x − µ) Λ(x − µ)
a T
.
(2π)n Γ(a) 2
q
|yΛ| ba a−1
(2π)n
exp − 12 (x − µ)T (yΛ)(x − µ) · Γ(a)
y exp [−by]
p(y|x) = q Γ(a+ n ) −(a+ n2 )
|Λ|
(2π)n
· Γ(a)2 · ba · b + 12 (x − µ)T Λ(x − µ)
a+ n2
n 1 1 1
= y 2 · exp − (x − µ) (yΛ)(x − µ) · y
T a−1
· exp [−by] · · b + (x − µ) Λ(x − µ)
T
2 Γ a + n2 2
n
a+ 2
b + 12 (x − µ)T Λ(x − µ) a+ n −1 1
= ·y 2 · exp − b + (x − µ) Λ(x − µ)
T
Γ a + n2 2
(12)
which is the probability density function of a gamma distribution (→ II/3.4.7) with shape and rate
parameters
n 1
a+ and b + (x − µ)T Λ(x − µ) , (13)
2 2
such that
n 1
p(y|x) = Gam y; a + , b + (x − µ) Λ(x − µ) .
T
(14)
2 2
■
310 CHAPTER II. PROBABILITY DISTRIBUTIONS
Proof: If all entries of Z1 are independent and standard normally distributed (→ II/3.2.3)
i.i.d.
z1i ∼ N (0, 1) for all i = 1, . . . , n , (2)
this implies a multivariate normal distribution with diagonal covariance matrix (→ II/4.1.15):
Z1 ∼ N (0n , In ) (3)
where 0n is an n × 1 matrix of zeros and In is the n × n identity matrix.
If the distribution of Z2 is a standard gamma distribution (→ II/3.4.3)
Z2 ∼ Gam(a, 1) , (4)
then due to the relationship between gamma and standard gamma distribution distribution (→
II/3.4.4), we have:
Z2
∼ Gam(a, b) .
Y = (5)
b
Moreover, using the linear transformation theorem for the multivariate normal distribution (→
II/4.1.12), it follows that:
Z1 ∼ N (0n , In )
! !T
1 1 1 1
X =µ+ p AZ1 ∼ N µ + p A 0n , p A In p A
Z2 /b Z2 /b Z2 /b Z2 /b
! (6)
2
1
X ∼ N µ + 0n , √ AAT
Y
−1
X ∼ N µ, (Y Λ) .
This means that, by definition (→ II/4.3.1), X and Y jointly follow a normal-gamma distribution
(→ II/4.3.1):
X, Y ∼ NG(µ, Λ, a, b) , (8)
Thus, given Z1 defined by (2) and Z2 defined by (4), X and Y defined by (1) are a sample from
NG(µ, Λ, a, b).
Sources:
• Wikipedia (2022): “Normal-gamma distribution”; in: Wikipedia, the free encyclopedia, retrieved
on 2022-09-22; URL: https://en.wikipedia.org/wiki/Normal-gamma_distribution#Generating_
normal-gamma_random_variates.
X ∼ Dir(α) , (1)
if and only if its probability density function (→ I/1.7.1) is given by
P
k
Γ i=1 α i Yk
Dir(x; α) = Qk xi αi −1 (2)
i=1 Γ(α i ) i=1
where
Pk αi > 0 for all i = 1, . . . , k, and the density is zero, if xi ∈
/ [0, 1] for any i = 1, . . . , k or
x
i=1 i ̸
= 1.
Sources:
• Wikipedia (2020): “Dirichlet distribution”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
05-10; URL: https://en.wikipedia.org/wiki/Dirichlet_distribution#Probability_density_function.
X ∼ Dir(α) . (1)
Then, the probability density function (→ I/1.7.1) of X is
P
k
Γ α
i=1 i Yk
fX (x) = Qk xi αi −1 . (2)
i=1 Γ(αi ) i=1
Proof: This follows directly from the definition of the Dirichlet distribution (→ II/4.4.1).
■
312 CHAPTER II. PROBABILITY DISTRIBUTIONS
P : x ∼ Dir(α1 )
(1)
Q : x ∼ Dir(α2 ) .
P " !#
k
Γ i=1 α1i X
k
Γ(α2i ) X
k Xk
KL[P || Q] = ln P + ln + (α1i − α2i ) ψ(α1i ) − ψ α1i . (2)
Γ k
α2i i=1
Γ(α1i ) i=1 i=1
i=1
Z
Dir(x; α1 )
KL[P || Q] = Dir(x; α1 ) ln dx
Dir(x; α2 )
X
k
(4)
Dir(x; α1 )
= ln
Dir(x; α2 ) p(x)
n P o
where X k is the set x ∈ Rk | ki=1 xi = 1, 0 ≤ xi ≤ 1, i = 1, . . . , k .
Using the probability density function of the Dirichlet distribution (→ II/4.4.2), this becomes:
∑
* Γ( ki=1 α1i ) Qk +
∏k
i=1 xi α1i −1
i=1 Γ(α1i )
KL[P || Q] = ln ∑
Qk
Γ( ki=1 α2i )
∏k
i=1 xi α2i −1
i=1 Γ(α2i ) p(x)
* Pk
Qk
+
Γ i=1 α1i Γ(α2i ) Yk
= ln P · Qi=1k
· xi α1i −α2i
k Γ(α )
Γ i=1 α2i i=1 1i i=1
P
p(x) (5)
* k +
Γ i=1 α1i X k
Γ(α2i ) X
k
= ln P + ln + (α1i − α2i ) · ln(xi )
Γ k
α i=1
Γ(α 1i ) i=1
i=1 2i
p(x)
P
k
Γ i=1 α1i X k
Γ(α2i ) X
k
= ln P + ln + (α1i − α2i ) · ⟨ln xi ⟩p(x) .
Γ k
α i=1
Γ(α 1i ) i=1
i=1 2i
!
X
k
x ∼ Dir(α) ⇒ ⟨ln xi ⟩ = ψ(αi ) − ψ αi , (6)
i=1
Sources:
• Penny, William D. (2001): “KL-Divergences of Normal, Gamma, Dirichlet and Wishart densi-
ties”; in: University College, London, p. 2, eqs. 8-9; URL: https://www.fil.ion.ucl.ac.uk/~wpenny/
publications/densities.ps.
r ∼ Dir(α) . (1)
where Γ(x) is the gamma function and γ(s, x) is the lowerr incomplete gamma function.
Proof: In the context of the Dirichlet distribution (→ II/4.4.1), the exceedance probability (→
I/1.3.10) for a particular ri is defined as:
n o
φi = p ∀j ∈ 1, . . . , k j =
̸ i : ri > rj |α
^ (4)
=p ri > rj α .
j̸=i
The probability density function of the Dirichlet distribution (→ II/4.4.2) is given by:
P
k
Γ α
i=1 i Y k
Dir(r; α) = Qk ri αi −1 . (5)
i=1 Γ(αi ) i=1
314 CHAPTER II. PROBABILITY DISTRIBUTIONS
X
k
ri ∈ [0, 1] for i = 1, . . . , k and ri = 1 , (6)
i=1
Γ(α1 + α2 ) α1 −1 α2 −1
p(r) = r r2 (7)
Γ(α1 ) Γ(α2 ) 1
which is equivalent to the probability density function of the beta distribution (→ II/3.9.3)
r1α1 −1 (1 − r1 )α2 −1
p(r1 ) = (8)
B(α1 , α2 )
with the beta function given by
Γ(x) Γ(y)
B(x, y) = . (9)
Γ(x + y)
With (6), the exceedance probability for this bivariate case simplifies to
Z 1
φ1 = p(r1 > r2 ) = p(r1 > 1 − r1 ) = p(r1 > 1/2) = p(r1 ) dr1 . (10)
1
2
Using the cumulative distribution function of the beta distribution (→ II/3.9.5), it evaluates to
Z 1
2 B 12 ; α1 , α2
φ1 = 1 − p(r1 ) dr1 = 1 − (11)
0 B(α1 , α2 )
with the incomplete beta function
Z x
B(x; a, b) = xa−1 (1 − x)b−1 dx . (12)
0
X
k
Y1 ∼ Gam(α1 , β), . . . , Yk ∼ Gam(αk , β), Ys = Yj
i=1 (14)
Y1 Yk
⇒ X = (X1 , . . . , Xk ) = ,..., ∼ Dir(α1 , . . . , αk ) .
Ys Ys
ba a−1
Gam(x; a, b) = x exp[−bx] for x>0. (15)
Γ(a)
Consider the gamma random variables (→ II/3.4.1)
X
k
q1 ∼ Gam(α1 , 1), . . . , qk ∼ Gam(αk , 1), qs = qj (16)
j=1
In order to obtain the exceedance probability φi , the dependency on qi in this probability still has
to be removed. From equations (4) and (18), it follows that
In other words, the exceedance probability (→ I/1.3.10) for one element from a Dirichlet-distributed
(→ II/4.4.1) random vector (→ I/1.2.3) is an integral from zero to infinity where the first term in
the integrand conforms to a product of gamma (→ II/3.4.1) cumulative distribution functions (→
I/1.8.1) and the second term is a gamma (→ II/3.4.1) probability density function (→ I/1.7.1).
Sources:
• Soch J, Allefeld C (2016): “Exceedance Probabilities for the Dirichlet Distribution”; in: arXiv
stat.AP, 1611.01439; URL: https://arxiv.org/abs/1611.01439.
5. MATRIX-VARIATE CONTINUOUS DISTRIBUTIONS 317
X ∼ MN (M, U, V ) , (1)
if and only if its probability density function (→ I/1.7.1) is given by
1 1 −1 T −1
MN (X; M, U, V ) = p · exp − tr V (X − M ) U (X − M ) (2)
(2π)np |V |n |U |p 2
where M is an n × p real matrix, U is an n × n positive definite matrix and V is a p × p positive
definite matrix.
Sources:
• Wikipedia (2020): “Matrix normal distribution”; in: Wikipedia, the free encyclopedia, retrieved on
2020-01-27; URL: https://en.wikipedia.org/wiki/Matrix_normal_distribution#Definition.
X ∼ MN (M, U, V ) , (1)
if and only if vec(X) is multivariate normally distributed (→ II/4.1.1)
Proof: The probability density function of the matrix-normal distribution (→ II/5.1.3) with n × p
mean M , n × n covariance across rows U and p × p covariance across columns V is
1 1 −1 T −1
MN (X; M, U, V ) = p · exp − tr V (X − M ) U (X − M ) . (3)
(2π)np |V |n |U |p 2
Using the trace property tr(ABC) = tr(BCA), we have:
1 1 T −1 −1
MN (X; M, U, V ) = p · exp − tr (X − M ) U (X − M ) V . (4)
(2π)np |V |n |U |p 2
Using the trace-vectorization relation tr(AT B) = vec(A)T vec(B), we have:
1 1 −1 −1
MN (X; M, U, V ) = p · exp − vec(X − M ) vec U (X − M ) V
T
. (5)
(2π)np |V |n |U |p 2
Using the vectorization-Kronecker relation vec(ABC) = C T ⊗ A vec(B), we have:
318 CHAPTER II. PROBABILITY DISTRIBUTIONS
1 1 −1 −1
MN (X; M, U, V ) = p · exp − vec(X − M ) V ⊗ U
T
vec(X − M ) . (6)
(2π)np |V |n |U |p 2
1 1 −1
MN (X; M, U, V ) = p · exp − vec(X − M ) (V ⊗ U ) vec(X − M ) .
T
(7)
(2π)np |V |n |U |p 2
1 1 −1
MN (X; M, U, V ) = p ·exp − [vec(X) − vec(M )] (V ⊗ U ) [vec(X) − vec(M )] .
T
(2π)np |V |n |U |p 2
(8)
Using the Kronecker-determinant relation |A ⊗ B| = |A| |B| , we have:
m n
1 1 −1
MN (X; M, U, V ) = p ·exp − [vec(X) − vec(M )] (V ⊗ U ) [vec(X) − vec(M )] .
T
(2π)np |V ⊗ U | 2
(9)
This is the probability density function of the multivariate normal distribution (→ II/4.1.4) with the
np × 1 mean vector vec(M ) and the np × np covariance matrix V ⊗ U :
Sources:
• Wikipedia (2020): “Matrix normal distribution”; in: Wikipedia, the free encyclopedia, retrieved on
2020-01-20; URL: https://en.wikipedia.org/wiki/Matrix_normal_distribution#Proof.
X ∼ MN (M, U, V ) . (1)
Then, the probability density function (→ I/1.7.1) of X is
1 1 −1 T −1
f (X) = p · exp − tr V (X − M ) U (X − M ) . (2)
(2π)np |V |n |U |p 2
Proof: This follows directly from the definition of the matrix-normal distribution (→ II/5.1.1).
■
5. MATRIX-VARIATE CONTINUOUS DISTRIBUTIONS 319
5.1.4 Mean
Theorem: Let X be a random matrix (→ I/1.2.4) following a matrix-normal distribution (→
II/5.1.1):
X ∼ MN (M, U, V ) . (1)
Then, the mean or expected value (→ I/1.10.1) of X is
E(X) = M . (2)
Proof: When X follows a matrix-normal distribution (→ II/5.1.1), its vectorized version follows a
multivariate normal distribution (→ II/5.1.2)
E[X] = M . (5)
Sources:
• Wikipedia (2022): “Matrix normal distribution”; in: Wikipedia, the free encyclopedia, retrieved on
2022-09-15; URL: https://en.wikipedia.org/wiki/Matrix_normal_distribution#Expected_values.
5.1.5 Covariance
Theorem: Let X be an n × p random matrix (→ I/1.2.4) following a matrix-normal distribution
(→ II/5.1.1):
X ∼ MN (M, U, V ) . (1)
Then,
1) the covariance matrix (→ I/1.13.9) of each row of X is a scalar multiple of V
i,• ) ∝ V
Cov(xT for all i = 1, . . . , n ; (2)
2) the covariance matrix (→ I/1.13.9) of each column of X is a scalar multiple of U
Proof:
1) The marginal distribution (→ I/1.5.3) of a given row of X is a multivariate normal distribution
(→ II/5.1.10)
320 CHAPTER II. PROBABILITY DISTRIBUTIONS
i,• ) = uii V ∝ V .
Cov(xT (5)
2) The marginal distribution (→ I/1.5.3) of a given column of X is a multivariate normal distribution
(→ II/5.1.10)
Sources:
• Wikipedia (2022): “Matrix normal distribution”; in: Wikipedia, the free encyclopedia, retrieved on
2022-09-15; URL: https://en.wikipedia.org/wiki/Matrix_normal_distribution#Expected_values.
X ∼ MN (M, U, V ) . (1)
Then, the differential entropy (→ I/2.2.1) of X in nats is
np n p np
h(X) = ln(2π) + ln |V | + ln |U | + . (2)
2 2 2 2
np 1 1
h(X) = ln(2π) + ln (|V |n |U |p ) + np
2 2 2 (7)
np n p np
= ln(2π) + ln |V | + ln |U | + .
2 2 2 2
■
P : X ∼ MN (M1 , U1 , V1 )
(1)
Q : X ∼ MN (M2 , U2 , V2 ) .
1
KL[P || Q] = vec(M2 − M1 )T vec U2−1 (M2 − M1 )V2−1
2
|V1 | |U1 | (2)
−1 −1
+ tr (V2 V1 ) ⊗ (U2 U1 ) − n ln − p ln − np .
|V2 | |U2 |
1
KL[P || Q] = (vec(M2 ) − vec(M1 ))T (V2 ⊗ U2 )−1 (vec(M2 ) − vec(M1 ))
2
|V1 ⊗ U1 | (5)
−1
+ tr (V2 ⊗ U2 ) (V1 ⊗ U1 ) − ln − np .
|V2 ⊗ U2 |
Using the vectorization operator and Kronecker product properties
1
KL[P || Q] = vec(M2 − M1 )T (V2−1 ⊗ U2−1 ) vec(M2 − M1 )
2
|V1 | |U1 | (10)
−1 −1
+ tr (V2 V1 ) ⊗ (U2 U1 ) − n ln − p ln − np .
|V2 | |U2 |
Using the relationship between Kronecker product and vectorization operator
1
KL[P || Q] = vec(M2 − M1 )T vec U2−1 (M2 − M1 )V2−1
2
|V1 | |U1 | (12)
−1 −1
+ tr (V2 V1 ) ⊗ (U2 U1 ) − n ln − p ln − np .
|V2 | |U2 |
■
5.1.8 Transposition
Theorem: Let X be a random matrix (→ I/1.2.4) following a matrix-normal distribution (→
II/5.1.1):
X ∼ MN (M, U, V ) . (1)
Then, the transpose of X also has a matrix-normal distribution:
X T ∼ MN (M T , V, U ) . (2)
Proof: The probability density function of the matrix-normal distribution (→ II/5.1.3) is:
1 1 −1 T −1
f (X) = MN (X; M, U, V ) = p · exp − tr V (X − M ) U (X − M ) . (3)
(2π)np |V |n |U |p 2
X ∼ MN (M, U, V ) . (1)
Then, a linear transformation of X is also matrix-normally distributed
vec(Y ) = vec(AXB + C)
= vec(AXB) + vec(C) (5)
= (B ⊗ A)vec(X) + vec(C) .
T
Y ∼ MN (AM B + C, AU AT , B T V B) . (7)
■
324 CHAPTER II. PROBABILITY DISTRIBUTIONS
X ∼ MN (M, U, V ) . (1)
Then,
1) the marginal distribution (→ I/1.5.3) of any subset matrix XI,J , obtained by dropping some rows
and/or columns from X, is also a matrix-normal distribution (→ II/5.1.1)
Proof:
1) Define a selector matrix A, such that aij = 1, if the i-th row in the subset matrix should be the
j-th row from the original matrix (and aij = 0 otherwise)
1 , if I = j
i
A ∈ R|I|×n , s.t. aij = (6)
0 , otherwise
and define a selector matrix B, such that bij = 1, if the j-th column in the subset matrix should be
the i-th column from the original matrix (and bij = 0 otherwise)
1 , if J = i
j
B ∈ Rp×|J| , s.t. bij = (7)
0 , otherwise .
A = ei , B = Ip , (10)
the i-th row of X can be expressed as
A = In , B = eT
j , (14)
the j-th column of X can be expressed as
A = ei , B = eT
j , (17)
the (i, j)-th entry of X can be expressed as
As xij is a scalar, this is equivalent to a univariate normal distribution (→ II/3.2.1) as a special case
(→ II/3.2.2) of the matrix-normal distribution (→ II/4.1.2):
Proof: If all entries of X are independent and standard normally distributed (→ II/3.2.3)
i.i.d.
xij ∼ N (0, 1) for all i = 1, . . . , n and j = 1, . . . , p , (2)
this implies a multivariate normal distribution with diagonal covariance matrix (→ II/4.1.15):
X ∼ MN (0np , In , Ip ) . (4)
Thus, with the linear transformation theorem for the matrix-normal distribution (→ II/5.1.9), it
follows that
Y = M + AXB ∼ MN M + A0np B, AIn AT , B T Ip B
∼ MN M, AAT , B T B (5)
∼ MN (M, U, V ) .
Sources:
• Wikipedia (2021): “Matrix normal distribution”; in: Wikipedia, the free encyclopedia, retrieved on
2021-12-07; URL: https://en.wikipedia.org/wiki/Matrix_normal_distribution#Drawing_values_
from_the_distribution.
5. MATRIX-VARIATE CONTINUOUS DISTRIBUTIONS 327
X ∼ MN (0, In , V ) . (1)
Define the scatter matrix S as the product of the transpose of X with itself:
X
n
T
S=X X= xT
i xi . (2)
i=1
Then, the matrix S is said to follow a Wishart distribution with scale matrix V and degrees of
freedom n
S ∼ W(V, n) (3)
where n > p − 1 and V is a positive definite symmetric covariance matrix.
Sources:
• Wikipedia (2020): “Wishart distribution”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
03-22; URL: https://en.wikipedia.org/wiki/Wishart_distribution#Definition.
P : S ∼ W(V1 , n1 )
(1)
Q : S ∼ W(V2 , n2 ) .
" #
1 Γp n2 n
n2 (ln |V2 | − ln |V1 |) + n1 tr(V2−1 V1 ) + 2 ln 2 1
KL[P || Q] = n1
+ (n1 − n2 )ψp − n1 p (2)
2 Γp 2
2
Yk
j−1
Γp (x) = π p(p−1)/4
Γ x− (3)
j=1
2
Z
p(x)
KL[P || Q] = p(x) ln dx (5)
X q(x)
which, applied to the Wishart distributions (→ II/5.2.1) in (1), yields
Z
W(S; V1 , n1 )
KL[P || Q] = W(S; V1 , n1 ) ln dS
Sp W(S; V2 , n2 )
(6)
W(S; α1 )
= ln
W(S; α1 ) p(S)
*
√ 1
n · |S|(n1 −p−1)/2 · exp − 21 tr V1−1 S +
2n1 p |V1 |n1 Γp ( 21 )
KL[P || Q] = ln
√ 1
n · |S|(n2 −p−1)/2 · exp − 21 tr V2−1 S
2n2 p |V2 |n2 Γp ( 22 )
p(S)
* s !+
|V |n2 Γ n2
p 2 1 1
= ln 2(n2 −n1 )p ·
2
· · |S|(n1 −n2 )/2 · exp − tr V1−1 S + tr V2−1 S
|V1 | n 1
Γp 2n1
2 2
p(S)
*
(n2 − n1 )p n2 n1 Γp n22
= ln 2 + ln |V2 | − ln |V1 | + ln
2 2 2 Γp n21
n1 − n2 1 −1
1 −1
+ ln |S| − tr V1 S + tr V2 S
2 2 2 p(S)
(n2 − n1 )p n2 n1 Γp n22
= ln 2 + ln |V2 | − ln |V1 | + ln
2 2 2 Γp n21
n1 − n2 1 1
+ ⟨ln |S|⟩p(S) − tr V1−1 S p(S) + tr V2−1 S p(S) .
2 2 2
(7)
(n2 − n1 )p n2 n1 Γp n22
KL[P || Q] = ln 2 + ln |V2 | − ln |V1 | + ln
2 2 2 Γp n21
n1 − n2 h n1 i n n1
+ p · ln 2 + ln |V1 | − tr V1−1 V1 + tr V2−1 V1
1
+ ψp
2 2 2 2
n2 Γp 2 n2
n1 − n2 n1 n1 n1
−1
= (ln |V2 | − ln |V1 |) + ln + ψ p − tr (Ip ) + tr V 2 V1
2 Γp n21 2 2 2 2
" #
1 Γ n2 n
p 2
= n2 (ln |V2 | − ln |V1 |) + n1 tr(V2−1 V1 ) + 2 ln n1
+ (n1 − n2 )ψp 1 − n1 p .
2 Γp 2 2
(11)
Sources:
• Penny, William D. (2001): “KL-Divergences of Normal, Gamma, Dirichlet and Wishart densities”;
in: University College, London, pp. 2-3, eqs. 13/15; URL: https://www.fil.ion.ucl.ac.uk/~wpenny/
publications/densities.ps.
• Wikipedia (2021): “Wishart distribution”; in: Wikipedia, the free encyclopedia, retrieved on 2021-
12-02; URL: https://en.wikipedia.org/wiki/Wishart_distribution#KL-divergence.
X, Y ∼ NW(M, U, V, ν) , (1)
if the distribution of X conditional on Y is a matrix-normal distribution (→ II/5.1.1) with mean M ,
covariance across rows U , covariance across columns Y −1 and Y follows a Wishart distribution (→
II/5.2.1) with scale matrix V and degrees of freedom ν:
X|Y ∼ MN (M, U, Y −1 )
(2)
Y ∼ W(V, ν) .
The p × p matrix Y can be seen as the precision matrix (→ I/1.13.19) across the columns of the
n × p matrix X.
X, Y ∼ NW(M, U, V, ν) . (1)
Then, the joint probability (→ I/1.3.2) density function (→ I/1.7.1) of X and Y is
330 CHAPTER II. PROBABILITY DISTRIBUTIONS
√
1 2−νp
p(X, Y ) = p · · |Y |(ν+n−p−1)/2 ·
(2π) |U | |V |
np p ν Γp 2
ν
(2)
1 T −1 −1
exp − tr Y (X − M ) U (X − M ) + V .
2
X|Y ∼ MN (M, U, Y −1 )
(3)
Y ∼ W(V, ν) .
Thus, using the probability density function of the matrix-normal distribution (→ II/5.1.3) and the
probability density function of the Wishart distribution, we have the following probabilities:
p(X|Y ) = MN (X; M, U, Y −1 )
s
|Y |n 1 T −1
= · exp − tr Y (X − M ) U (X − M )
(2π)np |U |p 2
(4)
p(Y ) = W(Y ; V, ν)
1 1 1 −1
= ·p · |Y |(ν−p−1)/2
· exp − tr V Y .
Γp ν2 2νp |V |ν 2
■
5. MATRIX-VARIATE CONTINUOUS DISTRIBUTIONS 331
5.3.3 Mean
Theorem: Let X ∈ Rn×p and Y ∈ Rp×p follow a normal-Wishart distribution (→ II/5.3.1):
X, Y ∼ NW(M, U, V, ν) . (1)
Then, the expected value (→ I/1.10.1) of X and Y is
ZZ
E(X) = X · p(X, Y ) dX dY
ZZ
= X · p(X|Y ) · p(Y ) dX dY
Z Z
= p(Y ) X · p(X|Y ) dX dY
Z
(6)
= p(Y ) ⟨X⟩MN (M,U,Y −1 ) dY
Z
= p(Y ) · M dY
Z
= M p(Y ) dY
=M ,
and with the expected value of the Wishart distribution, E(Y ) becomes
Z
E(Y ) = Y · p(Y ) dY
= ⟨Y ⟩W(V,ν) (7)
= νV .
Thus, the expectation of the random matrix in equations (3) and (4) is
X M
E = , (8)
Y νV
as indicated by equation (2).
■
Chapter III
Statistical Models
333
334 CHAPTER III. STATISTICAL MODELS
yi ∼ N (µ, σ 2 ), i = 1, . . . , n . (1)
Sources:
• Bishop, Christopher M. (2006): “Example: The univariate Gaussian”; in: Pattern Recognition for
Machine Learning, ch. 10.1.3, p. 470, eq. 10.21; URL: http://users.isr.ist.utl.pt/~wurmd/Livros/
school/Bishop%20-%20Pattern%20Recognition%20And%20Machine%20Learning%20-%20Springer%
20%202006.pdf.
yi ∼ N (µ, σ 2 ), i = 1, . . . , n . (1)
Then, the maximum likelihood estimates (→ I/4.1.3) for mean µ and variance σ 2 are given by
1X
n
µ̂ = yi
n i=1
(2)
1X
n
σ̂ 2 = (yi − ȳ)2 .
n i=1
Proof: The likelihood function (→ I/5.1.2) for each observation is given by the probability density
function of the normal distribution (→ II/3.2.10)
" 2 #
1 1 y i − µ
p(yi |µ, σ 2 ) = N (x; µ, σ 2 ) = √ · exp − (3)
2πσ 2 2 σ
and because observations are independent (→ I/1.3.6), the likelihood function for all observations is
the product of the individual ones:
s " n 2 #
Yn
1 1 X yi − µ
p(y|µ, σ 2 ) = p(yi |µ) = 2 n
· exp − . (4)
i=1
(2πσ ) 2 i=1 σ
This can be developed into
1. UNIVARIATE NORMAL DATA 335
n/2 " n 2 #
1 1 X y − 2yi µ + µ 2
p(y|µ, σ 2 ) = · exp − i
2πσ 2 2 σ2
n/2
i=1
(5)
1 1
= · exp − 2 y T y − 2nȳµ + nµ2
2πσ 2 2σ
P P
where ȳ = n1 ni=1 yi is the mean of data points and y T y = ni=1 yi2 is the sum of squared data points.
Thus, the log-likelihood function (→ I/4.1.2) is
n 1
LL(µ, σ 2 ) = log p(y|µ, σ 2 ) = − log(2πσ 2 ) − 2 y T y − 2nȳµ + nµ2 . (6)
2 2σ
dLL(µ, σ 2 ) nȳ nµ n
= 2 − 2 = 2 (ȳ − µ) (7)
dµ σ σ σ
and setting this derivative to zero gives the MLE for µ:
dLL(µ̂, σ 2 )
=0
dµ
n
0 = 2 (ȳ − µ̂)
σ
0 = ȳ − µ̂ (8)
µ̂ = ȳ
1X
n
µ̂ = yi .
n i=1
dLL(µ̂, σ 2 ) n 1 1
2
=− 2 + 2 2
y T y − 2nȳ µ̂ + nµ̂2
dσ 2σ 2(σ )
1 X 2
n
n
=− 2 + y i − 2y i µ̂ + µ̂ 2
(9)
2σ 2(σ 2 )2 i=1
1 X
n
n
=− + (yi − µ̂)2
2σ 2 2(σ 2 )2 i=1
dLL(µ̂, σ̂ 2 )
=0
dσ 2
1 X
n
0= 2 2
(yi − µ̂)2
2(σ̂ ) i=1
1 X
n
n
= (yi − µ̂)2
2σ̂ 2 2(σ̂ 2 )2 i=1
(10)
1 X
n
2(σ̂ 2 )2 n 2(σ̂ 2 )2
· 2 = · (yi − µ̂)2
n 2σ̂ n 2(σ̂ 2 )2 i=1
1X
n
σ̂ 2 = (yi − µ̂)2
n i=1
1X
n
2
σ̂ = (yi − ȳ)2
n i=1
Together, (8) and (10) constitute the MLE for the univariate Gaussian.
Sources:
• Bishop CM (2006): “Bayesian inference for the Gaussian”; in: Pattern Recognition for Machine
Learning, pp. 93-94, eqs. 2.121, 2.122; URL: http://users.isr.ist.utl.pt/~wurmd/Livros/school/
Bishop%20-%20Pattern%20Recognition%20And%20Machine%20Learning%20-%20Springer%20%
202006.pdf.
yi ∼ N (µ, σ 2 ), i = 1, . . . , n (1)
be a univariate Gaussian data set (→ III/1.1.1) with unknown mean µ and unknown variance σ 2 .
Then, the test statistic (→ I/4.3.5)
ȳ − µ0
t= √ (2)
s/ n
with sample mean (→ I/1.10.2) ȳ and sample variance (→ I/1.11.2) s2 follows a Student’s t-
distribution (→ II/3.3.1) with n − 1 degrees of freedom
t ∼ t(n − 1) (3)
under the null hypothesis (→ I/4.3.2)
H0 : µ = µ0 . (4)
1X
n
ȳ = yi (5)
n i=1
and the sample variance (→ I/1.11.2) is given by
1 X
n
2
s = (yi − ȳ)2 . (6)
n − 1 i=1
Using the linear combination formula for normal random variables (→ II/3.2.26), the sample mean
follows a normal distribution (→ II/3.2.1) with the following parameters:
2 !
1X
n
1 1
ȳ = yi ∼ N nµ, nσ 2 = N µ, σ 2 /n . (7)
n i=1 n n
Again employing the√ linear combination theorem and applying the null hypothesis from (4), the
distribution of Z = n(ȳ − µ0 )/σ becomes standard normal (→ II/3.2.3)
√ √ √ 2 2 !
n(ȳ − µ0 ) n n σ H0
Z= ∼N (µ − µ0 ), = N (0, 1) . (8)
σ σ σ n
Because sample variances calculated from independent normal random variables follow a chi-squared
distribution (→ II/3.2.7), the distribution of V = (n − 1) s2 /σ 2 is
(n − 1) s2
V = 2
∼ χ2 (n − 1) . (9)
σ
Finally, since the ratio of a standard normal random variable and the square root of a chi-squared
random variable follows a t-distribution (→ II/3.3.1), the distribution of the test statistic (→ I/4.3.5)
is given by
ȳ − µ0 Z
t= √ =p ∼ t(n − 1) . (10)
s/ n V /(n − 1)
This means that the null hypothesis (→ I/4.3.2) can be rejected when t is as extreme or more extreme
than the critical value (→ I/4.3.9) obtained from the Student’s t-distribution (→ II/3.3.1) with n − 1
degrees of freedom using a significance level (→ I/4.3.8) α.
Sources:
• Wikipedia (2021): “Student’s t-distribution”; in: Wikipedia, the free encyclopedia, retrieved on
2021-03-12; URL: https://en.wikipedia.org/wiki/Student%27s_t-distribution#Derivation.
y1i ∼ N (µ1 , σ 2 ), i = 1, . . . , n1
(1)
y2i ∼ N (µ2 , σ 2 ), i = 1, . . . , n2
338 CHAPTER III. STATISTICAL MODELS
be two univariate Gaussian data sets (→ III/1.1.1) representing two groups of unequal size n1 and n2
with unknown means µ1 and µ2 and equal unknown variance σ 2 . Then, the test statistic (→ I/4.3.5)
(ȳ1 − ȳ2 ) − µ∆
t= q (2)
sp · n11 + n12
with sample means (→ I/1.10.2) ȳ1 and ȳ2 and pooled standard deviation sp follows a Student’s
t-distribution (→ II/3.3.1) with n1 + n2 − 2 degrees of freedom
t ∼ t(n1 + n2 − 2) (3)
under the null hypothesis (→ I/4.3.2)
H0 : µ1 − µ2 = µ∆ . (4)
1 X
n1
ȳ1 = y1i
n1 i=1
(5)
1 X
n2
ȳ2 = y2i
n2 i=1
1 X 1 n
s21 = (y1i − ȳ1 )2
n1 − 1 i=1
(7)
1 X 2 n
s22 = (y2i − ȳ2 )2 .
n2 − 1 i=1
Using the linear combination formula for normal random variables (→ II/3.2.26), the sample means
follows normal distributions (→ II/3.2.1) with the following parameters:
2 !
1 X
n1
1 1
ȳ1 = y1i ∼ N n1 µ1 , n1 σ 2
= N µ1 , σ 2 /n1
n1 i=1 n1 n1
2 ! (8)
1 X
n2
1 1
ȳ2 = y2i ∼ N n2 µ2 , n2 σ 2
= N µ2 , σ /n2 .
2
n2 i=1 n2 n2
2
(ȳ1 − ȳ2 ) − µ∆ (µ1 − µ2 ) − µ∆ 1 σ2 2
σ H0
Z= q ∼N q , q + = N (0, 1) . (9)
σ · n11 + n12 σ · n11 + n12 σ · n11 + 1 n1 n2
n2
Because sample variances calculated from independent normal random variables follow a chi-squared
distribution (→ II/3.2.7), the distribution of V = (n1 + n2 − 2) s2p /σ 2 is
(n1 + n2 − 2) s2p
V = ∼ χ2 (n1 + n2 − 2) . (10)
σ2
Finally, since the ratio of a standard normal random variable and the square root of a chi-squared
random variable follows a t-distribution (→ II/3.3.1), the distribution of the test statistic (→ I/4.3.5)
is given by
(ȳ1 − ȳ2 ) − µ∆ Z
t= q =p ∼ t(n1 + n2 − 2) . (11)
sp · n11 + n12 V /(n1 + n2 − 2)
This means that the null hypothesis (→ I/4.3.2) can be rejected when t is as extreme or more
extreme than the critical value (→ I/4.3.9) obtained from the Student’s t-distribution (→ II/3.3.1)
with n1 + n2 − 2 degrees of freedom using a significance level (→ I/4.3.8) α.
■
Sources:
• Wikipedia (2021): “Student’s t-distribution”; in: Wikipedia, the free encyclopedia, retrieved on
2021-03-12; URL: https://en.wikipedia.org/wiki/Student%27s_t-distribution#Derivation.
• Wikipedia (2021): “Student’s t-test”; in: Wikipedia, the free encyclopedia, retrieved on 2021-03-12;
URL: https://en.wikipedia.org/wiki/Student%27s_t-test#Equal_or_unequal_sample_sizes,_similar_
variances_(1/2_%3C_sX1/sX2_%3C_2).
d¯ − µ0
t= √ where di = yi1 − yi2 (2)
sd / n
with sample mean (→ I/1.10.2) d¯ and sample variance (→ I/1.11.2) sd2 follows a Student’s t-
distribution (→ II/3.3.1) with n − 1 degrees of freedom
t ∼ t(n − 1) (3)
under the null hypothesis (→ I/4.3.2)
H0 : µ = µ0 . (4)
340 CHAPTER III. STATISTICAL MODELS
Proof: Define the pair-wise difference di = yi1 −yi2 which is, according to the linearity of the expected
value (→ I/1.10.5) and the invariance of the variance under addition (→ I/1.11.6), distributed as
Sources:
• Wikipedia (2021): “Student’s t-test”; in: Wikipedia, the free encyclopedia, retrieved on 2021-03-12;
URL: https://en.wikipedia.org/wiki/Student%27s_t-test#Dependent_t-test_for_paired_samples.
Proof: By definition, a conjugate prior (→ I/5.2.5) is a prior distribution (→ I/5.1.3) that, when
combined with the likelihood function (→ I/5.1.2), leads to a posterior distribution (→ I/5.1.7) that
belongs to the same family of probability distributions (→ I/1.5.1). This is fulfilled when the prior
density and the likelihood function are proportional to the model model parameters in the same way,
i.e. the model parameters appear in the same functional form in both.
Equation (1) implies the following likelihood function (→ I/5.1.2)
Y
n
p(y|µ, σ 2 ) = N (yi ; µ, σ 2 )
i=1
" 2 #
Yn
1 1 yi − µ
= √ · exp − (3)
2πσ 2 σ
i=1
" #
1 X
n
1
= √ · exp − 2 (yi − µ)2
( 2πσ )2 n 2σ i=1
Y
n
p(y|µ, τ ) = N (yi ; µ, τ −1 )
i=1
r h τ i
Y
n
τ
= · exp − (yi − µ)2 (4)
2π 2
i=1
r n " #
τX
n
τ
= · exp − (yi − µ)2
2π 2 i=1
s " #
1 τ Xn
p(y|µ, τ ) = · τ n/2 · exp − y 2 − 2µyi + µ2
(2π)n 2 i=1 i
s " !#
1 τ Xn Xn
= · τ n/2 · exp − y 2 − 2µ yi + nµ2
(2π)n 2 i=1 i i=1
s (6)
1 h τ T i
= ·τ n/2
· exp − y y − 2µnȳ + nµ 2
(2π)n 2
s
1 τn 1 T
= ·τ n/2
· exp − y y − 2µȳ + µ 2
(2π)n 2 n
P P
where ȳ = n1 ni=1 yi is the mean of data points and y T y = ni=1 yi2 is the sum of squared data points.
Completing the square over µ, finally gives
s
1 τn 1 T
p(y|µ, τ ) = ·τ n/2
· exp − (µ − ȳ) − ȳ + y y
2 2
(7)
(2π)n 2 n
In other words, the likelihood function (→ I/5.1.2) is proportional to a power of τ times an exponential
of τ and an exponential of a squared form of µ, weighted by τ :
h τ i h τn i
p(y|µ, τ ) ∝ τ n/2 · exp − y T y − nȳ 2 · exp − (µ − ȳ)2 . (8)
2 2
The same is true for a normal-gamma distribution (→ II/4.3.1) over µ and τ
r
τ λ0 τ λ0 b0 a0 a0 −1
p(µ, τ ) = · exp − (µ − µ0 ) ·
2
τ exp[−b0 τ ] (10)
2π 2 Γ(a0 )
exhibits the same proportionality
τ λ0
p(µ, τ ) ∝ τ a0 +1/2−1
· exp[−τ b0 ] · exp − (µ − µ0 )2 (11)
2
and is therefore conjugate relative to the likelihood.
■
Sources:
• Bishop CM (2006): “Bayesian inference for the Gaussian”; in: Pattern Recognition for Machine
Learning, pp. 97-102, eq. 2.154; URL: http://users.isr.ist.utl.pt/~wurmd/Livros/school/Bishop%
20-%20Pattern%20Recognition%20And%20Machine%20Learning%20-%20Springer%20%202006.pdf.
λ0 µ0 + nȳ
µn =
λ0 + n
λn = λ0 + n
n (4)
an = a0 +
2
1 T
bn = b0 + (y y + λ0 µ20 − λn µ2n ) .
2
Proof: According to Bayes’ theorem (→ I/5.3.1), the posterior distribution (→ I/5.1.7) is given by
p(y|µ, τ ) p(µ, τ )
p(µ, τ |y) = . (5)
p(y)
Since p(y) is just a normalization factor, the posterior is proportional (→ I/5.1.9) to the numerator:
Y
n
2
p(y|µ, σ ) = N (yi ; µ, σ 2 )
i=1
" 2 #
Yn
1 1 yi − µ
= √ · exp − (7)
2πσ 2 σ
i=1
" #
1 X
n
1
= √ · exp − 2 (yi − µ)2
( 2πσ 2 )n 2σ i=1
Y
n
p(y|µ, τ ) = N (yi ; µ, τ −1 )
i=1
r h τ i
Y
n
τ
= · exp − (yi − µ)2 (8)
2π 2
i=1
r n " #
τX
n
τ
= · exp − (yi − µ)2
2π 2 i=1
Combining the likelihood function (→ I/5.1.2) (8) with the prior distribution (→ I/5.1.3) (2), the
joint likelihood (→ I/5.1.5) of the model is given by
s
τ n+1 λ0 b0 a0 a0 −1
p(y, µ, τ ) = τ exp[−b0 τ ]·
(2π)n+1 Γ(a0 )
" !# (10)
τ Xn
exp − (yi − µ)2 + λ0 (µ − µ0 )2 .
2 i=1
s
τ n+1 λ0 b0 a0 a0 −1
p(y, µ, τ ) = τ exp[−b0 τ ]·
(2π)n+1 Γ(a0 ) (11)
h τ i
exp − (y y − 2µnȳ + nµ ) + λ0 (µ − 2µµ0 + µ0 )
T 2 2 2
2
P P
where ȳ = n1 ni=1 yi and y T y = ni=1 yi2 , such that
s
τ n+1 λ0 b0 a0 a0 −1
p(y, µ, τ ) = τ exp[−b0 τ ]·
(2π)n+1 Γ(a0 ) (12)
h τ i
exp − µ2 (λ0 + n) − 2µ(λ0 µ0 + nȳ) + (y T y + λ0 µ20 )
2
Completing the square over µ, we finally have
s
τ n+1 λ0 b0 a0 a0 −1
p(y, µ, τ ) = τ exp[−b0 τ ]·
(2π)n+1 Γ(a0 )
(13)
τ λn τ T
exp − (µ − µn ) −
2
y y + λ0 µ0 − λn µn
2 2
2 2
λ0 µ0 + nȳ
µn =
λ0 + n (14)
λn = λ0 + n .
n
an = a0 +
2
1 T (16)
bn = b0 + y y + λ0 µ20 − λn µ2n .
2
From the term in (13), we can isolate the posterior distribution over µ given τ :
Sources:
• Bishop CM (2006): “Bayesian inference for the Gaussian”; in: Pattern Recognition for Machine
Learning, pp. 97-102, eq. 2.154; URL: http://users.isr.ist.utl.pt/~wurmd/Livros/school/Bishop%
20-%20Pattern%20Recognition%20And%20Machine%20Learning%20-%20Springer%20%202006.pdf.
λ0 µ0 + nȳ
µn =
λ0 + n
λn = λ0 + n
n (4)
an = a0 +
2
1 T
bn = b0 + (y y + λ0 µ20 − λn µ2n ) .
2
Proof: According to the law of marginal probability (→ I/1.3.3), the model evidence (→ I/5.1.11)
for this model is:
ZZ
p(y|m) = p(y|µ, τ ) p(µ, τ ) dµ dτ . (5)
According to the law of conditional probability (→ I/1.3.4), the integrand is equivalent to the joint
likelihood (→ I/5.1.5):
ZZ
p(y|m) = p(y, µ, τ ) dµ dτ . (6)
Y
n
p(y|µ, σ ) =2
N (yi ; µ, σ 2 )
i=1
" 2 #
Y
n
1 1 yi − µ
= √ · exp − (7)
2πσ 2 σ
i=1
" #
1 X
n
1
= √ · exp − 2 (yi − µ)2
( 2πσ )2 n 2σ i=1
Y
n
p(y|µ, τ ) = N (yi ; µ, τ −1 )
i=1
Yn r h τ i
τ
= · exp − (yi − µ) 2
(8)
2π 2
i=1
r n " #
τX
n
τ
= · exp − (yi − µ)2
2π 2 i=1
When deriving the posterior distribution (→ III/1.1.7) p(µ, τ |y), the joint likelihood p(y, µ, τ ) is
obtained as
s
τ n+1 λ0 b0 a0 a0 −1
p(y, µ, τ ) = τ exp[−b0 τ ]·
(2π)n+1 Γ(a0 )
(9)
τ λn τ T
exp − (µ − µn ) −
2
y y + λ0 µ0 − λn µn .
2 2
2 2
Using the probability density function of the normal distribution (→ II/3.2.10), we can rewrite this
as
r r r
τn τ λ0 2π b0 a0 a0 −1
p(y, µ, τ ) = τ exp[−b0 τ ]·
(2π)n 2π τ λn Γ(a0 ) (10)
h τ i
−1
N (µ; µn , (τ λn ) ) exp − y y + λ0 µ0 − λn µn .
T 2 2
2
Now, µ can be integrated out easily:
Z s r
1 λ0 b0 a0 a0 +n/2−1
p(y, µ, τ ) dµ = τ exp[−b0 τ ]·
(2π)n λn Γ(a0 ) (11)
h τ i
exp − y T y + λ0 µ20 − λn µ2n .
2
1. UNIVARIATE NORMAL DATA 347
Using the probability density function of the gamma distribution (→ II/3.4.7), we can rewrite this
as
Z s r
1 λ0 b0 a0 Γ(an )
p(y, µ, τ ) dµ = Gam(τ ; an , bn ) . (12)
(2π)n λn Γ(a0 ) bn an
Finally, τ can also be integrated out:
ZZ s r
1 λ0 Γ(an ) b0 a0
p(y, µ, τ ) dµ dτ = . (13)
(2π)n λn Γ(a0 ) bn an
Thus, the log model evidence (→ IV/3.1.3) of this model is given by
n 1 λ0
log p(y|m) = − log(2π) + log + log Γ(an ) − log Γ(a0 ) + a0 log b0 − an log bn . (14)
2 2 λn
■
Sources:
• Bishop CM (2006): “Bayesian linear regression”; in: Pattern Recognition for Machine Learning, pp.
152-161, ex. 3.23, eq. 3.118; URL: http://users.isr.ist.utl.pt/~wurmd/Livros/school/Bishop%20-%
20Pattern%20Recognition%20And%20Machine%20Learning%20-%20Springer%20%202006.pdf.
1 an T 1 n n
Acc(m) = − y y − 2nȳµn + nµ2n − nλ−1 n + (ψ(an ) − log(bn )) − log(2π)
2 bn 2 2 2
1 an 1 λ0 1 λ0 1
Com(m) = λ0 (µ0 − µn )2 − 2(bn − b0 ) + − log − (3)
2 bn 2 λn 2 λn 2
bn Γ(an )
+ a0 · log − log + (an − a0 ) · ψ(an )
b0 Γ(a0 )
where µn and λn as well as an and bn are the posterior hyperparameters for the univariate Gaussian
(→ III/1.1.7) and ȳ is the sample mean (→ I/1.10.2).
The accuracy term is the expectation (→ I/1.10.1) of the log-likelihood function (→ I/4.1.2) log p(y|µ, τ )
with respect to the posterior distribution (→ I/5.1.7) p(µ, τ |y). With the log-likelihood function for
the univariate Gaussian (→ III/1.1.2) and the posterior distribution for the univariate Gaussian (→
III/1.1.7), the model accuracy of m evaluates to:
n n τ T 1 (5)
= log(τ ) − log(2π) − y y − 2nȳµn + nµ2n − nλ−1
2 2 2 2 n Gam(an ,bn )
n n 1 an T 1
= (ψ(an ) − log(bn )) − log(2π) − y y − 2nȳµn + nµ2n − nλ−1
2 2 2 bn 2 n
1 an T 1 n n
=− y y − 2nȳµn + nµ2n − nλn−1 + (ψ(an ) − log(bn )) − log(2π)
2 bn 2 2 2
The complexity penalty is the Kullback-Leibler divergence (→ I/2.5.1) of the posterior distribution
(→ I/5.1.7) p(µ, τ |y) from the prior distribution (→ I/5.1.3) p(µ, τ ). With the prior distribution (→
III/1.1.6) given by (2), the posterior distribution for the univariate Gaussian (→ III/1.1.7) and the
Kullback-Leibler divergence of the normal-gamma distribution (→ II/4.3.7), the model complexity
of m evaluates to:
yi ∼ N (µ, σ 2 ), i = 1, . . . , n . (1)
Sources:
• Bishop, Christopher M. (2006): “Bayesian inference for the Gaussian”; in: Pattern Recognition
for Machine Learning, ch. 2.3.6, p. 97, eq. 2.137; URL: http://users.isr.ist.utl.pt/~wurmd/Livros/
school/Bishop%20-%20Pattern%20Recognition%20And%20Machine%20Learning%20-%20Springer%
20%202006.pdf.
yi ∼ N (µ, σ 2 ), i = 1, . . . , n . (1)
Then, the maximum likelihood estimate (→ I/4.1.3) for the mean µ is given by
µ̂ = ȳ (2)
where ȳ is the sample mean (→ I/1.10.2)
1X
n
ȳ = yi . (3)
n i=1
Proof: The likelihood function (→ I/5.1.2) for each observation is given by the probability density
function of the normal distribution (→ II/3.2.10)
" 2 #
1 1 y i − µ
p(yi |µ) = N (x; µ, σ 2 ) = √ · exp − (4)
2πσ 2 2 σ
and because observations are independent (→ I/1.3.6), the likelihood function for all observations is
the product of the individual ones:
s " n 2 #
Yn
1 1 X yi − µ
p(y|µ) = p(yi |µ) = 2 n
· exp − . (5)
i=1
(2πσ ) 2 i=1 σ
This can be developed into
n/2 n
" #
1 1 X yi2 − 2yi µ + µ2
p(y|µ) = · exp −
2πσ 2 2 i=1 σ2
n/2 (6)
1 1
= · exp − 2 y y − 2nȳµ + nµ
T 2
2πσ 2 2σ
350 CHAPTER III. STATISTICAL MODELS
P P
where ȳ = n1 ni=1 yi is the mean of data points and y T y = ni=1 yi2 is the sum of squared data points.
Thus, the log-likelihood function (→ I/4.1.2) is
n 1
LL(µ) = log p(y|µ) = − log(2πσ 2 ) − 2 y T y − 2nȳµ + nµ2 . (7)
2 2σ
The derivatives of the log-likelihood with respect to µ are
dLL(µ) nȳ nµ n
= 2 − 2 = 2 (ȳ − µ)
dµ σ σ σ
2 (8)
d LL(µ) n
2
=− 2 .
dµ σ
Setting the first derivative to zero, we obtain:
dLL(µ̂)
=0
dµ
n
0 = 2 (ȳ − µ̂) (9)
σ
0 = ȳ − µ̂
µ̂ = ȳ
d2 LL(µ̂) n
= − <0. (10)
dµ2 σ2
This demonstrates that the estimate µ̂ = ȳ maximizes the likelihood p(y|µ).
■
Sources:
• Bishop, Christopher M. (2006): “Bayesian inference for the Gaussian”; in: Pattern Recognition
for Machine Learning, ch. 2.3.6, p. 98, eq. 2.143; URL: http://users.isr.ist.utl.pt/~wurmd/Livros/
school/Bishop%20-%20Pattern%20Recognition%20And%20Machine%20Learning%20-%20Springer%
20%202006.pdf.
yi ∼ N (µ, σ 2 ), i = 1, . . . , n (1)
be a univariate Gaussian data set (→ III/1.2.1) with unknown mean µ and known variance σ 2 . Then,
the test statistic (→ I/4.3.5)
√ ȳ − µ0
z= n (2)
σ
with sample mean (→ I/1.10.2) ȳ follows a standard normal distribution (→ II/3.2.3)
z ∼ N (0, 1) (3)
1. UNIVARIATE NORMAL DATA 351
H0 : µ = µ0 . (4)
1X
n
ȳ = yi . (5)
n i=1
Using the linear combination formula for normal random variables (→ II/3.2.26), the sample mean
follows a normal distribution (→ II/3.2.1) with the following parameters:
2 !
1X
n
1 1
ȳ = yi ∼ N nµ, nσ 2
= N µ, σ 2 /n . (6)
n i=1 n n
p
Again employing the linear combination theorem, the distribution of z = n/σ 2 (ȳ − µ0 ) becomes
r r r 2 2 !
n n n σ √ µ − µ0
z= (ȳ − µ0 ) ∼ N (µ − µ0 ), =N n ,1 , (7)
σ2 σ2 σ2 n σ
such that, under the null hypothesis in (4), we have:
Sources:
• Wikipedia (2021): “Z-test”; in: Wikipedia, the free encyclopedia, retrieved on 2021-03-24; URL:
https://en.wikipedia.org/wiki/Z-test#Use_in_location_testing.
• Wikipedia (2021): “Gauß-Test”; in: Wikipedia – Die freie Enzyklopädie, retrieved on 2021-03-24;
URL: https://de.wikipedia.org/wiki/Gau%C3%9F-Test#Einstichproben-Gau%C3%9F-Test.
be two univariate Gaussian data sets (→ III/1.1.1) representing two groups of unequal size n1 and
n2 with unknown means µ1 and µ2 and unknown variances σ12 and σ22 . Then, the test statistic (→
I/4.3.5)
(ȳ1 − ȳ2 ) − µ∆
z= q 2 (2)
σ1 σ2
n1
+ n22
352 CHAPTER III. STATISTICAL MODELS
with sample means (→ I/1.10.2) ȳ1 and ȳ2 follows a standard normal distribution (→ II/3.2.3)
z ∼ N (0, 1) (3)
under the null hypothesis (→ I/4.3.2)
H0 : µ1 − µ2 = µ∆ . (4)
1 X
n1
ȳ1 = y1i
n1 i=1
(5)
1 X
n2
ȳ2 = y2i .
n2 i=1
Using the linear combination formula for normal random variables (→ II/3.2.26), the sample means
follows normal distributions (→ II/3.2.1) with the following parameters:
2 !
1 X
n1
1 1
ȳ1 = y1i ∼ N n1 µ1 , n1 σ 2 = N µ1 , σ12 /n1
n1 i=1 n1 n1
2 ! (6)
1 X
n2
1 1
ȳ2 = y2i ∼ N n2 µ2 , n2 σ 2 = N µ2 , σ22 /n2 .
n2 i=1 n2 n2
Again employing the linear combination theorem, the distribution of z = [(ȳ1 − ȳ2 )−µ∆ ]/σ∆ becomes
2 !
(ȳ1 − ȳ2 ) − µ∆ (µ1 − µ2 ) − µ∆ 1 (µ1 − µ2 ) − µ∆
z= ∼N , σ∆ = N
2
,1 (7)
σ∆ σ∆ σ∆ σ∆
where σ∆ is the pooled standard deviation
s
σ12 σ22
σ∆ = + , (8)
n1 n2
such that, under the null hypothesis in (4), we have:
Sources:
• Wikipedia (2021): “Z-test”; in: Wikipedia, the free encyclopedia, retrieved on 2021-03-24; URL:
https://en.wikipedia.org/wiki/Z-test#Use_in_location_testing.
• Wikipedia (2021): “Gauß-Test”; in: Wikipedia – Die freie Enzyklopädie, retrieved on 2021-03-24;
URL: https://de.wikipedia.org/wiki/Gau%C3%9F-Test#Zweistichproben-Gau%C3%9F-Test_f%
C3%BCr_unabh%C3%A4ngige_Stichproben.
1. UNIVARIATE NORMAL DATA 353
√ d¯ − µ0
z=n where di = yi1 − yi2 (2)
σ
with sample mean (→ I/1.10.2) d¯ follows a standard normal distribution (→ II/3.2.3)
z ∼ N (0, 1) (3)
under the null hypothesis (→ I/4.3.2)
H0 : µ = µ0 . (4)
Proof: Define the pair-wise difference di = yi1 −yi2 which is, according to the linearity of the expected
value (→ I/1.10.5) and the invariance of the variance under addition (→ I/1.11.6), distributed as
Sources:
• Wikipedia (2021): “Z-test”; in: Wikipedia, the free encyclopedia, retrieved on 2021-03-24; URL:
https://en.wikipedia.org/wiki/Z-test#Use_in_location_testing.
• Wikipedia (2021): “Gauß-Test”; in: Wikipedia – Die freie Enzyklopädie, retrieved on 2021-03-24;
URL: https://de.wikipedia.org/wiki/Gau%C3%9F-Test#Zweistichproben-Gau%C3%9F-Test_f%
C3%BCr_abh%C3%A4ngige_(verbundene)_Stichproben.
Proof: By definition, a conjugate prior (→ I/5.2.5) is a prior distribution (→ I/5.1.3) that, when
combined with the likelihood function (→ I/5.1.2), leads to a posterior distribution (→ I/5.1.7) that
354 CHAPTER III. STATISTICAL MODELS
belongs to the same family of probability distributions (→ I/1.5.1). This is fulfilled when the prior
density and the likelihood function are proportional to the model model parameters in the same way,
i.e. the model parameters appear in the same functional form in both.
Equation (1) implies the following likelihood function (→ I/5.1.2)
Y
n
p(y|µ) = N (yi ; µ, σ 2 )
i=1
" 2 #
Yn
1 1 yi − µ
= √ · exp − (3)
i=1
2πσ 2 σ
r !n " #
1 1 Xn
= 2
· exp − 2 (yi − µ)2
2πσ 2σ i=1
Y
n
p(y|µ) = N (yi ; µ, τ −1 )
i=1
r h τ i
Y
n
τ
= · exp − (yi − µ)2 (4)
2π 2
i=1
r n " #
τX
n
τ
= · exp − (yi − µ)2
2π 2 i=1
" #
τ n/2 τX 2
n
p(y|µ) = · exp − y − 2µyi + µ2
2π 2 i=1 i
" !#
τ n/2 τ X 2
n Xn
= · exp − y − 2µ yi + nµ2
2π 2 i=1 i i=1 (5)
τ n/2 h τ i
= · exp − y T y − 2µnȳ + nµ2
2π 2
τ n/2 τn 1 T
= · exp − y y − 2µȳ + µ 2
2π 2 n
P P
where ȳ = n1 ni=1 yi is the mean of data points and y T y = ni=1 yi2 is the sum of squared data points.
Completing the square over µ, finally gives
τ n/2
τn 1 T
p(y|µ) = · exp − (µ − ȳ) − ȳ + y y
2 2
(6)
2π 2 n
h τn i
p(y|µ) ∝ exp − (µ − ȳ) . 2
(7)
2
The same is true for a normal distribution (→ II/3.2.1) over µ
Sources:
• Bishop, Christopher M. (2006): “Bayesian inference for the Gaussian”; in: Pattern Recognition for
Machine Learning, ch. 2.3.6, pp. 97-98, eq. 2.138; URL: http://users.isr.ist.utl.pt/~wurmd/Livros/
school/Bishop%20-%20Pattern%20Recognition%20And%20Machine%20Learning%20-%20Springer%
20%202006.pdf.
λ0 µ0 + τ nȳ
µn =
λ0 + τ n (4)
λn = λ0 + τ n
with the sample mean (→ I/1.10.2) ȳ and the inverse variance or precision (→ I/1.11.12) τ = 1/σ 2 .
Proof: According to Bayes’ theorem (→ I/5.3.1), the posterior distribution (→ I/5.1.7) is given by
356 CHAPTER III. STATISTICAL MODELS
p(y|µ) p(µ)
p(µ|y) = . (5)
p(y)
Since p(y) is just a normalization factor, the posterior is proportional (→ I/5.1.9) to the numerator:
Y
n
p(y|µ) = N (yi ; µ, σ 2 )
i=1
" 2 #
Yn
1 1 yi − µ
= √ · exp − (7)
i=1
2πσ 2 σ
r !n " #
1 X
n
1
= · exp − 2 (yi − µ)2
2πσ 2 2σ i=1
Y
n
p(y|µ) = N (yi ; µ, τ −1 )
i=1
r h τ i
Y
n
τ
= · exp − (yi − µ)2 (8)
2π 2
i=1
r n " #
τX
n
τ
= · exp − (yi − µ)2
2π 2 i=1
Combining the likelihood function (→ I/5.1.2) (8) with the prior distribution (→ I/5.1.3) (2), the
joint likelihood (→ I/5.1.5) of the model is given by
" !#
τ n2 λ 12 1 X
n
0
p(y, µ) = · · exp − τ (yi − µ) + λ0 (µ − µ0 )
2 2
2π 2π 2 i=1
" !#
τ n2 λ 12 1 X
n
0
= · · exp − τ (yi − 2yi µ + µ ) + λ0 (µ − 2µµ0 + µ0 )
2 2 2 2
2π 2π 2 i=1
(11)
τ n2 λ 12
1
0
= · · exp − τ (y y − 2nȳµ + nµ ) + λ0 (µ − 2µµ0 + µ0 )
T 2 2 2
2π 2π 2
τ 2 1
n
λ0 2 1 2
= · · exp − µ (τ n + λ0 ) − 2µ(τ nȳ + λ0 µ0 ) + (τ y y + λ0 µ0 )
T 2
2π 2π 2
P P
where ȳ = n1 ni=1 yi and y T y = ni=1 yi2 . Completing the square in µ then yields
τ n2 λ 12
λn
0
p(y, µ) = · · exp − (µ − µn ) + fn
2
(12)
2π 2π 2
with the posterior hyperparameters (→ I/5.1.7)
λ0 µ0 + τ nȳ
µn =
λ0 + τ n (13)
λn = λ0 + τ n
Sources:
• Bishop, Christopher M. (2006): “Bayesian inference for the Gaussian”; in: Pattern Recognition
for Machine Learning, ch. 2.3.6, p. 98, eqs. 2.139-2.142; URL: http://users.isr.ist.utl.pt/~wurmd/
Livros/school/Bishop%20-%20Pattern%20Recognition%20And%20Machine%20Learning%20-%20Springe
20%202006.pdf.
358 CHAPTER III. STATISTICAL MODELS
λ0 µ0 + τ nȳ
µn =
λ0 + τ n (4)
λn = λ0 + τ n
with the sample mean (→ I/1.10.2) ȳ and the inverse variance or precision (→ I/1.11.12) τ = 1/σ 2 .
Proof: According to the law of marginal probability (→ I/1.3.3), the model evidence (→ I/5.1.11)
for this model is:
Z
p(y|m) = p(y|µ) p(µ) dµ . (5)
According to the law of conditional probability (→ I/1.3.4), the integrand is equivalent to the joint
likelihood (→ I/5.1.5):
Z
p(y|m) = p(y, µ) dµ . (6)
Y
n
2
p(y|µ, σ ) = N (yi ; µ, σ 2 )
i=1
" 2 #
Yn
1 1 yi − µ
= √ · exp − (7)
i=1
2πσ 2 σ
r !n " #
1 X
n
1
= · exp − 2 (yi − µ) 2
2πσ 2 2σ i=1
Y
n
p(y|µ, τ ) = N (yi ; µ, τ −1 )
i=1
r h τ i
Y
n
τ
= · exp − (yi − µ)2 (8)
2π 2
i=1
r n " #
τX
n
τ
= · exp − (yi − µ)2
2π 2 i=1
When deriving the posterior distribution (→ III/1.2.7) p(µ|y), the joint likelihood p(y, µ) is obtained
as
τ n2 r λ
λn 1
0
p(y, µ) = · · exp − (µ − µn ) −2
τ y y + λ0 µ0 − λn µn .
T 2 2
(9)
2π 2π 2 2
Using the probability density function of the normal distribution (→ II/3.2.10), we can rewrite this
as
τ n2 r λ r 2π
1
0 −1
p(y, µ) = · · · N (µ; λn ) · exp − τ y y + λ0 µ0 − λn µn .
T 2 2
(10)
2π 2π λn 2
Now, µ can be integrated out using the properties of the probability density function (→ I/1.7.1):
Z τ n2 r λ
1
0
p(y|m) = p(y, µ) dµ = · · exp − τ y y + λ0 µ0 − λn µn .
T 2 2
(11)
2π λn 2
Thus, the log model evidence (→ IV/3.1.3) of this model is given by
τ 1
n λ0 1
log p(y|m) = log + log − τ y T y + λ0 µ20 − λn µ2n . (12)
2 2π 2 λn 2
■
n τ 1 τn
Acc(m) = log − τ y y − 2 τ nȳµn + τ nµn +
T 2
2 2π 2 λn
(3)
1 λ0 λ0
Com(m) = + λ0 (µ0 − µn )2 − 1 + log
2 λn λn
where µn and λn are the posterior hyperparameters for the univariate Gaussian with known variance
(→ III/1.2.7), τ = 1/σ 2 is the inverse variance or precision (→ I/1.11.12) and ȳ is the sample mean
(→ I/1.10.2).
The accuracy term is the expectation (→ I/1.10.1) of the log-likelihood function (→ I/4.1.2) log p(y|µ)
with respect to the posterior distribution (→ I/5.1.7) p(µ|y). With the log-likelihood function for
the univariate Gaussian with known variance (→ III/1.2.2) and the posterior distribution for the
univariate Gaussian with known variance (→ III/1.2.7), the model accuracy of m evaluates to:
The complexity penalty is the Kullback-Leibler divergence (→ I/2.5.1) of the posterior distribution
(→ I/5.1.7) p(µ|y) from the prior distribution (→ I/5.1.3) p(µ). With the prior distribution (→
III/1.2.6) given by (2), the posterior distribution for the univariate Gaussian with known variance
(→ III/1.2.7) and the Kullback-Leibler divergence of the normal distribution (→ II/3.2.24), the
model complexity of m evaluates to:
m0 : yi ∼ N (µ, σ 2 ), µ = 0
(2)
m1 : yi ∼ N (µ, σ 2 ), µ ∼ N (µ0 , λ−1
0 ) .
Proof: The log Bayes factor is equal to the difference of two log model evidences (→ IV/3.3.8):
λ0 µ0 + τ nȳ
µn =
λ0 + τ n (8)
λn = λ0 + τ n
with the sample mean (→ I/1.10.2) ȳ and the inverse variance or precision (→ I/1.11.12) τ = 1/σ 2 .
■
362 CHAPTER III. STATISTICAL MODELS
m0 : yi ∼ N (µ, σ 2 ), µ = 0
(2)
m1 : yi ∼ N (µ, σ 2 ), µ ∼ N (µ0 , λ−1
0 ) .
Then, under the null hypothesis (→ I/4.3.2) that m0 generated the data, the expectation (→ I/1.10.1)
of the log Bayes factor (→ IV/3.3.6) in favor of m1 with µ0 = 0 against m0 is
1 λ0 1 λn − λ0
⟨LBF10 ⟩ = log + (3)
2 λn 2 λn
where λn is the posterior precision for the univariate Gaussian with known variance (→ III/1.2.7).
Proof: The log Bayes factor for the univariate Gaussian with known variance (→ III/1.2.10) is
1 λ0 1
LBF10 = log − λ0 µ20 − λn µ2n (4)
2 λn 2
where the posterior hyperparameters (→ I/5.1.7) are given by (→ III/1.2.7)
λ0 µ0 + τ nȳ
µn =
λ0 + τ n (5)
λn = λ0 + τ n
with the sample mean (→ I/1.10.2) ȳ and the inverse variance or precision (→ I/1.11.12) τ = 1/σ 2 .
Plugging µn from (5) into (4), we obtain:
1 λ0 1 (λ0 µ0 + τ nȳ)2
LBF10 = log − λ0 µ0 − λn
2
2 λn 2 λ2n
(6)
1 λ0 1 1 2 2
= log − λ0 µ0 − (λ0 µ0 − 2τ nλ0 µ0 ȳ + τ (nȳ) )
2 2 2
2 λn 2 λn
Because m1 uses a zero-mean prior distribution (→ I/5.1.3) with prior mean (→ I/1.10.1) µ0 = 0
per construction, the log Bayes factor simplifies to:
1 λ0 1 τ 2 (nȳ)2
LBF10 = log + . (7)
2 λn 2 λn
From (1), we know that the data are distributed as yi ∼ N (µ, σ 2 ), such that we can derive the
expectation (→ I/1.10.1) of (nȳ)2 as follows:
1. UNIVARIATE NORMAL DATA 363
* n n +
XX
(nȳ)2 = yi yj = nyi2 + (n2 − n)[yi yj ]i̸=j
i=1 j=1
(8)
= n(µ2 + σ 2 ) + (n2 − n)µ2
= n2 µ2 + nσ 2 .
Applying this expected value (→ I/1.10.1) to (7), the expected LBF emerges as:
1 λ0 1 τ 2 (n2 µ2 + nσ 2 )
⟨LBF10 ⟩ = log +
2 λn 2 λn
2
(9)
1 λ0 1 (τ nµ) + τ n
= log +
2 λn 2 λn
Under the null hypothesis (→ I/4.3.2) that m0 generated the data, the unknown mean is µ = 0, such
that the log Bayes factor further simplifies to:
1 λ0 1 τn
⟨LBF10 ⟩ = log + . (10)
2 λn 2 λn
Finally, plugging λn from (5) into (10), we obtain:
1 λ0 1 λn − λ0
⟨LBF10 ⟩ = log + . (11)
2 λn 2 λn
■
m0 : yi ∼ N (µ, σ 2 ), µ = 0
(2)
m1 : yi ∼ N (µ, σ 2 ), µ ∼ N (µ0 , λ−1
0 ) .
n τ 1
cvLME(m0 ) = log − τ yTy
2 2π 2
2
τ S (i)
X n1 ȳ1
S (3)
n S−1 τ (nȳ)2
cvLME(m1 ) = log + log − y T y + −
2 2π 2 S 2 i=1
n1 n
364 CHAPTER III. STATISTICAL MODELS
where ȳ is the sample mean (→ I/1.10.2), τ = 1/σ 2 is the inverse variance or precision (→ I/1.11.12),
(i)
y1 are the training data in the i-th cross-validation fold and S is the number of data subsets (→
IV/3.1.9).
Proof: For evaluation of the cross-validated log model evidences (→ IV/3.1.9) (cvLME), we assume
that n data points are divided into S | n data subsets without remainder. Then, the number of
training data points n1 and test data points n2 are given by
n = n1 + n2
S−1
n1 = n (4)
S
1
n2 = n ,
S
such that training data y1 and test data y2 in the i-th cross-validation fold are
y = {y1 , . . . , yn }
n o
(i) (i) (i)
y1 = x ∈ y | x ∈ / y2 = y\y2 (5)
(i)
y2 = y(i−1)·n2 +1 , . . . , yi·n2 .
First, we consider the null model m0 assuming µ = 0. Because this model has no free parameter,
nothing is estimated from the training data and the assumed parameter value is applied to the test
data. Consequently, the out-of-sample log model evidence (oosLME) is equal to the log-likelihood
function (→ III/1.2.2) of the test data at µ = 0:
n τ 1h i
(i) 2 (i) T (i)
oosLMEi (m0 ) = log p y2 µ = 0 = log − τ y 2 y2 . (6)
2 2π 2
By definition, the cross-validated log model evidence is the sum of out-of-sample log model evidences
(→ IV/3.1.9) over cross-validation folds, such that the cvLME of m0 is:
X
S
cvLME(m0 ) = oosLMEi (m0 )
i=1
XS τ 1h i
n2 (i) T (i) (7)
= log − τ y 2 y2
i=1
2 2π 2
n τ 1
= log − τ yTy .
2 2π 2
(i)
Next, we have a look at the alternative m1 assuming µ ̸= 0. First, the training data y1 are analyzed
using a non-informative prior distribution (→ I/5.2.3) and applying the posterior distribution for the
univariate Gaussian with known variance (→ III/1.2.7):
1. UNIVARIATE NORMAL DATA 365
(1)
µ0 = 0
(1)
λ0 = 0
(i)
τ n1 ȳ1 + λ0 µ0
(1) (1)
(i)
(8)
µ(1)
n = (1)
= ȳ1
τ n1 + λ0
(1)
λ(1)
n = τ n1 + λ0 = τ n1 .
(1) (1) (i)
This results in a posterior characterized by µn and λn . Then, the test data y2 are analyzed
using this posterior as an informative prior distribution (→ I/5.2.3), again applying the posterior
distribution for the univariate Gaussian with known variance (→ III/1.2.7):
(2) (i)
µ0 = µ(1)
n = ȳ1
(2)
λ0 = λ(1)
n = τ n1
(i)
τ n2 ȳ2 + λ0 µ0
(2) (2) (9)
µ(2)
n = (2)
= ȳ
τ n 2 + λ0
(2)
λ(2)
n = τ n2 + λ0 = τ n .
(2) (2) (2) (2)
In the test data, we now have a prior characterized by µ0 /λ0 and a posterior characterized µn /λn .
Applying the log model evidence for the univariate Gaussian with known variance (→ III/1.2.8), the
out-of-sample log model evidence (oosLME) therefore follows as
!
n2 τ 1 λ0
(2)
1 h (i) T (i) i
(2) (2) 2 (2) (2) 2
oosLMEi (m1 ) = log + log (2)
τ −
y 2 y 2 + λ 0 µ 0 − λn µ n
2 2π 2 λn 2
(10)
n2 τ 1 n 1 τ (i) 2 τ
1 (i) T (i)
= log + log − τ y2 y2 + n1 ȳ1 − (nȳ) . 2
2 2π 2 n 2 n1 n
Again, because the cross-validated log model evidence is the sum of out-of-sample log model evidences
(→ IV/3.1.9) over cross-validation folds, the cvLME of m1 becomes:
X
S
cvLME(m1 ) = oosLMEi (m1 )
i=1
S τ 1 n 1
X n2 1 (i) T (i) τ (i) 2 τ
= log + log − τ y 2 y2 + n1 ȳ1 − (nȳ) 2
2 2π 2 n 2 n1 n
i=1
2
S · n2 τ S n1 τ X (i) T (i)
S (i)
n1 ȳ1 (nȳ)2 (11)
= log + log − y2 y2 + −
2 2π 2 n 2 i=1 n1 n
2
τ S X
(i)
n1 ȳ1
S
n S−1 τ T (nȳ)2
= log + log − y y + − .
2 2π 2 S 2 i=1
n1 n
m0 : yi ∼ N (µ, σ 2 ), µ = 0
(2)
m1 : yi ∼ N (µ, σ 2 ), µ ∼ N (µ0 , λ−1
0 ) .
Then, the cross-validated (→ IV/3.1.9) log Bayes factor (→ IV/3.3.6) in favor of m1 against m0 is
2
(i)
τ X n1 ȳ1
S
S S−1 (nȳ)2
cvLBF10 = log − − (3)
2 S 2 i=1 n1 n
where ȳ is the sample mean (→ I/1.10.2), τ = 1/σ 2 is the inverse variance or precision (→ I/1.11.12),
(i)
y1 are the training data in the i-th cross-validation fold and S is the number of data subsets (→
IV/3.1.9).
Proof: The relationship between log Bayes factor and log model evidences (→ IV/3.3.8) also holds
for cross-validated log bayes factor (→ IV/3.3.6) (cvLBF) and cross-validated log model evidences
(→ IV/3.1.9) (cvLME):
n τ 1
cvLME(m0 ) = log − τ yTy
2 2π 2
2
X 1 1
S n ȳ
(i)
(5)
n τ S S−1 τ 2
(nȳ)
cvLME(m1 ) = log + log − y T y + − .
2 2π 2 S 2 i=1
n1 n
Subtracting the two cvLMEs from each other, the cvLBF emerges as
m0 : yi ∼ N (µ, σ 2 ), µ = 0
(2)
m1 : yi ∼ N (µ, σ 2 ), µ ∼ N (µ0 , λ−1
0 ) .
Then, the expectation (→ I/1.10.1) of the cross-validated (→ IV/3.1.9) log Bayes factor (→ IV/3.3.6)
(cvLBF) in favor of m1 against m0 is
S S−1 1
⟨cvLBF10 ⟩ = log + τ nµ2 (3)
2 S 2
where τ = 1/σ 2 is the inverse variance or precision (→ I/1.11.12) and S is the number of data subsets
(→ IV/3.1.9).
Proof: The cross-validated log Bayes factor for the univariate Gaussian with known variance (→
III/1.2.13) is
2
(i)
τ X 1 1 n ȳ
S
S S−1 2
(nȳ)
cvLBF10 = log − − (4)
2 S 2 i=1 n1 n
From (1), we know that the data are distributed as yi ∼ N (µ, σ 2 ), such that we can derive the
2
(i)
expectation (→ I/1.10.1) of (nȳ)2 and n1 ȳ1 as follows:
* n n +
XX
(nȳ)2 = yi yj = nyi2 + (n2 − n)[yi yj ]i̸=j
i=1 j=1
(5)
= n(µ2 + σ 2 ) + (n2 − n)µ2
= n2 µ2 + nσ 2 .
Applying this expected value (→ I/1.10.1) to (4), the expected cvLBF emerges as:
368 CHAPTER III. STATISTICAL MODELS
2
* (i) +
S S−1 τ X
S
n1 ȳ1 2
(nȳ)
⟨cvLBF10 ⟩ = log − −
2 S 2 i=1
n1 n
2
(i)
n1 ȳ1
τ X ⟨(nȳ)2 ⟩
S
S S−1
= log − −
2 S 2 i=1 n1 n
S (6)
S
(5) S−1 τ X n21 µ2 + n1 σ 2 n2 µ2 + nσ 2
= log − −
2 S 2 i=1 n1 n
τX
S
S S−1
= log − [n1 µ2 + σ 2 ] − [nµ2 + σ 2 ]
2 S 2 i=1
τX
S
S S−1
= log − (n1 − n)µ2
2 S 2 i=1
τX
S
S S−1
⟨cvLBF10 ⟩ = log − (−n2 )µ2
2 S 2 i=1
(7)
S S−1 1
= log + τ nµ2 .
2 S 2
■
yij = µi + εij
i.i.d.
(2)
εij ∼ N (0, σ 2 )
where εij is the error term (→ III/1.4.1) belonging to observation j in category i and εij are the
independent and identically distributed.
1. UNIVARIATE NORMAL DATA 369
Sources:
• Bortz, Jürgen (1977): “Einfaktorielle Varianzanalyse”; in: Lehrbuch der Statistik. Für Sozialwis-
senschaftler, ch. 12.1, pp. 528ff.; URL: https://books.google.de/books?id=lNCyBgAAQBAJ.
• Denziloe (2018): “Derive the distribution of the ANOVA F-statistic under the alternative hypothe-
sis”; in: StackExchange CrossValidated, retrieved on 2022-11-06; URL: https://stats.stackexchange.
com/questions/355594/derive-the-distribution-of-the-anova-f-statistic-under-the-alternative-hypothesi.
X
k X
ni
SStreat = (ȳi − ȳ)2 . (2)
i=1 j=1
Here, ȳ_i is the mean for the i-th level of the factor (out of k levels), computed from ni values y_ij,
and ȳ is the mean across all values y_ij.
Sources:
• Wikipedia (2022): “Analysis of variance”; in: Wikipedia, the free encyclopedia, retrieved on 2022-11-
15; URL: https://en.wikipedia.org/wiki/Analysis_of_variance#Partitioning_of_the_sum_of_
squares.
1 X
ni
µ̂i = ȳi = yij . (3)
ni j=1
X
k X
ni X
k X
ni
RSS(µ) = ε2ij = (yij − µi )2 (4)
i=1 j=1 i=1 j=1
370 CHAPTER III. STATISTICAL MODELS
dRSS(µ) X i n
d
= (yij − µi )2
dµi j=1
dµ i
X
ni
= 2(yij − µi )(−1)
j=1
(5)
Xni
=2 (µi − yij )
j=1
X
ni
= 2ni µi − 2 yij for i = 1, . . . , k .
j=1
X
ni
0 = 2ni µ̂i − 2 yij
j=1
(6)
1 X
ni
µ̂i = yij for i = 1, . . . , k .
ni j=1
Proof: The total sum of squares (→ III/1.5.6) for one-way ANOVA (→ III/1.3.1) is given by
X
k X
ni
SStot = (yij − ȳ)2 (3)
i=1 j=1
where ȳ is the mean across all values yij . This can be rewritten as
1. UNIVARIATE NORMAL DATA 371
X
k X
ni X
k X
ni
(yij − ȳ) =
2
[(yij − ȳi ) + (ȳi − ȳ)]2
i=1 j=1 i=1 j=1
k X
X ni
= (yij − ȳi )2 + (ȳi − ȳ)2 + 2(yij − ȳi )(ȳi − ȳ) (4)
i=1 j=1
X
k X
ni X
k X
ni X
k X
ni
= (yij − ȳi ) +
2
(ȳi − ȳ) + 2
2
(ȳi − ȳ) (yij − ȳi ) .
i=1 j=1 i=1 j=1 i=1 j=1
X
k X
ni X
k X
ni X
k X
ni
(yij − ȳ) =2
(ȳi − ȳ) +
2
(yij − ȳi )2 . (6)
i=1 j=1 i=1 j=1 i=1 j=1
With the treatment sum of squares (→ III/1.3.2) for one-way ANOVA (→ III/1.3.1)
X
k X
ni
SStreat = (ȳi − ȳ)2 (7)
i=1 j=1
and the residual sum of squares (→ III/1.5.8) for one-way ANOVA (→ III/1.3.1)
X
k X
ni
SSres = (yij − ȳi )2 , (8)
i=1 j=1
we finally have:
Sources:
• Wikipedia (2022): “Analysis of variance”; in: Wikipedia, the free encyclopedia, retrieved on 2022-11-
15; URL: https://en.wikipedia.org/wiki/Analysis_of_variance#Partitioning_of_the_sum_of_
squares.
Pk
ni (ȳi − ȳ)2
1
k−1 i=1
F = 1 Pk Pni (2)
j=1 (yij − ȳi )
2
n−k i=1
F ∼ F(k − 1, n − k) (3)
under the null hypothesis (→ I/4.3.2)
H0 : µ1 = . . . = µk
(4)
H1 : µi ̸= µj for at least one i, j ∈ {1, . . . , k} , i ̸= j .
1 X
ni
ȳi = yij
ni j=1
(6)
1 XX
k i n
ȳ = yij .
n i=1 j=1
Let µ be the common mean (→ I/1.10.1) according to H0 given by (4), i.e. µ1 = . . . = µk = µ. Under
this null hypothesis, we have:
X
k X
ni Xk X ni 2
yij − µ
Uij2 =
i=1 j=1 i=1 j=1
σ
1 XX
k ni
= 2 ((yij − ȳi ) + (ȳi − ȳ) + (ȳ − µ))2
σ i=1 j=1
1 XX
k ni
= (yij − ȳi )2 + (ȳi − ȳ)2 + (ȳ − µ)2 + 2(yij − ȳi )(ȳi − ȳ) + 2(yij − ȳi )(ȳ − µ) + 2(ȳi − ȳ
σ 2 i=1 j=1
(9)
1. UNIVARIATE NORMAL DATA 373
X
ni X
ni
(yij − ȳi ) = yij − ni ȳi
j=1 j=1
Xni
1 X
ni
(10)
= yij − ni · yij
j=1
ni j=1
= 0, i = 1, . . . , k
X
k X
ni X
k
(ȳi − ȳ) = ni (ȳi − ȳ)
i=1 j=1 i=1
X
k X
k
= ni ȳi − ȳ ni
i=1 i=1
(11)
Xk
1 X
ni
1 XX
k ni
= ni · yij − n · yij
i=1
ni j=1 n i=1 j=1
=0,
" 2 2 2 #
X
k X
ni Xk X ni
y ij − ȳ i ȳ i − ȳ ȳ − µ
Uij2 = + +
i=1 j=1 i=1 j=1
σ σ σ
ni 2 X ni 2 X ni 2 (12)
Xk X
yij − ȳi
k X
ȳi − ȳ
k X
ȳ − µ
= + + .
i=1 j=1
σ i=1 j=1
σ i=1 j=1
σ
Cochran’s theorem states that, if a sum of squared standard normal (→ II/3.2.3) random variables
(→ I/1.2.2) can be written as a sum of squared forms
X
n X
m
Ui2 = Qj where Qj = U T B (j) U
i=1 j=1
X
m
(13)
with B (j) = In
j=1
then the terms Qj are independent (→ I/1.3.6) and each term Qj follows a chi-squared distribution
(→ II/3.7.1) with rj degrees of freedom:
Qj ∼ χ2 (rj ), j = 1, . . . , m . (14)
Let U be the n × 1 column vector of all observations
374 CHAPTER III. STATISTICAL MODELS
u
1
.
U = .. (15)
uk
where the group-wise ni × 1 column vectors are
(y1,1 − µ)/σ (yk,1 − µ)/σ
.. ..
u1 = . , ..., uk = . . (16)
(y1,n1 − µ)/σ (yk,nk − µ)/σ
Then, we observe that the sum in (12) can be represented in the form of (13) using the matrices
1 1
B (1)
= In − diag Jn , . . . , Jn
n1 1 nk k
1 1 1 (17)
B (2) = diag Jn1 , . . . , Jnk − Jn
n1 nk n
1
B (3) = Jn
n
where Jn is an n × n matrix of ones and diag (A1 , . . . , An ) denotes a block-diagonal matrix composed
of A1 , . . . , An . We observe that those matrices satisfy
X
k X
ni
Uij2 = Q1 + Q2 + Q3 = U T B (1) U + U T B (2) U + U T B (3) U (18)
i=1 j=1
as well as
rank B (1) = n − k
rank B (2) = k − 1 (20)
rank B (3) = 1 .
Let’s write down the explained sum of squares (→ III/1.5.7) and the residual sum of squares (→
III/1.5.8) for one-way analysis of variance (→ III/1.3.1) as
X
k X
ni
ESS = (ȳi − ȳ)2
i=1 j=1
(21)
X
k X
ni
RSS = (yij − ȳi )2 .
i=1 j=1
1. UNIVARIATE NORMAL DATA 375
Then, using (12), (13), (14), (17) and (20), we find that
ni 2
ESS X X
k
ȳi − ȳ
= = Q2 = U T B (2) U ∼ χ2 (k − 1)
σ2 i=1 j=1
σ
ni 2 (22)
RSS X X
k
yij − ȳi
= = Q1 = U T B (1) U ∼ χ2 (n − k) .
σ2 i=1 j=1
σ
Because ESS/σ 2 and RSS/σ 2 are also independent by (14), the F-statistic from (2) is equal to the
ratio of two independent chi-squared distributed (→ II/3.7.1) random variables (→ I/1.2.2) divided
by their degrees of freedom
(ESS/σ 2 )/(k − 1)
F =
(RSS/σ 2 )/(n − k)
ESS/(k − 1)
=
RSS/(n − k)
Pk Pni
(23)
j=1 (ȳi − ȳ)
1 2
k−1 i=1
= 1 Pk Pni
j=1 (yij − ȳi )
2
n−k i=1
Pk
i=1 ni (ȳi − ȳ)
1 2
k−1
= 1 Pk Pni
j=1 (yij − ȳi )
2
n−k i=1
F ∼ F(k − 1, n − k) (24)
under the null hypothesis (→ I/4.3.2) for the main effect.
Sources:
• Denziloe (2018): “Derive the distribution of the ANOVA F-statistic under the alternative hypothe-
sis”; in: StackExchange CrossValidated, retrieved on 2022-11-06; URL: https://stats.stackexchange.
com/questions/355594/derive-the-distribution-of-the-anova-f-statistic-under-the-alternative-hypothesi.
2) or, when using the reparametrized version of one-way ANOVA (→ III/1.3.7), the F-statistic can
be expressed as
1
Pk
k−1 i=1 ni δ̂i2
F = Pk Pni . (3)
1
n−k i=1 j=1 (yij − µ̂ − δ̂i )2
Proof: The F-statistic for the main effect in one-way ANOVA (→ III/1.3.5) is given in terms of the
sample means (→ I/1.10.2) as
Pk
i=1 ni (ȳi − ȳ)
1 2
k−1
F = 1 Pk Pni (4)
n−k i=1 j=1 (y ij − ȳi ) 2
where ȳ_i is the average of all values y_ij from category i and ȳ is the grand mean of all values
y_ij from all categories i = 1, . . . , k.
1) The ordinary least squares estimates for one-way ANOVA (→ III/1.3.3) are
µ̂ = ȳ
(7)
δ̂i = ȳi − ȳ ,
such that
1
Pk
k−1 i=1 ni δ̂i2
F = Pk Pni . (8)
1
n−k i=1 j=1 (yij − µ̂ − δ̂i )2
X
k
ni
δi = 0 , (3)
i=1
n
1. UNIVARIATE NORMAL DATA 377
in which case
1) the model parameters are related to each other as
δi = µi − µ, i = 1, . . . , k ; (4)
2) the ordinary least squares estimates (→ III/1.3.3) are given by
1 X 1 XX
ni k ni
δ̂i = ȳi − ȳ = yij − yij ; (5)
ni j=1 n i=1 j=1
3) the following sum of squares (→ III/1.3.4) is chi-square distributed (→ II/3.7.1)
ni 2
1 XX
k
δ̂i − δi ∼ χ2 (k − 1) ; (6)
σ 2 i=1 j=1
4) and the following test statistic (→ I/4.3.5) is F-distributed (→ II/3.8.1)
1
Pk 2
i=1 ni δ̂i
F = k−1
Pk Pni ∼ F(k − 1, n − k) (7)
j=1 (yij −
1
n−k i=1 ȳi )2
under the null hypothesis for the main effect (→ III/1.3.5)
H0 : δ1 = . . . = δk = 0 . (8)
Proof:
1) Equating (1) with (2), we get:
1 XX
k i n
µ̂ = ȳ•• = yij
n i=1 j=1
(10)
1 X 1 XX
ni k ni
δ̂i = ȳi• − ȳ•• = yij − yij .
ni j=1 n i=1 j=1
3) Let Uij = (yij − µ − δi )/σ, such that (→ II/3.2.4) Uij ∼ N (0, 1) and consider the sum of all
squared random variables (→ I/1.2.2) Uij :
X
k X
ni Xk X ni 2
yij − µ − δi
Uij2 =
i=1 j=1 i=1 j=1
σ
(11)
1 XX
k ni
= 2 [(yij − ȳi ) + ([ȳi − ȳ] − δi ) + (ȳ − µ)]2 .
σ i=1 j=1
378 CHAPTER III. STATISTICAL MODELS
This square of sums, using a number of intermediate steps, can be developed (→ III/1.3.5) into a
sum of squares:
X
k X
ni
1 XX
k ni
Uij2 = 2 (yij − ȳi )2 + ([ȳi − ȳ] − δi )2 + (ȳ − µ)2
i=1 j=1
σ i=1 j=1
" k n # (12)
1 XX X
i k X ni Xk X ni
= 2 (yij − ȳi )2 + ([ȳi − ȳ] − δi )2 + (ȳ − µ)2 .
σ i=1 j=1 i=1 j=1 i=1 j=1
To this sum, Cochran’s theorem for one-way analysis of variance can be applied (→ III/1.3.5), yielding
the distributions:
1 XX
k ni
(yij − ȳi )2 ∼ χ2 (n − k)
σ 2 i=1 j=1
(13)
1 XX X
k ni k X ni
2 (10) 1
2
([ȳi − ȳ] − δi ) = 2 (δ̂i − δi )2 ∼ χ2 (k − 1) .
σ i=1 j=1 σ i=1 j=1
4) The ratio of two chi-square distributed (→ II/3.7.1) random variables (→ I/1.2.2), divided by
their degrees of freedom, is defined to be F-distributed (→ II/3.8.1), so that
Pk Pni
1
σ2 i=1 j=1 i( δ̂ − δ i ) 2
/(k − 1)
F = P P
j=1 (yij − ȳi )
1 k ni 2 /(n − k)
σ2 i=1
Pk Pni
j=1 (δ̂i − δi )
1 2
k−1 i=1
= 1 Pk Pni
j=1 (yij − ȳi )
2
n−k i=1 (14)
Pk
i=1 ni (δ̂i − δi )
1 2
k−1
= 1 Pk Pni
j=1 (yij − ȳi )
2
n−k i=1
1
Pk 2
(8) k−1 i=1 ni δ̂i
= 1 Pk Pni
j=1 (yij − ȳi )
2
n−k i=1
F ∼ F(k − 1, n − k) (15)
under the null hypothesis.
Sources:
• Wikipedia (2022): “Analysis of variance”; in: Wikipedia, the free encyclopedia, retrieved on 2022-
11-15; URL: https://en.wikipedia.org/wiki/Analysis_of_variance#For_a_single_factor.
1. UNIVARIATE NORMAL DATA 379
X
a
wij αi = 0 for all j = 1, . . . , b
i=1
Xb
wij βj = 0 for all i = 1, . . . , a
j=1
(4)
Xa
wij γij = 0 for all j = 1, . . . , b
i=1
Xb
wij γij = 0 for all i = 1, . . . , a
j=1
Pa Pb
where the weights are wij = nij /n and the total sample size is n = i=1 j=1 nij .
Sources:
• Bortz, Jürgen (1977): “Zwei- und mehrfaktorielle Varianzanalyse”; in: Lehrbuch der Statistik. Für
Sozialwissenschaftler, ch. 12.2, pp. 538ff.; URL: https://books.google.de/books?id=lNCyBgAAQBAJ.
• ttd (2021): “Proof on SSAB/s2 chi2(I-1)(J-1) under the null hypothesis HAB: dij=0 for i=1,...,I
and j=1,...,J”; in: StackExchange CrossValidated, retrieved on 2022-11-06; URL: https://stats.
stackexchange.com/questions/545807/proof-on-ss-ab-sigma2-sim-chi2-i-1j-1-under-the-null-hypothesis.
380 CHAPTER III. STATISTICAL MODELS
X
a X
b X
nij
Here, ȳ_ij• is the mean for the (i, j)-th cell (out of a × b cells), computed from n_ij values y_ijk,
ȳ_i • • and ȳ_•j• are the level means for the two factors and and ȳ_• • • is the mean across all
values y_ijk.
Sources:
• Nandy, Siddhartha (2018): “Two-Way Analysis of Variance”; in: Stat 512: Applied Regression
Analysis, Purdue University, Summer 2018, Ch. 19; URL: https://www.stat.purdue.edu/~snandy/
stat512/topic7.pdf.
the parameters minimizing the residual sum of squares (→ III/1.5.8) and satisfying the constraints
for the model parameters (→ III/1.3.8) are given by
µ̂ = ȳ•••
α̂i = ȳi•• − ȳ•••
(2)
β̂j = ȳ•j• − ȳ•••
γ̂ij = ȳij• − ȳi•• − ȳ•j• + ȳ•••
where ȳ••• , ȳi•• , ȳ•j• and ȳij• are the following sample means (→ I/1.10.2):
1. UNIVARIATE NORMAL DATA 381
1 XXX
a b nij
ȳ••• = yijk
n i=1 j=1 k=1
1 XX
b nij
ȳi•• = yijk
ni• j=1 k=1
(3)
1 XX
a nij
ȳ•j• = yijk
n•j i=1 k=1
1 X
nij
ȳij• = yijk
nij k=1
Proof: In two-way ANOVA, model parameters are subject to the constraints (→ III/1.3.8)
X
a
wij αi = 0 for all j = 1, . . . , b
i=1
Xb
wij βj = 0 for all i = 1, . . . , a
j=1
(5)
Xa
wij γij = 0 for all j = 1, . . . , b
i=1
Xb
wij γij = 0 for all i = 1, . . . , a
j=1
where wij = nij /n. The residual sum of squares (→ III/1.5.8) for this model is
X
a X
b X
nij
X
a X
b X
nij
dRSS X X X d
a b nij
= (yijk − µ − αi − βj − γij )2
dµ i=1 j=1 k=1
dµ
a X
X b X
nij
= −2(yijk − µ − αi − βj − γij )
i=1 j=1 k=1
! (7)
a X
X b X
nij
dRSS X X d
b nij
= (yijk − µ − αi − βj − γij )2
dαi j=1 k=1
dαi
b X
X nij
dRSS X X d
a nij
= (yijk − µ − αi − βj − γij )2
dβj i=1 k=1
dβj
X
a X
nij
dRSS X d
nij
= (yijk − µ − αi − βj − γij )2
dγij k=1
dγij
X
nij
!
X
a X
b X
a X
b X
a X
b X
nij
1 XXX
a b nij
(5)
= yijk
n i=1 j=1 k=1
(3)
= ȳ•••
!
X
b X
b X
b X
nij
1
b X
X nij
X
b
nij X
b
nij
α̂i = yijk − µ̂ − βj − γij
ni• j=1 k=1 j=1
ni• j=1
ni•
1 XX n X nij n X nij
b nij b b
(12)
= yijk − µ̂ − βj − γij
ni• j=1 k=1 ni• j=1 n ni• j=1 n
1 XX 1 XXX
b nij a b nij
(5)
= yijk − yijk
ni• j=1 k=1 n i=1 j=1 k=1
(3)
= ȳi•• − ȳ•••
!
X
a X
a X
a X
nij
1 XX X X nij
a nij a a
nij
β̂j = yijk − µ̂ − αi − γij
n•j i=1 k=1 i=1
n•j i=1
n•j
1 X
a X
nij
n X nij
a
n X nij
a
(13)
= yijk − µ̂ − αi − γij
n•j i=1 k=1
n•j i=1 n n•j i=1 n
1 XX 1 XXX
a nij a b nij
(5)
= yijk − yijk
n•j i=1 k=1 n i=1 j=1 k=1
(3)
= ȳ•j• − ȳ•••
384 CHAPTER III. STATISTICAL MODELS
X
nij
1 X
nij
Sources:
• Olbricht, Gayla R. (2011): “Two-Way ANOVA: Interaction”; in: Stat 512: Applied Regression
Analysis, Purdue University, Spring 2011, Lect. 27; URL: https://www.stat.purdue.edu/~ghobbs/
STAT_512/Lecture_Notes/ANOVA/Topic_27.pdf.
Proof: The total sum of squares (→ III/1.5.6) for two-way ANOVA (→ III/1.3.8) is given by
X
a X
b X
nij
where ȳ_• • • is the mean across all values y_ijk. This can be rewritten as
X
a X
b X
nij
X
a X
b X
nij
(yijk − ȳ••• ) =
2
[(yijk − ȳij• ) + (ȳi•• − ȳ••• ) + (ȳ•j• − ȳ••• )+
i=1 j=1 k=1 i=1 j=1 k=1
(4)
It can be shown (→ III/1.3.12) that the following sums are all zero:
1. UNIVARIATE NORMAL DATA 385
X
a X
b X
nij
(yijk − ȳij• ) = 0
i=1 j=1 k=1
XXX
a b nij
(ȳi•• − ȳ••• ) = 0
i=1 j=1 k=1
(5)
XXX
a b nij
(ȳ•j• − ȳ••• ) = 0
i=1 j=1 k=1
a X
X b X
nij
X
a X
b X
nij
X
a X
b X
nij
(yijk − ȳ••• ) =
2
(yijk − ȳij• )2 + (ȳi•• − ȳ••• )2 + (ȳ•j• − ȳ••• )2 +
i=1 j=1 k=1 i=1 j=1 k=1
(ȳij• − ȳi•• − ȳ•j• + ȳ••• )2
X
a X
b X
nij
X
a X
b X
nij
X
a X
b X
nij
= (yijk − ȳij• ) +
2
(ȳi•• − ȳ••• ) +
2
(ȳ•j• − ȳ••• )2
i=1 j=1 k=1 i=1 j=1 k=1 i=1 j=1 k=1
XXX
a b nij
(ȳij• − ȳi•• − ȳ•j• + ȳ••• )2 .
i=1 j=1 k=1
(6)
X
a X
b X
nij
X
a X
b X
nij
and the residual sum of squares (→ III/1.5.8) for two-way ANOVA (→ III/1.3.8)
X
a X
b X
nij
we finally have:
386 CHAPTER III. STATISTICAL MODELS
Sources:
• Nandy, Siddhartha (2018): “Two-Way Analysis of Variance”; in: Stat 512: Applied Regression
Analysis, Purdue University, Summer 2018, Ch. 19; URL: https://www.stat.purdue.edu/~snandy/
stat512/topic7.pdf.
• Wikipedia (2022): “Analysis of variance”; in: Wikipedia, the free encyclopedia, retrieved on 2022-11-
15; URL: https://en.wikipedia.org/wiki/Analysis_of_variance#Partitioning_of_the_sum_of_
squares.
X
a
nij
αi = 0 for all j = 1, . . . , b
i=1
n
Xb
nij
βj = 0 for all i = 1, . . . , a
j=1
n
(2)
Xa
nij
γij = 0 for all j = 1, . . . , b
i=1
n
Xb
nij
γij = 0 for all i = 1, . . . , a .
j=1
n
Then, the following sums of squares (→ III/1.3.11) are chi-square distributed (→ II/3.7.1)
1 SSM
2
n(ȳ••• − µ)2 = 2 ∼ χ2 (1)
σ σ
1 X
a
SSA
2
ni• ([ȳi•• − ȳ••• ] − αi )2 = 2 ∼ χ2 (a − 1)
σ i=1 σ
1 X
b
SSB
2
n•j ([ȳ•j• − ȳ••• ] − βj )2 = 2 ∼ χ2 (b − 1)
σ j=1 σ (3)
1 XX
a b
SSA×B
2
nij ([ȳij• − ȳi•• − ȳ•j• + ȳ••• ] − γij )2 = 2
∼ χ2 ((a − 1)(b − 1))
σ i=1 j=1 σ
1 XXX
a b nij
SSres
2
(yijk − ȳij• )2 = 2 ∼ χ2 (n − ab) .
σ i=1 j=1 k=1 σ
1. UNIVARIATE NORMAL DATA 387
1 XXX
a b nij
ȳ••• = yijk
n i=1 j=1 k=1
1 XX
b nij
ȳi•• = yijk
ni• j=1 k=1
(5)
1 XX
a nij
ȳ•j• = yijk
n•j i=1 k=1
1 X
nij
ȳij• = yijk .
nij k=1
According to the model given by (1), the observations are distributed as:
X
a X
b X
nij
1 XXX
a b nij
2
Uijk = 2 [(yijk − µ − αi − βj − γij )−
i=1 j=1 k=1
σ i=1 j=1 k=1
[ȳ••• + (ȳi•• − ȳ••• ) + (ȳ•j• − ȳ••• ) + (ȳij• − ȳi•• − ȳ•j• + ȳ••• )] +
[ȳ••• + (ȳi•• − ȳ••• ) + (ȳ•j• − ȳ••• ) + (ȳij• − ȳi•• − ȳ•j• + ȳ••• )]]2
1 XXX
a b nij
= 2 [(yijk − [ȳ••• + (ȳi•• − ȳ••• ) + (ȳ•j• − ȳ••• ) + (ȳij• − ȳi•• − ȳ•j• + ȳ••• )])+
σ i=1 j=1 k=1
(ȳ••• − µ) + ([ȳi•• − ȳ••• ] − αi ) + ([ȳ•j• − ȳ••• ] − βj ) +
([ȳij• − ȳi•• − ȳ•j• + ȳ••• ] − γij )]2
1 XXX
a b nij
= 2 [(yijk − ȳij• ) + (ȳ••• − µ) + ([ȳi•• − ȳ••• ] − αi )+
σ i=1 j=1 k=1
([ȳ•j• − ȳ••• ] − βj ) + ([ȳij• − ȳi•• − ȳ•j• + ȳ••• ] − γij )]2
(9)
" nij #
X b X
a X nij
X
a X
b X
(yijk − ȳij• ) = yijk − nij · ȳij•
i=1 j=1 k=1 i=1 j=1 k=1
" nij #
(5) X
a X
b X 1 X
nij
X
a X
b X
nij
X
a
([ȳi•• − ȳ••• ] − αi ) = ni• · (ȳi•• − ȳ••• − αi )
i=1 j=1 k=1 i=1
X
a X
a X
a
= ni• · ȳi•• − ȳ••• ni• − ni• αi
i=1 i=1 i=1
(11)
(5) X a
1 XX
b nij
1 XXX
a b X
nij a
= ni• · yijk − n · yijk − ni• αi
i=1
ni• j=1 k=1
n i=1 j=1 k=1 i=1
X
a
(4) X
a X
b
nij (2)
=− ni• αi = −n αi = 0
i=1 i=1 j=1
n
1. UNIVARIATE NORMAL DATA 389
X
a X
b X
nij
X
b
([ȳ•j• − ȳ••• ] − βj ) = n•j · (ȳ•j• − ȳ••• − βj )
i=1 j=1 k=1 j=1
X
b X
b X
b
= n•j · ȳ•j• − ȳ••• n•j − n•j βj
j=1 j=1 j=1
(12)
(5) X
b
1 X
a X
nij
1 XXX
a b X
nij b
= n•j · yijk − n · yijk − n•j βj
j=1
n•j i=1 k=1
n i=1 j=1 k=1 j=1
X
b
(4) X
b X
a
nij (2)
=− n•j βj = −n βj = 0
j=1 j=1 i=1
n
X
a X
b X
nij
XXX
a b nij
= [(ȳij• − ȳ••• ) − (ȳi•• − ȳ••• ) − (ȳ•j• − ȳ••• ) − γij ]
i=1 j=1 k=1
XXX
a b nij X b X
a X nij
X b X
a X nij
(12) X X X XXX
a b nij a b nij
= (ȳij• − ȳ••• − γij ) − (ȳi•• − ȳ••• )
i=1 j=1 k=1 i=1 j=1 k=1
(13)
(11) XXX
a b nij
= (ȳij• − ȳ••• − γij )
i=1 j=1 k=1
X
a X
b X
a X
b X
a X
b
= nij ȳij• − ȳ••• nij − nij γij
i=1 j=1 i=1 j=1 i=1 j=1
(5) X
a X
b
1 X
nij
1 X b X
a X nij
X
a X
b
= nij · yijk − n · yijk − nij γij
i=1 j=1
nij k=1 n i=1 j=1 k=1 i=1 j=1
X
a X
b
1 X X nij
a b
(2)
=− nij γij = − γij = 0 .
i=1 j=1
n i=1 j=1 n
X
a X
b X
nij
1 XXX
a b nij
2
Uijk = 2 (yijk − ȳij• )2 + (ȳ••• − µ)2 + ([ȳi•• − ȳ••• ] − αi )2 +
i=1 j=1 k=1
σ i=1 j=1 k=1
([ȳ•j• − ȳ••• ] − βj )2 + ([ȳij• − ȳi•• − ȳ•j• + ȳ••• ] − γij )2
" a b nij
1 XXX Xa X b Xnij
= 2 (yijk − ȳij• ) +
2
(ȳ••• − µ)2 +
σ i=1 j=1 k=1 i=1 j=1 k=1 (15)
X
a X
b X nij a X
X b X nij
XXX
a b nij
([ȳij• − ȳi•• − ȳ•j• + ȳ••• ] − γij )2
i=1 j=1 k=1
Cochran’s theorem states that, if a sum of squared standard normal (→ II/3.2.3) random variables
(→ I/1.2.2) can be written as a sum of squared forms
X
n X
m
Ui2 = Qj where Qj = U T B (j) U
i=1 j=1
X
m
(16)
with B (j) = In
j=1
then the terms Qj are independent (→ I/1.3.6) and each term Qj follows a chi-squared distribution
(→ II/3.7.1) with rj degrees of freedom:
Qj ∼ χ2 (rj ), j = 1, . . . , m . (17)
First, we define the n × 1 vector U :
u1• u (yi,j,1 − µ − αi − βj − γij )/σ
i1
. . ..
U = .. where ui• = .. where uij = . . (18)
ua• uib (yi,j,nij − µ − αi − βj − γij )/σ
Next, we specify the n × n matrices B
1. UNIVARIATE NORMAL DATA 391
1 1 1 1
B (1)
= In − diag diag Jn , . . . , Jn , . . . , diag Jn , . . . , Jn
n11 11 n1b 1b na1 a1 nab ab
1
B (2) = Jn
n
1 1 1
(3)
B = diag Jn1• , . . . , Jna• − Jn
n1• na• n
(19)
1
B (4) = MB − Jn
n
(5) 1 1 1 1
B = diag diag Jn , . . . , Jn , . . . , diag Jn , . . . , Jn
n11 11 n1b 1b na1 a1 nab ab
1 1 1
− diag Jn1• , . . . , Jna• − MB + Jn
n1• na• n
where Jn is an n × n matrix of ones, Jn,m is an n × m matrix of ones and diag (A1 , . . . , An ) denotes
a block-diagonal matrix composed of A1 , . . . , An . We observe that those matrices satisfy
X
a X
b X
nij
X
5 X
5
2
Uijk = Ql = U T B (l) U (21)
i=1 j=1 k=1 l=1 l=1
as well as
X
5
B (l) = In (22)
l=1
rank B (1) = n − ab
rank B (2) = 1
rank B (3) = a − 1 (23)
rank B (4) = b − 1
rank B (5) = (a − 1)(b − 1) .
Thus, the conditions for applying Cochran’s theorem given by (16) are fulfilled and we can use (15),
(17), (19) and (23) to conclude that
392 CHAPTER III. STATISTICAL MODELS
1 XXX
a b nij
2
(ȳ••• − µ)2 = Q2 = U T B (2) U ∼ χ2 (1)
σ i=1 j=1 k=1
1 XXX
a b nij
([ȳi•• − ȳ••• ] − αi )2 = Q3 = U T B (3) U ∼ χ2 (a − 1)
σ 2 i=1 j=1 k=1
1 XXX
a b nij
([ȳ•j• − ȳ••• ] − βj )2 = Q4 = U T B (4) U ∼ χ2 (b − 1) (24)
σ 2 i=1 j=1 k=1
1 XXX
a b nij
([ȳij• − ȳi•• − ȳ•j• + ȳ••• ] − γij )2 = Q5 = U T B (5) U ∼ χ2 ((a − 1)(b − 1))
σ 2 i=1 j=1 k=1
1 XXX
a b nij
(yijk − ȳij• )2 = Q1 = U T B (1) U ∼ χ2 (n − ab) .
σ 2 i=1 j=1 k=1
Finally, we identify the terms Q with sums of squares in two-way ANOVA (→ III/1.3.11) and simplify
them to reach the expressions given by (3):
1 XXX
a b nij
SSM 1
2
= 2
(ȳ ••• − µ)2
= 2
n(ȳ••• − µ)2
σ σ i=1 j=1 k=1 σ
1 XXX 1 X
a b nij a
SSA
= 2 ([ȳi•• − ȳ••• ] − αi ) = 2
2
ni• ([ȳi•• − ȳ••• ] − αi )2
σ2 σ i=1 j=1 k=1 σ i=1
1 XXX 1 X
a b nij b
SSB
2
= 2 ([ȳ•j• − ȳ••• ] − βj ) = 2
2
n•j ([ȳ•j• − ȳ••• ] − βj )2
σ σ i=1 j=1 k=1 σ j=1
1 XXX 1 XX
a b nij a b
SSA×B
= 2 ([ȳij• − ȳi•• − ȳ•j• + ȳ••• ] − γij ) = 2
2
nij ([ȳij• − ȳi•• − ȳ•j• + ȳ••• ] − γij )2
σ2 σ i=1 j=1 k=1 σ i=1 j=1
1 XXX 1 XXX
a b nij a b nij
SSres
2
= 2 (yijk − ȳij• ) = 2
2
(yijk − ȳij• )2 .
σ σ i=1 j=1 k=1 σ i=1 j=1 k=1
(25)
Sources:
• Nandy, Siddhartha (2018): “Two-Way Analysis of Variance”; in: Stat 512: Applied Regression
Analysis, Purdue University, Summer 2018, Ch. 19; URL: https://www.stat.purdue.edu/~snandy/
stat512/topic7.pdf.
H0 : α1 = . . . = αa = 0
(4)
H1 : αi ̸= 0 for at least one i ∈ {1, . . . , a}
H0 : β1 = . . . = βb = 0
(7)
H1 : βj ̸= 0 for at least one j ∈ {1, . . . , b} .
Proof: Applying Cochran’s theorem for two-analysis of variance (→ III/1.3.12), we find that the
following squared sums
1 XXX 1 X
a b nij a
SSA
= 2 ([ȳi•• − ȳ••• ] − αi ) = 2
2
ni• ([ȳi•• − ȳ••• ] − αi )2
σ2 σ i=1 j=1 k=1 σ i=1
1 XXX 1 X
a b nij b
SSB
= ([ȳ•j• − ȳ••• ] − β j )2
= n•j ([ȳ•j• − ȳ••• ] − βj )2 (8)
σ2 σ 2 i=1 j=1 k=1 σ 2 j=1
1 XXX 1 XXX
a b nij a b nij
SSres
= 2 (yijk − ȳij• ) = 2
2
(yijk − ȳij• )2
σ2 σ i=1 j=1 k=1 σ i=1 j=1 k=1
SSA
∼ χ2 (a − 1)
σ2
SSB
∼ χ2 (b − 1) (9)
σ2
SSres
∼ χ2 (n − ab) .
σ2
1) Thus, the F-statistic from (2) is equal to the ratio of two independent (→ I/1.3.6) chi-squared
distributed (→ II/3.7.1) random variables (→ I/1.2.2) divided by their degrees of freedom
(SSA /σ 2 )/(a − 1)
FA =
(SSres /σ 2 )/(n − ab)
SSA /(a − 1)
=
SSres /(n − ab)
Pa
(10)
i=1 ni• ([ȳi•• − ȳ••• ] − αi )
1 2
(8) a−1
= 1 Pa Pb Pnij
k=1 (yijk − ȳij• )
2
i=1 j=1
n−ab
P
i=1 ni• (ȳi•• − ȳ••• )
1 a 2
(3) a−1
= 1 Pa Pb Pnij
k=1 (yijk − ȳij• )
2
n−ab i=1 j=1
(SSB /σ 2 )/(b − 1)
FB =
(SSres /σ 2 )/(n − ab)
SSB /(b − 1)
=
SSres /(n − ab)
Pb
(12)
j=1 n•j ([ȳ•j• − ȳ••• ] − βj )
1 2
(8) b−1
= 1 Pa Pb Pnij
k=1 (yijk − ȳij• )
2
n−ab i=1 j=1
Pb
j=1 n•j (ȳ•j• − ȳ••• )
1 2
(6) b−1
= 1 Pa Pb Pnij
k=1 (yijk − ȳij• )
2
n−ab i=1 j=1
Sources:
1. UNIVARIATE NORMAL DATA 395
• ttd (2021): “Proof on SSAB/s2 chi2(I-1)(J-1) under the null hypothesis HAB: dij=0 for i=1,...,I
and j=1,...,J”; in: StackExchange CrossValidated, retrieved on 2022-11-10; URL: https://stats.
stackexchange.com/questions/545807/proof-on-ss-ab-sigma2-sim-chi2-i-1j-1-under-the-null-hypothesis.
• JohnK (2014): “In a two-way ANOVA, how can the F-statistic for one factor have a central dis-
tribution if the null is false for the other factor?”; in: StackExchange CrossValidated, retrieved on
2022-11-10; URL: https://stats.stackexchange.com/questions/124166/in-a-two-way-anova-how-can-the-f-
H0 : γ11 = . . . = γab = 0
(4)
H1 : γij ̸= 0 for at least one (i, j) ∈ {1, . . . , a} × {1, . . . , b} .
Proof: Applying Cochran’s theorem for two-analysis of variance (→ III/1.3.12), we find that the
following squared sums
1 XXX
a b nij
SSA×B
= 2 ([ȳij• − ȳi•• − ȳ•j• + ȳ••• ] − γij )2
σ2 σ i=1 j=1 k=1
1 XX
a b
= nij ([ȳij• − ȳi•• − ȳ•j• + ȳ••• ] − γij )2
σ 2 i=1 j=1
(5)
1 XXX
a b nij
SSres
= 2 (yijk − ȳij• )2
σ2 σ i=1 j=1 k=1
1 XXX
a b nij
= 2 (yijk − ȳij• )2
σ i=1 j=1 k=1
SSA×B
∼ χ2 ((a − 1)(b − 1))
σ2 (6)
SSres
∼ χ2 (n − ab) .
σ2
Thus, the F-statistic from (2) is equal to the ratio of two independent (→ I/1.3.6) chi-squared
distributed (→ II/3.7.1) random variables (→ I/1.2.2) divided by their degrees of freedom
Sources:
• Nandy, Siddhartha (2018): “Two-Way Analysis of Variance”; in: Stat 512: Applied Regression
Analysis, Purdue University, Summer 2018, Ch. 19; URL: https://www.stat.purdue.edu/~snandy/
stat512/topic7.pdf.
• ttd (2021): “Proof on SSAB/s2 chi2(I-1)(J-1) under the null hypothesis HAB: dij=0 for i=1,...,I
and j=1,...,J”; in: StackExchange CrossValidated, retrieved on 2022-11-10; URL: https://stats.
stackexchange.com/questions/545807/proof-on-ss-ab-sigma2-sim-chi2-i-1j-1-under-the-null-hypothesis.
n(ȳ••• )2
FM = Pa Pb Pnij (2)
1
n−ab i=1 j=1 k=1 (yijk − ȳij• )2
follows an F-distribution (→ II/3.8.1)
1. UNIVARIATE NORMAL DATA 397
H0 : µ = 0
(4)
H1 : µ ̸= 0 .
Proof: Applying Cochran’s theorem for two-analysis of variance (→ III/1.3.12), we find that the
following squared sums
1 XXX
a b nij
SSM 1
2
= 2 (ȳ••• − µ)2 = 2 n(ȳ••• − µ)2
σ σ i=1 j=1 k=1 σ
(5)
1 XXX 1 XXX
a b nij a b nij
SSres
= 2 (yijk − ȳij• ) = 2
2
(yijk − ȳij• )2
σ2 σ i=1 j=1 k=1 σ i=1 j=1 k=1
SSM
∼ χ2 (1)
σ2 (6)
SSres
∼ χ2 (n − ab) .
σ2
Thus, the F-statistic from (2) is equal to the ratio of two independent (→ I/1.3.6) chi-squared
distributed (→ II/3.7.1) random variables (→ I/1.2.2) divided by their degrees of freedom
(SSM /σ 2 )/(1)
FM =
(SSres /σ 2 )/(n − ab)
SSM /(1)
=
SSres /(n − ab)
(5) n(ȳ••• − µ)2 (7)
= 1 Pa Pb Pnij
k=1 (yijk − ȳij• )
2
n−ab i=1 j=1
(2) n(ȳ••• )2
= Pa Pb Pnij
1
n−ab i=1 j=1 k=1 (yijk − ȳij• )2
Sources:
398 CHAPTER III. STATISTICAL MODELS
• Nandy, Siddhartha (2018): “Two-Way Analysis of Variance”; in: Stat 512: Applied Regression
Analysis, Purdue University, Summer 2018, Ch. 19; URL: https://www.stat.purdue.edu/~snandy/
stat512/topic7.pdf.
• Olbricht, Gayla R. (2011): “Two-Way ANOVA: Interaction”; in: Stat 512: Applied Regression
Analysis, Purdue University, Spring 2011, Lect. 27; URL: https://www.stat.purdue.edu/~ghobbs/
STAT_512/Lecture_Notes/ANOVA/Topic_27.pdf.
nµ̂2
FM = Pa Pb Pnij
k=1 (yijk − ŷijk )
1 2
i=1 j=1
n−ab
1
Pa 2
a−1 i=1 ni• α̂i
FA = P P P nij
k=1 (yijk − ŷijk )
1 a b 2
n−ab i=1 j=1
1
P b 2 (2)
b−1 j=1 n•j β̂j
FB = P P P nij
k=1 (yijk − ŷijk )
1 a b 2
n−ab i=1 j=1
1
P a P b 2
(a−1)(b−1) i=1 j=1 nij γ̂ij
FA×B = Pa Pb Pnij
k=1 (yijk − ŷijk )
1 2
n−ab i=1 j=1
n(ȳ••• )2
FM = Pa Pb Pnij
1
i=1 j=1 k=1 (yijk − ȳij• )2
n−ab
Pa
i=1 ni• (ȳi•• − ȳ••• )
1 2
a−1
FA = Pa Pb Pnij
k=1 (yijk − ȳij• )
1 2
n−ab i=1 j=1
Pb (4)
j=1 n•j (ȳ•j• − ȳ••• )
1 2
b−1
FB = Pa Pb Pnij
k=1 (yijk − ȳij• )
1 2
n−ab i=1 j=1
Pa Pb
j=1 nij (ȳij• − ȳi•• − ȳ•j• +
1
(a−1)(b−1) i=1 ȳ••• )2
FA×B = P a Pb Pnij
k=1 (yijk − ȳij• )
1 2
n−ab i=1 j=1
and the ordinary least squares estimates for two-way ANOVA (→ III/1.3.10) are
1. UNIVARIATE NORMAL DATA 399
µ̂ = ȳ•••
α̂i = ȳi•• − ȳ•••
(5)
β̂j = ȳ•j• − ȳ•••
γ̂ij = ȳij• − ȳi•• − ȳ•j• + ȳ•••
where the the sample means (→ I/1.10.2) are given by
1 XXX
a b nij
ȳ••• = yijk
n i=1 j=1 k=1
1 XX
b nij
ȳi•• = yijk
ni• j=1 k=1
(6)
1 XX
a nij
ȳ•j• = yijk
n•j i=1 k=1
1 X
nij
ȳij• = yijk .
nij k=1
nµ̂2
FM = Pa Pb Pnij
1
i=1 k=1 (yijk
j=1 − ŷijk )2
n−ab
1
Pa 2
a−1 i=1 ni• α̂i
FA = Pa Pb Pnij
1
n−ab i=1 j=1 k=1 (yijk − ŷijk )2
1
P b 2 (8)
b−1 j=1 n•j β̂j
FB = Pa Pb Pnij
1
n−ab i=1 j=1 k=1 (yijk − ŷijk )2
1
Pa Pb 2
(a−1)(b−1) i=1 j=1 nij γ̂ij
FA×B = Pa Pb Pnij .
k=1 (yijk − ŷijk )
1 2
n−ab i=1 j=1
y = β0 + β1 x + ε , (1)
together with a statement asserting a normal distribution (→ II/4.1.1) for ε
ε ∼ N (0, σ 2 V ) (2)
is called a univariate simple regression model or simply, “simple linear regression”.
• y is called “dependent variable”, “measured data” or “signal”;
• x is called “independent variable”, “predictor” or “covariate”;
• V is called “covariance matrix” or “covariance structure”;
• β1 is called “slope of the regression line (→ III/1.4.10)”;
• β0 is called “intercept of the regression line (→ III/1.4.10)”;
• ε is called “noise”, “errors” or “error terms”;
• σ 2 is called “noise variance” or “error variance”;
• n is the number of observations.
When the covariance structure V is equal to the n × n identity matrix, this is called simple linear
regression with independent and identically distributed (i.i.d.) observations:
i.i.d.
V = In ⇒ ε ∼ N (0, σ 2 In ) ⇒ εi ∼ N (0, σ 2 ) . (3)
In this case, the linear regression model can also be written as
yi = β0 + β1 xi + εi , εi ∼ N (0, σ 2 ) . (4)
Otherwise, it is called simple linear regression with correlated observations.
Sources:
• Wikipedia (2021): “Simple linear regression”; in: Wikipedia, the free encyclopedia, retrieved on
2021-10-27; URL: https://en.wikipedia.org/wiki/Simple_linear_regression#Fitting_the_regression_
line.
Proof: Without loss of generality, consider the simple linear regression case with uncorrelated errors
(→ III/1.4.1):
yi = β0 + β1 xi + εi , εi ∼ N (0, σ 2 ), i = 1, . . . , n . (2)
In matrix notation and using the multivariate normal distribution (→ II/4.1.1), this can also be
written as
1. UNIVARIATE NORMAL DATA 401
y = β0 1n + β1 x + ε, ε ∼ N (0, In )
h i β0 (3)
y = 1n x + ε, ε ∼ N (0, In ) .
β1
Comparing with the multiple linear regression equations for uncorrelated errors (→ III/1.5.1), we
finally note:
h i β0
y = Xβ + ε with X = 1n x and β = . (4)
β1
In the case of correlated observations (→ III/1.4.1), the error distribution changes to (→ III/1.5.1):
ε ∼ N (0, σ 2 V ) . (5)
■
y = β0 + β1 x + ε, εi ∼ N (0, σ 2 ), i = 1, . . . , n , (1)
the parameters minimizing the residual sum of squares (→ III/1.5.8) are given by
β̂0 = ȳ − β̂1 x̄
sxy (2)
β̂1 = 2
sx
where x̄ and ȳ are the sample means (→ I/1.10.2), s2x is the sample variance (→ I/1.11.2) of x and
sxy is the sample covariance (→ I/1.13.2) between x and y.
dRSS(β0 , β1 ) X
n
= 2(yi − β0 − β1 xi )(−1)
dβ0 i=1
X
n
= −2 (yi − β0 − β1 xi )
i=1
(4)
dRSS(β0 , β1 ) X
n
= 2(yi − β0 − β1 xi )(−xi )
dβ1 i=1
X
n
= −2 (xi yi − β0 xi − β1 x2i )
i=1
402 CHAPTER III. STATISTICAL MODELS
X
n
0 = −2 (yi − β̂0 − β̂1 xi )
i=1
(5)
Xn
0 = −2 (xi yi − β̂0 xi − β̂1 x2i )
i=1
X
n X
n
β̂1 xi + β̂0 · n = yi
i=1 i=1
(6)
X
n X
n X
n
β̂1 x2i + β̂0 xi = xi yi .
i=1 i=1 i=1
From the first equation, we can derive the estimate for the intercept:
1X 1X
n n
β̂0 = yi − β̂1 · xi
n i=1 n i=1 (7)
= ȳ − β̂1 x̄ .
From the second equation, we can derive the estimate for the slope:
X
n X
n X
n
β̂1 x2i + β̂0 xi = xi y i
i=1 i=1 i=1
X
n Xn
(7) Xn
β̂1 x2i + ȳ − β̂1 x̄ xi = xi yi
i=1 i=1
! i=1
(8)
X
n X
n X
n X
n
β̂1 x2i − x̄ xi = xi yi − ȳ xi
i=1 i=1
P
i=1
P
i=1
n
x y − ȳ ni=1 xi
β̂1 = Pn 2 Pn
i=1 i i
.
i=1 xi − x̄ i=1 xi
X
n X
n X
n
xi yi − ȳ xi = xi yi − nx̄ȳ
i=1 i=1 i=1
X
n
= xi yi − nx̄ȳ − nx̄ȳ + nx̄ȳ
i=1
Xn X
n X
n X
n
= xi yi − ȳ xi − x̄ yi + x̄ȳ (9)
i=1 i=1 i=1 i=1
Xn
= (xi yi − xi ȳ − x̄yi + x̄ȳ)
i=1
X
n
= (xi − x̄)(yi − ȳ)
i=1
X
n X
n X
n
x2i − x̄ xi = x2i − nx̄2
i=1 i=1 i=1
X
n
= x2i − 2nx̄x̄ + nx̄2
i=1
X
n X
n X
n
= x2i − 2x̄ xi − x̄2 (10)
i=1 i=1 i=1
Xn
= x2i − 2x̄xi + x̄2
i=1
X
n
= (xi − x̄)2 .
i=1
With (9) and (10), the estimate from (8) can be simplified as follows:
Pn P
xi yi − ȳ ni=1 xi
β̂1 = Pn 2
i=1 Pn
x − x̄ i=1 xi
Pni=1 i
(x − x̄)(yi − ȳ)
= Pn i
i=1
Together, (7) and (11) constitute the ordinary least squares parameter estimates for simple linear
regression.
■
404 CHAPTER III. STATISTICAL MODELS
Sources:
• Penny, William (2006): “Linear regression”; in: Mathematics for Brain Imaging, ch. 1.2.2, pp.
14-16, eqs. 1.24/1.25; URL: https://ueapsylabs.co.uk/sites/wpenny/mbi/mbi_course.pdf.
• Wikipedia (2021): “Proofs involving ordinary least squares”; in: Wikipedia, the free encyclopedia,
retrieved on 2021-10-27; URL: https://en.wikipedia.org/wiki/Proofs_involving_ordinary_least_
squares#Derivation_of_simple_linear_regression_estimators.
yi = β0 + β1 xi + εi , εi ∼ N (0, σ 2 ), i = 1, . . . , n , (1)
the parameters minimizing the residual sum of squares (→ III/1.5.8) are given by
β̂0 = ȳ − β̂1 x̄
sxy (2)
β̂1 = 2
sx
where x̄ and ȳ are the sample means (→ I/1.10.2), s2x is the sample variance (→ I/1.11.2) of x and
sxy is the sample covariance (→ I/1.13.2) between x and y.
Proof: Simple linear regression is a special case of multiple linear regression (→ III/1.4.2) with
h i β0
X = 1n x and β = (3)
β1
and ordinary least squares estimates (→ III/1.5.3) are given by
β̂ = (X T X)−1 X T y . (4)
Writing out equation (4), we have
−1
1T h i 1T
β̂ = 1n x y
n n
T
x xT
−1
n nx̄ nȳ
=
T T
nx̄ x x x y
(5)
1 x T
x −nx̄ nȳ
=
nx x − (nx̄)2 −nx̄ n
T T
x y
1 nȳ x x − nx̄ x y
T T
= .
nxT x − (nx̄)2 n xT y − (nx̄)(nȳ)
n xT y − (nx̄)(nȳ)
β̂1 =
nxT x − (nx̄)2
xT y − nx̄ȳ
=
xT x − nx̄2 P
P n
xi yi − ni=1 x̄ȳ
= Pi=1 Pn 2 (6)
i=1 xi −
n 2
i=1 x̄
Pn
(x − x̄)(yi − ȳ)
= Pn i
i=1
sxy
= .
s2x
nȳ xT x − nx̄ xT y
β̂0 =
nxT x − (nx̄)2
ȳ xT x − x̄ xT y
=
xT x − nx̄2
ȳ xT x − x̄ xT y + nx̄2 ȳ − nx̄2 ȳ
=
xT x − nx̄2
ȳ(x x − nx̄2 ) − x̄(xT y − nx̄ȳ)
T
(7)
=
xT x − nx̄2
ȳ(x x − nx̄2 ) x̄(xT y − nx̄ȳ)
T
= −
− nx̄2
xT x P xT x − nx̄2
P
n
xi yi − ni=1 x̄ȳ
= ȳ − x̄ Pn 2 Pn 2
i=1
i=1 xi − i=1 x̄
= ȳ − β̂1 x̄ .
y = β0 + β1 x + ε, εi ∼ N (0, σ 2 ), i = 1, . . . , n (1)
and consider estimation using ordinary least squares (→ III/1.4.3). Then, the expected values (→
I/1.10.1) of the estimated parameters are
E(β̂0 ) = β0
(2)
E(β̂1 ) = β1
which means that the ordinary least squares solution (→ III/1.4.3) produces unbiased estimators.
Proof: According to the simple linear regression model in (1), the expectation of a single data point
is
406 CHAPTER III. STATISTICAL MODELS
E(yi ) = β0 + β1 xi . (3)
The ordinary least squares estimates for simple linear regression (→ III/1.4.3) are given by
1X 1X
n n
β̂0 = yi − β̂1 · xi
n i=1 n i=1
Pn (4)
(x − x̄)(yi − ȳ)
β̂1 = i=1Pn i .
i=1 (xi − x̄)
2
we note that
X
n Pn Pn
(xi − x̄) xi − nx̄
ci = Pni=1
= Pni=1
i=1 (xi − x̄) i=1 (xi − x̄)
2 2
i=1 (6)
nx̄ − nx̄
= Pn =0,
i=1 (xi − x̄)
2
and
X
n Pn Pn
(x − x̄)x i=1 (xi − x̄xi )
2
ci xi = Pi=1 P
i i
=
i=1 (xi − x̄) i=1 (xi − x̄)
n 2 n 2
i=1
Pn 2 Pn
xi − 2nx̄x̄ + nx̄2 (x2i − 2x̄xi + x̄2 ) (7)
= i=1Pn = P
i=1
With (5), the estimate for the slope from (4) becomes
Pn
(x − x̄)(yi − ȳ)
Pn i
β̂1 = i=1
i=1 (xi − x̄)
2
X
n
= ci (yi − ȳ) (8)
i=1
Xn X
n
= ci yi − ȳ ci
i=1 i=1
!
X
n X
n
E(β̂1 ) = E ci yi − ȳ ci
i=1 i=1
X
n Xn
= ci E(yi ) − ȳ ci
i=1 i=1
(9)
X
n X n X
n
= β1 ci xi + β0 ci − ȳ ci
i=1 i=1 i=1
= β1 .
Finally, with (3) and (9), the expectation of the intercept estimate from (4) becomes
!
1X 1X
n n
E(β̂0 ) = E yi − β̂1 · xi
n i=1 n i=1
1X
n
= E(yi ) − E(β̂1 ) · x̄
n i=1
(10)
1X
n
= (β0 + β1 xi ) − β1 · x̄
n i=1
= β0 + β1 x̄ − β1 x̄
= β0 .
■
Sources:
• Penny, William (2006): “Finding the uncertainty in estimating the slope”; in: Mathematics for
Brain Imaging, ch. 1.2.4, pp. 18-20, eq. 1.37; URL: https://ueapsylabs.co.uk/sites/wpenny/mbi/
mbi_course.pdf.
• Wikipedia (2021): “Proofs involving ordinary least squares”; in: Wikipedia, the free encyclopedia,
retrieved on 2021-10-27; URL: https://en.wikipedia.org/wiki/Proofs_involving_ordinary_least_
squares#Unbiasedness_and_variance_of_%7F’%22%60UNIQ--postMath-00000037-QINU%60%
22’%7F.
y = β0 + β1 x + ε, εi ∼ N (0, σ 2 ), i = 1, . . . , n (1)
and consider estimation using ordinary least squares (→ III/1.4.3). Then, the variances (→ I/1.11.1)
of the estimated parameters are
xT x σ2
Var(β̂0 ) = ·
n (n − 1)s2x
(2)
σ2
Var(β̂1 ) =
(n − 1)s2x
408 CHAPTER III. STATISTICAL MODELS
where s2x is the sample variance (→ I/1.11.2) of x and xT x is the sum of squared values of the
covariate.
Proof: According to the simple linear regression model in (1), the variance of a single data point is
1X 1X
n n
β̂0 = yi − β̂1 · xi
n i=1 n i=1
Pn (4)
(x − x̄)(yi − ȳ)
Pn i
β̂1 = i=1 .
i=1 (xi − x̄)
2
we note that
X
n n
X 2
x − x̄
c2i = Pn i
i=1 (xi − x̄)
2
i=1
Pn
i=1
(xi − x̄)2 (6)
= Pni=1 2
[ i=1 (xi − x̄)2 ]
1
= Pn .
i=1 (xi − x̄)
2
With (5), the estimate for the slope from (4) becomes
Pn
(x − x̄)(yi − ȳ)
Pn i
β̂1 = i=1
i=1 (xi − x̄)
2
X
n
= ci (yi − ȳ) (7)
i=1
Xn X
n
= ci yi − ȳ ci
i=1 i=1
and with (3) and (6) as well as invariance (→ I/1.11.6), scaling (→ I/1.11.7) and additivity (→
I/1.11.10) of the variance, the variance of β̂1 is:
1. UNIVARIATE NORMAL DATA 409
!
X
n X
n
Var(β̂1 ) = Var ci yi − ȳ ci
i=1
! i=1
Xn
= Var ci yi
i=1
X
n
= c2i Var(yi )
i=1
X
n
(8)
2
=σ c2i
i=1
1
= σ 2 Pn
i=1 (xi − x̄)
2
σ2
= P
(n − 1) n−1
1
i=1 (xi − x̄)
n 2
σ2
= .
(n − 1)s2x
Finally, with (3) and (8), the variance of the intercept estimate from (4) becomes:
!
1X 1X
n n
Var(β̂0 ) = Var yi − β̂1 · xi
n i=1 n i=1
!
1X
n
= Var yi + Var β̂1 · x̄
n i=1
2 Xn
1 (9)
= Var(yi ) + x̄2 · Var(β̂1 )
n i=1
1 X 2
n
σ2
= 2 σ + x̄2
n i=1 (n − 1)s2x
σ2 σ 2 x̄2
= + .
n (n − 1)s2x
Applying the formula for the sample variance (→ I/1.11.2) s2x , we finally get:
410 CHAPTER III. STATISTICAL MODELS
1 x̄2
Var(β̂0 ) = σ 2
+ Pn
n (xi − x̄)2
1 Pn i=1
nP i=1 i
(x − x̄)2 x̄2
=σ 2
+ Pn
i=1 (xi − x̄) i=1 (xi − x̄)
n 2 2
1 Pn
(x2 − 2x̄xi + x̄2 ) + x̄2
= σ 2 n i=1 Pin
i=1 (xi − x̄)
2
P P !
i=1 xi − 2x̄ n
1 n 2 1 n 2
i=1 xi + x̄ + x̄2
=σ 2 n Pn (10)
i=1 (xi − x̄)
2
1 Pn 2 2
i=1 xi − 2x̄ + 2x̄
2
=σ 2 n Pn
i=1 (xi − x̄)
2
Pn 2 !
1
x
= σ2 n
Pi=1 i
(n − 1) n−11
i=1 (xi − x̄)
n 2
xT x σ2
= · .
n (n − 1)s2x
■
Sources:
• Penny, William (2006): “Finding the uncertainty in estimating the slope”; in: Mathematics for
Brain Imaging, ch. 1.2.4, pp. 18-20, eq. 1.37; URL: https://ueapsylabs.co.uk/sites/wpenny/mbi/
mbi_course.pdf.
• Wikipedia (2021): “Proofs involving ordinary least squares”; in: Wikipedia, the free encyclopedia,
retrieved on 2021-10-27; URL: https://en.wikipedia.org/wiki/Proofs_involving_ordinary_least_
squares#Unbiasedness_and_variance_of_%7F’%22%60UNIQ--postMath-00000037-QINU%60%
22’%7F.
y = β0 + β1 x + ε, εi ∼ N (0, σ 2 ) (1)
and consider estimation using ordinary least squares (→ III/1.4.3). Then, the estimated parameters
are normally distributed (→ II/4.1.1) as
β̂ β σ 2 x x/n −x̄
T
0 ∼ N 0 , · (2)
β̂1 β1 (n − 1) s2x −x̄ 1
where x̄ is the sample mean (→ I/1.10.2) and s2x is the sample variance (→ I/1.11.2) of x.
Proof: Simple linear regression is a special case of multiple linear regression (→ III/1.4.2) with
h i β0
X = 1n x and β = , (3)
β1
1. UNIVARIATE NORMAL DATA 411
y = Xβ + ε, ε ∼ N (0, σ 2 In ) (4)
and ordinary least sqaures estimates (→ III/1.5.3) are given by
β̂ = (X T X)−1 X T y . (5)
From (4) and the linear transformation theorem for the multivariate normal distribution (→ II/4.1.12),
it follows that
y ∼ N Xβ, σ 2 In . (6)
From (5), in combination with (6) and the transformation theorem (→ II/4.1.12), it follows that
β̂ ∼ N (X T X)−1 X T Xβ, σ 2 (X T X)−1 X T In X(X T X)−1
(7)
∼ N β, σ 2 (X T X)−1 .
Applying (3), the covariance matrix (→ II/4.1.1) can be further developed as follows:
−1
T h i
1
σ 2 (X T X)−1 = σ 2 1n x
n
xT
−1
n nx̄
= σ 2
T
nx̄ x x
(8)
σ 2 x T
x −nx̄
=
nx x − (nx̄)2 −nx̄ n
T
σ 2 x T
x/n −x̄
= T .
x x − nx̄2 −x̄ 1
i=1
X
n
2
= x2i − x̄
i=1
= (n − 1) s2x .
412 CHAPTER III. STATISTICAL MODELS
Sources:
• Wikipedia (2021): “Proofs involving ordinary least squares”; in: Wikipedia, the free encyclopedia,
retrieved on 2021-11-09; URL: https://en.wikipedia.org/wiki/Proofs_involving_ordinary_least_
squares#Unbiasedness_and_variance_of_%7F’%22%60UNIQ--postMath-00000037-QINU%60%
22’%7F.
Proof: The parameter estimates for simple linear regression are bivariate normally distributed under
ordinary least squares (→ III/1.4.7):
β̂ β σ 2 x x/n −x̄
T
0 ∼ N 0 , · (1)
β̂1 β1 (n − 1) sx
2
−x̄ 1
Because the covariance matrix (→ I/1.13.9) of the multivariate normal distribution (→ II/4.1.1)
contains the pairwise covariances of the random variables (→ I/1.2.2), we can deduce that the
covariance (→ I/1.13.1) of β̂0 and β̂1 is:
σ 2 x̄
Cov β̂0 , β̂1 = − (2)
(n − 1) s2x
where σ 2 is the noise variance (→ III/1.4.1), s2x is the sample variance (→ I/1.11.2) of x and n is the
number of observations. When x is mean-centered, we have x̄ = 0, such that:
Cov β̂0 , β̂1 = 0 . (3)
■
1. UNIVARIATE NORMAL DATA 413
Proof:
1) Under unaltered y and x, ordinary least squares estimates for simple linear regression (→ III/1.4.3)
are
β̂0 = ȳ − β̂1 x̄
Pn
(x − x̄)(yi − ȳ) sxy (1)
β̂1 = i=1 Pn i = 2
i=1 (xi − x̄)
2 sx
with sample means (→ I/1.10.2) x̄ and ȳ, sample variance (→ I/1.11.2) s2x and sample covariance
(→ I/1.13.2) sxy , such that β0 estimates “the mean y at x = 0”.
and we can see that β̂1 (x̃, y) = β̂1 (x, y), but β̂0 (x̃, y) ̸= β̂0 (x, y), specifically β0 now estimates “the
mean y at the mean x”.
and we can see that β̂1 (x, ỹ) = β̂1 (x, y), but β̂0 (x, ỹ) ̸= β̂0 (x, y), specifically β0 now estimates “the
mean x, multiplied with the negative slope”.
414 CHAPTER III. STATISTICAL MODELS
x̃i = xi − x̄ ⇒ x̃¯ = 0
(6)
ỹi = yi − ȳ ⇒ ỹ¯ = 0 .
and we can see that β̂1 (x̃, ỹ) = β̂1 (x, y), but β̂0 (x̃, ỹ) ̸= β̂0 (x, y), specifically β0 is now forced to
become zero.
yi = β0 + β1 xi + εi , εi ∼ N (0, σ 2 ) . (1)
Then, given some parameters β0 , β1 ∈ R, the set
L(β0 , β1 ) = (x, y) ∈ R2 | y = β0 + β1 x (2)
is called a “regression line” and the set
Sources:
• Wikipedia (2021): “Simple linear regression”; in: Wikipedia, the free encyclopedia, retrieved on
2021-10-27; URL: https://en.wikipedia.org/wiki/Simple_linear_regression#Fitting_the_regression_
line.
ȳ = β̂0 + β̂1 x̄
ȳ = ȳ − β̂1 x̄ + β̂1 x̄ (2)
ȳ = ȳ .
which is a true statement. Thus, the regression line (→ III/1.4.10) goes through the center of mass
point (x̄, ȳ), if the model (→ III/1.4.1) includes an intercept term β0 .
Sources:
• Wikipedia (2021): “Simple linear regression”; in: Wikipedia, the free encyclopedia, retrieved on
2021-10-27; URL: https://en.wikipedia.org/wiki/Simple_linear_regression#Numerical_properties.
Proof: The intersection point of the regression line (→ III/1.4.10) with the y-axis is
S(0|β̂0 ) . (3)
Let a be a vector describing the direction of the regression line, let b be the vector pointing from S
to O and let p be the vector pointing from S to P .
Because β̂1 is the slope of the regression line, we have
1
a= . (4)
β̂1
Moreover, with the points O and S, we have
xo 0 xo
b= − = . (5)
yo β̂0 yo − β̂0
416 CHAPTER III. STATISTICAL MODELS
Because P is located on the regression line, p is collinear with a and thus a scalar multiple of this
vector:
p=w·a. (6)
Moreover, as P is the point on the regression line which is closest to O, this means that the vector
b − p is orthogonal to a, such that the inner product of these two vectors is equal to zero:
aT (b − p) = 0 . (7)
Rearranging this equation gives
aT (b − p) = 0
aT (b − w · a) = 0
aT b − w · aT a = 0 (8)
w·a a=a b
T T
aT b
w= .
aT a
With (4) and (5), w can be calculated as
aT b
w=
aT a
T
1 xo
β̂1 yo − β̂0
w = T (9)
1 1
β̂1 β̂1
x0 + (yo − β̂0 )β̂1
w=
1 + β̂12
Finally, with the point S (3) and the vector p (6), the coordinates of P are obtained as
x 0 1 w
p = + w · = . (10)
yp β̂0 β̂1 β̂0 + β̂1 w
Together, (10) and (9) constitute the proof of equation (2).
Sources:
• Penny, William (2006): “Projections”; in: Mathematics for Brain Imaging, ch. 1.4.10, pp. 34-35,
eqs. 1.87/1.88; URL: https://ueapsylabs.co.uk/sites/wpenny/mbi/mbi_course.pdf.
1. UNIVARIATE NORMAL DATA 417
TSS = (n − 1) s2y
s2xy
ESS = (n − 1)
s2 (1)
x
s2xy
RSS = (n − 1) sy − 2
2
sx
where s2x and s2y are the sample variances (→ I/1.11.2) of x and y and sxy is the sample covariance
(→ I/1.13.2) between x and y.
Proof: The ordinary least squares parameter estimates (→ III/1.4.3) are given by
sxy
β̂0 = ȳ − β̂1 x̄ and β̂1 = . (2)
s2x
X
n
TSS = (yi − ȳ)2
i=1
1 X
n
(4)
= (n − 1) (yi − ȳ)2
n − 1 i=1
= (n − 1)s2y .
X
n
ESS = (ŷi − ȳ)2
i=1
X
n
= (β̂0 + β̂1 xi − ȳ)2
i=1
(2) Xn
= (ȳ − β̂1 x̄ + β̂1 xi − ȳ)2
i=1
n
X 2
= β̂1 (xi − x̄)
i=1 (6)
n 2
(2) X sxy
= (xi − x̄)
i=1
s2x
2 X n
sxy
= (xi − x̄)2
s2x i=1
2
sxy
= (n − 1)s2x
s2x
s2xy
= (n − 1) 2 .
sx
X
n
RSS = (yi − ŷi )2
i=1
X
n
= (yi − β̂0 − β̂1 xi )2
i=1
(2) Xn
= (yi − ȳ + β̂1 x̄ − β̂1 xi )2
i=1
n
X 2
= (yi − ȳ) − β̂1 (xi − x̄)
i=1
n
X
= (yi − ȳ) − 2β̂1 (xi − x̄)(yi − ȳ) +
2
β̂12 (xi − x̄) 2
(8)
i=1
Xn X
n X
n
= (yi − ȳ)2 − 2β̂1 (xi − x̄)(yi − ȳ) + β̂12 (xi − x̄)2
i=1 i=1 i=1
1 (x x/n) 1n − x̄ x
T T T
E=
(n − 1) s2x −x̄ 1T + x T
n
(x x/n) − 2x̄x1 + x1
T 2
· · · (x x/n) − x̄(x1 + xn ) + x1 xn
T
1 .. . . ..
P = . . .
(n − 1) sx
2
(x x/n) − x̄(x1 + xn ) + x1 xn · · ·
T
(x x/n) − 2x̄xn + xn
T 2
(n − 1)(x x/n) + x̄(2x1 − nx̄) − x1 · · ·
T 2
−(x x/n) + x̄(x1 + xn ) − x1 xn
T
1 .. .. ..
R= . . .
(n − 1) sx
2
−(x x/n) + x̄(x1 + xn ) − x1 xn
T
· · · (n − 1)(x x/n) + x̄(2xn − nx̄) − xn
T 2
(1)
420 CHAPTER III. STATISTICAL MODELS
where 1n is an n × 1 vector of ones, x is the n × 1 single predictor variable, x̄ is the sample mean (→
I/1.10.2) of x and s2x is the sample variance (→ I/1.11.2) of x.
Proof: Simple linear regression is a special case of multiple linear regression (→ III/1.4.2) with
h i β0
X = 1n x and β = , (2)
β1
such that the simple linear regression model can also be written as
y = Xβ + ε, ε ∼ N (0, σ 2 In ) . (3)
Moreover, we note the following equality (→ III/1.4.7):
E = (X T X)−1 X T (5)
which is a 2 × n matrix and can be reformulated as follows:
E = (X T X)−1 X T
−1
T h i
1n 1T
= 1n x
n
T
x xT
−1
n nx̄ 1T
= n
T T
nx̄ x x x
1 x T
x −nx̄ 1T (6)
= n
nx x − (nx̄) −nx̄ n
T 2
x T
1 xT x/n −x̄ 1T
= T n
x x − nx̄2 −x̄ 1 xT
(4) 1 (x x/n) 1n − x̄ x
T T T
= .
(n − 1) sx2
−x̄ 1T + xT
n
h i e1
P = X E = 1n x
e2
1 x1
.. .. (x x/n) − x̄x1 · · · (x x/n) − x̄xn
T T
1
= . .
(n − 1) sx
2 −x̄ + x1 ··· −x̄ + xn (8)
1 xn
(xT x/n) − 2x̄x1 + x21 · · · (xT x/n) − x̄(x1 + xn ) + x1 xn
1 .. . . ..
= . . . .
(n − 1) sx
2
(xT x/n) − x̄(x1 + xn ) + x1 xn · · · (xT x/n) − 2x̄xn + x2n
1 ··· 0 p · · · p1n
11
.. . . .. .. . . ..
R = In − P = . . . − . . .
0 ··· 1 pn1 · · · pnn
x x − nx̄ · · ·
T 2
0
(4) 1 .. .. ..
= . . .
(n − 1) sx
2
0 · · · x x − nx̄
T 2
(xT x/n) − 2x̄x1 + x21 · · · (xT x/n) − x̄(x1 + xn ) + x1 xn
1 .. .. ..
− . . .
(n − 1) s2x
(x x/n) − x̄(x1 + xn ) + x1 xn · · ·
T
(x x/n) − 2x̄xn + xn
T 2
(n − 1)(x x/n) + x̄(2x1 − nx̄) − x1 · · ·
T 2
−(x x/n) + x̄(x1 + xn ) − x1 xn
T
1 .. . . ..
= . . . .
(n − 1) sx
2
−(xT x/n) + x̄(x1 + xn ) − x1 xn · · · (n − 1)(xT x/n) + x̄(2xn − nx̄) − x2n
(10)
■
y = β0 + β1 x + ε, ε ∼ N (0, σ 2 V ) , (1)
422 CHAPTER III. STATISTICAL MODELS
the parameters minimizing the weighted residual sum of squares (→ III/1.5.8) are given by
xT V −1 x 1T
nV
−1
y − 1T
nV
−1
x xT V −1 y
β̂0 = −1 1 − 1T V −1 x xT V −1 1
xT V −1 x 1T
nV n n n
T −1 T −1 T −1 T −1
(2)
1 V 1n x V y − x V 1n 1n V y
β̂1 = Tn −1
1n V 1n xT V −1 x − xT V −1 1n 1TnV
−1 x
W V W T = In . (3)
Since V is a covariance matrix and thus symmetric, W is also symmetric and can be expressed as
the matrix square root of the inverse of V :
W V W = In ⇔ V = W −1 W −1 ⇔ V −1 = W W ⇔ W = V −1/2 . (4)
Because β0 is a scalar, (1) may also be written as
y = β0 1n + β1 x + ε, ε ∼ N (0, σ 2 V ) , (5)
Left-multiplying (5) with W , the linear transformation theorem (→ II/4.1.12) implies that
W y = β0 W 1n + β1 W x + W ε, W ε ∼ N (0, σ 2 W V W T ) . (6)
Applying (3), we see that (6) is actually a linear regression model (→ III/1.5.1) with independent
observations
h i β0
ỹ = x̃0 x̃ + ε̃, ε̃ ∼ N (0, σ 2 In ) (7)
β1
where ỹ = W y, x̃0 = W 1n , x̃ = W x and ε̃ = W ε, such that we can apply the ordinary least squares
solution (→ III/1.5.3) giving:
β̂ = (X̃ T X̃)−1 X̃ T ỹ
−1
T h i
x̃0 x̃T
= x̃0 x̃ ỹ
0
−1
1 x̃T x̃ −x̃T
0 x̃ x̃T
β̂ = T 0
ỹ
x̃0 x̃0 x̃ x̃ − x̃0 x̃ x̃ x̃0 −x̃ x̃0 x̃ x̃0
T T T T T
x̃ T
0
x̃T x̃ x̃T0 − x̃0 x̃ x̃
T T
1 ỹ
= T (9)
x̃0 x̃0 x̃T x̃ − x̃T T
0 x̃ x̃ x̃0 x̃T x̃ x̃ T
− x̃ T
x̃ x̃ T
0 0 0 0
1 x̃ x̃ x̃0 ỹ − x̃0 x̃ x̃ ỹ
T T T T
= T .
x̃0 x̃0 x̃ x̃ − x̃0 x̃ x̃ x̃0 x̃T x̃0 x̃T ỹ − x̃T x̃0 x̃T ỹ
T T T
0 0
n W W y − 1n W W x x W W y
xT W T W x 1T T T T T T
1
β̂ =
1T T
n W W 1n − 1T T
xT W T W x T T
n W W x x W W 1n 1T n W W 1n x W W y − x W W 1n 1n W W y
T T T T T T T
T −1 T −1 T −1 T −1
1 x V x 1 V y − 1 V x x V y
= T −1 T −1 n n
x V x 1n V 1n − 1n V x x V 1n 1 V 1n x V y − xT V −1 1n 1T V −1 y
T −1 T −1 T −1 T −1
n n
xT V −1 x 1TnV
−1 y−1T V −1 x xT V −1 y
n
= x T V −1 x 1T V −1 1 −1T V −1 x xT V −1 1
n n n n
1T −1 1 xT V −1 y−xT V −1 1 1T V −1 y
nV n n n
1T −1 1 xT V −1 x−xT V −1 1 1T V −1 x
nV n n n
(10)
y = β0 + β1 x + ε, ε ∼ N (0, σ 2 V ) , (1)
the parameters minimizing the weighted residual sum of squares (→ III/1.5.8) are given by
xT V −1 x 1TnV
−1
y − 1T
nV
−1
x xT V −1 y
β̂0 = −1 1 − 1T V −1 x xT V −1 1
xT V −1 x 1T
nV n n n
−1 T −1 T −1 T −1
(2)
1T V 1n x V y − x V 1 1
n n V y
β̂1 = Tn −1
1n V 1n xT V −1 x − xT V −1 1n 1T nV
−1 x
Proof: Simple linear regression is a special case of multiple linear regression (→ III/1.4.2) with
h i β0
X = 1n x and β = (3)
β1
424 CHAPTER III. STATISTICAL MODELS
β̂ = (X T V −1 X)−1 X T V −1 y . (4)
Writing out equation (4), we have
−1
1T h i 1T
β̂ = V −1 1n x V −1 y
n n
xT xT
−1
T −1 T −1 T −1
1n V 1n 1n V x 1 V y
= n
xT V −1 1n xT V −1 x xT V −1 y
(5)
−1 −1 −1
1 x V x T
−1T
nV x 1T
nV y
=
− 1n V x x V 1n −xT V −1 1n 1T V −1 1n
xT V −1 x 1T
nV
−1 1
T −1
n
T −1
x T −1
V y
n
T −1 T −1 T −1 T −1
1 x V x 1n V y − 1n V x x V y
= T −1 T −1 .
x V x 1n V 1n − 1n V x x V 1n 1T V −1 1n xT V −1 y − xT V −1 1n 1T V −1 y
T −1 T −1
n n
xT V −1 x 1T
nV
−1
y − 1T
nV
−1
x xT V −1 y
β̂0 = −1 1 − 1T V −1 x xT V −1 1
. (6)
xT V −1 x 1T
nV n n n
yi = β0 + β1 xi + εi , εi ∼ N (0, σ 2 ), i = 1, . . . , n , (1)
the maximum likelihood estimates (→ I/4.1.3) of β0 , β1 and σ 2 are given by
β̂0 = ȳ − β̂1 x̄
sxy
β̂1 = 2
sx (2)
1X
n
2
σ̂ = (yi − β̂0 − β̂1 xi )2
n i=1
where x̄ and ȳ are the sample means (→ I/1.10.2), s2x is the sample variance (→ I/1.11.2) of x and
sxy is the sample covariance (→ I/1.13.2) between x and y.
1. UNIVARIATE NORMAL DATA 425
Proof: With the probability density function of the normal distribution (→ II/3.2.10) and probability
under independence (→ I/1.3.6), the linear regression equation (1) implies the following likelihood
function (→ I/5.1.2)
Y
n
2
p(y|β0 , β1 , σ ) = p(yi |β0 , β1 , σ 2 )
i=1
Yn
= N (yi ; β0 + β1 xi , σ 2 )
i=1
Y
n (3)
1 (yi − β0 − β1 xi )2
= √ · exp −
2πσ 2σ 2
i=1
" #
1 X
n
1
=p · exp − 2 (yi − β0 − β1 xi )2
(2πσ )2 n 2σ i=1
1 X
n
dLL(β0 , β1 , σ 2 )
= 2 (yi − β0 − β1 xi ) (5)
dβ0 σ i=1
and setting this derivative to zero gives the MLE for β0 :
dLL(β̂0 , β̂1 , σ̂ 2 )
=0
dβ0
1 X
n
0= 2 (yi − β̂0 − β̂1 xi )
σ̂ i=1
X
n X
n
(6)
0= yi − nβ̂0 − β̂1 xi
i=1 i=1
1X 1X
n n
β̂0 = yi − β̂1 xi
n i=1 n i=1
β̂0 = ȳ − β̂1 x̄ .
1 X
n
dLL(β̂0 , β1 , σ 2 )
= 2 (xi yi − β̂0 xi − β1 x2i ) (7)
dβ1 σ i=1
426 CHAPTER III. STATISTICAL MODELS
dLL(β̂0 , β̂1 , σ̂ 2 )
=0
dβ1
1 X
n
0= 2 (xi yi − β̂0 xi − β̂1 x2i )
σ̂ i=1
X
n X
n X
n
0= xi yi − β̂0 xi − β̂1 x2i )
i=1 i=1 i=1
(6) Xn X
n X
n
0= xi yi − (ȳ − β̂1 x̄) xi − β̂1 x2i
i=1 i=1 i=1
X
n X
n X
n X
n
(8)
0= xi yi − ȳ xi + β̂1 x̄ xi − β̂1 x2i
i=1 i=1 i=1 i=1
X
n X
n
0= xi yi − nx̄ȳ + β̂1 nx̄2 − β̂1 x2i
Pi=1
P i=1
n
xi yi − ni=1 x̄ȳ
β̂1 = Pn 2 Pn 2
i=1
x − i=1 x̄
Pni=1 i
(x − x̄)(yi − ȳ)
β̂1 = i=1 Pn i
i=1 (xi − x̄)
2
sxy
β̂1 = 2 .
sx
The derivative of the log-likelihood function (4) at (β̂0 , β̂1 ) with respect to σ 2 is
1 X
n
dLL(β̂0 , β̂1 , σ 2 ) n
=− 2 + (yi − β̂0 − β̂1 xi )2 (9)
dσ 2 2σ 2(σ 2 )2 i=1
and setting this derivative to zero gives the MLE for σ 2 :
dLL(β̂0 , β̂1 , σ̂ 2 )
=0
dσ 2
1 X
n
n
0=− 2 + (yi − β̂0 − β̂1 xi )2
2σ̂ 2(σ̂ 2 )2 i=1
(10)
1 X
n
n
= (yi − β̂0 − β̂1 xi )2
2σ̂ 2 2(σ̂ 2 )2 i=1
1X
n
2
σ̂ = (yi − β̂0 − β̂1 xi )2 .
n i=1
Together, (6), (8) and (10) constitute the MLE for simple linear regression.
■
1. UNIVARIATE NORMAL DATA 427
yi = β0 + β1 xi + εi , εi ∼ N (0, σ 2 ), i = 1, . . . , n , (1)
the maximum likelihood estimates (→ I/4.1.3) of β0 , β1 and σ 2 are given by
β̂0 = ȳ − β̂1 x̄
sxy
β̂1 = 2
sx (2)
1X
n
2
σ̂ = (yi − β̂0 − β̂1 xi )2
n i=1
where x̄ and ȳ are the sample means (→ I/1.10.2), s2x is the sample variance (→ I/1.11.2) of x and
sxy is the sample covariance (→ I/1.13.2) between x and y.
Proof: Simple linear regression is a special case of multiple linear regression (→ III/1.4.2) with
h i β0
X = 1n x and β = (3)
β1
and weighted least sqaures estimates (→ III/1.5.26) are given by
β̂ = (X T V −1 X)−1 X T V −1 y
1 (4)
σ̂ 2 = (y − X β̂)T V −1 (y − X β̂) .
n
Under independent observations, the covariance matrix is
−1
1T h i 1T
β̂ = V −1 V −1 y
n n
1n x
xT x T
−1 (6)
1T h i 1T
= n
1n x n
y
xT x T
which is equal to the ordinary least squares solution for simple linear regression (→ III/1.4.4):
β̂0 = ȳ − β̂1 x̄
sxy (7)
β̂1 = 2 .
sx
428 CHAPTER III. STATISTICAL MODELS
1
σ̂ 2 = (y − X β̂)T V −1 (y − X β̂)
n
T
1 h i β̂0 h i β̂0
= y − 1n x y − 1n x
n β̂1 β̂1
(8)
1 T
= y − β̂0 − β̂1 x y − β̂0 − β̂1 x
n
1X
n
= (yi − β̂0 − β̂1 xi )2 .
n i=1
Proof: The residuals are defined as the estimated error terms (→ III/1.4.1)
X
n X
n
ε̂i = (yi − β̂0 − β̂1 xi )
i=1 i=1
X
n
= (yi − ȳ + β̂1 x̄ − β̂1 xi )
i=1
(3)
X
n X
n
= yi − nȳ + β̂1 nx̄ − β̂1 xi
i=1 i=1
Thus, the sum of the residuals (→ III/1.5.8) is zero under ordinary least squares (→ III/1.4.3), if
the model (→ III/1.4.1) includes an intercept term β0 .
■
Sources:
• Wikipedia (2021): “Simple linear regression”; in: Wikipedia, the free encyclopedia, retrieved on
2021-10-27; URL: https://en.wikipedia.org/wiki/Simple_linear_regression#Numerical_properties.
1. UNIVARIATE NORMAL DATA 429
Proof: The residuals are defined as the estimated error terms (→ III/1.4.1)
X
n X
n
xi ε̂i = xi (yi − β̂0 − β̂1 xi )
i=1 i=1
n
X
= xi yi − β̂0 xi − β̂1 x2i
i=1
Xn
= xi yi − xi (ȳ − β̂1 x̄) − β̂1 x2i
i=1
n
X
= xi (yi − ȳ) + β̂1 (x̄xi − x2i
i=1
!
X
n X
n X
n X
n
= xi yi − ȳ xi − β̂1 x2i − x̄ xi
i=1 i=1 i=1
! i=1
!
X
n X
n
(3)
= xi yi − nx̄ȳ − nx̄ȳ + nx̄ȳ − β̂1 x2i − 2nx̄x̄ + nx̄2
i=1
! i=1
!
X
n X
n X
n X
n X
n
= xi yi − ȳ xi − x̄ yi + nx̄ȳ − β̂1 x2i − 2x̄ xi + nx̄2
i=1 i=1 i=1 i=1 i=1
X
n X
n
= (xi yi − ȳxi − x̄yi + x̄ȳ) − β̂1 x2i − 2x̄xi + x̄ 2
i=1 i=1
Xn X
n
= (xi − x̄)(yi − ȳ) − β̂1 (xi − x̄)2
i=1 i=1
sxy
= (n − 1)sxy − 2 (n − 1)s2x
sx
= (n − 1)sxy − (n − 1)sxy
=0.
Because an inner product of zero also implies zero correlation (→ I/1.14.1), this demonstrates that
residuals (→ III/1.5.8) and covariate (→ III/1.4.1) values are uncorrelated under ordinary least
squares (→ III/1.4.3).
■
430 CHAPTER III. STATISTICAL MODELS
Sources:
• Wikipedia (2021): “Simple linear regression”; in: Wikipedia, the free encyclopedia, retrieved on
2021-10-27; URL: https://en.wikipedia.org/wiki/Simple_linear_regression#Numerical_properties.
y = β0 + β1 x + ε, εi ∼ N (0, σ 2 ), i = 1, . . . , n (1)
and consider estimation using ordinary least squares (→ III/1.4.3). Then, residual variance (→
IV/1.1.1) and sample variance (→ I/1.11.2) are related to each other via the correlation coefficient
(→ I/1.14.1):
2
σ̂ 2 = 1 − rxy
2
sy . (2)
Proof: The residual variance (→ IV/1.1.1) can be expressed in terms of the residual sum of squares
(→ III/1.5.8):
1
σ̂ 2 = RSS(β̂0 , β̂1 ) (3)
n−1
and the residual sum of squares for simple linear regression (→ III/1.4.13) is
s2xy
RSS(β̂0 , β̂1 ) = (n − 1) sy − 2
2
. (4)
sx
Combining (3) and (4), we obtain:
s2xy
2
σ̂ = − 2s2y
sx
s2xy
= 1 − 2 2 s2y (5)
sx sy
2 !
sxy
= 1− s2y .
sx sy
Using the relationship between correlation, covariance and standard deviation (→ I/1.14.1)
Cov(X, Y )
Corr(X, Y ) = p p (6)
Var(X) Var(Y )
which also holds for sample correlation, sample covariance (→ I/1.13.2) and sample standard devia-
tion (→ I/1.16.1)
sxy
rxy = , (7)
sx sy
we get the final result:
2
σ̂ 2 = 1 − rxy
2
sy . (8)
1. UNIVARIATE NORMAL DATA 431
Sources:
• Penny, William (2006): “Relation to correlation”; in: Mathematics for Brain Imaging, ch. 1.2.3,
p. 18, eq. 1.28; URL: https://ueapsylabs.co.uk/sites/wpenny/mbi/mbi_course.pdf.
• Wikipedia (2021): “Simple linear regression”; in: Wikipedia, the free encyclopedia, retrieved on
2021-10-27; URL: https://en.wikipedia.org/wiki/Simple_linear_regression#Numerical_properties.
y = β0 + β1 x + ε, εi ∼ N (0, σ 2 ), i = 1, . . . , n (1)
and consider estimation using ordinary least squares (→ III/1.4.3). Then, correlation coefficient (→
I/1.14.3) and the estimated value of the slope parameter (→ III/1.4.1) are related to each other via
the sample (→ I/1.11.2) standard deviations (→ I/1.16.1):
sx
rxy = β̂1 . (2)
sy
Proof: The ordinary least squares estimate of the slope (→ III/1.4.3) is given by
sxy
β̂1 = . (3)
s2x
Using the relationship between covariance and correlation (→ I/1.13.7)
sxy
β̂1 =
s2x
sx rxy sy
β̂1 =
s2x
sy (6)
β̂1 = rxy
sx
sx
⇔ rxy = β̂1 .
sy
■
Sources:
• Penny, William (2006): “Relation to correlation”; in: Mathematics for Brain Imaging, ch. 1.2.3,
p. 18, eq. 1.27; URL: https://ueapsylabs.co.uk/sites/wpenny/mbi/mbi_course.pdf.
• Wikipedia (2021): “Simple linear regression”; in: Wikipedia, the free encyclopedia, retrieved on
2021-10-27; URL: https://en.wikipedia.org/wiki/Simple_linear_regression#Fitting_the_regression_
line.
432 CHAPTER III. STATISTICAL MODELS
y = β0 + β1 x + ε, εi ∼ N (0, σ 2 ), i = 1, . . . , n (1)
and consider estimation using ordinary least squares (→ III/1.4.3). Then, the coefficient of deter-
mination (→ IV/1.2.1) is equal to the squared correlation coefficient (→ I/1.14.3) between x and
y:
R2 = rxy
2
. (2)
Proof: The ordinary least squares estimates for simple linear regression (→ III/1.4.3) are
β̂0 = ȳ − β̂1 x̄
sxy (3)
β̂1 = 2 .
sx
The coefficient of determination (→ IV/1.2.1) R2 is defined as the proportion of the variance explained
by the independent variables, relative to the total variance in the data. This can be quantified as the
ratio of explained sum of squares (→ III/1.5.7) to total sum of squares (→ III/1.5.6):
ESS
R2 =
. (4)
TSS
Using the explained and total sum of squares for simple linear regression (→ III/1.4.13), we have:
Pn
(ŷi − ȳ)2
R = Pi=1
2
Pn
i=1 (ȳ − β̂ x̄ + β̂1 xi − ȳ)2
2
R = Pn 1
i=1 (yi − ȳ)
2
Pn 2
i=1 β̂1 (x i − x̄)
= Pn
i=1 (yi − ȳ)
2
P
i=1 (xi − x̄)
1 n 2
= β̂1 1 Pn
2 n−1 (6)
i=1 (yi − ȳ)
2
n−1
2
2 sx
= β̂1 2
sy
2
sx
= β̂1 .
sy
Using the relationship between correlation coefficient and slope estimate (→ III/1.4.22), we conclude:
1. UNIVARIATE NORMAL DATA 433
2
2 sx 2
R = β̂1 = rxy . (7)
sy
■
Sources:
• Wikipedia (2021): “Simple linear regression”; in: Wikipedia, the free encyclopedia, retrieved on
2021-10-27; URL: https://en.wikipedia.org/wiki/Simple_linear_regression#Fitting_the_regression_
line.
• Wikipedia (2021): “Coefficient of determination”; in: Wikipedia, the free encyclopedia, retrieved on
2021-10-27; URL: https://en.wikipedia.org/wiki/Coefficient_of_determination#As_squared_correlation_
coefficient.
• Wikipedia (2021): “Correlation”; in: Wikipedia, the free encyclopedia, retrieved on 2021-10-27;
URL: https://en.wikipedia.org/wiki/Correlation#Sample_correlation_coefficient.
y = Xβ + ε , (1)
together with a statement asserting a normal distribution (→ II/4.1.1) for ε
ε ∼ N (0, σ 2 V ) (2)
is called a univariate linear regression model or simply, “multiple linear regression”.
• y is called “measured data”, “dependent variable” or “measurements”;
• X is called “design matrix”, “set of independent variables” or “predictors”;
• V is called “covariance matrix” or “covariance structure”;
• β are called “regression coefficients” or “weights”;
• ε is called “noise”, “errors” or “error terms”;
• σ 2 is called “noise variance” or “error variance”;
• n is the number of observations;
• p is the number of predictors.
Alternatively, the linear combination may also be written as
X
p
y= β i xi + ε (3)
i=1
X
p
y = β0 + βi xi + ε (4)
i=1
i.i.d.
V = In ⇒ ε ∼ N (0, σ 2 In ) ⇒ εi ∼ N (0, σ 2 ) . (5)
Otherwise, it is called multiple linear regression with correlated observations.
Sources:
• Wikipedia (2020): “Linear regression”; in: Wikipedia, the free encyclopedia, retrieved on 2020-03-
21; URL: https://en.wikipedia.org/wiki/Linear_regression#Simple_and_multiple_linear_regression.
Y = y, B = β, E = ε and Σ = σ2 (1)
where y, β, ε and σ 2 are the data vector, regression coefficients, noise vector and noise variance from
multiple linear regression (→ III/1.5.1).
Proof: The linear regression model with correlated errors (→ III/1.5.1) is given by:
y = Xβ + ε, ε ∼ N (0, σ 2 V ) . (2)
Because ε is an n × 1 vector and σ 2 is scalar, we have the following identities:
vec(ε) = ε
(3)
σ2 ⊗ V = σ2V .
Thus, using the relationship between multivariate normal and matrix normal distribution (→ II/5.1.2),
equation (2) can also be written as
y = Xβ + ε, ε ∼ MN (0, V, σ 2 ) . (4)
Comparing with the general linear model with correlated observations (→ III/2.1.1)
Y = XB + E, E ∼ MN (0, V, Σ) , (5)
we finally note the equivalences given in equation (1).
Sources:
• Wikipedia (2022): “General linear model”; in: Wikipedia, the free encyclopedia, retrieved on 2022-
07-21; URL: https://en.wikipedia.org/wiki/General_linear_model#Comparison_to_multiple_
linear_regression.
1. UNIVARIATE NORMAL DATA 435
β̂ = (X T X)−1 X T y . (2)
Proof: Let β̂ be the ordinary least squares (OLS) solution and let ε̂ = y − X β̂ be the resulting vector
of residuals. Then, this vector must be orthogonal to the design matrix,
X T ε̂ = 0 , (3)
because if it wasn’t, there would be another solution β̃ giving another vector ε̃ with a smaller residual
sum of squares. From (3), the OLS formula can be directly derived:
X T ε̂ = 0
X T y − X β̂ = 0
X T y − X T X β̂ = 0 (4)
X T X β̂ = X T y
β̂ = (X T X)−1 X T y .
Sources:
• Stephan, Klaas Enno (2010): “The General Linear Model (GLM)”; in: Methods and models for
fMRI data analysis in neuroeconomics, Lecture 3, Slides 10/11; URL: http://www.socialbehavior.
uzh.ch/teaching/methodsspring10.html.
β̂ = (X T X)−1 X T y . (2)
RSS(β) = y T y − y T Xβ − β T X T y + β T X T Xβ
(4)
= y T y − 2β T X T y + β T X T Xβ .
dRSS(β)
= −2X T y + 2X T Xβ (5)
dβ
and setting this deriative to zero, we obtain:
dRSS(β̂)
=0
dβ
0 = −2X T y + 2X T X β̂ (6)
T T
X X β̂ = X y
β̂ = (X T X)−1 X T y .
Sources:
• Wikipedia (2020): “Proofs involving ordinary least squares”; in: Wikipedia, the free encyclopedia,
retrieved on 2020-02-03; URL: https://en.wikipedia.org/wiki/Proofs_involving_ordinary_least_
squares#Least_squares_estimator_for_%CE%B2.
• ad (2015): “Derivation of the Least Squares Estimator for Beta in Matrix Notation”; in: Economic
Theory Blog, retrieved on 2021-05-27; URL: https://economictheoryblog.com/2015/02/19/ols_
estimator/.
2 x2 x1 y − x1 x2 x2 y
xT 1 x1 x2 y − x 2 x1 x1 y
T T T
xT T T T
β̂1 = T and β̂2 = T (2)
2 x2 − x1 x 2 x2 x1
x1 x1 xT 2 x2 − x1 x2 x2 x1
T T
x1 x 1 xT T T
2) and, if the two regressors are orthogonal to each other, they simplify to
xT y xT y
β̂1 = T1 and β̂2 = T2 , if x1 ⊥ x2 . (3)
x1 x1 x2 x2
Proof: The model in (1) is a special case of multiple linear regression (→ III/1.5.1) and the ordinary
least squares solution for multiple linear regression (→ III/1.5.3) is:
1. UNIVARIATE NORMAL DATA 437
β̂ = (X T X)−1 X T y . (4)
h i
1) Plugging X = x1 x2 into this equation, we obtain:
−1
xT h xT i
β̂ = x1 x2 y
1 1
T
x2 xT2
−1 (5)
xT T
1 x1 x1 x2 xT
1y
= .
T T T
x2 x1 x2 x2 x2 y
1 xT
2 x2 −xT
1 x2 xT
1y
β̂ =
xT T
− xT
1 x1 x2 x2
T
1 x2 x2 x 1 −x2 x1 x1 x1
T T T
x2 y
(7)
1 x x x y − x1 x2 x2 y
T T T T
= T 2 2 1
2 x2 − x1 x2 x2 x 1
x1 x1 xT T T
x x1 x y − x x1 x y
T T T
1
T
2 2 1
2 x2 x1 y − x1 x2 x2 y
xT T T T
β̂1 =
1 x1 x2 x2 − x1 x2 x2 x1
xT T T T
(8)
xT x1 xT y − xT T
2 x1 x1 y
β̂2 = T1 T2 .
x1 x1 x2 x2 − xT T
1 x2 x2 x1
2) If two regressors are orthogonal to each other, this means that the inner product of the corre-
sponding vectors is zero:
x1 ⊥ x2 ⇔ xT T
1 x 2 = x2 x 1 = 0 . (9)
Applying this to equation (8), we obtain:
xT T
2 x2 x1 y xT y
β̂1 = T T
= T1
x 1 x1 x2 x2 x1 x1
T T
(10)
x x1 x y xT y
β̂2 = T1 T2 = T2 .
x 1 x1 x2 x2 x2 x2
■
438 CHAPTER III. STATISTICAL MODELS
Sources:
• Wikipedia (2020): “Total sum of squares”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
03-21; URL: https://en.wikipedia.org/wiki/Total_sum_of_squares.
with estimated regression coefficients β̂, e.g. obtained via ordinary least squares (→ III/1.5.3).
Sources:
• Wikipedia (2020): “Explained sum of squares”; in: Wikipedia, the free encyclopedia, retrieved on
2020-03-21; URL: https://en.wikipedia.org/wiki/Explained_sum_of_squares.
with estimated regression coefficients β̂, e.g. obtained via ordinary least squares (→ III/1.5.3).
Sources:
1. UNIVARIATE NORMAL DATA 439
• Wikipedia (2020): “Residual sum of squares”; in: Wikipedia, the free encyclopedia, retrieved on
2020-03-21; URL: https://en.wikipedia.org/wiki/Residual_sum_of_squares.
X
n
TSS = (yi − ȳ + ŷi − ŷi )2
i=1
X
n
= ((ŷi − ȳ) + (yi − ŷi ))2
i=1
Xn
= ((ŷi − ȳ) + ε̂i )2
i=1
X
n
= (ŷi − ȳ)2 + 2 ε̂i (ŷi − ȳ) + ε̂2i
i=1
X
n X
n X
n (4)
= (ŷi − ȳ) +
2
ε̂2i +2 ε̂i (ŷi − ȳ)
i=1 i=1 i=1
Xn Xn Xn
= (ŷi − ȳ)2 + ε̂2i + 2 ε̂i (xi β̂ − ȳ)
i=1 i=1 i=1
!
Xn Xn Xn X
p
X
n
= (ŷi − ȳ) +
2
ε̂2i +2 ε̂i xij β̂j −2 ε̂i ȳ
i=1 i=1 i=1 j=1 i=1
Xn Xn Xp
X
n X
n
= (ŷi − ȳ) +
2
ε̂2i +2 β̂j ε̂i xij − 2ȳ ε̂i
i=1 i=1 j=1 i=1 i=1
The fact that the design matrix includes a constant regressor ensures that
X
n
ε̂i = ε̂T 1n = 0 (5)
i=1
440 CHAPTER III. STATISTICAL MODELS
and because the residuals are orthogonal to the design matrix (→ III/1.5.3), we have
X
n
ε̂i xij = ε̂T xj = 0 . (6)
i=1
and, with the definitions of explained (→ III/1.5.7) and residual sum of squares (→ III/1.5.8), it is
Sources:
• Wikipedia (2020): “Partition of sums of squares”; in: Wikipedia, the free encyclopedia, retrieved on
2020-03-09; URL: https://en.wikipedia.org/wiki/Partition_of_sums_of_squares#Partitioning_
the_sum_of_squares_in_linear_regression.
Ey = β̂ . (1)
P y = ŷ = X β̂ . (1)
Sources:
• Wikipedia (2020): “Projection matrix”; in: Wikipedia, the free encyclopedia, retrieved on 2020-07-
22; URL: https://en.wikipedia.org/wiki/Projection_matrix#Overview.
Ry = ε̂ = y − ŷ = y − X β̂ . (1)
Sources:
1. UNIVARIATE NORMAL DATA 441
• Wikipedia (2020): “Projection matrix”; in: Wikipedia, the free encyclopedia, retrieved on 2020-07-
22; URL: https://en.wikipedia.org/wiki/Projection_matrix#Properties.
β̂ = Ey
ŷ = P y (2)
ε̂ = Ry
where
E = (X T X)−1 X T
P = X(X T X)−1 X T (3)
−1
R = In − X(X X) X T T
are the estimation matrix (→ III/1.5.10), projection matrix (→ III/1.5.11) and residual-forming
matrix (→ III/1.5.12) and n is the number of observations.
Proof:
1) Ordinary least squares parameter estimates of β are defined as minimizing the residual sum of
squares (→ III/1.5.8)
β̂ = arg min (y − Xβ)T (y − Xβ) (4)
β
β̂ = (X T X)−1 X T y
(3)
(5)
= Ey .
2) The fitted signal is given by multiplying the design matrix with the estimated regression coefficients
ŷ = X β̂ (6)
and using (5), this becomes
ŷ = X(X T X)−1 X T y
(3) (7)
= Py .
442 CHAPTER III. STATISTICAL MODELS
3) The residuals of the model are calculated by subtracting the fitted signal from the measured signal
ε̂ = y − ŷ (8)
and using (7), this becomes
ε̂ = y − X(X T X)−1 X T y
= (In − X(X T X)−1 X T )y (9)
(3)
= Ry .
Sources:
• Stephan, Klaas Enno (2010): “The General Linear Model (GLM)”; in: Methods and models for
fMRI data analysis in neuroeconomics, Lecture 3, Slide 10; URL: http://www.socialbehavior.uzh.
ch/teaching/methodsspring10.html.
PT = P
(1)
RT = R .
Proof: Let X be the design matrix from the linear regression model (→ III/1.5.1). Then, the matrix
X T X is symmetric, because
T
(X T X)T = X T X T = X T X . (2)
Thus, the inverse of X T X is also symmetric, i.e.
T
(X T X)−1 = (X T X)−1 . (3)
1) The projection matrix for ordinary least squares is given by (→ III/1.5.13)
T
P T = X(X T X)−1 X T
T T
= X T (X T X)−1 X T (5)
= X(X T X)−1 X T .
such that
RT = (In − P )T
= InT − P T
(5) (7)
= In − P
(6)
=R.
■
Sources:
• Wikipedia (2020): “Projection matrix”; in: Wikipedia, the free encyclopedia, retrieved on 2022-12-
22; URL: https://en.wikipedia.org/wiki/Projection_matrix#Properties.
P2 = P
(1)
R2 = R .
Proof:
1) The projection matrix for ordinary least squares is given by (→ III/1.5.13)
R2 = (In − P )(In − P )
= In − P − P + P 2
(3)
= In − 2P + P (5)
= In − P
(4)
=R.
444 CHAPTER III. STATISTICAL MODELS
Sources:
• Wikipedia (2020): “Projection matrix”; in: Wikipedia, the free encyclopedia, retrieved on 2020-07-
22; URL: https://en.wikipedia.org/wiki/Projection_matrix#Properties.
y = Xβ + ε, ε ∼ N (0, σ 2 V ) (1)
and consider estimation using weighted least squares (→ III/1.5.20). Then, the estimated parameters
and the vector of residuals (→ III/1.5.18) are independent from each other:
β̂ = (X T V −1 X)−1 X T V −1 y and
(2)
ε̂ = y − X β̂ ind.
Proof: Equation (1) implies the following distribution (→ III/1.5.18) for the random vector (→
I/1.2.3) y:
y ∼ N Xβ, σ 2 V
∼ N (Xβ, Σ) (3)
with Σ = σ2V .
Note that the estimated parameters and residuals can be written as projections from the same random
vector (→ III/1.5.13) y:
β̂ = (X T V −1 X)−1 X T V −1 y
= Ay (4)
T −1 −1 T −1
with A = (X V X) X V
ε̂ = y − X β̂
= (In − X(X T V −1 X)−1 X T V −1 )y
(5)
= By
with B = (In − X(X T V −1 X)−1 X T V −1 ) .
Two projections AZ and BZ from the same multivariate normal (→ II/4.1.1) random vector (→
I/1.2.3) Z ∼ N (µ, Σ) are independent, if and only if the following condition holds (→ III/1.5.16):
AΣB T = 0 . (6)
Combining (3), (4) and (5), we check whether this is fulfilled in the present case:
1. UNIVARIATE NORMAL DATA 445
This demonstrates that β̂ and ε̂ – and likewise, all pairs of terms separately derived (→ III/1.5.24)
from β̂ and ε̂ – are statistically independent (→ I/1.3.6).
Sources:
• jld (2018): “Understanding t-test for linear regression”; in: StackExchange CrossValidated, re-
trieved on 2022-12-13; URL: https://stats.stackexchange.com/a/344008.
β̂ ∼ N β, σ 2 (X T X)−1
ŷ ∼ N Xβ, σ 2 P (2)
ε̂ ∼ N 0, σ 2 (In − P )
where P is the projection matrix (→ III/1.5.11) for ordinary least squares (→ III/1.5.3)
Proof: We will use the linear transformation theorem for the multivariate normal distribution (→
II/4.1.12):
y = Xβ + ε, ε ∼ N (0, σ 2 In ) . (5)
Applying (4) to (5), the measured data are distributed as
y ∼ N Xβ, σ 2 In . (6)
1) The parameter estimates from ordinary least sqaures (→ III/1.5.3) are given by
446 CHAPTER III. STATISTICAL MODELS
β̂ = (X T X)−1 X T y (7)
and thus, by applying (4) to (7), they are distributed as
T −1 T
β̂ ∼ N (X X) X Xβ, σ 2 (X T X)−1 X T In X(X T X)−1
(8)
∼ N β, σ 2 (X T X)−1 .
ŷ ∼ N Xβ, σ 2 X(X T X)−1 X T
(10)
∼ N Xβ, σ 2 P .
ε̂ ∼ N In − X(X T X)−1 X T Xβ, σ 2 [In − P ] In [In − P ]T
(12)
∼ N Xβ − Xβ, σ [In − P ] [In − P ]
2 T
.
Sources:
• Koch, Karl-Rudolf (2007): “Linear Model”; in: Introduction to Bayesian Statistics, Springer,
Berlin/Heidelberg, 2007, ch. 4, eqs. 4.2, 4.30; URL: https://www.springer.com/de/book/9783540727231;
DOI: 10.1007/978-3-540-72726-2.
• Penny, William (2006): “Multiple Regression”; in: Mathematics for Brain Imaging, ch. 1.5, pp.
39-41, eqs. 1.106-1.110; URL: https://ueapsylabs.co.uk/sites/wpenny/mbi/mbi_course.pdf.
y = Xβ + ε, ε ∼ N (0, σ 2 V ) (1)
and consider estimation using weighted least squares (→ III/1.5.20). Then, the estimated parameters,
fitted signal and residuals are distributed as
1. UNIVARIATE NORMAL DATA 447
β̂ ∼ N β, σ 2 (X T V −1 X)−1
ŷ ∼ N Xβ, σ 2 (P V ) (2)
ε̂ ∼ N 0, σ 2 (In − P )V
where P is the projection matrix (→ III/1.5.11) for weighted least squares (→ III/1.5.20)
Proof: We will use the linear transformation theorem for the multivariate normal distribution (→
II/4.1.12):
β̂ = (X T V −1 X)−1 X T V −1 y (6)
and thus, by applying (4) to (6), they are distributed as
β̂ ∼ N (X T V −1 X)−1 X T V −1 Xβ, σ 2 (X T V −1 X)−1 X T V −1 V V −1 X(X T V −1 X)−1
(7)
∼ N β, σ 2 (X T V −1 X)−1 .
2) The fitted signal in multiple linear regression (→ III/1.5.13) is given by
∼N Xβ − Xβ, σ 2 V − V P T − P V + P V P T
∼N 0, σ 2 V − V V −1 X(X T V −1 X)−1 X T − X(X T V −1 X)−1 X T V −1 V + P V P T
(11)
∼N 0, σ 2 V − 2P V + X(X T V −1 X)−1 X T V −1 V V −1 X(X T V −1 X)−1 X T
∼N 0, σ 2 [V − 2P V + P V ]
∼N 0, σ 2 [V − P V ]
∼N 0, σ 2 [In − P ] V .
448 CHAPTER III. STATISTICAL MODELS
Sources:
• Koch, Karl-Rudolf (2007): “Linear Model”; in: Introduction to Bayesian Statistics, Springer,
Berlin/Heidelberg, 2007, ch. 4, eqs. 4.2, 4.30; URL: https://www.springer.com/de/book/9783540727231;
DOI: 10.1007/978-3-540-72726-2.
• Penny, William (2006): “Multiple Regression”; in: Mathematics for Brain Imaging, ch. 1.5, pp.
39-41, eqs. 1.106-1.110; URL: https://ueapsylabs.co.uk/sites/wpenny/mbi/mbi_course.pdf.
• Soch J, Allefeld C, Haynes JD (2020): “Inverse transformed encoding models – a solution to
the problem of correlated trial-by-trial parameter estimates in fMRI decoding”; in: NeuroIm-
age, vol. 209, art. 116449, eq. A.10; URL: https://www.sciencedirect.com/science/article/pii/
S1053811919310407; DOI: 10.1016/j.neuroimage.2019.116449.
• Soch J, Meyer AP, Allefeld C, Haynes JD (2017): “How to improve parameter estimates in GLM-
based fMRI data analysis: cross-validated Bayesian model averaging”; in: NeuroImage, vol. 158, pp.
186-195, eq. A.2; URL: https://www.sciencedirect.com/science/article/pii/S105381191730527X;
DOI: 10.1016/j.neuroimage.2017.06.056.
y = Xβ + ε, ε ∼ N (0, σ 2 V ) (1)
and consider estimation using weighted least squares (→ III/1.5.20). Then, the residual sum of
squares (→ III/1.5.8) ε̂T ε̂, divided by the true error variance (→ III/1.5.1) σ 2 , follows a chi-squared
distribution (→ II/3.7.1) with n − p degrees of freedom
ε̂T ε̂
∼ χ2 (n − p) (2)
σ2
where n and p are the dimensions of the n × p design matrix (→ III/1.5.1) X.
W V W T = In . (3)
Then, left-multiplying the regression model in (1) with W gives
W y = W Xβ + W ε, W ε ∼ N (0, σ 2 W V W T ) (4)
which can be rewritten as
ỹ ∼ N (X̃β, σ 2 In ) . (6)
With that, we have obtained a linear regression model (→ III/1.5.1) with independent observations.
Cochran’s theorem for multivariate normal variables states that, for an n × 1 normal random vector
(→ II/4.1.1) whose covariance matrix (→ I/1.13.9) is a scalar multiple of the identity matrix, a
1. UNIVARIATE NORMAL DATA 449
specific squared form will follow a non-central chi-squared distribution where the degrees of freedom
and the non-centrality paramter depend on the matrix in the quadratic form:
x ∼ N (µ, σ 2 In ) ⇒ y = xT Ax/σ 2 ∼ χ2 tr(A), µT Aµ . (7)
First, we formulate the residuals (→ III/1.5.13) in terms of transformed measurements ỹ:
1 X 2 ε̂T ε̂
n
ε̂ = 2 = ỹ T R̃T R̃ỹ/σ 2 (9)
σ 2 i=1 i σ
ε̂T ε̂
= ỹ T R̃ỹ/σ 2 . (10)
σ2
With that, we can apply Cochran’s theorem given by (7) which yields
ε̂T ε̂
∼ χ tr(In − P̃ ), β X̃ R̃X̃β
2 T T
σ2
∼ χ2 tr(In ) − tr(P̃ ), β T X̃ T (In − P̃ )X̃β
−1 T −1 T
∼ χ tr(In ) − tr(X̃(X̃ X̃) X̃ ), β (X̃ X̃ − X̃ X̃(X̃ X̃) X̃ X̃)β
2 T T T T T
(11)
∼ χ2 tr(In ) − tr(X̃ T X̃(X̃ T X̃)−1 ), β T (X̃ T X̃ − X̃ T X̃)β
∼ χ2 tr(In ) − tr(Ip ), β T 0pp β
∼ χ2 (n − p, 0) .
Because a non-central chi-squared distribution with non-centrality parameter of zero reduces to the
central chi-squared distribution, we obtain our final result:
ε̂T ε̂
∼ χ2 (n − p) . (12)
σ2
■
Sources:
• Koch, Karl-Rudolf (2007): “Estimation of the Variance Factor in Traditional Statistics”; in: In-
troduction to Bayesian Statistics, Springer, Berlin/Heidelberg, 2007, ch. 4.2.3, eq. 4.37; URL:
https://www.springer.com/de/book/9783540727231; DOI: 10.1007/978-3-540-72726-2.
• Penny, William (2006): “Estimating error variance”; in: Mathematics for Brain Imaging, ch. 2.2,
pp. 49-51, eqs. 2.4-2.8; URL: https://ueapsylabs.co.uk/sites/wpenny/mbi/mbi_course.pdf.
450 CHAPTER III. STATISTICAL MODELS
• Wikipedia (2022): “Ordinary least squares”; in: Wikipedia, the free encyclopedia, retrieved on
2022-12-13; URL: https://en.wikipedia.org/wiki/Ordinary_least_squares#Estimation.
• ocram (2022): “Why is RSS distributed chi square times n-p?”; in: StackExchange CrossValidated,
retrieved on 2022-12-21; URL: https://stats.stackexchange.com/a/20230.
y = Xβ + ε, ε ∼ N (0, σ 2 V ) , (1)
the parameters minimizing the weighted residual sum of squares (→ III/1.5.8) are given by
β̂ = (X T V −1 X)−1 X T V −1 y . (2)
W V W T = In . (3)
Since V is a covariance matrix and thus symmetric, W is also symmetric and can be expressed as
the matrix square root of the inverse of V :
W V W = In ⇔ V = W −1 W −1 ⇔ V −1 = W W ⇔ W = V −1/2 . (4)
Left-multiplying the linear regression equation (1) with W , the linear transformation theorem (→
II/4.1.12) implies that
W y = W Xβ + W ε, W ε ∼ N (0, σ 2 W V W T ) . (5)
Applying (3), we see that (5) is actually a linear regression model (→ III/1.5.1) with independent
observations
β̂ = (X̃ T X̃)−1 X̃ T ỹ
−1
= (W X)T W X (W X)T W y
−1 T T
= X TW TW X X W Wy (7)
−1
= X TW W X X TW W y
(4) −1 T −1
= X T V −1 X X V y
Sources:
1. UNIVARIATE NORMAL DATA 451
• Stephan, Klaas Enno (2010): “The General Linear Model (GLM)”; in: Methods and models for
fMRI data analysis in neuroeconomics, Lecture 3, Slides 20/23; URL: http://www.socialbehavior.
uzh.ch/teaching/methodsspring10.html.
• Wikipedia (2021): “Weighted least squares”; in: Wikipedia, the free encyclopedia, retrieved on
2021-11-17; URL: https://en.wikipedia.org/wiki/Weighted_least_squares#Motivation.
y = Xβ + ε, ε ∼ N (0, σ 2 V ) , (1)
the parameters minimizing the weighted residual sum of squares (→ III/1.5.8) are given by
β̂ = (X T V −1 X)−1 X T V −1 y . (2)
W V W T = In . (3)
Since V is a covariance matrix and thus symmetric, W is also symmetric and can be expressed the
matrix square root of the inverse of V :
W V W = In ⇔ V = W −1 W −1 ⇔ V −1 = W W ⇔ W = V −1/2 . (4)
Left-multiplying the linear regression equation (1) with W , the linear transformation theorem (→
II/4.1.12) implies that
W y = W Xβ + W ε, W ε ∼ N (0, σ 2 W V W T ) . (5)
Applying (3), we see that (5) is actually a linear regression model (→ III/1.5.1) with independent
observations
W y = W Xβ + W ε, W ε ∼ N (0, σ 2 In ) . (6)
With this, we can express the weighted residual sum of squares (→ III/1.5.8) as
X
n
wRSS(β) = (W ε)2i = (W ε)T (W ε) = (W y − W Xβ)T (W y − W Xβ) (7)
i=1
wRSS(β) = y T W T W y − y T W T W Xβ − β T X T W T W y + β T X T W T W Xβ
= y T W W y − 2β T X T W W y + β T X T W W Xβ (8)
(4)
= y T V −1 y − 2β T X T V −1 y + β T X T V −1 Xβ .
dwRSS(β)
= −2X T V −1 y + 2X T V −1 Xβ (9)
dβ
452 CHAPTER III. STATISTICAL MODELS
dwRSS(β̂)
=0
dβ
0 = −2X T V −1 y + 2X T V −1 X β̂ (10)
X T V −1 X β̂ = X T V −1 y
β̂ = (X T V −1 X)−1 X T V −1 y .
1.5.22 t-contrast
Definition: Consider a linear regression model (→ III/1.5.1) with n × p design matrix X and p × 1
regression coefficients β:
y = Xβ + ε, ε ∼ N (0, σ 2 V ) . (1)
Then, a t-contrast is specified by a p × 1 vector c and it entails the null hypothesis (→ I/4.3.2) that
the product of this vector and the regression coefficients is zero:
H0 : cT β = 0 . (2)
Consequently, the alternative hypothesis (→ I/4.3.3) of a two-tailed t-test (→ I/4.2.4) is
H1 : cT β ̸= 0 (3)
and the alternative hypothesis (→ I/4.3.3) of a one-sided t-test (→ I/4.2.4) would be
Sources:
• Stephan, Klaas Enno (2010): “Classical (frequentist) inference”; in: Methods and models for fMRI
data analysis in neuroeconomics, Lecture 4, Slides 7/9; URL: http://www.socialbehavior.uzh.ch/
teaching/methodsspring10.html.
1.5.23 F-contrast
Definition: Consider a linear regression model (→ III/1.5.1) with n × p design matrix X and p × 1
regression coefficients β:
y = Xβ + ε, ε ∼ N (0, σ 2 V ) . (1)
Then, an F-contrast is specified by a p × q matrix C, yielding a q × 1 vector γ = C T β, and it entails
the null hypothesis (→ I/4.3.2) that each value in this vector is zero:
H0 : γ1 = 0 ∧ . . . ∧ γq = 0 . (2)
1. UNIVARIATE NORMAL DATA 453
Consequently, the alternative hypothesis (→ I/4.3.3) of the statistical test (→ I/4.3.1) would be that
at least one entry of this vector is non-zero:
H1 : γ1 ̸= 0 ∨ . . . ∨ γq ̸= 0 . (3)
Here, C is called the “contrast matrix” and C T β are called the “contrast values”. With estimated
regression coefficients, C T β̂ are called the “estimated contrast values”.
Sources:
• Stephan, Klaas Enno (2010): “Classical (frequentist) inference”; in: Methods and models for fMRI
data analysis in neuroeconomics, Lecture 4, Slides 23/25; URL: http://www.socialbehavior.uzh.
ch/teaching/methodsspring10.html.
y = Xβ + ε, ε ∼ N (0, σ 2 V ) (1)
and a t-contrast (→ III/1.5.22) on the model parameters
cT β̂
t= p (3)
σ̂ 2 cT (X T V −1 X)−1 c
with the parameter estimates (→ III/1.5.26)
β̂ = (X T V −1 X)−1 X T V −1 y
1 (4)
σ̂ 2 = (y − X β̂)T V −1 (y − X β̂)
n−p
t ∼ t(n − p) (5)
under the null hypothesis (→ I/4.3.2)
H0 : cT β = 0
(6)
H1 : cT β > 0 .
Proof:
1) We know that the estimated regression coefficients in linear regression follow a multivariate normal
distribution (→ III/1.5.18):
β̂ ∼ N β, σ 2 (X T V −1 X)−1 . (7)
454 CHAPTER III. STATISTICAL MODELS
cT β̂
z=p . (9)
σ 2 cT (X T V −1 X)−1 c
Again applying the linear transformation theorem (→ II/4.1.12), this is distributed as
!
cT β
z∼N p ,1 (10)
σ 2 cT (X T V −1 X)−1 c
and thus follows a standard normal distribution (→ II/3.2.3) under the null hypothesis (→ I/4.3.2):
1 X 2 ε̂T ε̂
n
1
v= 2 ε̂i = 2 = 2 (y − X β̂)T V −1 (y − X β̂) (12)
σ i=1 σ σ
is following a chi-squared distribution (→ III/1.5.19):
v ∼ χ2 (n − p) . (13)
3) Because the estimated regression coefficients and the vector of residuals are independent from
each other (→ III/1.5.16)
cT β̂ ε̂T ε̂
z=p and v= ind. , (15)
σ 2 cT (X T V −1 X)−1 c σ2
the following quantity is, by definition, t-distributed (→ II/3.3.1)
z
t= p ∼ t(n − p), if H0 (16)
v/(n − p)
and the quantity can be evaluated as:
1. UNIVARIATE NORMAL DATA 455
(16) z
t = p
v/(n − p)
r
(15) cT β̂ n−p
= p ·
σ 2 cT (X T V −1 X)−1 c ε̂T ε̂/σ 2
cT β̂
=q
ε̂T ε̂
· cT (X T V −1 X)−1 c (17)
n−p
(12) cT β̂
= q
(y−X β̂)T V −1 (y−X β̂)
n−p
· cT (X T V −1 X)−1 c
(4) cT β̂
= p .
σ̂ 2 cT (X T V −1 X)−1 c
This means that the null hypothesis (→ I/4.3.2) in (6) can be rejected when t from (17) is as
extreme or more extreme than the critical value (→ I/4.3.9) obtained from Student’s t-distribution
(→ II/3.3.1) with n − p degrees of freedom using a significance level (→ I/4.3.8) α.
■
Sources:
• Stephan, Klaas Enno (2010): “Classical (frequentist) inference”; in: Methods and models for fMRI
data analysis in neuroeconomics, Lecture 4, Slides 7/9; URL: http://www.socialbehavior.uzh.ch/
teaching/methodsspring10.html.
• Walter, Henrik (ed.) (2005): “Datenanalyse für funktionell bildgebende Verfahren”; in: Funk-
tionelle Bildgebung in Psychiatrie und Psychotherapie, Schattauer, Stuttgart/New York, 2005, p.
40; URL: https://books.google.de/books?id=edWzKAHi7jQC&source=gbs_navlinks_s.
• jld (2018): “Understanding t-test for linear regression”; in: StackExchange CrossValidated, re-
trieved on 2022-12-13; URL: https://stats.stackexchange.com/a/344008.
• Soch, Joram (2020): “Distributional Transformation Improves Decoding Accuracy When Predict-
ing Chronological Age From Structural MRI”; in: Frontiers in Psychiatry, vol. 11, art. 604268, eqs.
8/9; URL: https://www.frontiersin.org/articles/10.3389/fpsyt.2020.604268/full; DOI: 10.3389/fp-
syt.2020.604268.
y = Xβ + ε, ε ∼ N (0, σ 2 V ) (1)
and an F-contrast (→ III/1.5.23) on the model parameters
β̂ = (X T V −1 X)−1 X T V −1 y
1 (4)
σ̂ 2 = (y − X β̂)T V −1 (y − X β̂)
n−p
F ∼ F(q, n − p) (5)
under the null hypothesis (→ I/4.3.2)
H0 : γ1 = 0 ∧ . . . ∧ γq = 0
(6)
H1 : γ1 ̸= 0 ∨ . . . ∨ γq ̸= 0 .
Proof:
1) We know that the estimated regression coefficients in linear regression follow a multivariate normal
distribution (→ III/1.5.18):
β̂ ∼ N β, σ 2 (X T V −1 X)−1 . (7)
Thus, the estimated contrast vector (→ III/1.5.23) γ̂ = C T β̂ is also distributed according to a
multivariate normal distribution (→ II/4.1.12):
γ̂ ∼ N C T β, σ 2 C T (X T V −1 X)−1 C . (8)
Substituting the noise variance σ 2 with the noise precision τ = 1/σ 2 , we can also write this down as
a conditional distribution (→ I/1.5.4):
−1
γ̂|τ ∼ N C T β, (τ Q)−1 with Q = C T (X T V −1 X)−1 C . (9)
2) We also know that the residual sum of squares (→ III/1.5.8), divided the true error variance (→
III/1.5.1)
1 X 2 ε̂T ε̂
n
1
2
ε̂i = 2 = 2 (y − X β̂)T V −1 (y − X β̂) (10)
σ i=1 σ σ
is following a chi-squared distribution (→ III/1.5.19):
ε̂T ε̂
= τ ε̂T ε̂ ∼ χ2 (n − p) . (11)
σ2
The chi-squared distribution is a special case of the gamma distribution (→ II/3.7.2)
k 1
X ∼ χ (k) ⇒ X ∼ Gam
2
, (12)
2 2
and the gamma distribution changes under multiplication (→ II/3.4.6) in the following way:
b
X ∼ Gam (a, b) ⇒ cX ∼ Gam a, . (13)
c
1. UNIVARIATE NORMAL DATA 457
Thus, combining (12) and (13) with (11), we obtain the marginal distribution (→ I/1.5.3) of τ as:
1 n − p ε̂T ε̂
τ ε̂ ε̂ = τ ∼ Gam
T
, . (14)
ε̂T ε̂ 2 2
3) Note that the joint distribution (→ I/1.5.2) of γ̂ and τ is, following from (9) and (14) and by
definition, a normal-gamma distribution (→ II/4.3.1):
n − p ε̂T ε̂
γ̂, τ ∼ NG C β, Q,
T
, . (15)
2 2
The marginal distribution of a normal-gamma distribution with respect to the normal random vari-
able, is a multivariate t-distribution (→ II/4.3.8):
a −1
X, Y ∼ NG(µ, Λ, a, b) ⇒ X ∼ t µ, Λ , 2a . (16)
b
Thus, the marginal distribution (→ I/1.5.3) of γ̂ is:
−1 !
n−p
γ̂ ∼ t C β, T
Q ,n − p . (17)
ε̂T ε̂
4) Because of the following relationship between the multivariate t-distribution and the F-distribution
(→ II/4.2.3)
(19) T n−p
F = γ̂ − C β T
γ̂ − C T β /q
Q
ε̂T ε̂
(6) T n−p
= γ̂ Q γ̂/q
ε̂T ε̂
(2) T n−p
= β̂ C T
Q C T β̂/q
ε̂ ε̂
(9) T n − p T T −1 −1 −1 (20)
= β̂ C C (X V X) C C T β̂/q
ε̂T ε̂
!
(10) n−p −1
= β̂ T C C T (X T V −1 X)−1 C C T β̂/q
(y − X β̂)T V −1 (y − X β̂)
(4) T 1 T T −1 −1
−1
= β̂ C C (X V X) C C T β̂/q
σ̂ 2
−1 T
= β̂ T C σ̂ 2 C T (X T V −1 X)−1 C C β̂/q .
This means that the null hypothesis (→ I/4.3.2) in (6) can be rejected when F from (20) is as
extreme or more extreme than the critical value (→ I/4.3.9) obtained from Fisher’s F-distribution
458 CHAPTER III. STATISTICAL MODELS
(→ II/3.8.1) with q numerator and n − p denominator degrees of freedom using a significance level
(→ I/4.3.8) α.
Sources:
• Stephan, Klaas Enno (2010): “Classical (frequentist) inference”; in: Methods and models for fMRI
data analysis in neuroeconomics, Lecture 4, Slides 23/25; URL: http://www.socialbehavior.uzh.
ch/teaching/methodsspring10.html.
• Koch, Karl-Rudolf (2007): “Multivariate Distributions”; in: Introduction to Bayesian Statistics,
Springer, Berlin/Heidelberg, 2007, ch. 2.5, eqs. 2.202, 2.213, 2.211; URL: https://www.springer.
com/de/book/9783540727231; DOI: 10.1007/978-3-540-72726-2.
• jld (2018): “Understanding t-test for linear regression”; in: StackExchange CrossValidated, re-
trieved on 2022-12-13; URL: https://stats.stackexchange.com/a/344008.
• Penny, William (2006): “Comparing nested GLMs”; in: Mathematics for Brain Imaging, ch. 2.3,
pp. 51-52, eq. 2.9; URL: https://ueapsylabs.co.uk/sites/wpenny/mbi/mbi_course.pdf.
y = Xβ + ε, ε ∼ N (0, σ 2 V ) , (1)
the maximum likelihood estimates (→ I/4.1.3) of β and σ 2 are given by
β̂ = (X T V −1 X)−1 X T V −1 y
1 (2)
σ̂ 2 = (y − X β̂)T V −1 (y − X β̂) .
n
Proof: With the probability density function of the multivariate normal distribution (→ II/4.1.4),
the linear regression equation (1) implies the following likelihood function (→ I/5.1.2)
n n 1
LL(β, σ 2 ) = − log(2π) − log(σ 2 ) − log(|V |)
2 2 2 (5)
1
− 2 y T P y − 2β T X T P y + β T X T P Xβ .
2σ
dLL(β, σ 2 ) d 1
= − 2 y P y − 2β X P y + β X P Xβ
T T T T T
dβ dβ 2σ
1 d
= 2 2β T X T P y − β T X T P Xβ
2σ dβ (6)
1
= 2 2X T P y − 2X T P Xβ
2σ
1
= 2 X T P y − X T P Xβ
σ
and setting this derivative to zero gives the MLE for β:
dLL(β̂, σ 2 )
=0
dβ
1
0 = 2 X T P y − X T P X β̂
σ (7)
0 = X T P y − X T P X β̂
X T P X β̂ = X T P y
−1
β̂ = X T P X X TP y
dLL(β̂, σ 2 ) d n 1 T −1
= − log(σ ) − 2 (y − X β̂) V (y − X β̂)
2
dσ 2 dσ 2 2 2σ
n 1 1
=− 2
+ (y − X β̂)T V −1 (y − X β̂) (8)
2 σ 2(σ 2 )2
n 1
=− 2 + 2 2
(y − X β̂)T V −1 (y − X β̂)
2σ 2(σ )
dLL(β̂, σ̂ 2 )
=0
dσ 2
n 1
0=− 2
+ (y − X β̂)T V −1 (y − X β̂)
2σ̂ 2(σ̂ 2 )2
n 1
2
= (y − X β̂)T V −1 (y − X β̂) (9)
2σ̂ 2(σ̂ 2 )2
2(σ̂ 2 )2 n 2(σ̂ 2 )2 1
· 2 = · (y − X β̂)T V −1 (y − X β̂)
n 2σ̂ n 2(σ̂ 2 )2
1
σ̂ 2 = (y − X β̂)T V −1 (y − X β̂)
n
Together, (7) and (9) constitute the MLE for multiple linear regression.
m : y = Xβ + ε, ε ∼ N (0, σ 2 V ) . (1)
Then, the maximum log-likelihood (→ I/4.1.5) for this model is
n RSS n
MLL(m) = − log − [1 + log(2π)] (2)
2 n 2
under uncorrelated observations (→ III/1.5.1), i.e. if V = In , and
n wRSS n 1
MLL(m) = − log − [1 + log(2π)] − log |V | , (3)
2 n 2 2
in the general case, i.e. if V ̸= In , where RSS is the residual sum of squares (→ III/1.5.8) and wRSS
is the weighted residual sum of squares (→ III/1.5.21).
Proof: The likelihood function (→ I/5.1.2) for multiple linear regression is given by (→ III/1.5.26)
such that, with |σ 2 V | = (σ 2 )n |V |, the log-likelihood function (→ I/4.1.2) for this model becomes (→
III/1.5.26)
1X
n
1 1 wRSS
(y − X β̂)T V −1 (y − X β̂) = (W y − W X β̂)T (W y − W X β̂) = (W ε̂)2i = (7)
n n n i=1 n
where W = V −1/2 . Plugging (6) into (5), we obtain the maximum log-likelihood (→ I/4.1.5) as
MLL(m) = LL(β̂, σ̂ 2 )
n n 1 1
= − log(2π) − log(σ̂ 2 ) − log |V | − 2 (y − X β̂)T V −1 (y − X β̂)
2 2 2 2σ̂
n n wRSS 1 1 n (8)
= − log(2π) − log − log |V | − · · wRSS
2 2 n 2 2 wRSS
n wRSS n 1
= − log − [1 + log(2π)] − log |V |
2 n 2 2
1 X 2 RSS
n
1
σ̂ = (y − X β̂)T (y − X β̂) =
2
ε̂ = (9)
n n i=1 i n
and
1 1 1
log |V | = log |In | = log 1 = 0 , (10)
2 2 2
such that
n RSS n
MLL(m) = − log − [1 + log(2π)] (11)
2 n 2
which proves the result in (2). This completes the proof.
Sources:
• Claeskens G, Hjort NL (2008): “Akaike’s information criterion”; in: Model Selection and Model Av-
eraging, ex. 2.2, p. 66; URL: https://www.cambridge.org/core/books/model-selection-and-model-averaging
E6F1EC77279D1223423BB64FC3A12C37; DOI: 10.1017/CBO9780511790485.
m : y = Xβ + ε, ε ∼ N (0, σ 2 V ) . (1)
Then, the deviance (→ IV/2.3.2) for this model is
462 CHAPTER III. STATISTICAL MODELS
D(β, σ 2 ) = RSS/σ 2 + n · log(σ 2 ) + log(2π) (2)
under uncorrelated observations (→ III/1.5.1), i.e. if V = In , and
D(β, σ 2 ) = wRSS/σ 2 + n · log(σ 2 ) + log(2π) + log |V | , (3)
in the general case, i.e. if V ̸= In , where RSS is the residual sum of squares (→ III/1.5.8) and wRSS
is the weighted residual sum of squares (→ III/1.5.21).
Proof: The likelihood function (→ I/5.1.2) for multiple linear regression is given by (→ III/1.5.26)
such that, with |σ 2 V | = (σ 2 )n |V |, the log-likelihood function (→ I/4.1.2) for this model becomes (→
III/1.5.26)
1 1
− 2
(y − Xβ)T V −1 (y − Xβ) = − 2 (W y − W Xβ)T (W y − W Xβ)
2σ 2σ !
1 1X
n
wRSS (6)
=− 2 (W ε)2i = −
2σ n i=1 2σ 2
where W = V −1/2 . Plugging (6) into (5) and multiplying with −2, we obtain the deviance (→
IV/2.3.2) as
D(β, σ 2 ) = −2 LL(β, σ 2 )
wRSS n n 1
= −2 − − log(σ ) − log(2π) − log |V |
2
(7)
2σ 2 2 2 2
= wRSS/σ + n · log(σ ) + log(2π) + log |V |
2 2
1 1
− 2
(y − Xβ)T V −1 (y − Xβ) = − 2 (y − Xβ)T (y − Xβ)
2σ 2σ !
1 1X 2
n
RSS (8)
=− 2 εi = − 2
2σ n i=1 2σ
and
1. UNIVARIATE NORMAL DATA 463
1 1 1
log |V | = log |In | = log 1 = 0 , (9)
2 2 2
such that
D(β, σ 2 ) = RSS/σ 2 + n · log(σ 2 ) + log(2π) (10)
which proves the result in (2). This completes the proof.
m : y = Xβ + ε, ε ∼ N (0, σ 2 V ) . (1)
Then, the Akaike information criterion (→ IV/2.1.1) for this model is
wRSS
AIC(m) = n log + n [1 + log(2π)] + log |V | + 2(p + 1) (2)
n
where wRSS is the weighted residual sum of squares (→ III/1.5.8), p is the number of regressors
(→ III/1.5.1) in the design matrix X and n is the number of observations (→ III/1.5.1) in the data
vector y.
Sources:
• Claeskens G, Hjort NL (2008): “Akaike’s information criterion”; in: Model Selection and Model Av-
eraging, ex. 2.2, p. 66; URL: https://www.cambridge.org/core/books/model-selection-and-model-averaging
E6F1EC77279D1223423BB64FC3A12C37; DOI: 10.1017/CBO9780511790485.
464 CHAPTER III. STATISTICAL MODELS
m : y = Xβ + ε, ε ∼ N (0, σ 2 V ) . (1)
Then, the Bayesian information criterion (→ IV/2.2.1) for this model is
wRSS
BIC(m) = n log + n [1 + log(2π)] + log |V | + log(n) (p + 1) (2)
n
where wRSS is the weighted residual sum of squares (→ III/1.5.8), p is the number of regressors
(→ III/1.5.1) in the design matrix X and n is the number of observations (→ III/1.5.1) in the data
vector y.
m : y = Xβ + ε, ε ∼ N (0, σ 2 V ) . (1)
Then, the corrected Akaike information criterion (→ IV/2.1.2) for this model is
wRSS 2 n (p + 1)
AICc (m) = n log + n [1 + log(2π)] + log |V | + (2)
n n−p−2
where wRSS is the weighted residual sum of squares (→ III/1.5.8), p is the number of regressors
(→ III/1.5.1) in the design matrix X and n is the number of observations (→ III/1.5.1) in the data
vector y.
2k 2 + 2k
AICc (m) = AIC(m) + (3)
n−k−1
1. UNIVARIATE NORMAL DATA 465
where AIC(m) is the Akaike information criterion (→ IV/2.1.1), k is the number of free parameters
in m and n is the number of observations.
The Akaike information criterion for multiple linear regression (→ III/1.5.27) is given by
wRSS
AIC(m) = n log + n [1 + log(2π)] + log |V | + 2(p + 1) (4)
n
and the number of free paramters in multiple linear regression (→ III/1.5.1) is k = p + 1, i.e. one for
each regressor in the design matrix (→ III/1.5.1) X, plus one for the noise variance (→ III/1.5.1)
σ2.
Thus, the corrected AIC of m follows from (3) and (4) as
wRSS 2k 2 + 2k
AICc (m) = n log + n [1 + log(2π)] + log |V | + 2 k +
n n−k−1
wRSS 2nk − 2k 2 − 2k 2k 2 + 2k
= n log + n [1 + log(2π)] + log |V | + +
n n−k−1 n−k−1
wRSS 2nk (5)
= n log + n [1 + log(2π)] + log |V | +
n n−k−1
wRSS 2 n (p + 1)
= n log + n [1 + log(2π)] + log |V | +
n n−p−2
.
Sources:
• Claeskens G, Hjort NL (2008): “Akaike’s information criterion”; in: Model Selection and Model Av-
eraging, ex. 2.5, p. 67; URL: https://www.cambridge.org/core/books/model-selection-and-model-averaging
E6F1EC77279D1223423BB64FC3A12C37; DOI: 10.1017/CBO9780511790485.
y = Xβ + ε, ε ∼ N (0, σ 2 V ) (1)
be a linear regression model (→ III/1.5.1) with measured n × 1 data vector y, known n × p design
matrix X, known n × n covariance structure V as well as unknown p × 1 regression coefficients β
and unknown noise variance σ 2 .
Then, the conjugate prior (→ I/5.2.5) for this model is a normal-gamma distribution (→ II/4.3.1)
Proof: By definition, a conjugate prior (→ I/5.2.5) is a prior distribution (→ I/5.1.3) that, when
combined with the likelihood function (→ I/5.1.2), leads to a posterior distribution (→ I/5.1.7) that
belongs to the same family of probability distributions (→ I/1.5.1). This is fulfilled when the prior
466 CHAPTER III. STATISTICAL MODELS
density and the likelihood function are proportional to the model parameters in the same way, i.e.
the model parameters appear in the same functional form in both.
Equation (1) implies the following likelihood function (→ I/5.1.2)
s
1 1 T −1
p(y|β, σ ) = N (y; Xβ, σ V ) =
2 2
exp − 2 (y − Xβ) V (y − Xβ) (3)
(2π)n |σ 2 V | 2σ
which, for mathematical convenience, can also be parametrized as
s
|τ P | h τ i
−1
p(y|β, τ ) = N (y; Xβ, (τ P ) ) = exp − (y − Xβ) P (y − Xβ)
T
(4)
(2π)n 2
using the noise precision τ = 1/σ 2 and the n × n precision matrix P = V −1 .
In other words, the likelihood function (→ I/5.1.2) is proportional to a power of τ , times an expo-
nential of τ and an exponential of a squared form of β, weighted by τ :
h τ i h τ i
p(y|β, τ ) ∝ τ n/2
· exp − y P y − y Qy · exp − (β − X̃y) X P X(β − X̃y) .
T T T T
(8)
2 2
The same is true for a normal-gamma distribution (→ II/4.3.1) over β and τ
Sources:
• Bishop CM (2006): “Bayesian linear regression”; in: Pattern Recognition for Machine Learning,
pp. 152-161, ex. 3.12, eq. 3.112; URL: https://www.springer.com/gp/book/9780387310732.
y = Xβ + ε, ε ∼ N (0, σ 2 V ) (1)
be a linear regression model (→ III/1.5.1) with measured n × 1 data vector y, known n × p design
matrix X, known n × n covariance structure V as well as unknown p × 1 regression coefficients β
and unknown noise variance σ 2 . Moreover, assume a normal-gamma prior distribution (→ III/1.6.1)
over the model parameters β and τ = 1/σ 2 :
µn = Λ−1 T
n (X P y + Λ0 µ0 )
Λn = X T P X + Λ 0
n (4)
an = a0 +
2
1 T
0 Λ0 µ0 − µn Λn µn ) .
bn = b0 + (y P y + µT T
2
Proof: According to Bayes’ theorem (→ I/5.3.1), the posterior distribution (→ I/5.1.7) is given by
p(y|β, τ ) p(β, τ )
p(β, τ |y) = . (5)
p(y)
Since p(y) is just a normalization factor, the posterior is proportional (→ I/5.1.9) to the numerator:
Combining the likelihood function (→ I/5.1.2) (8) with the prior distribution (→ I/5.1.3) (2), the
joint likelihood (→ I/5.1.5) of the model is given by
s
τ n+p b0 a0 a0 −1
p(y, β, τ ) = |P ||Λ 0 | τ exp[−b0 τ ]·
(2π)n+p Γ(a0 ) (10)
h τ i
exp − (y − Xβ)T P (y − Xβ) + (β − µ0 )T Λ0 (β − µ0 ) .
2
Expanding the products in the exponent gives:
s
τ n+p b0 a0 a0 −1
p(y, β, τ ) = |P ||Λ0 | τ exp[−b0 τ ]·
(2π)n+p Γ(a0 )
h τ (11)
exp − y T P y − y T P Xβ − β T X T P y + β T X T P Xβ+
2
β T Λ0 β − β T Λ0 µ0 − µT T
0 Λ0 β + µ0 Λ0 µ0 .
s
τ n+p b0 a0 a0 −1
p(y, β, τ ) = |P ||Λ 0 | τ exp[−b0 τ ]·
(2π)n+p Γ(a0 ) (12)
h τ i
exp − (β − µn )T Λn (β − µn ) + (y T P y + µT Λ µ
0 0 0 − µT
Λ µ
n n n )
2
with the posterior hyperparameters (→ I/5.1.7)
µn = Λ−1 T
n (X P y + Λ0 µ0 )
(13)
Λn = X T P X + Λ 0 .
n
an = a0 +
2
1 T (15)
0 Λ0 µ0 − µn Λn µn ) .
bn = b0 + (y P y + µT T
2
From the term in (14), we can isolate the posterior distribution over β given τ :
Sources:
• Bishop CM (2006): “Bayesian linear regression”; in: Pattern Recognition for Machine Learning,
pp. 152-161, ex. 3.12, eq. 3.113; URL: https://www.springer.com/gp/book/9780387310732.
m : y = Xβ + ε, ε ∼ N (0, σ 2 V ) (1)
be a linear regression model (→ III/1.5.1) with measured n × 1 data vector y, known n × p design
matrix X, known n × n covariance structure V as well as unknown p × 1 regression coefficients β
and unknown noise variance σ 2 . Moreover, assume a normal-gamma prior distribution (→ III/1.6.1)
over the model parameters β and τ = 1/σ 2 :
1 n 1 1
log p(y|m) = log |P | − log(2π) + log |Λ0 | − log |Λn |+
2 2 2 2 (3)
log Γ(an ) − log Γ(a0 ) + a0 log b0 − an log bn
µn = Λ−1 T
n (X P y + Λ0 µ0 )
Λn = X T P X + Λ 0
n (4)
an = a0 +
2
1 T
0 Λ0 µ0 − µn Λn µn ) .
bn = b0 + (y P y + µT T
2
470 CHAPTER III. STATISTICAL MODELS
Proof: According to the law of marginal probability (→ I/1.3.3), the model evidence (→ I/5.1.11)
for this model is:
ZZ
p(y|m) = p(y|β, τ ) p(β, τ ) dβ dτ . (5)
According to the law of conditional probability (→ I/1.3.4), the integrand is equivalent to the joint
likelihood (→ I/5.1.5):
ZZ
p(y|m) = p(y, β, τ ) dβ dτ . (6)
When deriving the posterior distribution (→ III/1.6.2) p(β, τ |y), the joint likelihood p(y, β, τ ) is
obtained as
s s
τ n |P | τ p |Λ0 | b0 a0 a0 −1
p(y, β, τ ) = τ exp[−b0 τ ]·
(2π)n (2π)p Γ(a0 ) (9)
h τ i
exp − (β − µn )T Λn (β − µn ) + (y T P y + µT0 Λ0 µ0 − µTn Λn µn ) .
2
Using the probability density function of the multivariate normal distribution (→ II/4.1.4), we can
rewrite this as
s s s
τ n |P | τ p |Λ0 |
(2π)p b0 a0 a0 −1
p(y, β, τ ) = τ exp[−b0 τ ]·
(2π)n (2π)p τ p |Λn | Γ(a0 ) (10)
h τ i
N (β; µn , (τ Λn )−1 ) exp − (y T P y + µT0 Λ0 µ0 − µTn Λn µn ) .
2
Now, β can be integrated out easily:
Z s s
τ n |P | |Λ0 | b0 a0 a0 −1
p(y, β, τ ) dβ = τ exp[−b0 τ ]·
(2π)n |Λn | Γ(a0 ) (11)
h τ i
exp − (y T P y + µT0 Λ0 µ0 − µTn Λn µn ) .
2
Using the probability density function of the gamma distribution (→ II/3.4.7), we can rewrite this
as
1. UNIVARIATE NORMAL DATA 471
Z s s
|P | |Λ0 | b0 a0 Γ(an )
p(y, β, τ ) dβ = Gam(τ ; an , bn ) . (12)
(2π)n |Λn | Γ(a0 ) bn an
Finally, τ can also be integrated out:
ZZ s s
|P | |Λ0 | Γ(an ) b0 a0
p(y, β, τ ) dβ dτ = = p(y|m) . (13)
(2π)n |Λn | Γ(a0 ) bn an
Thus, the log model evidence (→ IV/3.1.3) of this model is given by
1 n 1 1
log p(y|m) = log |P | − log(2π) + log |Λ0 | − log |Λn |+
2 2 2 2 (14)
log Γ(an ) − log Γ(a0 ) + a0 log b0 − an log bn .
Sources:
• Bishop CM (2006): “Bayesian linear regression”; in: Pattern Recognition for Machine Learning,
pp. 152-161, ex. 3.23, eq. 3.118; URL: https://www.springer.com/gp/book/9780387310732.
m : y = Xβ + ε, ε ∼ N (0, σ 2 V ) (1)
be a linear regression model (→ III/1.5.1) with measured n × 1 data vector y, known n × p design
matrix X, known n × n covariance structure V as well as unknown p × 1 regression coefficients β
and unknown noise variance σ 2 . Moreover, assume a normal-gamma prior distribution (→ III/1.6.1)
over the model parameters β and τ = 1/σ 2 :
1 an 1
Acc(m) = − (y − Xµn )T P (y − Xµn ) − tr(X T P XΛ−1 n )
2 bn 2
1 n n
+ log |P | − log(2π) + (ψ(an ) − log(bn ))
2 2 2
1 an 1 1 |Λ0 | p (3)
Com(m) = (µ0 − µn )T Λ0 (µ0 − µn ) − 2(bn − b0 ) + tr(Λ0 Λ−1
n )− log −
2 bn 2 2 |Λn | 2
bn Γ(an )
+ a0 log − log + (an − a0 )ψ(an ) .
b0 Γ(a0 )
where µn , Λn , an and bn are the posterior hyperparameters for Bayesian linear regression (→ III/1.6.2)
and P is the data precision matrix (→ I/1.13.19): P = V −1 .
1) The accuracy term is the expectation (→ I/1.10.1) of the log-likelihood function (→ I/4.1.2)
log p(y|β, τ ) with respect to the posterior distribution (→ I/5.1.7) p(β, τ |y). This expectation can be
rewritten as:
ZZ
Acc(m) = p(β, τ |y) log p(y|β, τ ) dβ dτ
Z Z
= p(τ |y) p(β|τ, y) log p(y|β, τ ) dβ dτ (5)
D E
= ⟨log p(y|β, τ )⟩p(β|τ,y) .
p(τ |y)
With the log-likelihood function for multiple linear regression (→ III/1.5.26), we have:
** s !+ +
1 1
Acc(m) = log · exp − (y − Xβ)T (σ 2 V )−1 (y − Xβ)
(2π)n |σ 2 V | 2
p(β|τ,y) p(τ |y)
** s !+ +
τ n |P | 1
= log n
· exp − (y − Xβ)T (τ P )(y − Xβ)
(2π) 2
p(β|τ,y) p(τ |y)
* +
1 n n 1
= log |P | + log τ − log(2π) − (y − Xβ) (τ P )(y − Xβ)
T
2 2 2 2 p(β|τ,y) p(τ |y)
* +
1 n n τ T
= log |P | + log τ − log(2π) − y P y − 2y T P Xβ + β T X T P Xβ .
2 2 2 2 p(β|τ,y) p(τ |y)
(6)
With the posterior distribution for Bayesian linear regression (→ III/1.6.2), this becomes:
* +
1 n n τ T
Acc(m) = log |P | + log τ − log(2π) − y P y − 2y P Xβ + β X P Xβ
T T T
2 2 2 2 N (β;µn ,(τ Λn )−1 ) Gam(τ ;an ,b
(7)
If x ∼ N(µ, Σ), then its expected value is (→ II/4.1.8)
⟨x⟩ = µ (8)
and the expectation of a quadratic form is given by (→ I/1.10.9)
xT Ax = µT Aµ + tr(AΣ) . (9)
Thus, the model accuracy of m evaluates to:
1. UNIVARIATE NORMAL DATA 473
1 n n
Acc(m) = log |P | + log τ − log(2π)−
2 2 2
τ T 1 −1
y P y − 2y P Xµn + µn X P Xµn + tr(X P XΛn )
T T T T
2 τ Gam(τ ;an ,bn )
1 n n τ 1
= log |P | + log τ − log(2π) − (y − Xµn )T P (y − Xµn ) − tr(X T P XΛ−1
n ) .
2 2 2 2 2 Gam(τ ;an ,bn )
(10)
1 an 1
Acc(m) = − (y − Xµn )T P (y − Xµn ) − tr(X T P XΛ−1
n )
2 bn 2 (13)
1 n n
+ log |P | − log(2π) + (ψ(an ) − log(bn ))
2 2 2
which proofs the first part of (3).
2) The complexity penalty is the Kullback-Leibler divergence (→ I/2.5.1) of the posterior distribution
(→ I/5.1.7) p(β, τ |y) from the prior distribution (→ I/5.1.3) p(β, τ ). This can be rewritten as follows:
ZZ
p(β, τ |y)
Com(m) = p(β, τ |y) log dβ dτ
p(β, τ )
ZZ
p(β|τ, y) p(τ |y)
= p(β|τ, y) p(τ |y) log dβ dτ
p(β|τ ) p(τ ) (14)
Z Z Z Z
p(β|τ, y) p(τ |y)
= p(τ |y) p(β|τ, y) log dβ dτ + p(τ |y) log p(β|τ, y) dβ dτ
p(β|τ ) p(τ )
= ⟨KL [p(β|τ, y) || p(β|τ )]⟩p(τ |y) + KL [p(τ |y) || p(τ )] .
With the prior distribution (→ III/1.6.1) given by (2) and the posterior distribution for Bayesian
linear regression (→ III/1.6.2), this becomes:
Com(m) = KL N (β; µn , (τ Λn )−1 ) || N (β; µ0 , (τ Λ0 )−1 ) Gam(τ ;an ,bn )
(15)
+ KL [Gam(τ ; an , bn ) || Gam(τ ; a0 , b0 )] .
With the Kullback-Leibler divergence for the multivariate normal distribution (→ II/4.1.11)
1 T −1 −1 |Σ1 |
KL[N (µ1 , Σ1 ) || N (µ2 , Σ2 )] = (µ2 − µ1 ) Σ2 (µ2 − µ1 ) + tr(Σ2 Σ1 ) − ln −n (16)
2 |Σ2 |
474 CHAPTER III. STATISTICAL MODELS
b1 Γ(a1 ) a1
KL[Gam(a1 , b1 ) || Gam(a2 , b2 )] = a2 ln − ln + (a1 − a2 ) ψ(a1 ) − (b1 − b2 ) , (17)
b2 Γ(a2 ) b1
the model complexity of m evaluates to:
1 −1 |(τ Λn )−1 |
Com(m) = (µ0 − µn ) (τ Λ0 )(µ0 − µn ) + tr((τ Λ0 )(τ Λn ) ) − log
T
−p
2 |(τ Λ0 )−1 | p(τ |y)
(18)
bn Γ(an ) an
+ a0 log − log + (an − a0 ) ψ(an ) − (bn − b0 ) .
b0 Γ(a0 ) bn
1 an 1 1 |Λ0 | p
Com(m) = (µ0 − µn )T Λ0 (µ0 − µn ) + tr(Λ0 Λ−1 n )− log −
2 bn 2 2 |Λn | 2
(19)
bn Γ(an ) an
+ a0 log − log + (an − a0 ) ψ(an ) − (bn − b0 ) .
b0 Γ(a0 ) bn
1 an 1 1 |Λ0 | p
Com(m) = (µ0 − µn )T Λ0 (µ0 − µn ) − 2(bn − b0 ) + tr(Λ0 Λ−1
n )− log −
2 bn 2 2 |Λn | 2
(20)
bn Γ(an )
+ a0 log − log + (an − a0 )ψ(an )
b0 Γ(a0 )
1 n 1 1
log p(y|m) = log |P | − log(2π) + log |Λ0 | − log |Λn |+
2 2 2 2 (22)
log Γ(an ) − log Γ(a0 ) + a0 log b0 − an log bn
Sources:
• Soch J, Allefeld A (2016): “Kullback-Leibler Divergence for the Normal-Gamma Distribution”;
in: arXiv math.ST, 1611.01437, eqs. 23/30; URL: https://arxiv.org/abs/1611.01437.
• Soch J, Allefeld C, Haynes JD (2016): “How to avoid mismodelling in GLM-based fMRI data anal-
ysis: cross-validated Bayesian model selection”; in: NeuroImage, vol. 141, pp. 469-489, Appendix C;
URL: https://www.sciencedirect.com/science/article/pii/S1053811916303615; DOI: 10.1016/j.neuroimage.
• Soch J (2018): “cvBMS and cvBMA: filling in the gaps”; in: arXiv stat.ME, sect. 2.2, eqs. 8-24;
URL: https://arxiv.org/abs/1807.01585; DOI: 10.48550/arXiv.1807.01585.
µn = Λ−1 T
n (X P y + Λ0 µ0 )
Λn = X T P X + Λ 0
n (10)
an = a0 +
2
1 T
0 Λ0 µ0 − µn Λn µn ) .
bn = b0 + (y P y + µT T
2
Thus, we have the following posterior expectations:
⟨β⟩β,τ |y = µn (11)
an
⟨τ ⟩β,τ |y = (12)
bn
−1
β T Aβ β|τ,y
= µT
n Aµn + tr A(τ Λn )
1 −1
(14)
= µT
n Aµn + tr AΛn .
τ
In these identities, we have used the mean of the multivariate normal distribution (→ II/4.1.8), the
mean of the gamma distribution (→ II/3.4.10), the logarithmic expectation of the gamma distribution
(→ II/3.4.12), the expectation of a quadratic form (→ I/1.10.9) and the covariance of the multivariate
normal distribution (→ II/4.1.9).
With that, the deviance at the expectation is:
(5)
D(⟨β⟩ , ⟨τ ⟩) = n · log(2π) − n · log(⟨τ ⟩) − log |P | + τ · (y − X ⟨β⟩)T P (y − X ⟨β⟩)
(11)
= n · log(2π) − n · log(⟨τ ⟩) − log |P | + τ · (y − Xµn )T P (y − Xµn ) (15)
(12) an an
= n · log(2π) − n · log − log |P | + · (y − Xµn )T P (y − Xµn ) .
bn bn
(5)
⟨D(β, τ )⟩ = n · log(2π) − n · log(τ ) − log |P | + τ · (y − Xβ)T P (y − Xβ)
= n · log(2π) − n · ⟨log(τ )⟩ − log |P | + τ · (y − Xβ)T P (y − Xβ)
(13)
= n · log(2π) − n · [ψ(an ) − log(bn )] − log |P |
D E
+ τ · (y − Xβ) P (y − Xβ) β|τ,y
T
τ |y
(8)
DIC(m) = 2 ⟨D(β, τ )⟩ − D(⟨β⟩ , ⟨τ ⟩)
(16)
= 2 [n · log(2π) − n · [ψ(an ) − log(bn )] − log |P |
an −1
+ · (y − Xµn ) P (y − Xµn ) + tr X P XΛn
T T
bn
(15) an an
− n · log(2π) − n · log − log |P | + · (y − Xµn ) P (y − Xµn )
T
bn bn (17)
= n · log(2π) − 2nψ(an ) + 2n log(bn ) + n log(an ) − log(bn ) − log |P |
an
+ (y − Xµn )T P (y − Xµn ) + tr X T P XΛ−1 n
bn
= n · log(2π) − n [2ψ(an ) − log(an ) − log(bn )] − log |P |
an
+ (y − Xµn )T P (y − Xµn ) + tr X T P XΛ−1 n .
bn
This conforms to equation (3).
■
y = Xβ + ε, ε ∼ N (0, σ 2 V ) (1)
478 CHAPTER III. STATISTICAL MODELS
and assume a normal-gamma (→ II/4.3.1) prior distribution (→ I/5.1.3) over the model parameters
β and τ = 1/σ 2 :
H1 : cT β > 0 (3)
is given by
cT µ
Pr (H1 |y) = 1 − T − √ ;ν (4)
cT Σc
where c is a p × 1 contrast vector (→ III/1.5.22), T(x; ν) is the cumulative distribution function (→
I/1.8.1) of the t-distribution (→ II/3.3.1) with ν degrees of freedom and µ, Σ and ν can be obtained
from the posterior hyperparameters (→ I/5.1.7) of Bayesian linear regression.
Proof: The posterior distribution for Bayesian linear regression (→ III/1.6.2) is given by a normal-
gamma distribution (→ II/4.3.1) over β and τ = 1/σ 2
µn = Λ−1 T
n (X P y + Λ0 µ0 )
Λn = X T P X + Λ 0
n (6)
an = a0 +
2
1 T
0 Λ0 µ0 − µn Λn µn ) .
bn = b0 + (y P y + µT T
2
The marginal distribution of a normal-gamma distribution is a multivariate t-distribution (→ II/4.3.8),
such that the marginal (→ I/1.5.3) posterior (→ I/5.1.7) distribution of β is
µ = µn
−1
an
Σ= Λn (8)
bn
ν = 2 an .
Define the quantity γ = cT β. According to the linear transformation theorem for the multivariate
t-distribution, γ also follows a multivariate t-distribution (→ II/4.2.1):
Using the relation between non-standardized t-distribution and standard t-distribution (→ II/3.3.4),
we can finally write:
(0 − cT µ)
Pr (H1 |y) = 1 − T √ ;ν
cT Σc
(11)
cT µ
= 1 − T −√ ;ν .
cT Σc
■
Sources:
• Koch, Karl-Rudolf (2007): “Multivariate t-distribution”; in: Introduction to Bayesian Statistics,
Springer, Berlin/Heidelberg, 2007, eqs. 2.235, 2.236, 2.213, 2.210, 2.188; URL: https://www.
springer.com/de/book/9783540727231; DOI: 10.1007/978-3-540-72726-2.
y = Xβ + ε, ε ∼ N (0, σ 2 V ) (1)
and assume a normal-gamma (→ II/4.3.1) prior distribution (→ I/5.1.3) over the model parameters
β and τ = 1/σ 2 :
H0 : C T β = 0 (3)
is given by the credibility level
(1 − α) = F µT C(C T Σ C)−1 C T µ /q; q, ν (4)
where C is a p × q contrast matrix (→ III/1.5.23), F(x; v, w) is the cumulative distribution function
(→ I/1.8.1) of the F-distribution (→ II/3.8.1) with v numerator degrees of freedom, w denominator
480 CHAPTER III. STATISTICAL MODELS
degrees of freedom and µ, Σ and ν can be obtained from the posterior hyperparameters (→ I/5.1.7)
of Bayesian linear regression.
Proof: The posterior distribution for Bayesian linear regression (→ III/1.6.2) is given by a normal-
gamma distribution (→ II/4.3.1) over β and τ = 1/σ 2
µn = Λ−1 T
n (X P y + Λ0 µ0 )
Λn = X T P X + Λ 0
n (6)
an = a0 +
2
1 T
0 Λ0 µ0 − µn Λn µn ) .
bn = b0 + (y P y + µT T
2
The marginal distribution of a normal-gamma distribution is a multivariate t-distribution (→ II/4.3.8),
such that the marginal (→ I/1.5.3) posterior (→ I/5.1.7) distribution of β is
µ = µn
−1
an
Σ= Λn (8)
bn
ν = 2 an .
Define the quantity γ = C T β. According to the linear transformation theorem for the multivariate
t-distribution, γ also follows a multivariate t-distribution (→ II/4.2.1):
(1 − α) = F (QF(0); q, ν)
(11)
= F µT C(C T Σ C)−1 C T µ /q; q, ν .
■
Sources:
1. UNIVARIATE NORMAL DATA 481
y = Xβ + ε, ε ∼ N (0, Σ) (1)
be a linear regression model (→ III/1.5.1) with measured n × 1 data vector y, known n × p design
matrix X and known n × n covariance matrix Σ as well as unknown p × 1 regression coefficients β.
Then, the conjugate prior (→ I/5.2.5) for this model is a multivariate normal distribution (→ II/4.1.1)
Proof: By definition, a conjugate prior (→ I/5.2.5) is a prior distribution (→ I/5.1.3) that, when
combined with the likelihood function (→ I/5.1.2), leads to a posterior distribution (→ I/5.1.7) that
belongs to the same family of probability distributions (→ I/1.5.1). This is fulfilled when the prior
density and the likelihood function are proportional to the model parameters in the same way, i.e.
the model parameters appear in the same functional form in both.
Equation (1) implies the following likelihood function (→ I/5.1.2):
s
1 1 T −1
p(y|β) = N (y; Xβ, Σ) = exp − (y − Xβ) Σ (y − Xβ) . (3)
(2π)n |Σ| 2
Expanding the product in the exponent, we have:
s
1 1 T −1 T −1 T T −1 T T −1
p(y|β) = · exp − y Σ y − y Σ Xβ − β X Σ y + β X Σ Xβ . (4)
(2π)n |Σ| 2
Completing the square over β, one obtains
s
1 1
T T −1 T −1
p(y|β) = · exp − (β − X̃y) X Σ X(β − X̃y) − y Qy + y Σ y
T
(5)
(2π)n |Σ| 2
−1 T −1
where X̃ = X T Σ−1 X X Σ and Q = X̃ T X T Σ−1 X X̃.
s
1 1 T T −1
1 T T −1
p(y|β) = · exp − y Qy + y Σ y · exp − (β − X̃y) X Σ X(β − X̃y) . (6)
(2π)n |Σ| 2 2
Sources:
• Bishop CM (2006): “Bayesian linear regression”; in: Pattern Recognition for Machine Learning,
pp. 152-161, eq. 3.48; URL: https://www.springer.com/gp/book/9780387310732.
• Penny WD (2012): “Comparing Dynamic Causal Models using AIC, BIC and Free Energy”; in:
NeuroImage, vol. 59, iss. 2, pp. 319-330, eq. 9; URL: https://www.sciencedirect.com/science/
article/pii/S1053811911008160; DOI: 10.1016/j.neuroimage.2011.07.039.
y = Xβ + ε, ε ∼ N (0, Σ) (1)
be a linear regression model (→ III/1.5.1) with measured n × 1 data vector y, known n × p design
matrix X and known n × n covariance matrix Σ as well as unknown p × 1 regression coefficients β.
Moreover, assume a multivariate normal distribution (→ III/1.7.1) over the model parameter β:
µn = Σn (X T Σ−1 y + Σ−1
0 µ0 )
−1 (4)
Σn = X T Σ−1 X + Σ−1 0 .
Proof: According to Bayes’ theorem (→ I/5.3.1), the posterior distribution (→ I/5.1.7) is given by
p(y|β) p(β)
p(β|y) = . (5)
p(y)
1. UNIVARIATE NORMAL DATA 483
Since p(y) is just a normalization factor, the posterior is proportional (→ I/5.1.9) to the numerator:
s
1
p(y, β) = ·
(2π)n+p |Σ||Σ0 |
(9)
1 T −1 T −1
exp − (y − Xβ) Σ (y − Xβ) + (β − µ0 ) Σ0 (β − µ0 ) .
2
s
1
p(y, β) = ·
(2π)n+p |Σ||Σ0 |
1 (10)
exp − y T Σ−1 y − y T Σ−1 Xβ − β T X T Σ−1 y + β T X T Σ−1 Xβ+
2
β T Σ−1
0 β − β T −1
Σ0 µ 0 − µ T −1
Σ
0 0 β + µ T −1
Σ
0 0 µ 0 .
s
1
p(y, β) = ·
(2π)n+p |Σ||Σ0 |
1 (11)
exp − β T [X T Σ−1 X + Σ−1 T −1 −1
0 ]β − 2β [X Σ y + Σ0 µ0 ]+
T
2
y T Σ−1 y + µT −1
0 Σ0 µ0 .
s
1
p(y, β) = ·
(2π)n+p |Σ||Σ0 |
(12)
1 T −1 T −1 T −1 T −1
exp − (β − µn ) Σn (β − µn ) + (y Σ y + µ0 Σ0 µ0 − µn Σn µn )
2
µn = Σn (X T Σ−1 y + Σ−1
0 µ0 )
−1 (13)
Σn = X T Σ−1 X + Σ−1 0 .
Sources:
• Bishop CM (2006): “Bayesian linear regression”; in: Pattern Recognition for Machine Learning,
pp. 152-161, eqs. 3.49-3.51, ex. 3.7; URL: https://www.springer.com/gp/book/9780387310732.
• Penny WD (2012): “Comparing Dynamic Causal Models using AIC, BIC and Free Energy”; in:
NeuroImage, vol. 59, iss. 2, pp. 319-330, eq. 27; URL: https://www.sciencedirect.com/science/
article/pii/S1053811911008160; DOI: 10.1016/j.neuroimage.2011.07.039.
m : y = Xβ + ε, ε ∼ N (0, Σ) (1)
be a linear regression model (→ III/1.5.1) with measured n × 1 data vector y, known n × p design
matrix X and known n × n covariance matrix Σ as well as unknown p × 1 regression coefficients β.
Moreover, assume a multivariate normal distribution (→ III/1.7.1) over the model parameter β:
1 −1 1 n
y Σ ey −
log p(y|m) = − eT log |Σ| − log(2π)
2 2 2 (3)
1 T −1 1 1
− eβ Σ0 eβ − log |Σ0 | + log |Σn | .
2 2 2
1. UNIVARIATE NORMAL DATA 485
ey = y − Xµn
(4)
eβ = µ0 − µn
µn = Σn (X T Σ−1 y + Σ−1
0 µ0 )
−1 (5)
Σn = X T Σ−1 X + Σ−1 0 .
Proof: According to the law of marginal probability (→ I/1.3.3), the model evidence (→ I/5.1.11)
for this model is:
Z
p(y|m) = p(y|β) p(β) dβ . (6)
According to the law of conditional probability (→ I/1.3.4), the integrand is equivalent to the joint
likelihood (→ I/5.1.5):
Z
p(y|m) = p(y, β) dβ . (7)
s
1
p(y, β) = ·
(2π)n+p |Σ||Σ0 |
(9)
1 T −1 T −1 T −1 T −1
exp − (β − µn ) Σn (β − µn ) + (y Σ y + µ0 Σ0 µ0 − µn Σn µn ) .
2
Using the probability density function of the multivariate normal distribution (→ II/4.1.4), we can
rewrite this as
s s r
1 1 (2π)p |Σn |
p(y, β) = · N (β; µn , Σn )·
(2π)n |Σ| (2π)p |Σ0 | 1
(10)
1 T −1 T −1 T −1
exp − y Σ y + µ0 Σ0 µ0 − µn Σn µn .
2
With that, β can be integrated out easily:
Z s s
1 |Σn | 1 T −1 T −1 T −1
p(y, β) dβ = · exp − y Σ y + µ0 Σ0 µ0 − µn Σn µn . (11)
(2π)n |Σ| |Σ0 | 2
486 CHAPTER III. STATISTICAL MODELS
y T Σ−1 y + µT −1 T −1
0 Σ0 µ0 − µn Σn µn (12)
and plug in the posterior covariance
−1
Σn = X T Σ−1 X + Σ−1
0 . (13)
This gives
y T Σ−1 y + µT −1 T −1
0 Σ0 µ0 − µn Σn µn
= y T Σ−1 y + µT −1 T T −1 −1
0 Σ0 µ0 − µn X Σ X + Σ0 µn
= y T Σ−1 y + µT −1 T T −1 T −1
0 Σ0 µ0 − µn X Σ Xµn − µn Σ0 µn (14)
−1
= (y − Xµn ) Σ (y − Xµn ) + (µ0 − µn )
T T
Σ−1
0 (µ0 − µn )
(4) −1 T −1
= eT
y Σ ey + eβ Σ0 eβ .
1 −1 1 n
y Σ ey −
log p(y|m) = − eT log |Σ| − log(2π)
2 2 2 (16)
1 T −1 1 1
− eβ Σ0 eβ − log |Σ0 | + log |Σn | .
2 2 2
■
Sources:
• Penny WD (2012): “Comparing Dynamic Causal Models using AIC, BIC and Free Energy”; in:
NeuroImage, vol. 59, iss. 2, pp. 319-330, eqs. 19-23; URL: https://www.sciencedirect.com/science/
article/pii/S1053811911008160; DOI: 10.1016/j.neuroimage.2011.07.039.
• Bishop CM (2006): “Bayesian linear regression”; in: Pattern Recognition for Machine Learning,
pp. 152-161; URL: https://www.springer.com/gp/book/9780387310732.
m : y = Xβ + ε, ε ∼ N (0, Σ) (1)
be a linear regression model (→ III/1.5.1) with measured n × 1 data vector y, known n × p design
matrix X and known n × n covariance matrix Σ as well as unknown p × 1 regression coefficients β.
Moreover, assume a multivariate normal distribution (→ III/1.7.1) over the model parameter β:
1 −1 1 n 1
Acc(m) = − eTy Σ ey − log |Σ| − log(2π) − tr(X T Σ−1 XΣn )
2 2 2 2 (3)
1 T −1 1 1 1 p
Com(m) = eβ Σ0 eβ + log |Σ0 | − log |Σn | + tr(Σ−1
0 Σn ) −
2 2 2 2 2
with the “prediction error” and “parameter error” terms
ey = y − Xµn
(4)
eβ = µ0 − µn
µn = Σn (X T Σ−1 y + Σ−1
0 µ0 )
−1 (5)
Σn = X T Σ−1 X + Σ−1 0 .
1) The accuracy term is the expectation (→ I/1.10.1) of the log-likelihood function (→ I/4.1.2)
log p(y|β) with respect to the posterior distribution (→ I/5.1.7) p(β|y):
* s !+
1 1
Acc(m) = log exp − (y − Xβ)T Σ−1 (y − Xβ)
(2π)n |Σ| 2
p(β|y)
n 1 1
= − log(2π) − log |Σ| − (y − Xβ)T Σ−1 (y − Xβ) (8)
2 2 2 p(β|y)
n 1
1 T −1 T −1 T T −1
= − log(2π) − log |Σ| − y Σ y − 2y Σ Xβ + β X Σ Xβ .
2 2 2 p(β|y)
With the posterior distribution for Bayesian linear regression with known covariance (→ III/1.7.2),
this becomes:
n 1 1 T −1 T −1 T T −1
Acc(m) = − log(2π) − log |Σ| − y Σ y − 2y Σ Xβ + β X Σ Xβ . (9)
2 2 2 N (β;µn ,Σn )
488 CHAPTER III. STATISTICAL MODELS
⟨x⟩ = µ (10)
and the expectation of a quadratic form is given by (→ I/1.10.9)
xT Ax = µT Aµ + tr(AΣ) . (11)
Thus, the model accuracy of m evaluates to
n 1
Acc(m) = − log(2π) − log |Σ|−
2 2
1 T −1
y Σ y − 2y T Σ−1 Xµn + µT n X T −1
Σ Xµ n + tr(X T −1
Σ XΣ n )
2 (12)
1 1 n 1
= − (y − Xµn )T Σ−1 (y − Xµn ) − log |Σ| − log(2π) − tr(X T Σ−1 XΣn )
2 2 2 2
(4) 1 T −1 1 n 1
= − ey Σ ey − log |Σ| − log(2π) − tr(X T Σ−1 XΣn )
2 2 2 2
which proofs the first part of (3).
2) The complexity penalty is the Kullback-Leibler divergence (→ I/2.5.1) of the posterior distribution
(→ I/5.1.7) p(β|y) from the prior distribution (→ I/5.1.3) p(β):
1 T −1 −1 |Σn |
Com(m) = (µ0 − µn ) Σ0 (µ0 − µn ) + tr(Σ0 Σn ) − log −p
2 |Σ0 |
1 1 1 1 p (16)
= (µ0 − µn )T Σ−1 0 (µ0 − µn ) + log |Σ0 | − log |Σn | + tr(Σ−1
0 Σn ) −
2 2 2 2 2
(4) 1 T −1 1 1 1 p
= eβ Σ0 eβ + log |Σ0 | − log |Σn | + tr(Σ−1 0 Σn ) −
2 2 2 2 2
which proofs the second part of (3).
1 −1 1 n
y Σ ey −
log p(y|m) = − eT log |Σ| − log(2π)
2 2 2 (18)
1 T −1 1 1
− eβ Σ0 eβ − log |Σ0 | + log |Σn | .
2 2 2
This requires to recognize, based on (5), that
1 1 p
− tr(X T Σ−1 XΣn ) − tr(Σ−1
0 Σn ) +
2 2 2
1 T −1 −1
p
= − tr [X Σ X + Σ0 ]Σn +
2 2
1 −1
p
= − tr Σn Σn +
2 2 (19)
1 p
= − tr (Ip ) +
2 2
p p
= − +
2 2
= 0.
Sources:
• Penny WD (2012): “Comparing Dynamic Causal Models using AIC, BIC and Free Energy”; in:
NeuroImage, vol. 59, iss. 2, pp. 319-330, eqs. 20-21; URL: https://www.sciencedirect.com/science/
article/pii/S1053811911008160; DOI: 10.1016/j.neuroimage.2011.07.039.
• Bishop CM (2006): “Bayesian linear regression”; in: Pattern Recognition for Machine Learning,
pp. 152-161; URL: https://www.springer.com/gp/book/9780387310732.
490 CHAPTER III. STATISTICAL MODELS
Y = XB + E, E ∼ MN (0, V, Σ) (1)
is called a multivariate linear regression model or simply, “general linear model”.
• Y is called “data matrix”, “set of dependent variables” or “measurements”;
• X is called “design matrix”, “set of independent variables” or “predictors”;
• B are called “regression coefficients” or “weights”;
• E is called “noise matrix” or “error terms”;
• V is called “covariance across rows”;
• Σ is called “covariance across columns”;
• n is the number of observations;
• v is the number of measurements;
• p is the number of predictors.
When rows of Y correspond to units of time, e.g. subsequent measurements, V is called “temporal
covariance”. When columns of Y correspond to units of space, e.g. measurement channels, Σ is called
“spatial covariance”.
When the covariance matrix V is a scalar multiple of the n×n identity matrix, this is called a general
linear model with independent and identically distributed (i.i.d.) observations:
i.i.d.
V = λIn ⇒ E ∼ MN (0, λIn , Σ) ⇒ εi ∼ N (0, λΣ) . (2)
Otherwise, it is called a general linear model with correlated observations.
Sources:
• Wikipedia (2020): “General linear model”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
03-21; URL: https://en.wikipedia.org/wiki/General_linear_model.
Y = XB + E, E ∼ MN (0, σ 2 In , Σ) , (1)
the ordinary least squares (→ III/1.5.3) parameters estimates are given by
B̂ = (X T X)−1 X T Y . (2)
Proof: Let B̂ be the ordinary least squares (→ III/1.5.3) (OLS) solution and let Ê = Y − X B̂ be
the resulting matrix of residuals. According to the exogeneity assumption of OLS, the errors have
conditional mean (→ I/1.10.1) zero
E(E|X) = 0 , (3)
2. MULTIVARIATE NORMAL DATA 491
a direct consequence of which is that the regressors are uncorrelated with the errors
E(X T E) = 0 , (4)
which, in the finite sample, means that the residual matrix must be orthogonal to the design matrix:
X T Ê = 0 . (5)
From (5), the OLS formula can be directly derived:
X T Ê = 0
X T Y − X B̂ = 0
X T Y − X T X B̂ = 0 (6)
X T X B̂ = X T Y
B̂ = (X T X)−1 X T Y .
Y = XB + E, E ∼ MN (0, V, Σ) , (1)
the weighted least sqaures (→ III/1.5.20) parameter estimates are given by
B̂ = (X T V −1 X)−1 X T V −1 Y . (2)
W V W T = In . (3)
Since V is a covariance matrix and thus symmetric, W is also symmetric and can be expressed as
the matrix square root of the inverse of V :
W W = V −1 ⇔ W = V −1/2 . (4)
Left-multiplying the linear regression equation (1) with W , the linear transformation theorem (→
II/5.1.9) implies that
W Y = W XB + W E, W E ∼ MN (0, W V W T , Σ) . (5)
Applying (3), we see that (5) is actually a general linear model (→ III/2.1.1) with independent
observations
B̂ = (X̃ T X̃)−1 X̃ T Ỹ
−1
= (W X)T W X (W X)T W Y
−1 T T
= X TW TW X X W WY (7)
−1
= X TW W X X TW W Y
(4) −1 T −1
= X T V −1 X X V Y
Y = XB + E, E ∼ MN (0, V, Σ) , (1)
maximum likelihood estimates (→ I/4.1.3) for the unknown parameters B and Σ are given by
B̂ = (X T V −1 X)−1 X T V −1 Y
1 (2)
Σ̂ = (Y − X B̂)T V −1 (Y − X B̂) .
n
nv n v
LL(B, Σ) = − log(2π) − log(|Σ|) + log(|P |)
2 2 2
1 −1 T (5)
− tr Σ Y P Y − Y T P XB − B T X T P Y + B T X T P XB .
2
2. MULTIVARIATE NORMAL DATA 493
dLL(B, Σ) d 1 −1 T
= − tr Σ Y P Y − Y P XB − B X P Y + B X P XB
T T T T T
dB dB 2
d 1 −1 T
d 1 −1 T T
= − tr −2Σ Y P XB + − tr Σ B X P XB
dB 2 dB 2 (6)
1 1
= − −2X T P Y Σ−1 − X T P XBΣ−1 + (X T P X)T B(Σ−1 )T
2 2
= X P Y Σ − X P XBΣ−1
T −1 T
dLL(B̂, Σ)
=0
dB
0 = X T P Y Σ−1 − X T P X B̂Σ−1
0 = X T P Y − X T P X B̂ (7)
X T P X B̂ = X T P Y
−1
B̂ = X T P X X TP Y
i
dLL(B̂, Σ) d n 1 h −1 T −1
= − log |Σ| − tr Σ (Y − X B̂) V (Y − X B̂)
dΣ dΣ 2 2
n 1 T
= − Σ−1 +
T
Σ−1 (Y − X B̂)T V −1 (Y − X B̂) Σ−1 (8)
2 2
n −1 1 −1
= − Σ + Σ (Y − X B̂)T V −1 (Y − X B̂) Σ−1
2 2
and setting this derivative to zero gives the MLE for Σ:
dLL(B̂, Σ̂)
=0
dΣ
n 1
0 = − Σ̂−1 + Σ̂−1 (Y − X B̂)T V −1 (Y − X B̂) Σ̂−1
2 2
n −1 1 −1
Σ̂ = Σ̂ (Y − X B̂)T V −1 (Y − X B̂) Σ̂−1
2 2 (9)
1
Σ̂−1 = Σ̂−1 (Y − X B̂)T V −1 (Y − X B̂) Σ̂−1
n
1
Iv = (Y − X B̂)T V −1 (Y − X B̂) Σ̂−1
n
1
Σ̂ = (Y − X B̂)T V −1 (Y − X B̂)
n
Together, (7) and (9) constitute the MLE for the GLM.
■
494 CHAPTER III. STATISTICAL MODELS
Y = XB + E, E ∼ MN (0, V, Σ) (1)
Y = Xt Γ + Et , Et ∼ MN (0, V, Σt ) (2)
and assume that Xt can be transformed into X using a transformation matrix T ∈ Rt×p
X = Xt T (3)
where p < t and X, Xt and T have full ranks rk(X) = p, rk(Xt ) = t and rk(T ) = p.
Then, a linear model (→ III/2.1.1) of the parameter estimates from (2), under the assumption of
(1), is called a transformed general linear model.
Sources:
• Soch J, Allefeld C, Haynes JD (2020): “Inverse transformed encoding models – a solution to
the problem of correlated trial-by-trial parameter estimates in fMRI decoding”; in: NeuroIm-
age, vol. 209, art. 116449, Appendix A; URL: https://www.sciencedirect.com/science/article/pii/
S1053811919310407; DOI: 10.1016/j.neuroimage.2019.116449.
Y = XB + E, E ∼ MN (0, V, Σ) (1)
Y = Xt Γ + Et , Et ∼ MN (0, V, Σt ) (2)
and a matrix T transforming Xt into X:
X = Xt T . (3)
Then, the transformed general linear model (→ III/2.2.1) is given by
Γ̂ = T B + H, H ∼ MN (0, U, Σ) (4)
where the covariance across rows (→ II/5.1.1) is U = (XtT V −1 Xt )−1 .
Proof: The linear transformation theorem for the matrix-normal distribution (→ II/5.1.9) states:
Y ∼ MN (XB, V, Σ) (7)
Combining (6) with (7), the distribution of Γ̂ is
Γ̂ ∼ MN (XtT V −1 Xt )−1 XtT V −1 XB, (XtT V −1 Xt )−1 XtT V −1 V V −1 Xt (XtT V −1 Xt )−1 , Σ
∼ MN (XtT V −1 Xt )−1 XtT V −1 Xt T B, (XtT V −1 Xt )−1 XtT V −1 Xt (XtT V −1 Xt )−1 , Σ (8)
∼ MN T B, (XtT V −1 Xt )−1 , Σ .
Sources:
• Soch J, Allefeld C, Haynes JD (2020): “Inverse transformed encoding models – a solution to the
problem of correlated trial-by-trial parameter estimates in fMRI decoding”; in: NeuroImage, vol.
209, art. 116449, Appendix A, Theorem 1; URL: https://www.sciencedirect.com/science/article/
pii/S1053811919310407; DOI: 10.1016/j.neuroimage.2019.116449.
Y = XB + E, E ∼ MN (0, V, Σ) (1)
and the transformed general linear model (→ III/2.2.1)
Γ̂ = T B + H, H ∼ MN (0, U, Σ) (2)
which are linked to each other (→ III/2.2.2) via
X = Xt T . (4)
Then, the parameter estimates for B from (1) and (2) are equivalent.
Proof: The weighted least squares parameter estimates (→ III/2.1.3) for (1) are given by
B̂ = (X T V −1 X)−1 X T V −1 Y (5)
and the weighted least squares parameter estimates (→ III/2.1.3) for (2) are given by
B̂ = (T T U −1 T )−1 T T U −1 Γ̂ . (6)
The covariance across rows for the transformed general linear model (→ III/2.2.2) is equal to
496 CHAPTER III. STATISTICAL MODELS
(6)
B̂ = (T T U −1 T )−1 T T U −1 Γ̂
(7)
= (T T XtT V −1 Xt T )−1 T T XtT V −1 Xt Γ̂
(4)
= (X T V −1 X)−1 T T XtT V −1 Xt Γ̂
(8)
(3)
= (X T V −1 X)−1 T T XtT V −1 Xt (XtT V −1 Xt )−1 XtT V −1 Y
= (X T V −1 X)−1 T T XtT V −1 Y
(4)
= (X T V −1 X)−1 X T V −1 Y
Sources:
• Soch J, Allefeld C, Haynes JD (2020): “Inverse transformed encoding models – a solution to the
problem of correlated trial-by-trial parameter estimates in fMRI decoding”; in: NeuroImage, vol.
209, art. 116449, Appendix A, Theorem 2; URL: https://www.sciencedirect.com/science/article/
pii/S1053811919310407; DOI: 10.1016/j.neuroimage.2019.116449.
Y = XB + E, E ∼ MN (0, V, Σ) . (1)
Then, a linear model (→ III/2.1.1) of X in terms of Y , under the assumption of (1), is called an
inverse general linear model.
Sources:
• Soch J, Allefeld C, Haynes JD (2020): “Inverse transformed encoding models – a solution to
the problem of correlated trial-by-trial parameter estimates in fMRI decoding”; in: NeuroIm-
age, vol. 209, art. 116449, Appendix C; URL: https://www.sciencedirect.com/science/article/pii/
S1053811919310407; DOI: 10.1016/j.neuroimage.2019.116449.
Y = XB + E, E ∼ MN (0, V, Σ) . (1)
Then, the inverse general linear model (→ III/2.3.1) of X ∈ Rn×p is given by
2. MULTIVARIATE NORMAL DATA 497
X = Y W + N, N ∼ MN (0, V, Σx ) (2)
where W ∈ Rv×p is a matrix, such that B W = Ip , and the covariance across columns (→ II/5.1.1)
is Σx = W T ΣW .
Proof: The linear transformation theorem for the matrix-normal distribution (→ II/5.1.9) states:
X = Y W + N, N ∼ MN (0, V, W T ΣW ) (6)
which is equivalent to (2).
Sources:
• Soch J, Allefeld C, Haynes JD (2020): “Inverse transformed encoding models – a solution to the
problem of correlated trial-by-trial parameter estimates in fMRI decoding”; in: NeuroImage, vol.
209, art. 116449, Appendix C, Theorem 4; URL: https://www.sciencedirect.com/science/article/
pii/S1053811919310407; DOI: 10.1016/j.neuroimage.2019.116449.
Y = XB + E, E ∼ MN (0, V, Σ) (1)
implying the inverse general linear model (→ III/2.3.2) of X ∈ Rn×p
X = Y W + N, N ∼ MN (0, V, Σx ) . (2)
where
B W = Ip and Σx = W T ΣW . (3)
Then, the weighted least squares solution (→ III/2.1.3) for W is the best linear unbiased estimator
of W .
Proof: The linear transformation theorem for the matrix-normal distribution (→ II/5.1.9) states:
The weighted least squares parameter estimates (→ III/2.1.3) for (2) are given by
Ŵ = (Y T V −1 Y )−1 Y T V −1 X . (5)
The best linear unbiased estimator θ̂ of a certain quantity θ estimated from measured data y is 1) an
estimator resulting from a linear operation f (y), 2) whose expected value is equal to θ and 3) which
has, among those satisfying 1) and 2), the minimum variance (→ I/1.11.1).
W̃ = M X ∼ MN (M Y W, M V M T , Σx ) (6)
which requires (→ II/5.1.4) that M Y = Iv . This is fulfilled by any matrix
M = (Y T V −1 Y )−1 Y T V −1 + D (7)
where D is a v × n matrix which satisfies DY = 0.
3) Third, the best linear unbiased estimator is the one with minimum variance (→ I/1.11.1), i.e. the
one that minimizes the expected Frobenius norm
D h iE
Var W̃ = tr (W̃ − W )T (W̃ − W ) . (8)
h i (8) D h iE
Var W̃ (M ) = tr (W̃ − W )T (W̃ − W )
D h iE
= tr (W̃ − W )(W̃ − W )T
hD Ei
= tr (W̃ − W )(W̃ − W )T (11)
(10)
= tr tr(Σx ) M V M T
= tr(Σx ) tr(M V M T ) .
h i (7) h T i
Var W̃ (D) = tr(Σx ) tr (Y T V −1 Y )−1 Y T V −1 + D V (Y T V −1 Y )−1 Y T V −1 + D
= tr(Σx ) tr (Y T V −1 Y )−1 Y T V −1 V V −1 Y (Y T V −1 Y )−1 + (12)
(Y T V −1 Y )−1 Y T V −1 V DT + DV V −1 Y (Y T V −1 Y )−1 + DV DT
= tr(Σx ) tr (Y T V −1 Y )−1 + tr DV DT .
Since DV DT is a positive-semidefinite matrix, all its eigenvalues are non-negative. Because the trace
of a square matrix is the sum of its eigenvalues, the mimimum variance is achieved by D = 0, thus
producing Ŵ as in (5).
Sources:
• Soch J, Allefeld C, Haynes JD (2020): “Inverse transformed encoding models – a solution to the
problem of correlated trial-by-trial parameter estimates in fMRI decoding”; in: NeuroImage, vol.
209, art. 116449, Appendix C, Theorem 5; URL: https://www.sciencedirect.com/science/article/
pii/S1053811919310407; DOI: 10.1016/j.neuroimage.2019.116449.
X̂ = Y W . (1)
Given that the columns of X̂ are linearly independent, then
Sources:
• Haufe S, Meinecke F, Görgen K, Dähne S, Haynes JD, Blankertz B, Bießmann F (2014): “On
the interpretation of weight vectors of linear models in multivariate neuroimaging”; in: Neu-
roImage, vol. 87, pp. 96–110, eq. 3; URL: https://www.sciencedirect.com/science/article/pii/
S1053811913010914; DOI: 10.1016/j.neuroimage.2013.10.067.
X̂ = Y W . (1)
Then, the parameter matrix of the corresponding forward model (→ III/2.3.4) is equal to
A = Σy W Σ−1
x (2)
500 CHAPTER III. STATISTICAL MODELS
Σx = X̂ T X̂
(3)
Σy = Y T Y .
Y = X̂AT + E , (4)
subject to the constraint that predicted X and errors E are uncorrelated:
X̂ T E = 0 . (5)
With that, we can directly derive the parameter matrix A:
(4)
Y = X̂AT + E
X̂AT = Y − E
X̂ T X̂AT = X̂ T (Y − E)
X̂ T X̂AT = X̂ T Y − X̂ T E
(5)
X̂ T X̂AT = X̂ T Y (6)
(1)
X̂ T X̂AT = W T Y T Y
(3)
Σx AT = W T Σy
AT = Σ−1 T
x W Σy
A = Σy W Σx−1 .
■
Sources:
• Haufe S, Meinecke F, Görgen K, Dähne S, Haynes JD, Blankertz B, Bießmann F (2014): “On
the interpretation of weight vectors of linear models in multivariate neuroimaging”; in: NeuroIm-
age, vol. 87, pp. 96–110, Theorem 1; URL: https://www.sciencedirect.com/science/article/pii/
S1053811913010914; DOI: 10.1016/j.neuroimage.2013.10.067.
X̂ = Y W . (1)
Then, there exists a corresponding forward model (→ III/2.3.4).
and the parameters of the corresponding forward model (→ III/2.3.5) are equal to
A = Σy W Σ−1
x where Σx = X̂ T X̂ and Σy = Y T Y . (3)
1) Because the columns of X̂ are assumed to be linearly independent by definition of the corresponding
forward model (→ III/2.3.4), the matrix Σx = X̂ T X̂ is invertible, such that A in (3) is well-defined.
2) Moreover, the solution for the matrix A satisfies the constraint of the corresponding forward model
(→ III/2.3.4) for predicted X and errors E to be uncorrelated which can be shown as follows:
(2)
X̂ T E = X̂ T Y − X̂AT
(3)
= X̂ T Y − X̂ Σ−1
x W T
Σy
= X̂ T Y − X̂ T X̂ Σ−1 T
x W Σy
(3)
−1 (4)
= X̂ Y − X̂ X̂ X̂ X̂
T T T
W T Y TY
(1)
= (Y W )T Y − W T Y T Y
= W TY TY − W TY TY
=0.
Sources:
• Haufe S, Meinecke F, Görgen K, Dähne S, Haynes JD, Blankertz B, Bießmann F (2014): “On
the interpretation of weight vectors of linear models in multivariate neuroimaging”; in: NeuroIm-
age, vol. 87, pp. 96–110, Appendix B; URL: https://www.sciencedirect.com/science/article/pii/
S1053811913010914; DOI: 10.1016/j.neuroimage.2013.10.067.
Y = XB + E, E ∼ MN (0, V, Σ) (1)
be a general linear model (→ III/2.1.1) with measured n × v data matrix Y , known n × p design
matrix X, known n × n covariance structure (→ II/5.1.1) V as well as unknown p × v regression
coefficients B and unknown v × v noise covariance (→ II/5.1.1) Σ.
Then, the conjugate prior (→ I/5.2.5) for this model is a normal-Wishart distribution (→ II/5.3.1)
Proof: By definition, a conjugate prior (→ I/5.2.5) is a prior distribution (→ I/5.1.3) that, when
combined with the likelihood function (→ I/5.1.2), leads to a posterior distribution (→ I/5.1.7) that
belongs to the same family of probability distributions (→ I/1.5.1). This is fulfilled when the prior
density and the likelihood function are proportional to the model parameters in the same way, i.e.
the model parameters appear in the same functional form in both.
Equation (1) implies the following likelihood function (→ I/5.1.2)
s
1 1 −1 T −1
p(Y |B, Σ) = MN (Y ; XB, V, Σ) = exp − tr Σ (Y − XB) V (Y − XB)
(2π)nv |Σ|n |V |v 2
(3)
which, for mathematical convenience, can also be parametrized as
s
−1 −1 |T |n |P |v 1
p(Y |B, T ) = MN (Y ; XB, P ,T )= exp − tr T (Y − XB) P (Y − XB)
T
(4)
(2π)nv 2
using the v × v precision matrix (→ I/1.13.19) T = Σ−1 and the n × n precision matrix (→ I/1.13.19)
P = V −1 .
s
|P |v 1 T
p(Y |B, T ) = · |T | · exp − tr T Y P Y − Y P XB − B X P Y + B X P XB
n/2 T T T T T
.
(2π)nv 2
(6)
Completing the square over B, finally gives
s
|P |v 1 h i
p(Y |B, T ) = · |T | · exp − tr T (B − X̃Y ) X P X(B − X̃Y ) − Y QY + Y P Y
n/2 T T T T
(2π)nv 2
(7)
T
−1 T T T
where X̃ = X P X X P and Q = X̃ X P X X̃.
In other words, the likelihood function (→ I/5.1.2) is proportional to a power of the determinant of
T , times an exponential of the trace of T and an exponential of the trace of a squared form of B,
weighted by T :
i
1 T 1 h
p(Y |B, T ) ∝ |T | ·exp − tr T Y P Y − Y QY
n/2 T
·exp − tr T (B − X̃Y ) X P X(B − X̃Y )
T T
.
2 2
(8)
The same is true for a normal-Wishart distribution (→ II/5.3.1) over B and T
s r
|T |p |Λ0 |v 1 1 |Ω0 |ν0 (ν0 −v−1)/2 1
p(B, T ) = exp − tr T (B − M0 ) Λ0 (B − M0 ) ·
T
|T | exp − tr (Ω0 T )
(2π)pv 2 Γv ν20 2ν0 v 2
(10)
exhibits the same proportionality
1 1
p(B, T ) ∝ |T |
(ν0 +p−v−1)/2
· exp − tr (T Ω0 ) · exp − tr T (B − M0 ) Λ0 (B − M0 )
T
(11)
2 2
Sources:
• Wikipedia (2020): “Bayesian multivariate linear regression”; in: Wikipedia, the free encyclopedia,
retrieved on 2020-09-03; URL: https://en.wikipedia.org/wiki/Bayesian_multivariate_linear_regression#
Conjugate_prior_distribution.
Y = XB + E, E ∼ MN (0, V, Σ) (1)
be a general linear model (→ III/2.1.1) with measured n×v data matrix Y , known n×p design matrix
X, known n × n covariance structure (→ II/5.1.1) V as well as unknown p × v regression coefficients
B and unknown v × v noise covariance (→ II/5.1.1) Σ. Moreover, assume a normal-Wishart prior
distribution (→ III/2.4.1) over the model parameters B and T = Σ−1 :
Mn = Λ−1 T
n (X P Y + Λ0 M0 )
Λn = X T P X + Λ 0
(4)
Ωn = Ω0 + Y T P Y + M0T Λ0 M0 − MnT Λn Mn
νn = ν0 + n .
Proof: According to Bayes’ theorem (→ I/5.3.1), the posterior distribution (→ I/5.1.7) is given by
Since p(Y ) is just a normalization factor, the posterior is proportional (→ I/5.1.9) to the numerator:
s
1 1 −1 T −1
p(Y |B, Σ) = MN (Y ; XB, V, Σ) = exp − tr Σ (Y − XB) V (Y − XB)
(2π)nv |Σ|n |V |v 2
(7)
which, for mathematical convenience, can also be parametrized as
s
−1 |T |n |P |v 1
p(Y |B, T ) = MN (Y ; XB, P, T )= exp − tr T (Y − XB) P (Y − XB)
T
(8)
(2π)nv 2
using the v × v precision matrix (→ I/1.13.19) T = Σ−1 and the n × n precision matrix (→ I/1.13.19)
P = V −1 .
Combining the likelihood function (→ I/5.1.2) (8) with the prior distribution (→ I/5.1.3) (2), the
joint likelihood (→ I/5.1.5) of the model is given by
s s r
|T |n |P |v |T |p |Λ0 |v |Ω0 |ν0 1 (ν0 −v−1)/2 1
p(Y, B, T ) = · |T | exp − tr (Ω0 T ) ·
(2π)nv (2π)pv 2ν0 v Γv ν20 2
(10)
1
exp − tr T (Y − XB) P (Y − XB) + (B − M0 ) Λ0 (B − M0 )
T T
.
2
Expanding the products in the exponent gives:
s s r
|T |n |P |v |T |p |Λ0 |v |Ω0 |ν0 1 (ν0 −v−1)/2 1
p(Y, B, T ) = · |T | exp − tr (Ω0 T ) ·
(2π)nv (2π)pv 2ν0 v Γv ν20 2
1 (11)
exp − tr T Y T P Y − Y T P XB − B T X T P Y + B T X T P XB+
2
B T Λ0 B − B T Λ0 M0 − M0T Λ0 B + M0T Λ0 µ0 .
2. MULTIVARIATE NORMAL DATA 505
s s r
|T |n |P |v |T |p |Λ0 |v |Ω0 |ν0 1 (ν0 −v−1)/2 1
p(Y, B, T ) = · |T | exp − tr (Ω0 T ) ·
(2π)nv (2π)pv 2ν0 v Γv ν20 2
(12)
1
exp − tr T (B − Mn )T Λn (B − Mn ) + (Y T P Y + M0T Λ0 M0 − MnT Λn Mn ) .
2
Mn = Λ−1 T
n (X P Y + Λ0 M0 )
(13)
Λn = X T P X + Λ 0 .
1 (νn −v−1)/2 1
p(Y, B, T ) ∝ |T |p/2
·exp − tr T (B − Mn ) Λn (B − Mn ) ·|T |
T
·exp − tr (Ωn T ) (14)
2 2
Ωn = Ω0 + Y T P Y + M0T Λ0 M0 − MnT Λn Mn
(15)
νn = ν0 + n .
From the term in (14), we can isolate the posterior distribution over B given T :
Sources:
• Wikipedia (2020): “Bayesian multivariate linear regression”; in: Wikipedia, the free encyclopedia,
retrieved on 2020-09-03; URL: https://en.wikipedia.org/wiki/Bayesian_multivariate_linear_regression#
Posterior_distribution.
Y = XB + E, E ∼ MN (0, V, Σ) (1)
be a general linear model (→ III/2.1.1) with measured n×v data matrix Y , known n×p design matrix
X, known n × n covariance structure (→ II/5.1.1) V as well as unknown p × v regression coefficients
506 CHAPTER III. STATISTICAL MODELS
v nv v v
log p(Y |m) = log |P | − log(2π) + log |Λ0 | − log |Λn |+
2 2 2 2
ν0 1 νn 1 νn ν (3)
0
log Ω0 − log Ωn + log Γv − log Γv
2 2 2 2 2 2
Mn = Λ−1 T
n (X P Y + Λ0 M0 )
Λn = X T P X + Λ 0
(4)
Ωn = Ω0 + Y T P Y + M0T Λ0 M0 − MnT Λn Mn
νn = ν0 + n .
Proof: According to the law of marginal probability (→ I/1.3.3), the model evidence (→ I/5.1.11)
for this model is:
ZZ
p(Y |m) = p(Y |B, T ) p(B, T ) dB dT . (5)
According to the law of conditional probability (→ I/1.3.4), the integrand is equivalent to the joint
likelihood (→ I/5.1.5):
ZZ
p(Y |m) = p(Y, B, T ) dB dT . (6)
s
1 1 −1 T −1
p(Y |B, Σ) = MN (Y ; XB, V, Σ) = exp − tr Σ (Y − XB) V (Y − XB)
(2π)nv |Σ|n |V |v 2
(7)
which, for mathematical convenience, can also be parametrized as
s
−1 |T |n |P |v 1
p(Y |B, T ) = MN (Y ; XB, P, T )= exp − tr T (Y − XB) P (Y − XB)
T
(8)
(2π)nv 2
using the v × v precision matrix (→ I/1.13.19) T = Σ−1 and the n × n precision matrix (→ I/1.13.19)
P = V −1 .
When deriving the posterior distribution (→ III/2.4.2) p(B, T |Y ), the joint likelihood p(Y, B, T ) is
obtained as
2. MULTIVARIATE NORMAL DATA 507
s s r
|T |n |P |v |T |p |Λ0 |v |Ω0 |ν0 1 (ν0 −v−1)/2 1
p(Y, B, T ) = · |T | exp − tr (Ω0 T ) ·
(2π)nv (2π)pv 2ν0 v Γv ν20 2
(9)
1
exp − tr T (B − Mn ) Λn (B − Mn ) + (Y P Y + M0 Λ0 M0 − Mn Λn Mn )
T T T T
.
2
Using the probability density function of the matrix-normal distribution (→ II/5.1.3), we can rewrite
this as
s s r s
|T |n |P |v |T |p |Λ0 |v
(2π)pv |Ω0 |ν0 1 (ν0 −v−1)/2 1
p(Y, B, T ) = · |T | exp − tr (Ω0 T ) ·
(2π)nv |T |p |Λn |v
(2π)pv 2ν0 v Γv ν20 2
−1 −1 1 T
MN (B; Mn , Λn , T ) · exp − tr T Y P Y + M0 Λ0 M0 − Mn Λn Mn
T T
.
2
(10)
Z s s r
|T |n |P |v |Λ0 |v |Ω0 |ν0 1
p(Y, B, T ) dB = · |T |(ν0 −v−1)/2 ·
(2π)nv |Λn |v 2ν0 v Γv ν20
(11)
1
exp − tr T Ω0 + Y P Y + M0 Λ0 M0 − Mn Λn Mn
T T T
.
2
Using the probability density function of the Wishart distribution, we can rewrite this as
Z s s r s
|P |v |Λ0 |v |Ω0 |ν0 2νn v Γv ν2n
p(Y, B, T ) dB = · W(T ; Ω−1
n , νn ) . (12)
(2π)nv |Λn |v 2ν0 v |Ωn |νn Γv ν20
Finally, T can also be integrated out:
ZZ s s s ν
|P |v |Λ0 |v 1
Ω 0
2 0
Γv νn
2
p(Y, B, T ) dB dT = νn = p(y|m) . (13)
(2π)nv |Λn |v 1
Ω
2 n
Γv ν0
2
v nv v v
log p(Y |m) = log |P | − log(2π) + log |Λ0 | − log |Λn |+
2 2 2 2
ν0 1 νn 1 ν ν (14)
n 0
log Ω0 − log Ωn + log Γv − log Γv .
2 2 2 2 2 2
■
508 CHAPTER III. STATISTICAL MODELS
3 Count data
3.1 Binomial observations
3.1.1 Definition
Definition: An ordered pair (n, y) with n ∈ N and y ∈ N0 , where y is the number of successes in n
trials, constitutes a set of binomial observations.
y ∼ Bin(n, p) . (1)
Then, the null hypothesis (→ I/4.3.2)
H0 : p = p0 (2)
is rejected (→ I/4.3.1) at significance level (→ I/4.3.8) α, if
y ≤ c1 or y ≥ c2 (3)
where c1 is the largest integer value, such that
X
c1
α
Bin(x; n, p0 ) ≤ , (4)
x=0
2
and c2 is the smallest integer value, such that
X
n
α
Bin(x; n, p0 ) ≤ , (5)
x=c2
2
where Bin(x; n, p) is the probability mass function of the binomial distribution (→ II/1.3.2):
n x
Bin(x; n, p) = p (1 − p)n−x . (6)
x
Proof: The alternative hypothesis (→ I/4.3.3) relative to H0 for a two-sided test (→ I/4.3.4) is
H1 : p ̸= p0 . (7)
We can use y as a test statistic (→ I/4.3.5). Its sampling distribution (→ I/1.5.5) is given by (1). The
cumulative distribution function (→ I/1.8.1) (CDF) of the test statistic under the null hypothesis is
thus equal to the cumulative distribution function of a binomial distribution with success probability
(→ II/1.3.1) p0 :
X
z z
X n
Pr(y ≤ z|H0 ) = Bin(x; n, p0 ) = px0 (1 − p0 )n−x . (8)
x=0 x=0
x
3. COUNT DATA 509
The critical value (→ I/4.3.9) is the value of y, such that the probability of observing this or more
extreme values of the test statistic is equal to or smaller than α. Since H0 and H1 define a two-tailed
test, we need two critical values y1 and y2 that satisfy
Given the test statistic’s CDF in (8), this is fulfilled by the values c1 and c2 defined in (4) and (5).
Thus, the null hypothesis H0 can be rejected (→ I/4.3.9), if the observed test statistic is inside the
rejection region (→ I/4.3.1):
Sources:
• Wikipedia (2023): “Binomial test”; in: Wikipedia, the free encyclopedia, retrieved on 2023-12-16;
URL: https://en.wikipedia.org/wiki/Binomial_test#Usage.
• Wikipedia (2023): “Binomialtest”; in: Wikipedia – Die freie Enzyklopädie, retrieved on 2023-12-16;
URL: https://de.wikipedia.org/wiki/Binomialtest#Signifikanzniveau_und_kritische_Werte.
y ∼ Bin(n, p) . (1)
Then, the maximum likelihood estimator (→ I/4.1.3) of p is
y
p̂ = . (2)
n
Proof: With the probability mass function of the binomial distribution (→ II/1.3.2), equation (1)
implies the following likelihood function (→ I/5.1.2):
p(y|p) = Bin(y; n, p)
n y (3)
= p (1 − p)n−y .
y
dLL(p) y n−y
= − (5)
dp p 1−p
and setting this derivative to zero gives the MLE for p:
dLL(p)
=0
dp̂
y n−y
0= −
p̂ 1 − p̂
n−y y
=
1 − p̂ p̂ (6)
(n − y) p̂ = y (1 − p̂)
n p̂ − y p̂ = y − y p̂
n p̂ = y
y
p̂ = .
n
■
Sources:
• Wikipedia (2022): “Binomial distribution”; in: Wikipedia, the free encyclopedia, retrieved on 2022-
11-23; URL: https://en.wikipedia.org/wiki/Binomial_distribution#Statistical_inference.
y ∼ Bin(n, p) . (1)
Then, the maximum log-likelihood (→ I/4.1.5) for this model is
MLL = LL(p̂)
y
n y
= log + y log + (n − y) log 1 −
y n n
n y n−y (5)
= log + y log + (n − y) log
y n n
n
= log + y log(y) + (n − y) log(n − y) − n log(n) .
y
With the definition of the binomial coefficient
n n!
= (6)
k k! (n − k)!
and the definition of the gamma function
y ∼ Bin(n, p) . (1)
Moreover, assume a beta prior distribution (→ III/3.1.6) over the model parameter p:
Proof: Given the prior distribution (→ I/5.1.3) in (2), the posterior distribution (→ I/5.1.7) for
binomial observations (→ III/3.1.1) is also a beta distribution (→ III/3.1.7)
αn = α0 + y
(5)
βn = β0 + (n − y) .
512 CHAPTER III. STATISTICAL MODELS
αn − 1
p̂MAP =
αn + βn − 2
(5) α0 + y − 1
= (7)
α0 + y + β0 + (n − y) − 2
α0 + y − 1
= .
α0 + β0 + n − 2
■
y ∼ Bin(n, p) . (1)
Then, the conjugate prior (→ I/5.2.5) for the model parameter p is a beta distribution (→ II/3.9.1):
Proof: With the probability mass function of the binomial distribution (→ II/1.3.2), the likelihood
function (→ I/5.1.2) implied by (1) is given by
n y
p(y|p) = p (1 − p)n−y . (3)
y
In other words, the likelihood function is proportional to a power of p times a power of (1 − p):
■
3. COUNT DATA 513
Sources:
• Wikipedia (2020): “Binomial distribution”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
01-23; URL: https://en.wikipedia.org/wiki/Binomial_distribution#Estimation_of_parameters.
y ∼ Bin(n, p) . (1)
Moreover, assume a beta prior distribution (→ III/3.1.6) over the model parameter p:
αn = α0 + y
(4)
βn = β0 + (n − y) .
Proof: With the probability mass function of the binomial distribution (→ II/1.3.2), the likelihood
function (→ I/5.1.2) implied by (1) is given by
n y
p(y|p) = p (1 − p)n−y . (5)
y
Combining the likelihood function (5) with the prior distribution (2), the joint likelihood (→ I/5.1.5)
of the model is given by
Sources:
• Wikipedia (2020): “Binomial distribution”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
01-23; URL: https://en.wikipedia.org/wiki/Binomial_distribution#Estimation_of_parameters.
y ∼ Bin(n, p) . (1)
Moreover, assume a beta prior distribution (→ III/3.1.6) over the model parameter p:
αn = α0 + y
(4)
βn = β0 + (n − y) .
Proof: With the probability mass function of the binomial distribution (→ II/1.3.2), the likelihood
function (→ I/5.1.2) implied by (1) is given by
n y
p(y|p) = p (1 − p)n−y . (5)
y
Combining the likelihood function (5) with the prior distribution (2), the joint likelihood (→ I/5.1.5)
of the model is given by
Note that the model evidence is the marginal density of the joint likelihood (→ I/5.1.11):
Z
p(y) = p(y, p) dp . (7)
n 1 B(αn , βn ) 1
p(y, p) = pαn −1 (1 − p)βn −1 . (8)
y B(α0 , β0 ) 1 B(αn , βn )
Using the probability density function of the beta distribution (→ II/3.9.3), p can now be integrated
out easily
Z
n 1 B(αn , βn ) 1
p(y) = pαn −1 (1 − p)βn −1 dp
y B(α0 , β0 ) 1 B(αn , βn )
Z
n B(αn , βn )
= Bet(p; αn , βn ) dp (9)
y B(α0 , β0 )
n B(αn , βn )
= ,
y B(α0 , β0 )
Sources:
• Wikipedia (2020): “Beta-binomial distribution”; in: Wikipedia, the free encyclopedia, retrieved on
2020-01-24; URL: https://en.wikipedia.org/wiki/Beta-binomial_distribution#Motivation_and_
derivation.
y ∼ Bin(n, p) . (1)
Moreover, assume two statistical models (→ I/5.1.4), one assuming that p is 0.5 (null model (→
I/4.3.2)), the other imposing a beta distribution (→ III/3.1.6) as the prior distribution (→ I/5.1.3)
on the model parameter p (alternative (→ I/4.3.3)):
516 CHAPTER III. STATISTICAL MODELS
Proof: The log Bayes factor is equal to the difference of two log model evidences (→ IV/3.3.8):
n
LME(m0 ) = log p(y|p = 0.5) = log + y log(0.5) + (n − y) log(1 − 0.5)
y
(6)
n 1
= log + n log .
y 2
Subtracting the two LMEs from each other, the LBF emerges as
1
LBF10 = log B(αn , βn ) − log B(α0 , β0 ) − n log (7)
2
where the posterior hyperparameters (→ I/5.1.7) are given by (→ III/3.1.7)
αn = α0 + y
(8)
βn = β0 + (n − y)
with the number of trials (→ II/1.3.1) n and the number of successes (→ II/1.3.1) y.
■
3. COUNT DATA 517
y ∼ Bin(n, p) . (1)
Moreover, assume two statistical models (→ I/5.1.4), one assuming that p is 0.5 (null model (→
I/4.3.2)), the other imposing a beta distribution (→ III/3.1.6) as the prior distribution (→ I/5.1.3)
on the model parameter p (alternative (→ I/4.3.3)):
Proof: The posterior probability for one of two models is a function of the log Bayes factor in favor
of this model (→ IV/3.4.4):
exp(LBF12 )
p(m1 |y) = . (4)
exp(LBF12 ) + 1
The log Bayes factor in favor of the alternative model for binomial observations (→ III/3.1.9) is given
by
1
LBF10 = log B(αn , βn ) − log B(α0 , β0 ) − n log . (5)
2
and the corresponding Bayes factor (→ IV/3.3.1), i.e. exponentiated log Bayes factor (→ IV/3.3.7),
is equal to
B(αn , βn )
BF10 = exp(LBF10 ) = 2n · . (6)
B(α0 , β0 )
Thus, the posterior probability of the alternative, assuming a prior distribution over the probability
p, compared to the null model, assuming a fixed probability p = 0.5, follows as
(4) exp(LBF10 )
p(m1 |y) =
exp(LBF10 ) + 1
(6) 2n · B(αn ,βn )
B(α0 ,β0 )
=
2n · B(αn ,βn )
B(α0 ,β0 )
+1
(7)
2n · B(α n ,βn )
=
B(α0 ,β0 )
2n · B(αn ,βn )
B(α0 ,β0 )
1+ B(α0 ,β0 )
2−n B(α n ,βn )
1
=
1+ 2−n [B(α0 , β0 )/B(αn , βn )]
518 CHAPTER III. STATISTICAL MODELS
αn = α0 + y
(8)
βn = β0 + (n − y)
with the number of trials (→ II/1.3.1) n and the number of successes (→ II/1.3.1) y.
■
y ∼ Mult(n, p) . (1)
Then, the null hypothesis (→ I/4.3.2)
where Pr0 (x) is the probability of observing the numbers of occurences x = [x1 , . . . , xk ] under the
null hypothesis:
Y
k x
p0jj
Pr0 (x) = n! . (4)
j=1
xj !
Y
k y
pj 0i
Pr(y|H0 ) = n! . (8)
j=1
yj !
The probability of observing any other set of counts x, given H0 , is
Y
k x
pj 0i
Pr(x|H0 ) = n! . (9)
j=1
xj !
The p-value (→ I/4.3.10) is the probability of observing a value of the test statistic (→ I/4.3.5) that
is as extreme or more extreme then the actually observed test statistic. Any set of counts x might be
considered as extreme or more extreme than the actually observed counts y, if the former is equally
probable or less probably than the latter:
p<α. (12)
■
Sources:
• Wikipedia (2023): “Multinomial test”; in: Wikipedia, the free encyclopedia, retrieved on 2023-12-
23; URL: https://en.wikipedia.org/wiki/Multinomial_test.
y ∼ Mult(n, p) . (1)
Then, the maximum likelihood estimator (→ I/4.1.3) of p is
1 yj
p̂ = y, i.e. p̂j = for all j = 1, . . . , k . (2)
n n
Proof: Note that the marginal distribution of each element in a multinomial random vector is a
binomial distribution
520 CHAPTER III. STATISTICAL MODELS
yj ∼ Bin(n, pj ) (4)
which implies the likelihood function (→ III/3.1.3)
n y
p(y|pj ) = Bin(yj ; n, pj ) = p j (1 − pj )n−yj . (5)
yj j
To this, we can apply maximum likelihood estimation for binomial observations (→ III/3.1.3), such
that the MLE for each pj is
yj
p̂j = . (6)
n
■
y ∼ Mult(n, p) . (1)
Then, the maximum log-likelihood (→ I/4.1.5) for this model is
X
k X
k
MLL = log Γ(n + 1) − log Γ(yj + 1) − n log(n) + yj log(yj ) . (2)
j=1 j=1
Proof: With the probability mass function of the multinomial distribution (→ II/2.2.2), equation
(1) implies the following likelihood function (→ I/5.1.2):
p(y|p) = Mult(y; n, p)
Yk
n (3)
= p j yj .
y1 , . . . , yk j=1
Plugging (5) into (4), we obtain the maximum log-likelihood (→ I/4.1.5) of the multinomial obser-
vation model in (1) as
MLL = LL(p̂)
X k y
n j
= log + yj log
y1 , . . . , yk j=1
n
X k
n
= log + [yj log(yj ) − yj log(n)]
y1 , . . . , yk j=1 (6)
X k X k
n
= log + yj log(yj ) − yj log(n)
y1 , . . . , yk j=1 j=1
X k
n
= log + yj log(yj ) − n log(n) .
y1 , . . . , yk j=1
X
k X
k
MLL = log Γ(n + 1) − log Γ(yj + 1) − n log(n) + yj log(yj ) . (9)
j=1 j=1
y ∼ Mult(n, p) . (1)
Moreover, assume a Dirichlet prior distribution (→ III/3.2.6) over the model parameter p:
Proof: Given the prior distribution (→ I/5.1.3) in (2), the posterior distribution (→ I/5.1.7) for
multinomial observations (→ III/3.2.1) is also a Dirichlet distribution (→ III/3.2.7)
522 CHAPTER III. STATISTICAL MODELS
Applying (6) to (4) with (5), the maximum-a-posteriori estimates (→ I/5.1.8) of p follow as
αni − 1
p̂i,MAP = P
j αnj − k
(5) α0i + yi − 1
= P (7)
j (α0j + yj ) − k
α0i + yi − 1
=P P .
j α0j + j yj − k
y ∼ Mult(n, p) . (1)
Then, the conjugate prior (→ I/5.2.5) for the model parameter p is a Dirichlet distribution (→
II/4.4.1):
Proof: With the probability mass function of the multinomial distribution (→ II/2.2.2), the likeli-
hood function (→ I/5.1.2) implied by (1) is given by
Yk
n
p(y|p) = p j yj . (3)
y1 , . . . , yk j=1
3. COUNT DATA 523
In other words, the likelihood function is proportional to a product of powers of the entries of the
vector p:
Y
k
p(y|p) ∝ pj yj . (4)
j=1
Y
k
p(p) ∝ pj α0j −1 (7)
j=1
Sources:
• Wikipedia (2020): “Dirichlet distribution”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
03-11; URL: https://en.wikipedia.org/wiki/Dirichlet_distribution#Conjugate_to_categorical/multinomi
y ∼ Mult(n, p) . (1)
Moreover, assume a Dirichlet prior distribution (→ III/3.2.6) over the model parameter p:
Proof: With the probability mass function of the multinomial distribution (→ II/2.2.2), the likeli-
hood function (→ I/5.1.2) implied by (1) is given by
524 CHAPTER III. STATISTICAL MODELS
Yk
n
p(y|p) = p j yj . (5)
y1 , . . . , yk j=1
Combining the likelihood function (5) with the prior distribution (2), the joint likelihood (→ I/5.1.5)
of the model is given by
Note that the posterior distribution is proportional to the joint likelihood (→ I/5.1.9):
Y
k
p(p|y) ∝ pj αnj −1 (8)
j=1
which, when normalized to one, results in the probability density function of the Dirichlet distribution
(→ II/4.4.2):
P
k
Γ j=1 nj Y
α k
p(p|y) = Qk pj αnj −1 = Dir(p; αn ) . (9)
j=1 Γ(α nj ) j=1
Sources:
• Wikipedia (2020): “Dirichlet distribution”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
03-11; URL: https://en.wikipedia.org/wiki/Dirichlet_distribution#Conjugate_to_categorical/multinomi
y ∼ Mult(n, p) . (1)
Moreover, assume a Dirichlet prior distribution (→ III/3.2.6) over the model parameter p:
X
k
log p(y|m) = log Γ(n + 1) − log Γ(kj + 1)
j=1
! !
X
k X
k
+ log Γ α0j − log Γ αnj (3)
j=1 j=1
X
k X
k
+ log Γ(αnj ) − log Γ(α0j ) .
j=1 j=1
Proof: With the probability mass function of the multinomial distribution (→ II/2.2.2), the likeli-
hood function (→ I/5.1.2) implied by (1) is given by
Yk
n
p(y|p) = p j yj . (5)
y1 , . . . , yk j=1
Combining the likelihood function (5) with the prior distribution (2), the joint likelihood (→ I/5.1.5)
of the model is given by
Note that the model evidence is the marginal density of the joint likelihood:
Z
p(y) = p(y, p) dp . (7)
Using the probability density function of the Dirichlet distribution (→ II/4.4.2), p can now be
integrated out easily
526 CHAPTER III. STATISTICAL MODELS
P
Z Γ Pk α Qk k
Γ j=1 αnj Y
k
n j=1 0j j=1 Γ(αnj )
p(y) = Qk P Qk pj αnj −1 dp
y1 , . . . , y k Γ(α ) Γ k Γ(α )
j=1 αnj
j=1 0j j=1 nj j=1
P
Γ k Qk Z
n j=1 α0j j=1 Γ(αnj )
= Qk P Dir(p; αn ) dp (9)
y1 , . . . , y k Γ(α ) Γ k
j=1 αnj
j=1 0j
P
Γ k Qk
n j=1 α0j Γ(αnj )
= P Qj=1k
,
y1 , . . . , y k Γ k
α j=1 Γ(α0j )
j=1 nj
! !
n X
k X
k
log p(y|m) = log + log Γ α0j − log Γ αnj
y1 , . . . , yk j=1 j=1
(10)
X
k X
k
+ log Γ(αnj ) − log Γ(α0j ) .
j=1 j=1
X
k
log p(y|m) = log Γ(n + 1) − log Γ(kj + 1)
j=1
! !
X
k X
k
+ log Γ α0j − log Γ αnj (13)
j=1 j=1
X
k X
k
+ log Γ(αnj ) − log Γ(α0j ) .
j=1 j=1
y ∼ Mult(n, p) . (1)
Moreover, assume two statistical models (→ I/5.1.4), one assuming that each pj is 1/k (null model
(→ I/4.3.2)), the other imposing a Dirichlet distribution (→ III/3.2.6) as the prior distribution (→
I/5.1.3) on the model parameters p1 , . . . , pk (alternative (→ I/4.3.3)):
! !
X
k X
k
LBF10 = log Γ α0j − log Γ αnj
j=1 j=1
(3)
X
k X
k
1
+ log Γ(αnj ) − log Γ(α0j ) − n log
j=1 j=1
k
where Γ(x) is the gamma function and αn are the posterior hyperparameters for multinomial obser-
vations (→ III/3.2.7) which are functions of the numbers of observations (→ II/2.2.1) y1 , . . . , yk .
Proof: The log Bayes factor is equal to the difference of two log model evidences (→ IV/3.3.8):
X
k
LME(m1 ) = log p(y|m1 ) = log Γ(n + 1) − log Γ(kj + 1)
j=1
! !
X
k X
k
+ log Γ α0j − log Γ αnj (5)
j=1 j=1
X
k X
k
+ log Γ(αnj ) − log Γ(α0j ) .
j=1 j=1
Because the null model m0 has no free parameter, its log model evidence (→ IV/3.1.3) (logarithmized
marginal likelihood (→ I/5.1.11)) is equal to the log-likelihood function for multinomial observations
(→ III/3.2.3) at the value p0 = [1/k, . . . , 1/k]:
X k
n 1
LME(m0 ) = log p(y|p = p0 ) = log + yj log
y1 , . . . , y k k
j=1
(6)
n 1
= log + n log .
y1 , . . . , yk k
528 CHAPTER III. STATISTICAL MODELS
Subtracting the two LMEs from each other, the LBF emerges as
! !
X
k X
k
LBF10 = log Γ α0j − log Γ αnj
j=1 j=1
(7)
X
k X
k
1
+ log Γ(αnj ) − log Γ(α0j ) − n log
j=1 j=1
k
αn = α0 + y
= [α01 , . . . , α0k ] + [y1 , . . . , yk ]
(8)
= [α01 + y1 , . . . , α0k + yk ]
i.e. αnj = α0j + yj for all j = 1, . . . , k
y ∼ Mult(n, p) . (1)
Moreover, assume two statistical models (→ I/5.1.4), one assuming that each pj is 1/k (null model
(→ I/4.3.2)), the other imposing a Dirichlet distribution (→ III/3.2.6) as the prior distribution (→
I/5.1.3) on the model parameters p1 , . . . , pk (alternative (→ I/4.3.3)):
Then, the posterior probability (→ IV/3.4.1) of the alternative model (→ I/4.3.3) is given by
1
p(m1 |y) = ∑k ∏k (3)
Γ( αnj ) Γ(α0j )
1+ k −n · Γ(
j=1
∑k
α0j )
· j=1
∏k
Γ(αnj )
j=1 j=1
where Γ(x) is the gamma function and αn are the posterior hyperparameters for multinomial obser-
vations (→ III/3.2.7) which are functions of the numbers of observations (→ II/2.2.1) y1 , . . . , yk .
Proof: The posterior probability for one of two models is a function of the log Bayes factor in favor
of this model (→ IV/3.4.4):
exp(LBF12 )
p(m1 |y) = . (4)
exp(LBF12 ) + 1
3. COUNT DATA 529
The log Bayes factor in favor of the alternative model for multinomial observations (→ III/3.2.9) is
given by
! !
X
k X
k
LBF10 = log Γ α0j − log Γ αnj
j=1 j=1
(5)
X
k X
k
1
+ log Γ(αnj ) − log Γ(α0j ) − n log
j=1 j=1
k
and the corresponding Bayes factor (→ IV/3.3.1), i.e. exponentiated log Bayes factor (→ IV/3.3.7),
is equal to
P
k Qk
Γ j=1 α 0j Γ(αnj )
BF10 = exp(LBF10 ) = k n · P · Qj=1
k
. (6)
k Γ(α )
Γ α
j=1 nj j=1 0j
Thus, the posterior probability of the alternative, assuming a prior distribution over the probabilities
p1 , . . . , pk , compared to the null model, assuming fixed probabilities p = [1/k, . . . , 1/k], follows as
(4) exp(LBF10 )
p(m1 |y) =
exp(LBF10 ) + 1
∑k ∏k
n Γ( j=1 α0j ) Γ(αnj )
k · Γ ∑k α · ∏j=1 k
(6) ( j=1 nj ) j=1 Γ(α0j )
= ∑k ∏k
Γ( α0j ) Γ(αnj )
k n · Γ ∑kj=1 α · ∏j=1 k +1
( j=1 nj ) j=1 Γ(α0j )
∑ ∏k
Γ( k α0j ) Γ(αnj ) (7)
k n · Γ ∑kj=1 α · ∏j=1 k
( j=1 nj ) j=1 Γ(α0j )
= ∑k ∏k ∑ ∏k
Γ( j=1 α0j ) j=1 Γ(αnj ) Γ( kj=1 αnj ) j=1 Γ(α0j )
k · Γ ∑k α · ∏k Γ(α ) 1 + k · Γ ∑k α ·
n −n ∏k
( j=1 nj ) j=1 0j ( j=1 0j ) j=1 Γ(αnj )
1
= ∑k ∏k
Γ( αnj ) j=1 Γ(α0j )
1 + k −n · Γ ∑j=1
k · ∏ k
( j=1 0j )
α j=1 Γ(αnj )
yi ∼ Poiss(λ), i = 1, . . . , n . (1)
Sources:
• Wikipedia (2020): “Poisson distribution”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
03-22; URL: https://en.wikipedia.org/wiki/Poisson_distribution#Parameter_estimation.
yi ∼ Poiss(λ), i = 1, . . . , n . (1)
Then, the maximum likelihood estimate (→ I/4.1.3) for the rate parameter λ is given by
λ̂ = ȳ (2)
where ȳ is the sample mean (→ I/1.10.2)
1X
n
ȳ = yi . (3)
n i=1
Proof: The likelihood function (→ I/5.1.2) for each observation is given by the probability mass
function of the Poisson distribution (→ II/1.5.2)
λyi · exp(−λ)
p(yi |λ) = Poiss(yi ; λ) = (4)
yi !
and because observations are independent (→ I/1.3.6), the likelihood function for all observations is
the product of the individual ones:
Y
n Yn
λyi · exp(−λ)
p(y|λ) = p(yi |λ) = . (5)
i=1 i=1
y i !
Thus, the log-likelihood function (→ I/4.1.2) is
" #
Yn
λyi · exp(−λ)
LL(λ) = log p(y|λ) = log (6)
i=1
yi !
which can be developed into
X
n
λyi · exp(−λ)
LL(λ) = log
i=1
yi !
X
n
= [yi · log(λ) − λ − log(yi !)]
i=1
(7)
X
n X
n X
n
=− λ+ yi · log(λ) − log(yi !)
i=1 i=1 i=1
X
n X
n
= −nλ + log(λ) yi − log(yi !)
i=1 i=1
3. COUNT DATA 531
1X
n
dLL(λ)
= yi − n
dλ λ i=1
(8)
1 X
n
d2 LL(λ)
= − yi .
dλ2 λ2 i=1
dLL(λ̂)
=0
dλ
1X
n
0= yi − n (9)
λ̂ i=1
1X
n
λ̂ = yi = ȳ .
n i=1
1 X
n
d2 LL(λ̂)
=− 2 yi
dλ2 ȳ i=1
n · ȳ (10)
=− 2
ȳ
n
=− <0.
ȳ
yi ∼ Poiss(λ), i = 1, . . . , n . (1)
Then, the conjugate prior (→ I/5.2.5) for the model parameter λ is a gamma distribution (→
II/3.4.1):
Proof: With the probability mass function of the Poisson distribution (→ II/1.5.2), the likelihood
function (→ I/5.1.2) for each observation implied by (1) is given by
and because observations are independent (→ I/1.3.6), the likelihood function for all observations is
the product of the individual ones:
Y
n Yn
λyi · exp [−λ]
p(y|λ) = p(yi |λ) = . (4)
i=1 i=1
y i !
Resolving the product in the likelihood function, we have
Yn
1 Y yi Y
n n
p(y|λ) = · λ · exp [−λ]
y ! i=1
i=1 i i=1
Yn (5)
1
= · λnȳ · exp [−nλ]
i=1
yi !
1X
n
ȳ = yi . (6)
n i=1
In other words, the likelihood function is proportional to a power of λ times an exponential of λ:
b0 a0 a0 −1
p(λ) = λ exp[−b0 λ] (9)
Γ(a0 )
exhibits the same proportionality
Sources:
• Gelman A, Carlin JB, Stern HS, Dunson DB, Vehtari A, Rubin DB (2014): “Other standard
single-parameter models”; in: Bayesian Data Analysis, 3rd edition, ch. 2.6, p. 45, eq. 2.14ff.; URL:
http://www.stat.columbia.edu/~gelman/book/.
3. COUNT DATA 533
yi ∼ Poiss(λ), i = 1, . . . , n . (1)
Moreover, assume a gamma prior distribution (→ III/3.3.3) over the model parameter λ:
an = a0 + nȳ
(4)
bn = b0 + n .
Proof: With the probability mass function of the Poisson distribution (→ II/1.5.2), the likelihood
function (→ I/5.1.2) for each observation implied by (1) is given by
Yn
1 Y yi Y
n n
b0 a0 a0 −1
p(y, λ) = λ exp [−λ] · λ exp[−b0 λ]
y!
i=1 i i=1 i=1
Γ(a0 )
Yn
1 b0 a0 a0 −1
= λnȳ exp [−nλ] · λ exp[−b0 λ] (8)
i=1
y i ! Γ(a 0 )
Yn
1 b0 a0
= · · λa0 +nȳ−1 · exp [−(b0 + nλ)]
i=1
y i ! Γ(a 0 )
1X
n
ȳ = yi . (9)
n i=1
Note that the posterior distribution is proportional to the joint likelihood (→ I/5.1.9):
bn an an −1
p(λ|y) = λ exp [−bn λ] = Gam(λ; an , bn ) . (12)
Γ(a0 )
■
Sources:
• Gelman A, Carlin JB, Stern HS, Dunson DB, Vehtari A, Rubin DB (2014): “Other standard
single-parameter models”; in: Bayesian Data Analysis, 3rd edition, ch. 2.6, p. 45, eq. 2.15; URL:
http://www.stat.columbia.edu/~gelman/book/.
yi ∼ Poiss(λ), i = 1, . . . , n . (1)
Moreover, assume a gamma prior distribution (→ III/3.3.3) over the model parameter λ:
an = a0 + nȳ
(4)
bn = b0 + n .
Proof: With the probability mass function of the Poisson distribution (→ II/1.5.2), the likelihood
function (→ I/5.1.2) for each observation implied by (1) is given by
and because observations are independent (→ I/1.3.6), the likelihood function for all observations is
the product of the individual ones:
Y
n Yn
λyi · exp [−λ]
p(y|λ) = p(yi |λ) = . (6)
i=1 i=1
y i !
Combining the likelihood function (6) with the prior distribution (2), the joint likelihood (→ I/5.1.5)
of the model is given by
Yn
1 Y yi Y
n n
b0 a0 a0 −1
p(y, λ) = λ exp [−λ] · λ exp[−b0 λ]
i=1
y i ! i=1 i=1
Γ(a 0 )
Yn
1 b0 a0 a0 −1
= λnȳ exp [−nλ] · λ exp[−b0 λ] (8)
i=1
yi ! Γ(a0 )
Yn
1 b0 a0
= · · λa0 +nȳ−1 · exp [−(b0 + nλ)]
i=1
yi ! Γ(a0 )
1X
n
ȳ = yi . (9)
n i=1
Note that the model evidence is the marginal density of the joint likelihood (→ I/5.1.11):
Z
p(y) = p(y, λ) dλ . (10)
Z Y n a0
1 b0 Γ(an ) bn an an −1
p(y) = an · λ exp [−bn λ] dλ
i=1
y i ! Γ(a 0 ) b n Γ(a n )
Yn Z
1 Γ(an ) b0 a0
= Gam(λ; an , bn ) dλ (12)
i=1
yi ! Γ(a0 ) bn an
Yn
1 Γ(an ) b0 a0
= ,
i=1
yi ! Γ(a0 ) bn an
536 CHAPTER III. STATISTICAL MODELS
yi ∼ Poiss(λxi ), i = 1, . . . , n . (1)
Sources:
• Gelman A, Carlin JB, Stern HS, Dunson DB, Vehtari A, Rubin DB (2014): “Other standard
single-parameter models”; in: Bayesian Data Analysis, 3rd edition, ch. 2.6, p. 45, eq. 2.14; URL:
http://www.stat.columbia.edu/~gelman/book/.
yi ∼ Poiss(λxi ), i = 1, . . . , n . (1)
Then, the maximum likelihood estimate (→ I/4.1.3) for the rate parameter λ is given by
ȳ
λ̂ = (2)
x̄
where ȳ and x̄ are the sample means (→ I/1.10.2)
1X
n
ȳ = yi
n i=1
(3)
1X
n
x̄ = xi .
n i=1
Proof: With the probability mass function of the Poisson distribution (→ II/1.5.2), the likelihood
function (→ I/5.1.2) for each observation implied by (1) is given by
Y
n Yn
(λxi )yi · exp[−λxi ]
p(y|λ) = p(yi |λ) = . (5)
i=1 i=1
y i !
Thus, the log-likelihood function (→ I/4.1.2) is
" #
Yn
(λxi )yi · exp[−λxi ]
LL(λ) = log p(y|λ) = log (6)
i=1
yi !
which can be developed into
X
n
(λxi )yi · exp[−λxi ]
LL(λ) = log
i=1
yi !
X
n
= [yi · log(λxi ) − λxi − log(yi !)]
i=1
X
n X
n X
n
=− λxi + yi · [log(λ) + log(xi )] − log(yi !) (7)
i=1 i=1 i=1
Xn X
n X
n X
n
= −λ xi + log(λ) yi + yi log(xi ) − log(yi !)
i=1 i=1 i=1 i=1
X
n X
n
= −nx̄λ + nȳ log(λ) + yi log(xi ) − log(yi !)
i=1 i=1
dLL(λ) nȳ
= −nx̄ +
dλ λ (8)
d2 LL(λ) nȳ
=− 2 .
dλ2 λ
Setting the first derivative to zero, we obtain:
dLL(λ̂)
=0
dλ
nȳ
0 = −nx̄ + (9)
λ̂
nȳ ȳ
λ̂ =
= .
nx̄ x̄
Plugging this value into the second derivative, we confirm:
d2 LL(λ̂) nȳ
2
=−
dλ λ̂2
n · ȳ
=− (10)
(ȳ/x̄)2
n · x̄2
=− <0.
ȳ
538 CHAPTER III. STATISTICAL MODELS
This demonstrates that the estimate λ̂ = ȳ/x̄ maximizes the likelihood p(y|λ).
yi ∼ Poiss(λxi ), i = 1, . . . , n . (1)
Then, the conjugate prior (→ I/5.2.5) for the model parameter λ is a gamma distribution (→
II/3.4.1):
Proof: With the probability mass function of the Poisson distribution (→ II/1.5.2), the likelihood
function (→ I/5.1.2) for each observation implied by (1) is given by
Yn
xi yi Y yi Y
n n
p(y|λ) = · λ · exp [−λxi ]
yi ! i=1
i=1
n yi
i=1
" #
Y xi ∑ n X n
= · λ i=1 yi · exp −λ xi (5)
i=1
yi ! i=1
Yn yi
xi
= · λnȳ · exp [−nx̄λ]
i=1
y i !
1X
n
ȳ = yi
n i=1
(6)
1X
n
x̄ = xi .
n i=1
b0 a0 a0 −1
p(λ) = λ exp[−b0 λ] (9)
Γ(a0 )
exhibits the same proportionality
Sources:
• Gelman A, Carlin JB, Stern HS, Dunson DB, Vehtari A, Rubin DB (2014): “Other standard
single-parameter models”; in: Bayesian Data Analysis, 3rd edition, ch. 2.6, p. 45, eq. 2.14ff.; URL:
http://www.stat.columbia.edu/~gelman/book/.
yi ∼ Poiss(λxi ), i = 1, . . . , n . (1)
Moreover, assume a gamma prior distribution (→ III/3.4.3) over the model parameter λ:
an = a0 + nȳ
(4)
bn = b0 + nx̄ .
Proof: With the probability mass function of the Poisson distribution (→ II/1.5.2), the likelihood
function (→ I/5.1.2) for each observation implied by (1) is given by
Y
n Yn
(λxi )yi · exp [−λxi ]
p(y|λ) = p(yi |λ) = . (6)
i=1 i=1
y i !
Combining the likelihood function (6) with the prior distribution (2), the joint likelihood (→ I/5.1.5)
of the model is given by
Y
n
xi yi Y
n Y
n
b0 a0 a0 −1
p(y, λ) = λ yi
exp [−λxi ] · λ exp[−b0 λ]
yi ! i=1 Γ(a0 )
i=1
n yi
i=1
" #
Y xi ∑n X n
b0 a0 a0 −1
= λ i=1 yi exp −λ xi · λ exp[−b0 λ]
i=1
yi ! i=1
Γ(a0 )
n yi
(8)
Y xi b0 a0 a0 −1
= λnȳ exp [−nx̄λ] · λ exp[−b0 λ]
i=1
y i ! Γ(a 0 )
Yn yi
xi b0 a0
= · · λa0 +nȳ−1 · exp [−(b0 + nx̄)λ]
i=1
y i ! Γ(a 0 )
1X
n
ȳ = yi
n i=1
(9)
1X
n
x̄ = xi .
n i=1
Note that the posterior distribution is proportional to the joint likelihood (→ I/5.1.9):
bn an an −1
p(λ|y) = λ exp [−bn λ] = Gam(λ; an , bn ) . (12)
Γ(a0 )
■
3. COUNT DATA 541
Sources:
• Gelman A, Carlin JB, Stern HS, Dunson DB, Vehtari A, Rubin DB (2014): “Other standard
single-parameter models”; in: Bayesian Data Analysis, 3rd edition, ch. 2.6, p. 45, eq. 2.15; URL:
http://www.stat.columbia.edu/~gelman/book/.
yi ∼ Poiss(λxi ), i = 1, . . . , n . (1)
Moreover, assume a gamma prior distribution (→ III/3.4.3) over the model parameter λ:
X
n X
n
log p(y|m) = yi log(xi ) − log yi !+
i=1 i=1 (3)
log Γ(an ) − log Γ(a0 ) + a0 log b0 − an log bn .
an = a0 + nȳ
(4)
an = a0 + nx̄ .
Proof: With the probability mass function of the Poisson distribution (→ II/1.5.2), the likelihood
function (→ I/5.1.2) for each observation implied by (1) is given by
Y
n
xi yi Y
n Y
n
b0 a0 a0 −1
p(y, λ) = λ yi
exp [−λxi ] · λ exp[−b0 λ]
y i ! Γ(a 0 )
i=1
n yi
i=1 i=1
" #
Y xi ∑ n X n
b0 a0 a0 −1
= λ i=1 yi exp −λ xi · λ exp[−b0 λ]
i=1
y i ! i=1
Γ(a 0 )
n yi
(8)
Y xi b0 a0 a0 −1
= λnȳ exp [−nx̄λ] · λ exp[−b0 λ]
i=1
yi ! Γ(a0 )
Yn yi
xi b0 a0
= · λa0 +nȳ−1 · exp [−(b0 + nx̄)λ]
i=1
yi ! Γ(a0 )
1X
n
ȳ = yi
n i=1
(9)
1X
n
x̄ = xi .
n i=1
Note that the model evidence is the marginal density of the joint likelihood (→ I/5.1.11):
Z
p(y) = p(y, λ) dλ . (10)
Setting an = a0 + nȳ and bn = b0 + nx̄, the joint likelihood can also be written as
Yn yi
xi b0 a0 Γ(an ) bn an an −1
p(y, λ) = an · λ exp [−bn λ] . (11)
i=1
y i ! Γ(a 0 ) b n Γ(a n )
Using the probability density function of the gamma distribution (→ II/3.4.7), λ can now be inte-
grated out easily
Z Y n yi
xi b0 a0 Γ(an ) bn an an −1
p(y) = an · λ exp [−bn λ] dλ
i=1
y i ! Γ(a 0 ) b n Γ(a n )
Yn yi Z
xi Γ(an ) b0 a0
= Gam(λ; an , bn ) dλ (12)
i=1
yi ! Γ(a0 ) bn an
Yn yi
xi Γ(an ) b0 a0
= an ,
i=1
y i ! Γ(a 0 ) b n
X
n X
n
log p(y|m) = yi log(xi ) − log yi !+
i=1 i=1 (13)
log Γ(an ) − log Γ(a0 ) + a0 log b0 − an log bn .
■
4. FREQUENCY DATA 543
4 Frequency data
4.1 Beta-distributed data
4.1.1 Definition
Definition: Beta-distributed data are defined as a set of proportions y = {y1 , . . . , yn } with yi ∈
[0, 1], i = 1, . . . , n, independent and identically distributed according to a beta distribution (→
II/3.9.1) with shapes α and β:
ȳ(1 − ȳ)
α̂ = ȳ −1
v̄
(2)
ȳ(1 − ȳ)
β̂ = (1 − ȳ) −1
v̄
where ȳ is the sample mean (→ I/1.10.2) and v̄ is the unbiased sample variance (→ I/1.11.2):
1X
n
ȳ = yi
n i=1
(3)
1 X
n
v̄ = (yi − ȳ)2 .
n − 1 i=1
Proof: Mean (→ II/3.9.6) and variance (→ II/3.9.7) of the beta distribution (→ II/3.9.1) in terms
of the parameters α and β are given by
α
E(X) =
α+β
αβ (4)
Var(X) = 2
.
(α + β) (α + β + 1)
Thus, matching the moments (→ I/4.1.6) requires us to solve the following equation system for α
and β:
α
ȳ =
α+β
αβ (5)
v̄ = 2
.
(α + β) (α + β + 1)
544 CHAPTER III. STATISTICAL MODELS
ȳ(α + β) = α
αȳ + β ȳ = α
β ȳ = α − αȳ
α (6)
β = −α
ȳ
1
β=α −1 .
ȳ
If we define q = 1/ȳ − 1 and plug (6) into the second equation, we have:
α · αq
v̄ =
(α + αq)2 (α + αq + 1)
α2 q
=
(α(1 + q))2 (α(1 + q) + 1)
q (7)
=
(1 + q)2 (α(1 + q) + 1)
q
=
α(1 + q)3 + (1 + q)2
q = v̄ α(1 + q)3 + (1 + q)2 .
Noting that 1 + q = 1/ȳ and q = (1 − ȳ)/ȳ, one obtains for α:
1 − ȳ α 1
= v̄ 3 + 2
ȳ ȳ ȳ
1 − ȳ α 1
= 3+ 2
ȳ v̄ ȳ ȳ
ȳ (1 − ȳ)
3
= α + ȳ (8)
ȳ v̄
ȳ 2 (1 − ȳ)
α= − ȳ
v̄
ȳ(1 − ȳ)
= ȳ −1 .
v̄
Plugging this into equation (6), one obtains for β:
ȳ(1 − ȳ) 1 − ȳ
β = ȳ −1 ·
v̄ ȳ
(9)
ȳ(1 − ȳ)
= (1 − ȳ) −1 .
v̄
Together, (8) and (9) constitute the method-of-moment estimates of α and β.
■
Sources:
• Wikipedia (2020): “Beta distribution”; in: Wikipedia, the free encyclopedia, retrieved on 2020-01-
20; URL: https://en.wikipedia.org/wiki/Beta_distribution#Method_of_moments.
4. FREQUENCY DATA 545
yi = [yi1 , . . . , yik ],
yij ∈ [0, 1] and
(1)
X
k
yij = 1
j=1
for all i = 1, . . . , n (and j = 1, . . . , k) and each yi is independent and identically distributed according
to a Dirichlet distribution (→ II/4.4.1) with concentration parameters α = [α1 , . . . , αk ]:
yi ∼ Dir(α), i = 1, . . . , n . (2)
yi ∼ Dir(α), i = 1, . . . , n . (1)
Then, the maximum likelihood estimate (→ I/4.1.3) for the concentration parameters α can be
obtained by iteratively computing
" ! #
(new)
X
k
(old)
αj = ψ −1 ψ αj + log ȳj (2)
j=1
where ψ(x) is the digamma function and log ȳj is given by:
1X
n
log ȳj = log yij . (3)
n i=1
Proof: The likelihood function (→ I/5.1.2) for each observation is given by the probability density
function of the Dirichlet distribution (→ II/4.4.2)
P
k
Γ j=1 α j Y
k
p(yi |α) = Qk yij αj −1 (4)
j=1 Γ(αj ) j=1
and because observations are independent (→ I/1.3.6), the likelihood function for all observations is
the product of the individual ones:
Pk
Yn Yn Γ α
j=1 j Yk
p(y|α) = p(yi |α) = Q yij αj −1 . (5)
k
i=1 i=1 j=1 Γ(α j ) j=1
Pk
Γ j=1 αY
n
j Yk
LL(α) = log p(y|α) = log Q yij αj −1 (6)
k
i=1 j=1 Γ(α j ) j=1
!
X
n X
k X
n X
k X
n X
k
LL(α) = log Γ αj − log Γ(αj ) + (αj − 1) log yij
i=1 j=1 i=1 j=1 i=1 j=1
!
X
k X
k Xk
1X
n
= n log Γ αj −n log Γ(αj ) + n (αj − 1) log yij (7)
j=1 j=1 j=1
n i=1
!
X
k X
k X
k
= n log Γ αj −n log Γ(αj ) + n (αj − 1) log ȳj
j=1 j=1 j=1
1X
n
log ȳj = log yij . (8)
n i=1
The derivative of the log-likelihood with respect to a particular parameter αj is
" ! #
dLL(α) d Xk X k X k
= n log Γ αj − n log Γ(αj ) + n (αj − 1) log ȳj
dαj dαj j=1 j=1 j=1
" !#
d Xk
d d
= n log Γ αj − [n log Γ(αj )] + [n(αj − 1) log ȳj ] (9)
dαj j=1
dαj dαj
!
Xk
= nψ αj − nψ(αj ) + n log ȳj
j=1
d log Γ(x)
ψ(x) = . (10)
dx
Setting this derivative to zero, we obtain:
4. FREQUENCY DATA 547
dLL(α)
=0
dαj
!
X
k
0 = nψ αj − nψ(αj ) + n log ȳj
j=1
!
X
k
0=ψ αj − ψ(αj ) + log ȳj (11)
j=1
!
X
k
ψ(αj ) = ψ αj + log ȳj
j=1
" ! #
X
k
αj = ψ −1 ψ αj + log ȳj .
j=1
In the following, we will use a fixed-point iteration to maximize LL(α). Given an initial guess for α,
we construct a lower bound on the likelihood function (7) which is tight at α. The maximum of this
bound is computed and it becomes the new guess. Because the Dirichlet distribution (→ II/4.4.1)
belongs to the exponential family, the log-likelihood function is convex in α ánd the maximum is the
only stationary point, such that the procedure is guaranteed to converge to the maximum.
In our case, we use a bound on the gamma function
!
1 Xk X
k X
k
LL(α) = log Γ αj − log Γ(αj ) + (αj − 1) log ȳj
n j=1 j=1 j=1
! ! !
1 X
k X
k X
k X
k X
k X
k
LL(α) ≥ log Γ α̂j + αj − α̂j ψ α̂j − log Γ(αj ) + (αj − 1) log ȳj
n j=1 j=1 j=1 j=1 j=1 j=1
! !
1 X
k X
k X
k X
k
LL(α) ≥ αj ψ α̂j − log Γ(αj ) + (αj − 1) log ȳj + const.
n j=1 j=1 j=1 j=1
(13)
which leads to the following fixed-point iteration using (11):
" ! #
(new)
X k
(old)
αj = ψ −1 ψ αj + log ȳj . (14)
j=1
Sources:
• Minka TP (2012): “Estimating a Dirichlet distribution”; in: Papers by Tom Minka, retrieved on
2020-10-22; URL: https://tminka.github.io/papers/dirichlet/minka-dirichlet.pdf.
548 CHAPTER III. STATISTICAL MODELS
nm1 − m2
α̂ =
n m 2
m1
− m 1 − 1 + m1
(2)
(n − m1 ) n − mm1
2
β̂ =
n m 2
m1
− m1 − 1 + m1
where m1 and m2 are the first two raw sample moments (→ I/1.18.3):
1 X
N
m1 = yi
N i=1
(3)
1 X 2
N
m2 = y .
N i=1 i
Proof: The first two raw moments of the beta-binomial distribution in terms of the parameters α
and β are given by
nα
µ1 =
α+β
nα(nα + β + n) (4)
µ2 =
(α + β)(nα + β + 1)
Thus, matching the moments (→ I/4.1.6) requires us to solve the following equation system for α
and β:
nα
m1 =
α+β
nα(nα + β + n) (5)
m2 = .
(α + β)(nα + β + 1)
4. FREQUENCY DATA 549
m1 (α + β) = nα
m1 α + m1 β = nα
m1 β = nα − m1 α
nα (6)
β= −α
m1
n
β=α −1 .
m1
If we define q = n/m1 − 1 and plug (6) into the second equation, we have:
nα(nα + αq + n)
m2 =
(α + αq)(α + αq + 1)
nα(α(n + q) + n)
=
α(1 + q)(α(1 + q) + 1)
(7)
n(α(n + q) + n)
=
(1 + q)(α(1 + q) + 1)
n(α(n + q) + n)
= .
α(1 + q)2 + (1 + q)
Noting that 1 + q = n/m1 and expanding the fraction with m1 , one obtains:
n α m1 + n − 1 + n
n
m2 =
n 1
n α m2 + m1
1
α (n + nm1 − m1 ) + nm1
m2 =
α mn1 + 1
αn
m2 + 1 = α (n + nm1 − m1 ) + nm1
m1 (8)
m2
α n − (n + nm1 − m1 ) = nm1 − m2
m1
m2
α n − m1 − 1 + m1 = nm1 − m2
m1
nm1 − m2
α= .
n m1 − m1 − 1 + m1
m2
n
β=α −1
m1
nm1 − m2 n
β= −1
n m 2
− m1 − 1 + m1 m1
m1
n2 − nm1 − n m 2
+ m2 (9)
β=
m1
n m 2
m1
− m1 − 1 + m1
(n − m1 ) n − m m2
β̂ =
1
.
n m 2
m1
− m 1 − 1 + m 1
Sources:
• statisticsmatt (2022): “Method of Moments Estimation Beta Binomial Distribution”; in: YouTube,
retrieved on 2022-10-07; URL: https://www.youtube.com/watch?v=18PWnWJsPnA.
• Wikipedia (2022): “Beta-binomial distribution”; in: Wikipedia, the free encyclopedia, retrieved on
2022-10-07; URL: https://en.wikipedia.org/wiki/Beta-binomial_distribution#Method_of_moments.
5. CATEGORICAL DATA 551
5 Categorical data
5.1 Logistic regression
5.1.1 Definition
Definition: A logistic regression model is given by a set of binary observations yi ∈ {0, 1} , i =
1, . . . , n, a set of predictors xj ∈ Rn , j = 1, . . . , p, a base b and the assumption that the log-odds are
a linear combination of the predictors:
li = xi β + εi , i = 1, . . . , n (1)
where li are the log-odds that yi = 1
Pr(yi = 1)
li = logb (2)
Pr(yi = 0)
and xi is the i-th row of the n × p matrix
X = [x1 , . . . , xp ] . (3)
Within this model,
• y are called “categorical observations” or “dependent variable”;
• X is called “design matrix” or “set of independent variables”;
• β are called “regression coefficients” or “weights”;
• εi is called “noise” or “error term”;
• n is the number of observations;
• p is the number of predictors.
Sources:
• Wikipedia (2020): “Logistic regression”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
06-28; URL: https://en.wikipedia.org/wiki/Logistic_regression#Logistic_model.
li = xi β + εi , i = 1, . . . , n (1)
where xi are the predictors corresponding to the i-th observation yi and li are the log-odds that
yi = 1.
Then, the log-odds in favor of yi = 1 against yi = 0 can also be expressed as
Proof: Using Bayes’ theorem (→ I/5.3.1) and the law of marginal probability (→ I/1.3.3), the
posterior probabilities (→ I/5.1.7) for yi = 1 and yi = 0 are given by
552 CHAPTER III. STATISTICAL MODELS
p(yi = 1|xi )
li = logb
p(yi = 0|xi )
(4)
p(xi |yi = 1) p(yi = 1)
= logb .
p(xi |yi = 0) p(yi = 0)
Sources:
• Bishop, Christopher M. (2006): “Linear Models for Classification”; in: Pattern Recognition for
Machine Learning, ch. 4, p. 197, eq. 4.58; URL: http://users.isr.ist.utl.pt/~wurmd/Livros/school/
Bishop%20-%20Pattern%20Recognition%20And%20Machine%20Learning%20-%20Springer%20%
202006.pdf.
li = xi β + εi , i = 1, . . . , n (1)
where xi are the predictors corresponding to the i-th observation yi and li are the log-odds that
yi = 1.
Then, the probability that yi = 1 is given by
1
Pr(yi = 1) = (2)
1 + b−(xi β+εi )
where b is the base used to form the log-odds li .
pi
logb = xi β + εi
1 − pi
pi
= bxi β+εi
1 − pi
pi = bxi β+εi (1 − pi )
pi 1 + bxi β+εi = bxi β+εi
(4)
bxi β+εi
pi =
1 + bxi β+εi
bxi β+εi
pi = xi β+εi
b (1 + b−(xi β+εi ) )
1
pi = −(x
1 + b i β+εi )
which proves the identity given by (2).
Sources:
• Wikipedia (2020): “Logistic regression”; in: Wikipedia, the free encyclopedia, retrieved on 2020-
03-03; URL: https://en.wikipedia.org/wiki/Logistic_regression#Logistic_model.
554 CHAPTER III. STATISTICAL MODELS
Chapter IV
Model Selection
555
556 CHAPTER IV. MODEL SELECTION
1 Goodness-of-fit measures
1.1 Residual variance
1.1.1 Definition
Definition: Let there be a linear regression model (→ III/1.5.1)
y = Xβ + ε, ε ∼ N (0, σ 2 V ) (1)
with measured data y, known design matrix X and covariance structure V as well as unknown
regression coefficients β and noise variance σ 2 .
Then, an estimate of the noise variance σ 2 is called the “residual variance” σ̂ 2 , e.g. obtained via
maximum likelihood estimation (→ I/4.1.3).
1X
n
σ̂ =2
(xi − x̄)2 (2)
n i=1
where
1X
n
x̄ = xi (3)
n i=1
2) and σ̂ 2 is a biased estimator of σ 2
E σ̂ 2 ̸= σ 2 , (4)
more precisely:
n−1 2
E σ̂ 2 = σ . (5)
n
Proof:
1) This is equivalent to the maximum likelihood estimator for the univariate Gaussian with unknown
variance (→ III/1.1.2) and a special case of the maximum likelihood estimator for multiple linear
regression (→ III/1.5.26) in which y = x, X = 1n and β̂ = x̄:
1
σ̂ 2 = (y − X β̂)T (y − X β̂)
n
1
= (x − 1n x̄)T (x − 1n x̄) (6)
n
1X
n
= (xi − x̄)2 .
n i=1
1. GOODNESS-OF-FIT MEASURES 557
2) The expectation (→ I/1.10.1) of the maximum likelihood estimator (→ I/4.1.3) can be developed
as follows:
" n #
2 1X
E σ̂ = E (xi − x̄) 2
n i=1
" n #
1 X
= E (xi − x̄) 2
n
" n
i=1
#
1 X
= E xi − 2xi x̄ + x̄
2 2
n
" n
i=1
#
1 X X n X n
= E xi − 2
2
xi x̄ + x̄ 2
n
" n
i=1 i=1
#i=1
(7)
1 X
= E xi − 2nx̄ + nx̄
2 2 2
n
" n
i=1
#
1 X
= E xi − nx̄
2 2
n i=1
!
1 X 2 2
n
= E xi − nE x̄
n i=1
1 X 2
n
= E xi − E x̄2
n i=1
2 1 X n
E σ̂ = Var(xi ) + E(xi )2 − Var(x̄) + E(x̄)2 . (10)
n i=1
From (1), it follows that
" #
1 Xn
1X
n
E [x̄] = E xi = E [xi ]
n i=1 n i=1
1X
n
(11) 1 (12)
= µ= ·n·µ
n i=1 n
=µ.
" #
1X 1 X
n n
Var [x̄] = Var xi = 2 Var [xi ]
n i=1 n i=1
1 X 2
n
(11) 1 (13)
= 2 σ = 2 · n · σ2
n i=1 n
1 2
= σ .
n
Plugging (11), (12) and (13) into (10), we have
2 1 X n
1 2
E σ̂ = σ 2 + µ2 − σ +µ 2
n i=1 n
2 1 1 2
E σ̂ = · n · σ + µ −
2 2
σ +µ 2
n n (14)
1
E σ̂ 2 = σ 2 + µ2 − σ 2 − µ2
n
2 n − 1 2
E σ̂ = σ
n
which proves the bias given by (5).
Sources:
• Liang, Dawen (????): “Maximum Likelihood Estimator for Variance is Biased: Proof”, retrieved
on 2020-02-24; URL: https://dawenl.github.io/files/mle_biased.pdf.
y = Xβ + ε, ε ∼ N (0, σ 2 V ) . (1)
Then,
1) the maximum likelihood estimator (→ I/4.1.3) of σ 2 is
1. GOODNESS-OF-FIT MEASURES 559
1
σ̂ 2 = (y − X β̂)T V −1 (y − X β̂) (2)
n
where
β̂ = (X T V −1 X)−1 X T V −1 y (3)
2) and σ̂ 2 is a biased estimator of σ 2
E σ̂ 2 ̸= σ 2 , (4)
more precisely:
n−p 2
E σ̂ 2 = σ . (5)
n
Proof:
1) This follows from maximum likelihood estimation for multiple linear regression (→ III/1.5.26)
and is a special case (→ III/1.5.2) of maximum likelihood estimation for the general linear model
(→ III/2.1.4) in which Y = y, B = β and Σ = σ 2 :
1
σ̂ 2 =(Y − X B̂)T V −1 (Y − X B̂)
n (6)
1
= (y − X β̂)T V −1 (y − X β̂) .
n
2) We know that the residual sum of squares, divided by the true noise variance, is following a
chi-squared distribution (→ III/1.5.19):
ε̂T ε̂
∼ χ2 (n − p)
σ2 (7)
where ε̂T ε̂ = (y − X β̂)T V −1 (y − X β̂) .
nσ̂ 2
2
∼ χ2 (n − p) . (8)
σ
Using the relationship between chi-squared distribution and gamma distribution (→ II/3.7.2)
k 1
X ∼ χ (k) ⇒ cX ∼ Gam
2
, , (9)
2 2c
we can deduce from (8) that
σ 2 nσ̂ 2 n−p n
2
σ̂ = · ∼ Gam , 2 . (10)
n σ2 2 2σ
Using the expected value of the gamma distribution (→ II/3.4.10)
a
X ∼ Gam(a, b) ⇒ E(X) = , (11)
b
560 CHAPTER IV. MODEL SELECTION
n−p
n−p 2
E σ̂ 2 = 2
n = σ (12)
2σ 2
n
which proves the relationship given by (5).
■
Sources:
• ocram (2022): “Why is RSS distributed chi square times n-p?”; in: StackExchange CrossValidated,
retrieved on 2022-12-21; URL: https://stats.stackexchange.com/a/20230.
1 X
n
2
σ̂unb = (xi − x̄)2 . (2)
n − 1 i=1
Proof: It can be shown that (→ IV/1.1.2) the maximum likelihood estimator (→ I/4.1.3) of σ 2
1X
n
2
σ̂MLE = (xi − x̄)2 (3)
n i=1
is a biased estimator in the sense that
2 n−1 2
E σ̂MLE = σ . (4)
n
From (4), it follows that
n n 2
E 2
σ̂MLE = E σ̂MLE
n−1 n−1
(4) n n−1 2 (5)
= · σ
n−1 n
= σ2 ,
such that an unbiased estimator can be constructed as
2 n
σ̂unb = σ̂ 2
n − 1 MLE
1X
n
(3) n
= · (xi − x̄)2
n − 1 n i=1 (6)
1 X
n
= (xi − x̄)2 .
n − 1 i=1
1. GOODNESS-OF-FIT MEASURES 561
Sources:
• Liang, Dawen (????): “Maximum Likelihood Estimator for Variance is Biased: Proof”, retrieved
on 2020-02-25; URL: https://dawenl.github.io/files/mle_biased.pdf.
1.2 R-squared
1.2.1 Definition
Definition: Let there be a linear regression model (→ III/1.5.1) with independent (→ I/1.3.6)
observations
i.i.d.
y = Xβ + ε, εi ∼ N (0, σ 2 ) (1)
with measured data y, known design matrix X as well as unknown regression coefficients β and noise
variance σ 2 .
Then, the proportion of the variance of the dependent variable y (“total variance (→ III/1.5.6)”)
that can be predicted from the independent variables X (“explained variance (→ III/1.5.7)”) is called
“coefficient of determination”, “R-squared” or R2 .
Sources:
• Wikipedia (2020): “Coefficient of determination”; in: Wikipedia, the free encyclopedia, retrieved
on 2020-02-25; URL: https://en.wikipedia.org/wiki/Mean_squared_error#Proof_of_variance_
and_bias_relationship.
RSS/(n − p)
2
Radj =1− (3)
TSS/(n − 1)
where the residual (→ III/1.5.8) and total sum of squares (→ III/1.5.6) are
X
n
RSS = (yi − ŷi )2 , ŷ = X β̂
i=1
(4)
X
n
1X
n
TSS = (yi − ȳ)2 , ȳ = yi
i=1
n i=1
562 CHAPTER IV. MODEL SELECTION
where X is the n × p design matrix and β̂ are the ordinary least squares (→ III/1.5.3) estimates.
Proof: The coefficient of determination (→ IV/1.2.1) R2 is defined as the proportion of the variance
explained by the independent variables, relative to the total variance in the data.
then R2 is given by
ESS
R2 = . (6)
TSS
which is equal to
TSS − RSS RSS
R2 = =1− , (7)
TSS TSS
because (→ III/1.5.9) TSS = ESS + RSS.
This gives the adjusted R2 which adjusts R2 for the number of explanatory variables.
Sources:
• Wikipedia (2019): “Coefficient of determination”; in: Wikipedia, the free encyclopedia, retrieved on
2019-12-06; URL: https://en.wikipedia.org/wiki/Coefficient_of_determination#Adjusted_R2.
R2 = 1 − (exp[∆MLL])−2/n (2)
1. GOODNESS-OF-FIT MEASURES 563
where n is the number of observations and ∆MLL is the difference in maximum log-likelihood between
the model given by (1) and a linear regression model with only a constant regressor.
Proof: First, we express the maximum log-likelihood (→ I/4.1.5) (MLL) of a linear regression model
in terms of its residual sum of squares (→ III/1.5.8) (RSS). The model in (1) implies the following
log-likelihood function (→ I/4.1.2)
n 1
LL(β, σ 2 ) = log p(y|β, σ 2 ) = − log(2πσ 2 ) − 2 (y − Xβ)T (y − Xβ) , (3)
2 2σ
such that maximum likelihood estimates are (→ III/1.5.26)
β̂ = (X T X)−1 X T y (4)
1
(y − X β̂)T (y − X β̂)
σ̂ 2 = (5)
n
and the residual sum of squares (→ III/1.5.8) is
X
n
RSS = ε̂i = ε̂T ε̂ = (y − X β̂)T (y − X β̂) = n · σ̂ 2 . (6)
i=1
Since β̂ and σ̂ 2 are maximum likelihood estimates (→ I/4.1.3), plugging them into the log-likelihood
function gives the maximum log-likelihood:
n 1
MLL = LL(β̂, σ̂ 2 ) = − log(2πσ̂ 2 ) − 2 (y − X β̂)T (y − X β̂) . (7)
2 2σ̂
2 2
With (6) for the first σ̂ and (5) for the second σ̂ , the MLL becomes
n n 2π n
MLL = − log(RSS) − log − . (8)
2 2 n 2
Second, we establish the relationship between maximum log-likelihood (MLL) and coefficient of
determination (R²). Consider the two models
m 0 : X0 = 1 n
(9)
m 1 : X1 = X
For m1 , the residual sum of squares is given by (6); and for m0 , the residual sum of squares is equal
to the total sum of squares (→ III/1.5.6):
X
n
TSS = (yi − ȳ)2 . (10)
i=1
h n n i
exp[∆MLL] = exp − log(RSS) + log(TSS)
2 2
= (exp [log(RSS) − log(TSS)])−n/2
−n/2
exp[log(RSS)] (12)
=
exp[log(TSS)]
−n/2
RSS
= .
TSS
Taking both sides to the power of −2/n and subtracting from 1, we have
RSS
(exp[∆MLL])−2/n =
TSS (13)
−2/n RSS
1 − (exp[∆MLL]) =1− = R2
TSS
which proves the identity given above.
Var(X β̂)
SNR = . (2)
σ̂ 2
Sources:
• Soch J, Allefeld C (2018): “MACS – a new SPM toolbox for model assessment, comparison and
selection”; in: Journal of Neuroscience Methods, vol. 306, pp. 19-31, eq. 6; URL: https://www.
sciencedirect.com/science/article/pii/S0165027018301468; DOI: 10.1016/j.jneumeth.2018.05.017.
β̂ = (X T X)−1 X T y . (2)
Then, the signal-to noise ratio (→ IV/1.3.1) can be expressed in terms of the coefficient of determi-
nation (→ IV/1.2.1)
R2
SNR = (3)
1 − R2
and vice versa
SNR
R2 = , (4)
1 + SNR
if the predicted signal mean is equal to the actual signal mean.
Note that it is irrelevant whether we use the biased estimator of the variance (→ IV/1.1.2) (dividing
by n) or the unbiased estimator fo the variance (→ IV/1.1.4) (dividing by n−1), because the relevant
terms cancel out.
If the predicted signal mean is equal to the actual signal mean – which is the case when variable
regressors in X have mean zero, such that they are orthogonal to a constant regressor in X –, this
means that ŷ¯ = ȳ, such that
Pn
(ŷi − ȳ)2
SNR = Pni=1 . (7)
i=1 (yi − ŷi )
2
Then, the SNR can be written in terms of the explained (→ III/1.5.7), residual (→ III/1.5.8) and
total sum of squares (→ III/1.5.6):
ESS ESS/TSS
SNR = = . (8)
RSS RSS/TSS
With the derivation of the coefficient of determination (→ IV/1.2.2), this becomes
R2
SNR = . (9)
1 − R2
Rearranging this equation for the coefficient of determination (→ IV/1.2.1), we have
SNR
R2 = , (10)
1 + SNR
■
566 CHAPTER IV. MODEL SELECTION
Sources:
• Akaike H (1974): “A New Look at the Statistical Model Identification”; in: IEEE Transactions on
Automatic Control, vol. AC-19, no. 6, pp. 716-723; URL: https://ieeexplore.ieee.org/document/
1100705; DOI: 10.1109/TAC.1974.1100705.
Then, the corrected Akaike information criterion (→ IV/2.1.2) (AICc ) of this model is defined as
2k 2 + 2k
AICc (m) = AIC(m) + (2)
n−k−1
where AIC(m) is the Akaike information criterion (→ IV/2.1.1) and k is the number of free param-
eters estimated via (1).
Sources:
• Hurvich CM, Tsai CL (1989): “Regression and time series model selection in small samples”; in:
Biometrika, vol. 76, no. 2, pp. 297-307; URL: https://academic.oup.com/biomet/article-abstract/
76/2/297/265326; DOI: 10.1093/biomet/76.2.297.
2k 2 + 2k
AICc (m) = AIC(m) + . (2)
n−k−1
Note that the number of free model parameters k is finite. Thus, we have:
2k 2 + 2k
lim AICc (m) = lim AIC(m) +
n→∞ n→∞ n−k−1
2k 2 + 2k
= lim AIC(m) + lim (3)
n→∞ n→∞ n − k − 1
= AIC(m) + 0
= AIC(m) .
Sources:
• Wikipedia (2022): “Akaike information criterion”; in: Wikipedia, the free encyclopedia, retrieved on
2022-03-18; URL: https://en.wikipedia.org/wiki/Akaike_information_criterion#Modification_for_
small_sample_size.
2k 2 + 2k
AICc (m) = AIC(m) + . (3)
n−k−1
Plugging (2) into (3), we obtain:
2k 2 + 2k
AICc (m) = −2 log p(y|θ̂, m) + 2 k +
n−k−1
2k(n − k − 1) 2k 2 + 2k
= −2 log p(y|θ̂, m) + +
n−k−1 n−k−1 (4)
2nk − 2k − 2k
2
2k 2 + 2k
= −2 log p(y|θ̂, m) + +
n−k−1 n−k−1
2nk
= −2 log p(y|θ̂, m) + .
n−k−1
568 CHAPTER IV. MODEL SELECTION
Sources:
• Wikipedia (2022): “Akaike information criterion”; in: Wikipedia, the free encyclopedia, retrieved on
2022-03-11; URL: https://en.wikipedia.org/wiki/Akaike_information_criterion#Modification_for_
small_sample_size.
Sources:
• Schwarz G (1978): “Estimating the Dimension of a Model”; in: The Annals of Statistics, vol. 6,
no. 2, pp. 461-464; URL: https://www.jstor.org/stable/2958889.
2.2.2 Derivation
Theorem: Let p(y|θ, m) be the likelihood function (→ I/5.1.2) of a generative model (→ I/5.1.1)
m ∈ M with model parameters θ ∈ Θ describing measured data y ∈ Rn . Let p(θ|m) be a prior
distribution (→ I/5.1.3) on the model parameters. Assume that likelihood function and prior density
are twice differentiable.
Then, as the number of data points goes to infinity, an approximation to the log-marginal likelihood
(→ I/5.1.11) log p(y|m), up to constant terms not depending on the model, is given by the Bayesian
information criterion (→ IV/2.2.1) (BIC) as
g(θ) = p(θ|m)
1 (3)
h(θ) = LL(θ) .
n
2. CLASSICAL INFORMATION CRITERIA 569
Z
p(y|m) = p(y|θ, m) p(θ|m) dθ
ZΘ (4)
= exp [n h(θ)] g(θ) dθ .
Θ
−2 log p(y|m) ≈ −2 LL(θ̂) + p log n − p log(2π) − 2 log p(θ̂|m) + log J(θ̂) . (8)
As n → ∞, the last three terms are Op (1) and can therefore be ignored when comparing between
models M = {m1 , . . . , mM } and using p(y|mj ) to compute posterior model probabilies (→ IV/3.4.1)
p(mj |y). With that, the BIC is given as
Sources:
• Claeskens G, Hjort NL (2008): “The Bayesian information criterion”; in: Model Selection and Model
Averaging, ch. 3.2, pp. 78-81; URL: https://www.cambridge.org/core/books/model-selection-and-model-av
E6F1EC77279D1223423BB64FC3A12C37; DOI: 10.1017/CBO9780511790485.
Sources:
• Spiegelhalter DJ, Best NG, Carlin BP, Van Der Linde A (2002): “Bayesian measures of model com-
plexity and fit”; in: Journal of the Royal Statistical Society, Series B: Statistical Methodology, vol.
64, iss. 4, pp. 583-639; URL: https://rss.onlinelibrary.wiley.com/doi/10.1111/1467-9868.00353;
DOI: 10.1111/1467-9868.00353.
• Soch J, Allefeld C (2018): “MACS – a new SPM toolbox for model assessment, comparison and
selection”; in: Journal of Neuroscience Methods, vol. 306, pp. 19-31, eqs. 10-12; URL: https://www.
sciencedirect.com/science/article/pii/S0165027018301468; DOI: 10.1016/j.jneumeth.2018.05.017.
2.3.2 Deviance
Definition: Let there be a generative model (→ I/5.1.1) m describing measured data y using model
parameters θ. Then, the deviance of m is a function of θ which multiplies the log-likelihood function
(→ I/4.1.2) with −2:
Sources:
• Spiegelhalter DJ, Best NG, Carlin BP, Van Der Linde A (2002): “Bayesian measures of model com-
plexity and fit”; in: Journal of the Royal Statistical Society, Series B: Statistical Methodology, vol.
64, iss. 4, pp. 583-639; URL: https://rss.onlinelibrary.wiley.com/doi/10.1111/1467-9868.00353;
DOI: 10.1111/1467-9868.00353.
• Soch J, Allefeld C (2018): “MACS – a new SPM toolbox for model assessment, comparison and
selection”; in: Journal of Neuroscience Methods, vol. 306, pp. 19-31, eqs. 10-12; URL: https://www.
sciencedirect.com/science/article/pii/S0165027018301468; DOI: 10.1016/j.jneumeth.2018.05.017.
• Wikipedia (2022): “Deviance information criterion”; in: Wikipedia, the free encyclopedia, retrieved
on 2022-03-01; URL: https://en.wikipedia.org/wiki/Deviance_information_criterion#Definition.
3. BAYESIAN MODEL SELECTION 571
Sources:
• Penny WD (2012): “Comparing Dynamic Causal Models using AIC, BIC and Free Energy”; in:
NeuroImage, vol. 59, iss. 2, pp. 319-330, eq. 15; URL: https://www.sciencedirect.com/science/
article/pii/S1053811911008160; DOI: 10.1016/j.neuroimage.2011.07.039.
3.1.2 Derivation
Theorem: Let p(y|θ, m) be a likelihood function (→ I/5.1.2) of a generative model (→ I/5.1.1) m
for making inferences on model parameters θ given measured data y. Moreover, let p(θ|m) be a prior
distribution (→ I/5.1.3) on model parameters θ in the parameter space Θ. Then, the model evidence
(→ IV/3.1.1) (ME) can be expressed in terms of likelihood (→ I/5.1.2) and prior (→ I/5.1.3) as
Z
ME(m) = p(y|θ, m) p(θ|m) dθ . (1)
Θ
Proof: This a consequence of the law of marginal probability (→ I/1.3.3) for continuous variables
(→ I/1.2.6)
Z
p(y|m) = p(y, θ|m) dθ (2)
Θ
Sources:
• Soch J, Allefeld C (2018): “MACS – a new SPM toolbox for model assessment, comparison and
selection”; in: Journal of Neuroscience Methods, vol. 306, pp. 19-31, eq. 13; URL: https://www.
sciencedirect.com/science/article/pii/S0165027018301468; DOI: 10.1016/j.jneumeth.2018.05.017.
Proof: This a consequence of the law of marginal probability (→ I/1.3.3) for continuous variables
(→ I/1.2.6)
Z
p(y|m) = p(y, θ|m) dθ (3)
Proof: For a full probability model (→ I/5.1.4), Bayes’ theorem (→ I/5.3.1) makes a statement
about the posterior distribution (→ I/5.1.7):
3. BAYESIAN MODEL SELECTION 573
p(y|θ, m) p(θ|m)
p(θ|y, m) = . (3)
p(y|m)
Rearranging for p(y|m) and logarithmizing, we have:
p(y|θ, m) p(θ|m)
LME(m) = log p(y|m) = log
p(θ|y, m) (4)
= log p(y|θ, m) + log p(θ|m) − log p(θ|y, m) .
Proof: We consider Bayesian inference on data y using model (→ I/5.1.1) m with parameters θ.
Then, Bayes’ theorem (→ I/5.3.1) makes a statement about the posterior distribution (→ I/5.1.7),
i.e. the probability of parameters, given the data and the model:
p(y|θ, m) p(θ|m)
p(θ|y, m) = . (4)
p(y|m)
Rearranging this for the model evidence (→ IV/3.1.5), we have:
p(y|θ, m) p(θ|m)
p(y|m) = . (5)
p(θ|y, m)
Logarthmizing both sides of the equation, we obtain:
p(θ|y, m)
log p(y|m) = log p(y|θ, m) − log . (6)
p(θ|m)
Now taking the expectation over the posterior distribution yields:
Z Z
p(θ|y, m)
log p(y|m) = p(θ|y, m) log p(y|θ, m) dθ − p(θ|y, m) log dθ . (7)
p(θ|m)
By definition, the left-hand side is the log model evidence and the terms on the right-hand side corre-
spond to the posterior expectation of the log-likelihood function and the Kullback-Leibler divergence
of posterior from prior
574 CHAPTER IV. MODEL SELECTION
Sources:
• Beal & Ghahramani (2003): “The variational Bayesian EM algorithm for incomplete data: with
application to scoring graphical model structures”; in: Bayesian Statistics, vol. 7; URL: https:
//mlg.eng.cam.ac.uk/zoubin/papers/valencia02.pdf.
• Penny et al. (2007): “Bayesian Comparison of Spatially Regularised General Linear Models”; in:
Human Brain Mapping, vol. 28, pp. 275–293; URL: https://onlinelibrary.wiley.com/doi/full/10.
1002/hbm.20327; DOI: 10.1002/hbm.20327.
• Soch et al. (2016): “How to avoid mismodelling in GLM-based fMRI data analysis: cross-validated
Bayesian model selection”; in: NeuroImage, vol. 141, pp. 469–489; URL: https://www.sciencedirect.
com/science/article/pii/S1053811916303615; DOI: 10.1016/j.neuroimage.2016.07.047.
1 X
M
∗
LME (mi ) = log p(y|mi ) − log p(y|mj ) . (1)
M j=1
To prove the theorem, we will now rewrite the right-hand side until we arrive at an expression for
the normalized model evidence (→ IV/3.1.3). First, applying c logb a = logb ac , we obtain
X
M
∗
LME (mi ) = log p(y|mi ) − log p(y|mj )1/M . (2)
j=1
Finally, the right-hand side is equal to ratio of mi ’s model evidence to the geometric mean of all
model evidences.
Sources:
• Penny, Will (2015): “Bayesian model selection for group studies using Gibbs sampling”; in: SPM12,
retrieved on 2023-09-08; URL: https://github.com/spm/spm12/blob/master/spm_BMS_gibbs.
m.
• Soch, Joram (2018): “Random Effects Bayesian Model Selection using Variational Bayes”; in:
MACS – a new SPM toolbox for model assessment, comparison and selection, retrieved on 2023-09-
08; URL: https://github.com/JoramSoch/MACS/blob/master/ME_BMS_RFX_VB.m; DOI: 10.5281/ze
odo.845404.
Sources:
• Wikipedia (2020): “Lindley’s paradox”; in: Wikipedia, the free encyclopedia, retrieved on 2020-11-
25; URL: https://en.wikipedia.org/wiki/Lindley%27s_paradox#Bayesian_approach.
X
S Z
cvLME(m) = log p(yi |θ, m) p(θ|y¬i , m) dθ (1)
i=1
S
where y¬i = j̸=i yj is the union of all data subsets except yi and p(θ|y¬i , m) is the posterior distri-
bution (→ I/5.1.7) obtained from y¬i when using the prior distribution (→ I/5.1.3) pni (θ|m):
Sources:
• Soch J, Allefeld C, Haynes JD (2016): “How to avoid mismodelling in GLM-based fMRI data anal-
ysis: cross-validated Bayesian model selection”; in: NeuroImage, vol. 141, pp. 469-489, eqs. 13-15;
URL: https://www.sciencedirect.com/science/article/pii/S1053811916303615; DOI: 10.1016/j.neuroimage.
• Soch J, Meyer AP, Allefeld C, Haynes JD (2017): “How to improve parameter estimates in GLM-
based fMRI data analysis: cross-validated Bayesian model averaging”; in: NeuroImage, vol. 158,
pp. 186-195, eq. 6; URL: https://www.sciencedirect.com/science/article/pii/S105381191730527X;
DOI: 10.1016/j.neuroimage.2017.06.056.
• Soch J, Allefeld C (2018): “MACS – a new SPM toolbox for model assessment, comparison and
selection”; in: Journal of Neuroscience Methods, vol. 306, pp. 19-31, eqs. 14-15; URL: https://www.
sciencedirect.com/science/article/pii/S0165027018301468; DOI: 10.1016/j.jneumeth.2018.05.017.
• Soch J (2018): “cvBMS and cvBMA: filling in the gaps”; in: arXiv stat.ME, 1807.01585, eq. 1;
URL: https://arxiv.org/abs/1807.01585.
and (→ I/5.2.7)
Sources:
• Wikipedia (2020): “Empirical Bayes method”; in: Wikipedia, the free encyclopedia, retrieved on
2020-11-25; URL: https://en.wikipedia.org/wiki/Empirical_Bayes_method#Introduction.
• Penny, W.D. and Ridgway, G.R. (2013): “Efficient Posterior Probability Mapping Using Savage-
Dickey Ratios”; in: PLoS ONE, vol. 8, iss. 3, art. e59655, eqs. 7/11; URL: https://journals.plos.
org/plosone/article?id=10.1371/journal.pone.0059655; DOI: 10.1371/journal.pone.0059655.
3. BAYESIAN MODEL SELECTION 577
and
Z
q(θ)
KL [q(θ)||p(θ|m)] = q(θ) log dθ . (3)
p(θ|m)
Sources:
• Wikipedia (2020): “Variational Bayesian methods”; in: Wikipedia, the free encyclopedia, retrieved
on 2020-11-25; URL: https://en.wikipedia.org/wiki/Variational_Bayesian_methods#Evidence_
lower_bound.
• Penny W, Flandin G, Trujillo-Barreto N (2007): “Bayesian Comparison of Spatially Regularised
General Linear Models”; in: Human Brain Mapping, vol. 28, pp. 275–293, eqs. 2-9; URL: https:
//onlinelibrary.wiley.com/doi/full/10.1002/hbm.20327; DOI: 10.1002/hbm.20327.
f ⇔ m1 ∨ . . . ∨ mM . (1)
Then, the family evidence (FE) of f is defined as the marginal probability (→ I/1.3.3) relative to
the model evidences (→ IV/3.1.1) p(y|mi ), conditional only on f :
Sources:
• Soch J, Allefeld C (2018): “MACS – a new SPM toolbox for model assessment, comparison and
selection”; in: Journal of Neuroscience Methods, vol. 306, pp. 19-31, eq. 16; URL: https://www.
sciencedirect.com/science/article/pii/S0165027018301468; DOI: 10.1016/j.jneumeth.2018.05.017.
578 CHAPTER IV. MODEL SELECTION
3.2.2 Derivation
Theorem: Let f be a family of M generative models (→ I/5.1.1) m1 , . . . , mM with model evidences
(→ IV/3.1.1) p(y|m1 ), . . . , p(y|mM ). Then, the family evidence (→ IV/3.2.1) can be expressed in
terms of the model evidences as
X
M
FE(f ) = p(y|mi ) p(mi |f ) (1)
i=1
where p(mi |f ) are the within-family (→ IV/3.2.3) prior (→ I/5.1.3) model (→ I/5.1.1) probabilities
(→ I/1.3.1).
Proof: This a consequence of the law of marginal probability (→ I/1.3.3) for discrete variables (→
I/1.2.6)
X
M
p(y|f ) = p(y, mi |f ) (2)
i=1
X
M
FE(f ) = p(y|f ) = p(y|mi ) p(mi |f ) . (5)
i=1
f ⇔ m1 ∨ . . . ∨ mM . (1)
Then, the log family evidence is given by the logarithm of the family evidence (→ IV/3.2.1):
X
M
LFE(f ) = log p(y|f ) = log p(y|mi ) p(mi |f ) . (2)
i=1
Sources:
• Soch J, Allefeld C (2018): “MACS – a new SPM toolbox for model assessment, comparison and
selection”; in: Journal of Neuroscience Methods, vol. 306, pp. 19-31, eq. 16; URL: https://www.
sciencedirect.com/science/article/pii/S0165027018301468; DOI: 10.1016/j.jneumeth.2018.05.017.
3. BAYESIAN MODEL SELECTION 579
X
M
LFE(f ) = log p(y|mi ) p(mi |f ) (2)
i=1
where p(mi |f ) are the within-family (→ IV/3.2.3) prior (→ I/5.1.3) model (→ I/5.1.1) probabilities
(→ I/1.3.1).
X
M
p(f ) = p(mi ) (3)
i=1
X
M
p(f |y) = p(mi |y) (4)
i=1
PM
p(y|mi ) p(mi )
p(y|f ) = i=1
PM
i=1 p(mi )
X
M
p(mi )
= p(y|mi ) · PM
i=1 i=1 p(mi )
(9)
XM
p(mi , f )
= p(y|mi ) ·
i=1
p(f )
X
M
= p(y|mi ) · p(mi |f ) .
i=1
where p(mi |fj ) are within-family (→ IV/3.2.3) prior (→ I/5.1.3) model (→ I/5.1.1) probabilities (→
I/1.3.1).
Proof: Let us consider the (unlogarithmized) family evidence p(y|fj ). According to the law of
marginal probability (→ I/1.3.3), this conditional probability is given by
X
p(y|fj ) = [p(y|mi , fj ) · p(mi |fj )] . (2)
mi ∈fj
Because model families are mutually exclusive, it holds that p(y|mi , fj ) = p(y|mi ), such that
X
p(y|fj ) = [p(y|mi ) · p(mi |fj )] . (3)
mi ∈fj
Logarithmizing transforms the family evidence p(y|fj ) into the log family evidence LFE(fj ):
X
LFE(fj ) = log [p(y|mi ) · p(mi |fj )] . (4)
mi ∈fj
Sources:
• Soch J, Allefeld C (2018): “MACS – a new SPM toolbox for model assessment, comparison and
selection”; in: Journal of Neuroscience Methods, vol. 306, pp. 19-31, eq. 16; URL: https://www.
sciencedirect.com/science/article/pii/S0165027018301468; DOI: 10.1016/j.jneumeth.2018.05.017.
where L∗ (fj ) is the maximum log model evidence in family fj , L′ (mi ) is the difference of each log
model evidence to each family’s maximum and p(mi |fj ) are within-family prior model probabilities
(→ IV/3.2.2).
2) Under the condition that prior model probabilities are equal within model families, the approxi-
mation simplifies to
X
LFE(fj ) = L∗ (fj ) + log exp[L′ (mi )] − log Mj (2)
mi ∈fj
Proof: The log family evidence is given in terms of log model evidences (→ IV/3.2.5) as
X
LFE(fj ) = log [exp[LME(mi )] · p(mi |fj )] . (3)
mi ∈fj
Often, especially for complex models or many observations, log model evidences (→ IV/3.1.3) are
highly negative, such that calculation of the term exp[LME(mi )] in modern computers will give model
evidences (→ IV/3.1.1) as zero, making calculation of LFEs impossible.
1) As a solution, we select the maximum LME within each family
model is also much less evident that the family’s best model in this case – making the approximation
acceptable.
Using the relation (5), equation (3) can be reworked into
X
LFE(fj ) = log exp[L′ (mi ) + L∗ (fj )] · p(mi |fj )
mi ∈fj
X
= log exp[L′ (mi )] · exp[L∗ (fj )] · p(mi |fj )
mi ∈fj
X (6)
= log exp[L∗ (fj )] · exp[L′ (mi )] · p(mi |fj )
mi ∈fj
X
= L∗ (fj ) + log exp[L′ (mi )] · p(mi |fj ) .
mi ∈fj
X 1
LFE(fj ) = L∗ (fj ) + log exp[L′ (mi )] ·
mi ∈fj
Mj
1 X (8)
= L∗ (fj ) + log · exp[L′ (mi )]
Mj m ∈f
i j
X
∗ ′
= L (fj ) + log exp[L (mi )] − log Mj .
mi ∈fj
Sources:
• Soch J (2018): “cvBMS and cvBMA: filling in the gaps”; in: arXiv stat.ME, 1807.01585, sect. 2.3,
eq. 32; URL: https://arxiv.org/abs/1807.01585.
p(y | m1 )
BF12 = . (1)
p(y | m2 )
3. BAYESIAN MODEL SELECTION 583
Note that by Bayes’ theorem (→ I/5.3.1), the ratio of posterior model probabilities (→ IV/3.4.1)
(i.e., the posterior model odds) can be written as
p(m1 | y) p(m1 )
= · BF12 . (3)
p(m2 | y) p(m2 )
In other words, the Bayes factor can be viewed as the factor by which the prior model odds are
updated (after observing data y) to posterior model odds – which is also expressed by Bayes’ rule
(→ I/5.3.2).
Sources:
• Kass, Robert E. and Raftery, Adrian E. (1995): “Bayes Factors”; in: Journal of the American
Statistical Association, vol. 90, no. 430, pp. 773-795; URL: https://dx.doi.org/10.1080/01621459.
1995.10476572; DOI: 10.1080/01621459.1995.10476572.
3.3.2 Transitivity
Theorem: Consider three competing models (→ I/5.1.1) m1 , m2 , and m3 3 for observed data y. Then
the Bayes factor (→ IV/3.3.1) for m1 over m3 can be written as:
Proof: By definition (→ IV/3.3.1), the Bayes factor BF13 is the ratio of marginal likelihoods of data
y over m1 and m3 , respectively. That is,
p(y | m1 )
BF13 = . (2)
p(y | m3 )
We can equivalently write
p(y | m1 )
(2)
BF13 =
p(y | m3 )
p(y | m1 ) p(y | m2 )
= ·
p(y | m3 ) p(y | m2 ) (3)
p(y | m1 ) p(y | m2 )
= ·
p(y | m2 ) p(y | m3 )
(2)
= BF12 · BF23 ,
■
584 CHAPTER IV. MODEL SELECTION
p(δ = δ0 | y, m1 )
BF01 = . (1)
p(δ = δ0 | m1 )
Proof: By definition (→ IV/3.3.1), the Bayes factor BF01 is the ratio of marginal likelihoods of data
y over m0 and m1 , respectively. That is,
p(y | m0 )
BF01 = . (2)
p(y | m1 )
The key idea in the proof is that we can use a “change of variables” technique to express BF01
entirely in terms of the “encompassing” model m1 . This proceeds by first unpacking the marginal
likelihood (→ I/5.1.11) for m0 over the nuisance parameter φ and then using the fact that m0 is a
sharp hypothesis nested within m1 to rewrite everything in terms of m1 . Specifically,
Z
p(y | m0 ) = p(y | φ, m0 ) p(φ | m0 ) dφ
Z
(3)
= p(y | φ, δ = δ0 , m1 ) p(φ | δ = δ0 , m1 ) dφ
= p(y | δ = δ0 , m1 ).
p(δ = δ0 | y, m1 ) p(y | m1 )
p(y | δ = δ0 , m1 ) = . (4)
p(δ = δ0 | m1 )
Thus we have
(2) p(y | m0 )
BF01 =
p(y | m1 )
1
= p(y | m0 ) ·
p(y | m1 )
(3) 1
= p(y | δ = δ0 , m1 ) · (5)
p(y | m1 )
(4) p(δ = δ0 | y, m1 ) p(y | m1 ) 1
= ·
p(δ = δ0 | m1 ) p(y | m1 )
p(δ = δ0 | y, m1 )
= ,
p(δ = δ0 | m1 )
Sources:
• Faulkenberry, Thomas J. (2019): “A tutorial on generalizing the default Bayesian t-test via pos-
terior sampling and encompassing priors”; in: Communications for Statistical Applications and
Methods, vol. 26, no. 2, pp. 217-238; URL: https://dx.doi.org/10.29220/CSAM.2019.26.2.217;
DOI: 10.29220/CSAM.2019.26.2.217.
• Penny, W.D. and Ridgway, G.R. (2013): “Efficient Posterior Probability Mapping Using Savage-
Dickey Ratios”; in: PLoS ONE, vol. 8, iss. 3, art. e59655, eq. 16; URL: https://journals.plos.org/
plosone/article?id=10.1371/journal.pone.0059655; DOI: 10.1371/journal.pone.0059655.
c 1/d
BF1e = = (1)
d 1/c
where 1/d and 1/c represent the proportions of the posterior and prior of the encompassing model,
respectively, that are in agreement with the inequality constraint imposed by the nested model m1 .
Proof: Consider first that for any model m1 on data y with parameter θ, Bayes’ theorem (→ I/5.3.1)
implies
p(y | θ, m1 ) · p(θ | m1 )
p(θ | y, m1 ) = . (2)
p(y | m1 )
Rearranging equation (2) allows us to write the marginal likelihood (→ I/5.1.11) for y under m1 as
p(y | θ, m1 ) · p(θ | m1 )
p(y | m1 ) = . (3)
p(θ | y, m1 )
Taking the ratio of the marginal likelihoods for m1 and the encompassing model (→ IV/3.3.5) me
yields the following Bayes factor (→ IV/3.3.1):
p(θ′ | m1 )/p(θ′ | y, m1 )
BF1e = . (5)
p(θ′ | me )/p(θ′ | y, me )
Because m1 is nested within me via an inequality constraint, the prior p(θ′ | m1 ) is simply a truncation
of the encompassing prior p(θ′ | me ). Thus, we can express p(θ′ | m1 ) in terms of the encompassing
prior p(θ′ | me ) by multiplying the encompassing prior by an indicator function over m1 and then
normalizing the resulting product. That is,
586 CHAPTER IV. MODEL SELECTION
where Iθ′ ∈m1 is an indicator function. For parameters θ′ ∈ m1 , this indicator function is identically
equal to 1, so the expression in parentheses reduces to a constant, say c, allowing us to write the
prior as
Sources:
• Klugkist, I., Kato, B., and Hoijtink, H. (2005): “Bayesian model selection using encompassing
priors”; in: Statistica Neerlandica, vol. 59, no. 1., pp. 57-69; URL: https://dx.doi.org/10.1111/j.
1467-9574.2005.00279.x; DOI: 10.1111/j.1467-9574.2005.00279.x.
• Faulkenberry, Thomas J. (2019): “A tutorial on generalizing the default Bayesian t-test via pos-
terior sampling and encompassing priors”; in: Communications for Statistical Applications and
Methods, vol. 26, no. 2, pp. 217-238; URL: https://dx.doi.org/10.29220/CSAM.2019.26.2.217;
DOI: 10.29220/CSAM.2019.26.2.217.
Sources:
• Klugkist, I., Kato, B., and Hoijtink, H. (2005): “Bayesian model selection using encompassing
priors”; in: Statistica Neerlandica, vol. 59, no. 1, pp. 57-69; URL: https://dx.doi.org/10.1111/j.
1467-9574.2005.00279.x; DOI: 10.1111/j.1467-9574.2005.00279.x.
3. BAYESIAN MODEL SELECTION 587
¬(m1 ∧ m2 ) (1)
Then, the Bayes factor in favor of m1 and against m2 is the ratio of the model evidences (→ I/5.1.11)
of m1 and m2 :
p(y|m1 )
BF12 = . (2)
p(y|m2 )
The log Bayes factor is given by the logarithm of the Bayes factor:
p(y|m1 )
LBF12 = log BF12 = log . (3)
p(y|m2 )
Sources:
• Soch J, Allefeld C (2018): “MACS – a new SPM toolbox for model assessment, comparison and
selection”; in: Journal of Neuroscience Methods, vol. 306, pp. 19-31, eq. 18; URL: https://www.
sciencedirect.com/science/article/pii/S0165027018301468; DOI: 10.1016/j.jneumeth.2018.05.017.
p(y|m1 )
LBF12 = log . (2)
p(y|m2 )
Proof: The Bayes factor (→ IV/3.3.1) is defined as the posterior (→ I/5.1.7) odds ratio when both
models (→ I/5.1.1) are equally likely apriori (→ I/5.1.3):
p(m1 |y)
BF12 = (3)
p(m2 |y)
Plugging in the posterior odds ratio according to Bayes’ rule (→ I/5.3.2), we have
p(y|m1 ) p(m1 )
BF12 = · . (4)
p(y|m2 ) p(m2 )
When both models are equally likely apriori, the prior (→ I/5.1.3) odds ratio is one, such that
p(y|m1 )
BF12 = . (5)
p(y|m2 )
Equation (2) follows by logarithmizing both sides of (5).
■
588 CHAPTER IV. MODEL SELECTION
Proof: The Bayes factor (→ IV/3.3.1) is defined as the ratio of the model evidences (→ I/5.1.11)
of m1 and m2
p(y|m1 )
BF12 = (2)
p(y|m2 )
and the log Bayes factor (→ IV/3.3.6) is defined as the logarithm of the Bayes factor
p(y|m1 )
LBF12 = log . (4)
p(y|m2 )
and, with the definition of the log model evidence (→ IV/3.1.3)
Sources:
• Soch J, Allefeld C (2018): “MACS – a new SPM toolbox for model assessment, comparison and
selection”; in: Journal of Neuroscience Methods, vol. 306, pp. 19-31, eq. 18; URL: https://www.
sciencedirect.com/science/article/pii/S0165027018301468; DOI: 10.1016/j.jneumeth.2018.05.017.
Sources:
3. BAYESIAN MODEL SELECTION 589
• Soch J, Allefeld C (2018): “MACS – a new SPM toolbox for model assessment, comparison and
selection”; in: Journal of Neuroscience Methods, vol. 306, pp. 19-31, eq. 23; URL: https://www.
sciencedirect.com/science/article/pii/S0165027018301468; DOI: 10.1016/j.jneumeth.2018.05.017.
3.4.2 Derivation
Theorem: Let there be a set of generative models (→ I/5.1.1) m1 , . . . , mM with model evidences
(→ I/5.1.11) p(y|m1 ), . . . , p(y|mM ) and prior probabilities (→ I/5.1.3) p(m1 ), . . . , p(mM ). Then, the
posterior probability (→ IV/3.4.1) of model mi is given by
p(y|mi ) p(mi )
p(mi |y) = PM , i = 1, . . . , M . (1)
j=1 p(y|mj ) p(mj )
Proof: From Bayes’ theorem (→ I/5.3.1), the posterior model probability (→ IV/3.4.1) of the i-th
model can be derived as
p(y|mi ) p(mi )
p(mi |y) = . (2)
p(y)
Using the law of marginal probability (→ I/1.3.3), the denominator can be rewritten, such that
p(y|mi ) p(mi )
p(mi |y) = PM . (3)
j=1 p(y, mj )
p(y|mi ) p(mi )
p(mi |y) = PM . (4)
j=1 p(y|mj ) p(mj )
where BFi,0 is the Bayes factor (→ IV/3.3.1) comparing model mi with m0 and αi is the prior (→
I/5.1.3) odds ratio of model mi against m0 .
p(y|mi )
BFi,0 = (2)
p(y|m0 )
and prior odds ratio of mi against m0
p(mi )
αi = . (3)
p(m0 )
590 CHAPTER IV. MODEL SELECTION
p(y|mi ) · p(mi )
p(mi |y) = PM . (4)
j=1 p(y|m j ) · p(m j )
Now applying (2) and (3) to (4), we have
such that
BFi,0 · αi
p(mi |y) = PM . (6)
j=1 BFj,0 · αj
Sources:
• Hoeting JA, Madigan D, Raftery AE, Volinsky CT (1999): “Bayesian Model Averaging: A Tu-
torial”; in: Statistical Science, vol. 14, no. 4, pp. 382–417, eq. 9; URL: https://projecteuclid.org/
euclid.ss/1009212519; DOI: 10.1214/ss/1009212519.
exp(LBF12 )
p(m1 |y) = . (1)
exp(LBF12 ) + 1
p(m1 |y)
= BF12 . (4)
p(m2 |y)
Because the two posterior model probabilities (→ IV/3.4.1) add up to 1, we have
3. BAYESIAN MODEL SELECTION 591
p(m1 |y)
= BF12 . (5)
1 − p(m1 |y)
Now rearranging for the posterior probability (→ IV/3.4.1), this gives
BF12
p(m1 |y) = . (6)
BF12 + 1
Because the log Bayes factor is the logarithm of the Bayes factor (→ IV/3.3.6), we finally have
exp(LBF12 )
p(m1 |y) = . (7)
exp(LBF12 ) + 1
■
Sources:
• Soch J, Allefeld C (2018): “MACS – a new SPM toolbox for model assessment, comparison and
selection”; in: Journal of Neuroscience Methods, vol. 306, pp. 19-31, eq. 21; URL: https://www.
sciencedirect.com/science/article/pii/S0165027018301468; DOI: 10.1016/j.jneumeth.2018.05.017.
• Zeidman P, Silson EH, Schwarzkopf DS, Baker CI, Penny W (2018): “Bayesian population re-
ceptive field modelling”; in: NeuroImage, vol. 180, pp. 173-187, eq. 11; URL: https://www.
sciencedirect.com/science/article/pii/S1053811917307462; DOI: 10.1016/j.neuroimage.2017.09.008.
exp[LME(mi )] p(mi )
p(mi |y) = PM , i = 1, . . . , M , (1)
j=1 exp[LME(mj )] p(mj )
p(y|mi ) p(mi )
p(mi |y) = PM . (2)
j=1 p(y|mj ) p(mj )
exp[LME(mi )] p(mi )
p(mi |y) = PM . (5)
j=1 exp[LME(mj )] p(mj )
■
592 CHAPTER IV. MODEL SELECTION
Sources:
• Soch J, Allefeld C (2018): “MACS – a new SPM toolbox for model assessment, comparison and
selection”; in: Journal of Neuroscience Methods, vol. 306, pp. 19-31, eq. 23; URL: https://www.
sciencedirect.com/science/article/pii/S0165027018301468; DOI: 10.1016/j.jneumeth.2018.05.017.
X
M
p(θ|y) = p(θ|y, mi ) · p(mi |y) . (1)
i=1
Sources:
• Hoeting JA, Madigan D, Raftery AE, Volinsky CT (1999): “Bayesian Model Averaging: A Tu-
torial”; in: Statistical Science, vol. 14, no. 4, pp. 382–417, eq. 1; URL: https://projecteuclid.org/
euclid.ss/1009212519; DOI: 10.1214/ss/1009212519.
3.5.2 Derivation
Theorem: Let m1 , . . . , mM be M statistical models (→ I/5.1.4) with posterior model probabilities
(→ IV/3.4.1) p(m1 |y), . . . , p(mM |y) and posterior distributions (→ I/5.1.7) p(θ|y, m1 ), . . . , p(θ|y, mM ).
Then, the marginal (→ I/1.5.3) posterior (→ I/5.1.7) density (→ I/1.7.1), conditional (→ I/1.3.4)
on the measured data y, but unconditional (→ I/1.3.3) on the modelling approach m, is given by:
X
M
p(θ|y) = p(θ|y, mi ) · p(mi |y) . (1)
i=1
Proof: Using the law of marginal probability (→ I/1.3.3), the probability distribution of the shared
parameters θ conditional (→ I/1.3.4) on the measured data y can be obtained by marginalizing (→
I/1.3.3) over the discrete random variable (→ I/1.2.2) model m:
X
M
p(θ|y) = p(θ, mi |y) . (2)
i=1
Using the law of the conditional probability (→ I/1.3.4), the summand can be expanded to give
X
M
p(θ|y) = p(θ|y, mi ) · p(mi |y) (3)
i=1
where p(θ|y, mi ) is the posterior distribution (→ I/5.1.7) of the i-th model and p(mi |y) happens to
be the posterior probability (→ IV/3.4.1) of the i-th model.
■
3. BAYESIAN MODEL SELECTION 593
X
M
exp[LME(mi )] p(mi )
p(θ|y) = p(θ|mi , y) · PM , (1)
i=1 j=1 exp[LME(mj )] p(mj )
Proof: According to the law of marginal probability (→ I/1.3.3), the probability of the shared
parameters θ conditional on the measured data y can be obtained (→ IV/3.5.2) by marginalizing
over the discrete variable model m:
X
M
p(θ|y) = p(θ|mi , y) · p(mi |y) , (2)
i=1
where p(mi |y) is the posterior probability (→ IV/3.4.1) of the i-th model. One can express posterior
model probabilities in terms of log model evidences (→ IV/3.4.5) as
exp[LME(mi )] p(mi )
p(mi |y) = PM (3)
j=1 exp[LME(mj )] p(mj )
Sources:
• Soch J, Allefeld C (2018): “MACS – a new SPM toolbox for model assessment, comparison and
selection”; in: Journal of Neuroscience Methods, vol. 306, pp. 19-31, eq. 25; URL: https://www.
sciencedirect.com/science/article/pii/S0165027018301468; DOI: 10.1016/j.jneumeth.2018.05.017.
594 CHAPTER IV. MODEL SELECTION
Chapter V
Appendix
595
596 CHAPTER V. APPENDIX
1 Proof by Number
P216 ugkv- Expectation of the log Bayes fac- JoramSoch 2021-03-24 362
lbfmean tor for the univariate Gaussian with
known variance
P217 ugkv- Cross-validated log model evidence JoramSoch 2021-03-24 363
cvlme for the univariate Gaussian with
known variance
P218 ugkv-cvlbf Cross-validated log Bayes factor for JoramSoch 2021-03-24 366
the univariate Gaussian with known
variance
P219 ugkv- Expectation of the cross-validated JoramSoch 2021-03-24 367
cvlbfmean log Bayes factor for the univariate
Gaussian with known variance
P220 cdf-pit Probability integral transform using JoramSoch 2021-04-07 31
cumulative distribution function
P221 cdf-itm Inverse transformation method us- JoramSoch 2021-04-07 32
ing cumulative distribution function
P222 cdf-dt Distributional transformation using JoramSoch 2021-04-07 33
cumulative distribution function
P223 ug-mle Maximum likelihood estimation for JoramSoch 2021-04-16 334
the univariate Gaussian
P224 poissexp- Maximum likelihood estimation for JoramSoch 2021-04-16 536
mle the Poisson distribution with expo-
sure values
P225 poiss-prior Conjugate prior distribution for JoramSoch 2020-04-21 531
Poisson-distributed data
P226 poiss-post Posterior distribution for Poisson- JoramSoch 2020-04-21 533
distributed data
P227 poiss-lme Log model evidence for Poisson- JoramSoch 2020-04-21 534
distributed data
P228 beta-mean Mean of the beta distribution JoramSoch 2021-04-29 255
P229 beta-var Variance of the beta distribution JoramSoch 2021-04-29 256
P230 poiss-var Variance of the Poisson distribution JoramSoch 2021-04-29 159
P231 mvt-f Relationship between multivariate t- JoramSoch 2021-05-04 296
distribution and F-distribution
P232 nst-t Relationship between non- JoramSoch 2021-05-11 209
standardized t-distribution and
t-distribution
1. PROOF BY NUMBER 609
P268 iglm-blue Best linear unbiased estimator for JoramSoch 2021-10-21 497
the inverse general linear model
P269 cfm-para Parameters of the corresponding for- JoramSoch 2021-10-21 499
ward model
P270 cfm-exist Existence of a corresponding for- JoramSoch 2021-10-21 500
ward model
P271 slr-ols Ordinary least squares for simple lin- JoramSoch 2021-10-27 401
ear regression
P272 slr- Expectation of parameter estimates JoramSoch 2021-10-27 405
olsmean for simple linear regression
P273 slr-olsvar Variance of parameter estimates for JoramSoch 2021-10-27 407
simple linear regression
P274 slr- Effects of mean-centering on param- JoramSoch 2021-10-27 413
meancent eter estimates for simple linear re-
gression
P275 slr-comp The regression line goes through the JoramSoch 2021-10-27 414
center of mass point
P276 slr-ressum The sum of residuals is zero in simple JoramSoch 2021-10-27 428
linear regression
P277 slr-rescorr The residuals and the covariate are JoramSoch 2021-10-27 429
uncorrelated in simple linear regres-
sion
P278 slr-resvar Relationship between residual vari- JoramSoch 2021-10-27 430
ance and sample variance in simple
linear regression
P279 slr-corr Relationship between correlation co- JoramSoch 2021-10-27 431
efficient and slope estimate in simple
linear regression
P280 slr-rsq Relationship between coefficient of JoramSoch 2021-10-27 432
determination and correlation coef-
ficient in simple linear regression
P281 slr-mlr Simple linear regression is a special JoramSoch 2021-11-09 400
case of multiple linear regression
P282 slr-olsdist Distribution of parameter estimates JoramSoch 2021-11-09 410
for simple linear regression
P283 slr-proj Projection of a data point to the re- JoramSoch 2021-11-09 415
gression line
612 CHAPTER V. APPENDIX
P284 slr-sss Sums of squares for simple linear re- JoramSoch 2021-11-09 417
gression
P285 slr-mat Transformation matrices for simple JoramSoch 2021-11-09 419
linear regression
P286 slr-wls Weighted least squares for simple JoramSoch 2021-11-16 421
linear regression
P287 slr-mle Maximum likelihood estimation for JoramSoch 2021-11-16 424
simple linear regression
P288 slr-ols2 Ordinary least squares for simple lin- JoramSoch 2021-11-16 404
ear regression
P289 slr-wls2 Weighted least squares for simple JoramSoch 2021-11-16 423
linear regression
P290 slr-mle2 Maximum likelihood estimation for JoramSoch 2021-11-16 427
simple linear regression
P291 mean-tot Law of total expectation JoramSoch 2021-11-26 51
P292 var-tot Law of total variance JoramSoch 2021-11-26 59
P293 cov-tot Law of total covariance JoramSoch 2021-11-26 65
P294 dir-kl Kullback-Leibler divergence for the JoramSoch 2021-12-02 312
Dirichlet distribution
P295 wish-kl Kullback-Leibler divergence for the JoramSoch 2021-12-02 327
Wishart distribution
P296 matn-kl Kullback-Leibler divergence for the JoramSoch 2021-12-02 321
matrix-normal distribution
P297 matn- Sampling from the matrix-normal JoramSoch 2021-12-07 326
samp distribution
P298 mean-tr Expected value of the trace of a ma- JoramSoch 2021-12-07 47
trix
P299 corr-z Correlation coefficient in terms of JoramSoch 2021-12-14 75
standard scores
P300 corr-range Correlation always falls between -1 JoramSoch 2021-12-14 74
and +1
P301 bern-var Variance of the Bernoulli distribu- JoramSoch 2022-01-20 143
tion
P302 bin-var Variance of the binomial distribu- JoramSoch 2022-01-20 148
tion
1. PROOF BY NUMBER 613
P320 slr-olscorr Parameter estimates for simple lin- JoramSoch 2022-04-14 412
ear regression are uncorrelated after
mean-centering
P321 norm- Probability of normal random vari- JoramSoch 2022-05-08 192
probstd able being within standard devia-
tions from its mean
P322 mult-cov Covariance matrix of the multino- adkipnis 2022-05-11 165
mial distribution
P323 nw-pdf Probability density function of the JoramSoch 2022-05-14 329
normal-Wishart distribution
P324 ng-nw Normal-gamma distribution is a spe- JoramSoch 2022-05-20 298
cial case of normal-Wishart distribu-
tion
P325 lognorm- Cumulative distribution function of majapavlo 2022-06-29 236
cdf the log-normal distribution
P326 lognorm-qf Quantile function of the log-normal majapavlo 2022-07-09 237
distribution
P327 nw-mean Mean of the normal-Wishart distri- JoramSoch 2022-07-14 331
bution
P328 gam-wish Gamma distribution is a special case JoramSoch 2022-07-14 212
of Wishart distribution
P329 mlr-glm Multiple linear regression is a special JoramSoch 2022-07-21 434
case of the general linear model
P330 mvn-matn Multivariate normal distribution is a JoramSoch 2022-07-31 276
special case of matrix-normal distri-
bution
P331 norm-mvn Normal distribution is a special case JoramSoch 2022-08-19 178
of multivariate normal distribution
P332 t-mvt t-distribution is a special case of JoramSoch 2022-08-25 208
multivariate t-distribution
P333 mvt-pdf Probability density function of the JoramSoch 2022-09-02 295
multivariate t-distribution
P334 bern-ent Entropy of the Bernoulli distribution JoramSoch 2022-09-02 144
P335 bin-ent Entropy of the binomial distribution JoramSoch 2022-09-02 150
P336 cat-ent Entropy of the categorical distribu- JoramSoch 2022-09-09 162
tion
1. PROOF BY NUMBER 615
2 Definition by Number
D115 vblme Variational Bayesian log model evi- JoramSoch 2020-11-25 577
dence
D116 prior-flat Flat, hard and soft prior distribution JoramSoch 2020-12-02 128
D117 prior-uni Uniform and non-uniform prior dis- JoramSoch 2020-12-02 128
tribution
D118 prior-inf Informative and non-informative JoramSoch 2020-12-02 129
prior distribution
D119 prior-emp Empirical and theoretical prior dis- JoramSoch 2020-12-02 129
tribution
D120 prior-conj Conjugate and non-conjugate prior JoramSoch 2020-12-02 129
distribution
D121 prior- Maximum entropy prior distribution JoramSoch 2020-12-02 130
maxent
D122 prior-eb Empirical Bayes prior distribution JoramSoch 2020-12-02 130
D123 prior-ref Reference prior distribution JoramSoch 2020-12-02 130
D124 ug Univariate Gaussian JoramSoch 2021-03-03 334
D125 h0 Null hypothesis JoramSoch 2021-03-12 120
D126 h1 Alternative hypothesis JoramSoch 2021-03-12 120
D127 hyp Statistical hypothesis JoramSoch 2021-03-19 118
D128 hyp-simp Simple and composite hypothesis JoramSoch 2021-03-19 118
D129 hyp-point Point and set hypothesis JoramSoch 2021-03-19 118
D130 test Statistical hypothesis test JoramSoch 2021-03-19 119
D131 tstat Test statistic JoramSoch 2021-03-19 121
D132 size Size of a statistical test JoramSoch 2021-03-19 121
D133 alpha Significance level JoramSoch 2021-03-19 122
D134 cval Critical value JoramSoch 2021-03-19 122
D135 pval p-value JoramSoch 2021-03-19 122
D136 ugkv Univariate Gaussian with known JoramSoch 2021-03-23 349
variance
D137 power Power of a statistical test JoramSoch 2021-03-31 121
D138 hyp-tail One-tailed and two-tailed hypothe- JoramSoch 2021-03-31 119
sis
D139 test-tail One-tailed and two-tailed test JoramSoch 2021-03-31 120
626 CHAPTER V. APPENDIX
3 Proof by Topic
A
• Accuracy and complexity for Bayesian linear regression, 471
• Accuracy and complexity for Bayesian linear regression with known covariance, 486
• Accuracy and complexity for the univariate Gaussian, 347
• Accuracy and complexity for the univariate Gaussian with known variance, 359
• Addition law of probability, 13
• Addition of the differential entropy upon multiplication with a constant, 93
• Addition of the differential entropy upon multiplication with invertible matrix, 94
• Additivity of the Kullback-Leibler divergence for independent distributions, 109
• Additivity of the variance for independent random variables, 59
• Akaike information criterion for multiple linear regression, 463
• Application of Cochran’s theorem to two-way analysis of variance, 386
• Approximation of log family evidences based on log model evidences, 581
B
• Bayes’ rule, 131
• Bayes’ theorem, 131
• Bayesian information criterion for multiple linear regression, 464
• Bayesian model averaging in terms of log model evidences, 593
• Best linear unbiased estimator for the inverse general linear model, 497
• Binomial test, 508
C
• Characteristic function of a function of a random variable, 35
• Chi-squared distribution is a special case of gamma distribution, 244
• Combined posterior distributions in terms of individual posterior distributions obtained from con-
ditionally independent data, 126
• Concavity of the Shannon entropy, 86
• Conditional distributions of the multivariate normal distribution, 289
• Conditional distributions of the normal-gamma distribution, 308
• Conjugate prior distribution for Bayesian linear regression, 465
• Conjugate prior distribution for Bayesian linear regression with known covariance, 481
• Conjugate prior distribution for binomial observations, 512
• Conjugate prior distribution for multinomial observations, 522
• Conjugate prior distribution for multivariate Bayesian linear regression, 501
• Conjugate prior distribution for Poisson-distributed data, 531
• Conjugate prior distribution for the Poisson distribution with exposure values, 538
• Conjugate prior distribution for the univariate Gaussian, 340
• Conjugate prior distribution for the univariate Gaussian with known variance, 353
• Construction of confidence intervals using Wilks’ theorem, 114
• Construction of unbiased estimator for variance, 560
• Continuous uniform distribution maximizes differential entropy for fixed range, 176
• Convexity of the cross-entropy, 88
• Convexity of the Kullback-Leibler divergence, 108
• Corrected Akaike information criterion converges to uncorrected Akaike information criterion when
infinite data are available, 566
3. PROOF BY TOPIC 629
D
• Derivation of Bayesian model averaging, 592
• Derivation of R² and adjusted R², 561
• Derivation of the Bayesian information criterion, 568
• Derivation of the family evidence, 578
• Derivation of the log Bayes factor, 587
• Derivation of the log family evidence, 579
• Derivation of the log model evidence, 572
• Derivation of the model evidence, 571
• Derivation of the posterior model probability, 589
• Deviance for multiple linear regression, 461
• Deviance information criterion for multiple linear regression, 475
• Differential entropy can be negative, 91
• Differential entropy for the matrix-normal distribution, 320
• Differential entropy of the continuous uniform distribution, 174
• Differential entropy of the gamma distribution, 223
• Differential entropy of the multivariate normal distribution, 284
• Differential entropy of the normal distribution, 202
• Differential entropy of the normal-gamma distribution, 303
630 CHAPTER V. APPENDIX
E
• Effects of mean-centering on parameter estimates for simple linear regression, 413
• Encompassing prior method for computing Bayes factors, 585
• Entropy of the Bernoulli distribution, 144
• Entropy of the binomial distribution, 150
• Entropy of the categorical distribution, 162
• Entropy of the discrete uniform distribution, 138
• Entropy of the multinomial distribution, 166
• Equivalence of matrix-normal distribution and multivariate normal distribution, 317
• Equivalence of operations for model evidence and log model evidence, 574
• Equivalence of parameter estimates from the transformed general linear model, 495
• Exceedance probabilities for the Dirichlet distribution, 313
• Existence of a corresponding forward model, 500
• Expectation of a quadratic form, 48
• Expectation of parameter estimates for simple linear regression, 405
• Expectation of the cross-validated log Bayes factor for the univariate Gaussian with known variance,
367
• Expectation of the log Bayes factor for the univariate Gaussian with known variance, 362
• Expected value of a non-negative random variable, 41
• Expected value of the trace of a matrix, 47
• Expected value of x times ln(x) for a gamma distribution, 222
• Exponential distribution is a special case of gamma distribution, 225
• Expression of the cumulative distribution function of the normal distribution without the error
function, 190
• Expression of the probability mass function of the beta-binomial distribution using only the gamma
function, 156
• Extreme points of the probability density function of the normal distribution, 200
F
• F-statistic in terms of ordinary least squares estimates in one-way analysis of variance, 375
• F-statistics in terms of ordinary least squares estimates in two-way analysis of variance, 398
• F-test for grand mean in two-way analysis of variance, 396
• F-test for interaction in two-way analysis of variance, 395
• F-test for main effect in one-way analysis of variance, 371
• F-test for main effect in two-way analysis of variance, 392
• F-test for multiple linear regression using contrast-based inference, 455
3. PROOF BY TOPIC 631
G
• Gamma distribution is a special case of Wishart distribution, 212
• Gaussian integral, 185
• Gibbs’ inequality, 89
I
• Independence of estimated parameters and residuals in multiple linear regression, 444
• Independence of products of multivariate normal random vector, 294
• Inflection points of the probability density function of the normal distribution, 201
• Invariance of the covariance matrix under addition of constant vector, 69
• Invariance of the differential entropy under addition of a constant, 92
• Invariance of the Kullback-Leibler divergence under parameter transformation, 110
• Invariance of the variance under addition of a constant, 57
• Inverse transformation method using cumulative distribution function, 32
J
• Joint likelihood is the product of likelihood function and prior density, 125
K
• Kullback-Leibler divergence for the Bernoulli distribution, 145
• Kullback-Leibler divergence for the binomial distribution, 151
• Kullback-Leibler divergence for the continuous uniform distribution, 175
• Kullback-Leibler divergence for the Dirichlet distribution, 312
• Kullback-Leibler divergence for the discrete uniform distribution, 139
• Kullback-Leibler divergence for the gamma distribution, 224
• Kullback-Leibler divergence for the matrix-normal distribution, 321
• Kullback-Leibler divergence for the multivariate normal distribution, 285
• Kullback-Leibler divergence for the normal distribution, 203
• Kullback-Leibler divergence for the normal-gamma distribution, 304
• Kullback-Leibler divergence for the Wishart distribution, 327
L
• Law of the unconscious statistician, 51
• Law of total covariance, 65
• Law of total expectation, 51
• Law of total probability, 13
• Law of total variance, 59
• Linear combination of independent normal random variables, 206
• Linear transformation theorem for the matrix-normal distribution, 323
• Linear transformation theorem for the moment-generating function, 36
• Linear transformation theorem for the multivariate normal distribution, 287
• Linearity of the expected value, 43
• Log Bayes factor for binomial observations, 515
• Log Bayes factor for multinomial observations, 526
• Log Bayes factor for the univariate Gaussian with known variance, 361
632 CHAPTER V. APPENDIX
M
• Marginal distribution of a conditional binomial distribution, 153
• Marginal distributions for the matrix-normal distribution, 324
• Marginal distributions of the multivariate normal distribution, 288
• Marginal distributions of the normal-gamma distribution, 306
• Marginal likelihood is a definite integral of joint likelihood, 127
• Maximum likelihood estimation can result in biased estimates, 116
• Maximum likelihood estimation for binomial observations, 509
• Maximum likelihood estimation for Dirichlet-distributed data, 545
• Maximum likelihood estimation for multinomial observations, 519
• Maximum likelihood estimation for multiple linear regression, 458
• Maximum likelihood estimation for Poisson-distributed data, 530
• Maximum likelihood estimation for simple linear regression, 424
• Maximum likelihood estimation for simple linear regression, 427
• Maximum likelihood estimation for the general linear model, 492
• Maximum likelihood estimation for the Poisson distribution with exposure values, 536
• Maximum likelihood estimation for the univariate Gaussian, 334
• Maximum likelihood estimation for the univariate Gaussian with known variance, 349
• Maximum likelihood estimator of variance in multiple linear regression is biased, 558
• Maximum likelihood estimator of variance is biased, 556
• Maximum log-likelihood for binomial observations, 510
• Maximum log-likelihood for multinomial observations, 520
• Maximum log-likelihood for multiple linear regression, 460
• Maximum-a-posteriori estimation for binomial observations, 511
• Maximum-a-posteriori estimation for multinomial observations, 521
• Mean of the Bernoulli distribution, 142
• Mean of the beta distribution, 255
• Mean of the binomial distribution, 148
• Mean of the categorical distribution, 161
• Mean of the continuous uniform distribution, 171
• Mean of the ex-Gaussian distribution, 269
• Mean of the exponential distribution, 229
3. PROOF BY TOPIC 633
N
• Necessary and sufficient condition for independence of multivariate normal random variables, 292
• Non-invariance of the differential entropy under change of variables, 95
• (Non-)Multiplicativity of the expected value, 45
• Non-negativity of the expected value, 42
• Non-negativity of the Kullback-Leibler divergence, 105
• Non-negativity of the Kullback-Leibler divergence, 106
• Non-negativity of the Shannon entropy, 85
• Non-negativity of the variance, 55
• Non-symmetry of the Kullback-Leibler divergence, 107
• Normal distribution is a special case of multivariate normal distribution, 178
634 CHAPTER V. APPENDIX
O
• One-sample t-test for independent observations, 336
• One-sample z-test for independent observations, 350
• Ordinary least squares for multiple linear regression, 435
• Ordinary least squares for multiple linear regression, 435
• Ordinary least squares for multiple linear regression with two regressors, 436
• Ordinary least squares for one-way analysis of variance, 369
• Ordinary least squares for simple linear regression, 401
• Ordinary least squares for simple linear regression, 404
• Ordinary least squares for the general linear model, 490
• Ordinary least squares for two-way analysis of variance, 380
P
• Paired t-test for dependent observations, 339
• Paired z-test for dependent observations, 353
• Parameter estimates for simple linear regression are uncorrelated after mean-centering, 412
• Parameters of the corresponding forward model, 499
• Partition of a covariance matrix into expected values, 67
• Partition of covariance into expected values, 63
• Partition of skewness into expected values , 61
• Partition of sums of squares in one-way analysis of variance, 370
• Partition of sums of squares in ordinary least squares, 439
• Partition of sums of squares in two-way analysis of variance, 384
• Partition of the log model evidence into accuracy and complexity, 573
• Partition of the mean squared error into bias and variance, 113
• Partition of variance into expected values, 55
• Positive semi-definiteness of the covariance matrix, 68
• Posterior credibility region against the omnibus null hypothesis for Bayesian linear regression, 479
• Posterior density is proportional to joint likelihood, 126
• Posterior distribution for Bayesian linear regression, 467
• Posterior distribution for Bayesian linear regression with known covariance, 482
• Posterior distribution for binomial observations, 513
• Posterior distribution for multinomial observations, 523
• Posterior distribution for multivariate Bayesian linear regression, 503
• Posterior distribution for Poisson-distributed data, 533
• Posterior distribution for the Poisson distribution with exposure values, 539
• Posterior distribution for the univariate Gaussian, 342
• Posterior distribution for the univariate Gaussian with known variance, 355
• Posterior model probabilities in terms of Bayes factors, 589
• Posterior model probabilities in terms of log model evidences, 591
• Posterior model probability in terms of log Bayes factor, 590
• Posterior probability of the alternative hypothesis for Bayesian linear regression, 477
• Posterior probability of the alternative model for binomial observations, 517
• Posterior probability of the alternative model for multinomial observations, 528
• Probability and log-odds in logistic regression, 551
3. PROOF BY TOPIC 635
Q
• Quantile function is inverse of strictly monotonically increasing cumulative distribution function,
34
• Quantile function of the continuous uniform distribution, 170
• Quantile function of the discrete uniform distribution, 137
• Quantile function of the exponential distribution, 228
• Quantile function of the gamma distribution, 217
• Quantile function of the log-normal distribution, 237
• Quantile function of the normal distribution, 193
R
• Range of probability, 12
• Range of the variance of the Bernoulli distribution, 143
• Range of the variance of the binomial distribution, 149
• Relation of continuous Kullback-Leibler divergence to differential entropy, 112
• Relation of continuous mutual information to joint and conditional differential entropy, 104
• Relation of continuous mutual information to marginal and conditional differential entropy, 102
• Relation of continuous mutual information to marginal and joint differential entropy, 103
• Relation of discrete Kullback-Leibler divergence to Shannon entropy, 111
• Relation of mutual information to joint and conditional entropy, 100
• Relation of mutual information to marginal and conditional entropy, 98
• Relation of mutual information to marginal and joint entropy, 99
• Relationship between chi-squared distribution and beta distribution, 250
• Relationship between coefficient of determination and correlation coefficient in simple linear re-
gression, 432
• Relationship between correlation coefficient and slope estimate in simple linear regression, 431
• Relationship between covariance and correlation, 65
• Relationship between covariance matrix and correlation matrix, 71
• Relationship between gamma distribution and standard gamma distribution, 213
• Relationship between gamma distribution and standard gamma distribution, 214
• Relationship between multivariate normal distribution and chi-squared distribution, 277
• Relationship between multivariate t-distribution and F-distribution, 296
• Relationship between non-standardized t-distribution and t-distribution, 209
• Relationship between normal distribution and chi-squared distribution, 181
• Relationship between normal distribution and standard normal distribution, 178
• Relationship between normal distribution and standard normal distribution, 180
• Relationship between normal distribution and standard normal distribution, 180
• Relationship between normal distribution and t-distribution, 183
• Relationship between precision matrix and correlation matrix, 73
• Relationship between R² and maximum log-likelihood, 562
• Relationship between residual variance and sample variance in simple linear regression, 430
• Relationship between second raw moment, variance and mean, 82
• Relationship between signal-to-noise ratio and R², 564
• Reparametrization for one-way analysis of variance, 376
3. PROOF BY TOPIC 637
S
• Sampling from the matrix-normal distribution, 326
• Sampling from the normal-gamma distribution, 310
• Savage-Dickey density ratio for computing Bayes factors, 584
• Scaling of a random variable following the gamma distribution, 215
• Scaling of the covariance matrix upon multiplication with constant matrix, 70
• Scaling of the variance upon multiplication with a constant, 57
• Second central moment is variance, 83
• Self-covariance equals variance, 64
• Simple linear regression is a special case of multiple linear regression, 400
• Skewness of the ex-Gaussian distribution, 271
• Skewness of the exponential distribution, 233
• Skewness of the Wald distribution, 262
• Square of expectation of product is less than or equal to product of expectation of squares, 49
• Sums of squares for simple linear regression, 417
• Symmetry of the covariance, 63
• Symmetry of the covariance matrix, 67
T
• t-distribution is a special case of multivariate t-distribution, 208
• t-test for multiple linear regression using contrast-based inference, 453
• The p-value follows a uniform distribution under the null hypothesis, 123
• The regression line goes through the center of mass point, 414
• The residuals and the covariate are uncorrelated in simple linear regression, 429
• The sum of residuals is zero in simple linear regression, 428
• Transformation matrices for ordinary least squares, 441
• Transformation matrices for simple linear regression, 419
• Transitivity of Bayes Factors, 583
• Transposition of a matrix-normal random variable, 322
• Two-sample t-test for independent observations, 337
• Two-sample z-test for independent observations, 351
V
• Value of the probability-generating function for argument one, 39
• Value of the probability-generating function for argument zero, 39
• Variance of constant is zero, 56
• Variance of parameter estimates for simple linear regression, 407
• Variance of the Bernoulli distribution, 143
• Variance of the beta distribution, 256
• Variance of the binomial distribution, 148
• Variance of the continuous uniform distribution, 173
• Variance of the ex-Gaussian distribution, 270
• Variance of the exponential distribution, 231
• Variance of the gamma distribution, 219
• Variance of the linear combination of two random variables, 58
• Variance of the log-normal distribution, 242
• Variance of the normal distribution, 197
• Variance of the Poisson distribution, 159
638 CHAPTER V. APPENDIX
W
• Weighted least squares for multiple linear regression, 450
• Weighted least squares for multiple linear regression, 451
• Weighted least squares for simple linear regression, 421
• Weighted least squares for simple linear regression, 423
• Weighted least squares for the general linear model, 491
4. DEFINITION BY TOPIC 639
4 Definition by Topic
A
• Akaike information criterion, 566
• Alternative hypothesis, 120
B
• Bayes factor, 582
• Bayesian information criterion, 568
• Bayesian model averaging, 592
• Bernoulli distribution, 141
• Beta distribution, 249
• Beta-binomial data, 548
• Beta-binomial distribution, 154
• Beta-distributed data, 543
• Binomial distribution, 146
• Binomial observations, 508
• Bivariate normal distribution, 278
C
• Categorical distribution, 161
• Central moment, 82
• Characteristic function, 34
• Chi-squared distribution, 244
• Coefficient of determination, 561
• Conditional differential entropy, 97
• Conditional entropy, 87
• Conditional independence, 7
• Conditional probability distribution, 16
• Confidence interval, 114
• Conjugate and non-conjugate prior distribution, 129
• Constant, 4
• Continuous uniform distribution, 168
• Corrected Akaike information criterion, 566
• Correlation, 74
• Correlation matrix, 76
• Corresponding forward model, 499
• Covariance, 62
• Covariance matrix, 66
• Critical value, 122
• Cross-covariance matrix, 70
• Cross-entropy, 88
• Cross-validated log model evidence, 575
• Cumulant-generating function, 40
• Cumulative distribution function, 27
D
• Deviance, 570
640 CHAPTER V. APPENDIX
E
• Empirical and theoretical prior distribution, 129
• Empirical Bayes, 132
• Empirical Bayes prior distribution, 130
• Empirical Bayesian log model evidence, 576
• Encompassing model, 586
• Estimation matrix, 440
• Event space, 2
• ex-Gaussian distribution, 266
• Exceedance probability, 9
• Expected value, 41
• Expected value of a random matrix, 54
• Expected value of a random vector, 53
• Explained sum of squares, 438
• Exponential distribution, 225
F
• F-contrast for contrast-based inference in multiple linear regression, 452
• F-distribution, 247
• Family evidence, 577
• Flat, hard and soft prior distribution, 128
• Full probability model, 124
• Full width at half maximum, 78
G
• Gamma distribution, 212
• General linear model, 490
• Generative model, 124
I
• Informative and non-informative prior distribution, 129
• Interaction sum of squares, 380
• Inverse general linear model, 496
J
• Joint cumulative distribution function, 33
• Joint differential entropy, 97
• Joint entropy, 87
• Joint likelihood, 125
• Joint probability, 5
• Joint probability distribution, 16
4. DEFINITION BY TOPIC 641
K
• Kolmogorov axioms of probability, 9
• Kullback-Leibler divergence, 105
L
• Law of conditional probability, 6
• Law of marginal probability, 5
• Likelihood function, 124
• Likelihood function, 124
• Log Bayes factor, 587
• Log family evidence, 578
• Log model evidence, 571
• Log-likelihood function, 116
• Log-normal distribution, 234
• Logistic regression, 551
M
• Marginal likelihood, 127
• Marginal probability distribution, 16
• Matrix-normal distribution, 317
• Maximum, 79
• Maximum entropy prior distribution, 130
• Maximum likelihood estimation, 116
• Maximum log-likelihood, 117
• Maximum-a-posteriori estimation, 125
• Mean squared error, 113
• Median, 77
• Method-of-moments estimation, 117
• Minimum, 79
• Mode, 78
• Model evidence, 571
• Moment, 79
• Moment-generating function, 35
• Multinomial distribution, 163
• Multinomial observations, 518
• Multiple linear regression, 433
• Multivariate normal distribution, 276
• Multivariate t-distribution, 295
• Mutual exclusivity, 8
• Mutual information, 101
• Mutual information, 101
N
• Non-standardized t-distribution, 208
• Normal distribution, 177
• Normal-gamma distribution, 297
• Normal-Wishart distribution, 329
• Null hypothesis, 120
642 CHAPTER V. APPENDIX
O
• One-tailed and two-tailed hypothesis, 119
• One-tailed and two-tailed test, 120
• One-way analysis of variance, 368
P
• p-value, 122
• Point and set hypothesis, 118
• Poisson distribution, 158
• Poisson distribution with exposure values, 536
• Poisson-distributed data, 529
• Posterior distribution, 125
• Posterior model probability, 588
• Power of a statistical test, 121
• Precision, 60
• Precision matrix, 72
• Prior distribution, 124
• Probability, 5
• Probability density function, 20
• Probability distribution, 16
• Probability mass function, 17
• Probability space, 2
• Probability-generating function, 38
• Projection matrix, 440
Q
• Quantile function, 34
R
• Random event, 3
• Random experiment, 2
• Random matrix, 3
• Random variable, 3
• Random vector, 3
• Raw moment, 81
• Reference prior distribution, 130
• Regression line, 414
• Residual sum of squares, 438
• Residual variance, 556
• Residual-forming matrix, 440
S
• Sample correlation coefficient, 75
• Sample correlation matrix, 77
• Sample covariance, 62
• Sample covariance matrix, 66
• Sample mean, 41
• Sample skewness, 61
4. DEFINITION BY TOPIC 643
• Sample space, 2
• Sample variance, 54
• Sampling distribution, 17
• Shannon entropy, 85
• Signal-to-noise ratio, 564
• Significance level, 122
• Simple and composite hypothesis, 118
• Simple linear regression, 399
• Size of a statistical test, 121
• Skewness, 61
• Standard deviation, 78
• Standard gamma distribution, 213
• Standard normal distribution, 178
• Standard uniform distribution, 168
• Standardized moment, 84
• Statistical hypothesis, 118
• Statistical hypothesis test, 119
• Statistical independence, 6
T
• t-contrast for contrast-based inference in multiple linear regression, 452
• t-distribution, 207
• Test statistic, 121
• Total sum of squares, 438
• Transformed general linear model, 494
• Treatment sum of squares, 369
• Two-way analysis of variance, 379
U
• Uniform and non-uniform prior distribution, 128
• Uniform-prior log model evidence, 575
• Univariate and multivariate random variable, 4
• Univariate Gaussian, 334
• Univariate Gaussian with known variance, 349
V
• Variance, 54
• Variational Bayes, 133
• Variational Bayesian log model evidence, 577
W
• Wald distribution, 257
• Wishart distribution, 327