0% found this document useful (0 votes)
2 views5 pages

Chapter Iii. Statistical Models: I I N I I N I I N I I N

The document discusses Bayesian linear regression and the calculation of posterior hyperparameters based on observed data sets. It outlines the equations for updating hyperparameters after observing data and provides the joint posterior distribution of regression coefficients and noise variance. Additionally, it covers concepts such as statistical independence and hypotheses in the context of statistical models.

Uploaded by

nelelen929
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views5 pages

Chapter Iii. Statistical Models: I I N I I N I I N I I N

The document discusses Bayesian linear regression and the calculation of posterior hyperparameters based on observed data sets. It outlines the equations for updating hyperparameters after observing data and provides the joint posterior distribution of regression coefficients and noise variance. Additionally, it covers concepts such as statistical independence and hypotheses in the context of statistical models.

Uploaded by

nelelen929
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

512 CHAPTER III.

STATISTICAL MODELS

(i+1)
µ0 = µ(i)
n
(i+1)
Λ0 = Λ(i)
n
(i+1)
(7)
a0 = a(i)
n
(i+1)
b0 = b(i)
n .

The posterior distribution for Bayesian linear regression when observing a single data set is given by
the following hyperparameter equations (→ III/1.6.2):

µn = Λ−1 T
n (X P y + Λ0 µ0 )
Λn = X T P X + Λ 0
n (8)
an = a0 +
2
1 T
0 Λ0 µ0 − µn Λn µn ) .
bn = b0 + (y P y + µT T
2
We can apply (8) to calculate the posterior hyperparameters after seeing the first data set:

−1
 
(1) (1)
µ(1) (1)
n = Λn X1T P1 y1 + Λ0 µ0
−1 
= Λ(1)
n X1T P1 y1 + Λ0 µ0
(1)
Λ(1) T
n = X1 P 1 X1 + Λ 0
= X1T P1 X1 + Λ0
(1) 1
a(1)
n = a0 + n1 (9)
2
1
= a0 + n 1
2
(1) 1 T (1) T (1) (1) T (1) (1)

b(1)
n = b0 + y1 P1 y1 + µ0 Λ0 µ0 − µ(1) n Λ n µ n
2

1 T 
(1) T (1) (1)
= b0 + 0 Λ0 µ0 − µn
y 1 P1 y 1 + µ T Λn µn .
2
These are the prior hyperparameters before seeing the second data set:

(2)
µ0 = µ(1)
n
(2)
Λ0 = Λ(1)
n
(2)
(10)
a0 = a(1)
n
(2)
b0 = b(1)
n .

Thus, we can again use (8) to calculate the posterior hyperparameters after seeing the second data
set:
496 CHAPTER III. STATISTICAL MODELS

Completing the square over β, we finally have

s
τ n+p b0 a0 a0 −1
p(y, β, τ ) = |P ||Λ 0 | τ exp[−b0 τ ]·
(2π)n+p Γ(a0 ) (12)
h τ i
exp − (β − µn )T Λn (β − µn ) + (y T P y + µT Λ µ
0 0 0 − µT
Λ µ
n n n )
2
with the posterior hyperparameters (→ I/5.1.7)

µn = Λ−1 T
n (X P y + Λ0 µ0 )
(13)
Λn = X T P X + Λ 0 .

Ergo, the joint likelihood is proportional to


h τ i
p(y, β, τ ) ∝ τ · exp − (β − µn ) Λn (β − µn ) · τ an −1 · exp [−bn τ ]
p/2 T
(14)
2
with the posterior hyperparameters (→ I/5.1.7)

n
an = a0 +
2
1 T (15)
0 Λ0 µ0 − µn Λn µn ) .
bn = b0 + (y P y + µT T
2
From the term in (14), we can isolate the posterior distribution over β given τ :

p(β|τ, y) = N (β; µn , (τ Λn )−1 ) . (16)


From the remaining term, we can isolate the posterior distribution over τ :

p(τ |y) = Gam(τ ; an , bn ) . (17)


Together, (16) and (17) constitute the joint (→ I/1.3.2) posterior distribution (→ I/5.1.7) of β and
τ.


Sources:
• Bishop CM (2006): “Bayesian linear regression”; in: Pattern Recognition for Machine Learning,
pp. 152-161, ex. 3.12, eq. 3.113; URL: https://www.springer.com/gp/book/9780387310732.

1.6.3 Log model evidence


Theorem: Let

m : y = Xβ + ε, ε ∼ N (0, σ 2 V ) (1)
be a linear regression model (→ III/1.5.1) with measured n × 1 data vector y, known n × p design
matrix X, known n × n covariance structure V as well as unknown p × 1 regression coefficients β
and unknown noise variance σ 2 . Moreover, assume a normal-gamma prior distribution (→ III/1.6.1)
over the model parameters β and τ = 1/σ 2 :
56 CHAPTER I. GENERAL THEOREMS


Sources:
• Wikipedia (2020): “Variance”; in: Wikipedia, the free encyclopedia, retrieved on 2020-06-06; URL:
https://en.wikipedia.org/wiki/Variance#Basic_properties.

1.11.5 Variance of a constant


Theorem: The variance (→ I/1.11.1) of a constant (→ I/1.2.5) is zero

a = const. ⇒ Var(a) = 0 (1)


and if the variance (→ I/1.11.1) of X is zero, then X is a constant (→ I/1.2.5)

Var(X) = 0 ⇒ X = const. (2)

Proof:
1) A constant (→ I/1.2.5) is defined as a quantity that always has the same value. Thus, if understood
as a random variable (→ I/1.2.2), the expected value (→ I/1.10.1) of a constant is equal to itself:

E(a) = a . (3)
Plugged into the formula of the variance (→ I/1.11.1), we have

 
Var(a) = E (a − E(a))2
 
= E (a − a)2 (4)
= E(0) .

Applied to the formula of the expected value (→ I/1.10.1), this gives


X
E(0) = x · fX (x) = 0 · 1 = 0 . (5)
x=0

Together, (4) and (5) imply (1).

2) The variance (→ I/1.11.1) is defined as


 
Var(X) = E (X − E(X))2 . (6)
Because (X − E(X))2 is strictly non-negative (→ I/1.10.4), the only way for the variance to become
zero is, if the squared deviation is always zero:

(X − E(X))2 = 0 . (7)
This, in turn, requires that X is equal to its expected value (→ I/1.10.1)

X = E(X) (8)
which can only be the case, if X always has the same value (→ I/1.2.5):

X = const. (9)
1. PROBABILITY THEORY 7

• Stephan KE, Penny WD, Daunizeau J, Moran RJ, Friston KJ (2009): “Bayesian model selection for
group studies”; in: NeuroImage, vol. 46, pp. 1004–1017, eq. 16; URL: https://www.sciencedirect.
com/science/article/abs/pii/S1053811909002638; DOI: 10.1016/j.neuroimage.2009.03.025.
• Soch J, Allefeld C (2016): “Exceedance Probabilities for the Dirichlet Distribution”; in: arXiv
stat.AP, 1611.01439; URL: https://arxiv.org/abs/1611.01439.

1.3.6 Statistical independence


Definition: Generally speaking, random variables (→ I/1.2.2) are statistically independent, if their
joint probability (→ I/1.3.2) can be expressed in terms of their marginal probabilities (→ I/1.3.3).

1) A set of discrete random variables (→ I/1.2.2) X1 , . . . , Xn with possible values X1 , . . . , Xn is called


statistically independent, if
Y
n
p(X1 = x1 , . . . , Xn = xn ) = p(Xi = xi ) for all xi ∈ Xi , i = 1, . . . , n (1)
i=1

where p(x1 , . . . , xn ) are the joint probabilities (→ I/1.3.2) of X1 , . . . , Xn and p(xi ) are the marginal
probabilities (→ I/1.3.3) of Xi .

2) A set of continuous random variables (→ I/1.2.2) X1 , . . . , Xn defined on the domains X1 , . . . , Xn


is called statistically independent, if
Y
n
FX1 ,...,Xn (x1 , . . . , xn ) = FXi (xi ) for all xi ∈ Xi , i = 1, . . . , n (2)
i=1

or equivalently, if the probability densities (→ I/1.7.1) exist, if


Y
n
fX1 ,...,Xn (x1 , . . . , xn ) = fXi (xi ) for all xi ∈ Xi , i = 1, . . . , n (3)
i=1

where F are the joint (→ I/1.5.2) or marginal (→ I/1.5.3) cumulative distribution functions (→
I/1.8.1) and f are the respective probability density functions (→ I/1.7.1).

Sources:
• Wikipedia (2020): “Independence (probability theory)”; in: Wikipedia, the free encyclopedia, re-
trieved on 2020-06-06; URL: https://en.wikipedia.org/wiki/Independence_(probability_theory)
#Definition.

1.3.7 Conditional independence


Definition: Generally speaking, random variables (→ I/1.2.2) are conditionally independent given
another random variable, if they are statistically independent (→ I/1.3.6) in their conditional prob-
ability distributions (→ I/1.5.4) given this random variable.

1) A set of discrete random variables (→ I/1.2.6) X1 , . . . , Xn with possible values X1 , . . . , Xn is called


conditionally independent given the random variable Y with possible values Y, if
118 CHAPTER I. GENERAL THEOREMS

1) expressing the first k moments (→ I/1.18.1) of y in terms of θ

µ1 = f1 (θ1 , . . . , θk )
.. (1)
.
µk = fk (θ1 , . . . , θk ) ,

2) calculating the first k sample moments (→ I/1.18.1) from y

µ̂1 (y), . . . , µ̂k (y) (2)

3) and solving the system of k equations

µ̂1 (y) = f1 (θ̂1 , . . . , θ̂k )


.. (3)
.
µ̂k (y) = fk (θ̂1 , . . . , θ̂k )

for θ̂1 , . . . , θ̂k , which are subsequently refered to as “method-of-moments estimates”.

Sources:
• Wikipedia (2021): “Method of moments (statistics)”; in: Wikipedia, the free encyclopedia, retrieved
on 2021-04-29; URL: https://en.wikipedia.org/wiki/Method_of_moments_(statistics)#Method.

4.2 Statistical hypotheses


4.2.1 Statistical hypothesis
Definition: A statistical hypothesis is a statement about the parameters of a distribution describing
a population from which observations can be sampled as measured data.
More precisely, let m be a generative model (→ I/5.1.1) describing measured data y in terms of a
distribution D(θ) with model parameters θ ∈ Θ. Then, a statistical hypothesis is formally specified
as

H : θ ∈ Θ∗ where Θ∗ ⊂ Θ . (1)

Sources:
• Wikipedia (2021): “Statistical hypothesis testing”; in: Wikipedia, the free encyclopedia, retrieved
on 2021-03-19; URL: https://en.wikipedia.org/wiki/Statistical_hypothesis_testing#Definition_
of_terms.

4.2.2 Simple vs. composite


Definition: Let H be a statistical hypothesis (→ I/4.2.1). Then,

You might also like