Lecture Notes Week 2
Lecture Notes Week 2
Subjects Sections
Statistical model
Population 5.1.1
Histograms
QQ-plot
Location-scale family 3.5
Exponential family 3.4
The statistical sciences are concerned with answering questions or making decisions in
the face of uncertainty. Examples of such questions are
What is the probability that a destructive tornado hits the US next year?
1
Definition 2.3 (5.1.1). If X1 , . . . , Xn are iid with unknown pdf g, then we call X1 , . . . , Xn
a random sample from the population g.
If the underlying data generating process is iid, then the pdf splits:
n
Y
f (x1 , . . . , xn ) = g(xi ).
i=1
From now on, we will always write f ’s to denote multivariate pdf’s and g’s to denote
univariate ones. Moreover, we will directly call N a model and abuse notation by writing
distribution names instead of pdf’s, i.e.
{Bernoulli(p) | p ∈ [0, 1]} = {g(x | p) | p ∈ [0, 1]} = {px (1 − p)1−x | p ∈ [0, 1]}.
2
you don’t want to order too much in the morning. On the other hand, you don’t want
to run out of milk too early in the day, because this upsets your costumers. How much
milk should you buy every morning? There are many different possible ways to formulate
the research question. We don’t want to disappoint our costumers, but we also don’t
want to have too much excess supply. One possible way to frame the research question
is to ask ”What is the minimal amount of milk I should buy such that no costumer
finds an empty store with 99% certainty”? To answer this question we write down the
number of daily costumers for three months to obtain data (x1 , . . . , xn ), which we assume
comes from a stochastic vector (X1 , . . . , Xn ) with unknown pdf f . In this example it is
much more unreasonable to assume that the data generating process is iid. Surely people
buy more milk in the weekend than on Monday, also the amount of milk bought today
probably depends on the amount of milk bought yesterday. Nevertheless, we assume the
data has been adjusted for these effects and continue with our iid presumption. Now,
what could be a possible set of distributions for the number of sales on a single day?
To approximate costumer entry behaviour we assume that there are a large amount of
different potential costumers who live in an area around the store, where each one has
an independent but equally small probability to enter the store on a given day. We don’t
know the number of potential costumers, or their likeliness to come to the store, therefore
we include all possible remaining distributions. Verify that the model defined equals the
set {Binomial(k, p) | k ∈ N, p ∈ [0, 1]}.
Let m be the number of cartons of milk we buy in the morning. Then the research
question translates to determining the minimal m such that P (X1 > m) ≤ 0.01. We can
only calculate this probability if the true k0 and p0 are known. Estimating their values by
using the observed numbers of costumers in the last n days is called parameter estimation,
which is the second main subject of this course.
Example 2.6 (Celestial distance). Research question: A physicist wants to find the
distance µ0 between two celestial bodies. Therefore he measures this distance n times,
yielding varying results (x1 , . . . , xn ) due to equipment inaccuracy. If the measurements
are performed in a consistent manner, then its reasonable to assume that the data is an
iid realisation of a random sample X = (X1 , . . . , Xn ) with population g. To define a
statistical model we examine the unobserved measurement errors ei = Xi − µ0 , which
are also random variables. An error can often be interpreted as the total sum of many
small independent errors. It follows by the central limit theorem that the errors are
then approximately normally distributed and thus the Xi are also approximately normally
distributed. An appropriate statistical model could therefore be {N(µ, σ 2 ) | µ ≥ 0, σ 2 > 0}.
The mathematician and physicist Carl Friedrich Gauss discovered the normal distribution
exactly by trying to gain insights into this research question.
An intuitive way to estimate µ0 would be to take the average of the n measurements.
A common assumption is that errors have expectation zero, that is E(ei ) = 0. In that case
we obtain by the law of large numbers that
n n n
1X 1X 1X
Xi = µ0 + e i = µ0 + ei ≈ µ0 + E(e1 ) = µ0 .
n n n
i=1 i=1 i=1
We will show later on in the course that averaging is the best way, according to some
criteria, to estimate µ0 if the Xi are truly normally distributed. However, suppose that
this is not the case and instead that the Xi are Cauchy distributed. Then their first
1 Pn
moment does not exist, thus the law of large numbers does not apply and hence n i=1 ei
does not converge to zero. The estimate in this case is likely to be terrible.
3
2.2 Model validation
Throughout this course we will assume that our statistical models are correct, which means
that we assume that there is a unique (unknown) θ0 ∈ Θ such that X1 ∼ gθ0 . We have
seen in the previous example, however, that assuming a Gaussian model incorrectly can
lead to mistakes. Many times we have multiple potential statistical models, none of which
are completely undisputed. In cases like these it is necessary to validate the chosen model.
This section discusses methods that give us insight into whether our chosen model is correct
or not. We assume that (x1 , . . . , xn ) is a realisation of a random vector (X1 , . . . , Xn ) of
iid random variables with pdf g and cdf G.
2.2.1 Histograms
A simple technique to get a first impression from the density g is to plot a histogram of
the data x. Let a0 < a1 < . . . < am be an even partition of the range of the xi , that is
aj − aj−1 = c is constant for 1 ≤ j ≤ m. For any y ∈ R, the histogram function hn is
defined as
m Xn m n
!
1{aj−1 <y≤aj } 1{aj−1 <xi ≤aj } = 1{aj−1 <y≤aj } 1{aj−1 <xi ≤aj } .
X X X
hn (y) =
j=1 i=1 j=1 i=1
That is, the histogram function counts the number of observations on each interval defined
by the partition. It can be very useful to plot both a histogram and a given density in one
figure to compare them against one another. In that case we have to rescale the histogram,
since a pdf integrates to one, while hn integrates to c × n. Therefore we define
m n
1 XX
h̃n (y) = 1{aj−1 <y≤aj } 1{aj−1 <xi ≤aj }
cn
j=1 i=1
If n and m are large, then the histogram can give a good approximation of the density g.
To motivate this, take a y ∈ (aj−1 , aj ]. Then, the histogram function is approximated by
n Z aj
1 X (i) 1 1 (ii)
h̃n (y) = 1{aj−1 <xi ≤aj } ≈ P (aj−1 < X1 ≤ aj ) = g(x)dx ≈ g(y),
cn c c aj−1
i=1
where the approximation in (i) follows from the law of large numbers, while approximation
(ii) holds true if g does not vary too much on (aj−1 , aj ]. Note that variability of g on a given
interval goes down as the width of the interval decreases, which happens as m increases.
Histograms can thus give an impression of g. Unfortunately, to make the impression
good, we need a lot of data and the right choice of c, the width of the intervals. Too many
intervals, and the histogram will contain too many peeks, which makes it hard to notice
characteristics of g. Too few intervals results in a total loss of details and therefore there
is little we can say about g. Hence, we usually cannot expect more from a histogram than
a first impression. Figure 1 and Figure 2 show two simulated histograms compared to
their shared true pdf, which is a Normal(185, 36), for one hundred observations. Notice
how deceptive the second histogram can be if the true density is unknown.
4
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0
165 170 175 180 185 190 195 200 205
Lengths
Figure 1
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0
165 170 175 180 185 190 195 200
Lenghts
Figure 2
2.2.2 QQ-plots
Suppose that you suspect that the random sample X1 , . . . , Xn has population pdf h and
cdf H. QQ-plots are a popular way to quickly check whether these suspicions might be
true, i.e. whether g = h and G = H. The idea is based on the quantiles associated with
a distribution. Let Y be a random variable with distribution g. Then by symmetry we
have that
5
It follows that the order statistics can be used as an approximation for the quantiles as
for each 1 ≤ k ≤ n we have
k k
P Y ≤ X(k) = ⇒ G(x(k) ) = P Y ≤ x(k) ≈
n+1 n+1
k
⇒ x(k) ≈ G−1
n+1
A QQ-plot, or quantile-quantile-plot, is a scatter plot of the points x(k) , H −1 n+1
k
.
If indeed G = H, then these points should all approximately lie on the y = x line of the
graph. If this is not the case, then we have an immediate visual aid that tells us that H
is not a good approximation.
Lemma 2.9. Let Y be a random variable with cdf H, let µ ∈ R and σ > 0 and define
Yµ,σ = µ + σY . Then Yµ,σ has cdf Hµ,σ .
Proof. This follows immediately from calculating
y−µ y−µ
P (Yµ,σ ≤ y) = P (µ + σY ≤ y) = P Y ≤ =H .
σ σ
6
Example 2.10. Suppose that Y ∼ N (0, 1). Then we know that µ + σY ∼ N (µ, σ 2 ) and
thus the location-scale family of N (0, 1) is the set of all normal distributions.
Importantly, QQ-plots can be used to check whether the data generating process is a
member of a certain location-scale family. Suppose that the data is a sample drawn from
some distribution g(x|µ, σ) that is a member of the location-scale family h with CDF H.
Then, it follows that
x(k) − µ
k (∗) −1 k µ 1
≈ G(x(k) |µ, σ) = H ⇒H ≈ − + x(k) ,
n+1 σ n+1 σ σ
where (∗) follows from Lemma 2.8. Hence, even though the data is
a sample drawn from
g(x|µ, σ) and not h(x), when plotting the points (x(k) , H −1 k
n+1 , they should roughly
follow a straight line with intercept −µ/σ and slope 1/σ. In this case we can conclude
that the location-scale family of h is a good statistical model.
We now have a simple graphical aid to check if the set of normal distributions is a
good statistical model for our data. If the QQ-plot of our data with the standard normal
is approximately a straight line, then that is an indication that the model is correct. We
show the QQ-plot for simulated data from the Normal(185, 36) distribution compared
to the N(0, 1) distribution in Figure 3. In Figure 4 we compare simulated data from a
students t distribution with three degrees of freedom compared to the N(0, 1) distribution.
7
QQ Plot of Sample Data versus Standard Normal
205
200
195
185
180
175
170
165
-3 -2 -1 0 1 2 3
Standard Normal Quantiles
Figure 3
20
10
Quantiles of Input Sample
-10
-20
-30
-4 -3 -2 -1 0 1 2 3 4
Standard Normal Quantiles
Figure 4
where h(x), c(θ) ≥ 0, t1 (x), . . . , tk (x) are real valued functions of x that do not depend on
θ, and w1 (θ), . . . , wk (θ) are real-valued functions of the parameter(s) θ.
8
The exponential family contains many famous probability distributions, including most
of the distributions that you studied in probability theory.
Example 2.12 (3.4.1). Let X ∼ Binomial(n, p) with pdf given by
n x
g(x|n, p) = p (1 − p)n−x , 0 < p < 1.
x
Then, g(x|n, p) is a member of the exponential family, which becomes clear upon rewriting
x
n x n−x n n p
g(x|n, p) = p (1 − p) = (1 − p)
x x 1−p
n p
= (1 − p)n exp log x .
x 1−p
p
such that h(x) = nx , c(θ) = (1 − p)n , w1 (θ) = log 1−p
and t1 (x) = x.
(x − µ)2
2 1
g(x|µ, σ ) = √ exp − , µ ∈ R, and σ 2 ∈ R+ .
2πσ 2σ 2
However, as (2.2) does not allow for separate inclusions of information related to the
support, i.e. the “0 < x < ∞” part, it is best to include this directly into the pdf with
the use of the indicator function:
where (
1 x∈A
1A (x) = .
0 x∈
/A
Whenever the support of the distribution does not depend on the parameter, the indicator
function related to the support will simply get absorbed into the h(x) function. However,
if the support does depend on the parameter, the indicator function will not be parameter
free. Since we cannot split the indicator function into a function h(x) that depends only
on the data and a function c(θ) that depends only on the parameter, such distributions
will in general not be members of an exponentional family.
9
Example 2.14. Let X ∼ Binomial(k, p), with both k and p unknown. Then the pdf of X
is given by
k x
f (x | k, p) = p (1 − p)k−x 1{0,1,...,k} (x).
p
Since the indicator function cannot be split into an h(x) and c(θ) function, nor can it be
represented by an exponential function, this is not a member of the exponential family.
10