0% found this document useful (0 votes)
48 views

Lecture Notes Week 2

1) The document discusses statistical models and their use in answering research questions involving uncertainty. It provides examples of coin flipping, milk sales, and measuring celestial distances to illustrate statistical models. 2) A statistical model represents possible probability distributions for observed data based on prior information. It allows estimating unknown probabilities and parameters to help answer research questions. 3) Common statistical models include the Bernoulli model for binary data, the binomial model for counts of successes, and the normal model for measurements with additive errors. Parameter estimation and hypothesis testing are used to make inferences from the models.

Uploaded by

tarik Benseddik
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views

Lecture Notes Week 2

1) The document discusses statistical models and their use in answering research questions involving uncertainty. It provides examples of coin flipping, milk sales, and measuring celestial distances to illustrate statistical models. 2) A statistical model represents possible probability distributions for observed data based on prior information. It allows estimating unknown probabilities and parameters to help answer research questions. 3) Common statistical models include the Bernoulli model for binary data, the binomial model for counts of successes, and the normal model for measurements with additive errors. Parameter estimation and hypothesis testing are used to make inferences from the models.

Uploaded by

tarik Benseddik
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

2 Statistical models

Subjects Sections
Statistical model
Population 5.1.1
Histograms
QQ-plot
Location-scale family 3.5
Exponential family 3.4

The statistical sciences are concerned with answering questions or making decisions in
the face of uncertainty. Examples of such questions are
ˆ What is the probability that a destructive tornado hits the US next year?

ˆ Is a new medical procedure better than the older one?

ˆ How sure are we about the predictions of a political election?


The statistics approach starts by collecting information or data (x1 , . . . , xn ) that can
help us answer the research question. These could for example be previous times between
tornados, the number of healed people in a test group and voting choices for a subgroup of
the population. The hopes are that one can then use this data to understand the dynamics
of the underlying process. Unfortunately, one quickly realises that the underlying dynamics
producing the data are often too complex to fully describe. Tornado behaviour depends
on global weather conditions, medical effectiveness depends on living style and election
choices depend on social structures in our society. Therefore, instead of trying to fully
understand the underlying process, we assume that the observed data is a realization of
a stochastic vector (X1 , . . . , Xn ) with unknown pdf f . The goal then becomes to try to
determine f , because if we have it we can use it to answer the research question. Again,
unfortunately f is unknown, but we can use our (limited) knowledge and the data to define
a set of possible pdf’s that f could belong to.
Definition 2.1. A statistical model for (X1 , . . . , Xn ) is a collection of probability distri-
bution functions M = {f (x | θ) | θ ∈ Θ}, where Θ is a set and θ is an indexing parameter.
A statistical model represents all probability distributions that we a priori deem pos-
sible for (X1 , . . . , Xn ). If we have very little information about the underlying process,
then the statistical model has to be larger. In fact, a model can be so large that Θ is
infinite dimensional, for example we could include all pdf’s that are unimodal, i.e. contain
a single peak. If we have more information about the underlying process, then we can use
that to narrow down the statistical model.
Definition 2.2. A statistical model is called parametric if there exists a k ∈ N such that
Θ ⊆ Rk .
In this course we will focus on parametric models, as they can already accurately
describe a large portion of the world around us. Another thing we will typically assume
is that our data is independent and identically distributed. This simplifies our situation
significantly as our pdfs f (x | θ) are multivariate if we have more than one observation,
i.e. n > 1.

1
Definition 2.3 (5.1.1). If X1 , . . . , Xn are iid with unknown pdf g, then we call X1 , . . . , Xn
a random sample from the population g.
If the underlying data generating process is iid, then the pdf splits:
n
Y
f (x1 , . . . , xn ) = g(xi ).
i=1

To construct a statistical model it therefore suffices to specify a collection of univariate


distributions N = {g(x | θ) | θ ∈ Θ} as it defines the model
( n )
Y
M= g(xi | θ) | θ ∈ Θ .
i=1

From now on, we will always write f ’s to denote multivariate pdf’s and g’s to denote
univariate ones. Moreover, we will directly call N a model and abuse notation by writing
distribution names instead of pdf’s, i.e.
ˆ {Bernoulli(p) | p ∈ [0, 1]} = {g(x | p) | p ∈ [0, 1]} = {px (1 − p)1−x | p ∈ [0, 1]}.

ˆ {Exponential(λ) | λ > 0} = {g(x | λ) | λ > 0} = {λe−λx | λ > 0}.

2.1 Examples of statistical models


Let’s discuss some examples of simple research questions, the accompanying data and the
associated statistical model.
Example 2.4 (Coin wager). I have a coin and offer to bet you a thousand euros on
whether the next flip ends up heads. Research question: should you take the bet? To
decide wether you should accept you might be interested whether the coin is fair, more
specifically how likely is it that a heads turns up on the flip. To obtain an indication, I
allow you to throw the coin one hundred times to obtain data (x1 , . . . , xn ). Now, even
in such a simple setting, fully describing the underlying dynamics behind the coin flips is
impossible. The outcome of a coin toss depends on countless small factors like air pressure,
wind direction, the strength that I will use to flip the coin, the time that I decide to catch
the coin and more. Therefore we assume that the data is a realization of a stochastic
vector (X1 , . . . , Xn ) with unknown pdf f . To define a statistical model we assume that
the coin flips are independent, so that it suffices to find a univariate pdf g. A single coin
flip has only two possible outcomes, zero and one, but we have no information how likely
each outcome is, so for our model we take all possible distributions on these two points.
Verify that the model defined equals the set {Bernoulli(p) | p ∈ [0, 1]}.
Our research question now translates to whether p0 = P(X1 = 1) < 0.5. Suppose
we have observed 99 heads, then most likely p0 > 0.5, but we cannot be sure since any
0 < p < 1 can produce the observed data. At what point should we be convinced that
p0 is indeed smaller than one half. At 49 observed heads? Maybe we want to be more
conservative, 40 heads? Formalising this procedure is called hypothesis testing, which is
one of the two main subjects that we will discuss in this course.
Example 2.5 (Milk sales). Suppose you own a store that sells milk. Every morning a
truck comes in bringing fresh dairy, which is put in the store freezer and sold during the
day to costumers. Storing milk is expansive, because the freezer uses a lot of energy, so

2
you don’t want to order too much in the morning. On the other hand, you don’t want
to run out of milk too early in the day, because this upsets your costumers. How much
milk should you buy every morning? There are many different possible ways to formulate
the research question. We don’t want to disappoint our costumers, but we also don’t
want to have too much excess supply. One possible way to frame the research question
is to ask ”What is the minimal amount of milk I should buy such that no costumer
finds an empty store with 99% certainty”? To answer this question we write down the
number of daily costumers for three months to obtain data (x1 , . . . , xn ), which we assume
comes from a stochastic vector (X1 , . . . , Xn ) with unknown pdf f . In this example it is
much more unreasonable to assume that the data generating process is iid. Surely people
buy more milk in the weekend than on Monday, also the amount of milk bought today
probably depends on the amount of milk bought yesterday. Nevertheless, we assume the
data has been adjusted for these effects and continue with our iid presumption. Now,
what could be a possible set of distributions for the number of sales on a single day?
To approximate costumer entry behaviour we assume that there are a large amount of
different potential costumers who live in an area around the store, where each one has
an independent but equally small probability to enter the store on a given day. We don’t
know the number of potential costumers, or their likeliness to come to the store, therefore
we include all possible remaining distributions. Verify that the model defined equals the
set {Binomial(k, p) | k ∈ N, p ∈ [0, 1]}.
Let m be the number of cartons of milk we buy in the morning. Then the research
question translates to determining the minimal m such that P (X1 > m) ≤ 0.01. We can
only calculate this probability if the true k0 and p0 are known. Estimating their values by
using the observed numbers of costumers in the last n days is called parameter estimation,
which is the second main subject of this course.
Example 2.6 (Celestial distance). Research question: A physicist wants to find the
distance µ0 between two celestial bodies. Therefore he measures this distance n times,
yielding varying results (x1 , . . . , xn ) due to equipment inaccuracy. If the measurements
are performed in a consistent manner, then its reasonable to assume that the data is an
iid realisation of a random sample X = (X1 , . . . , Xn ) with population g. To define a
statistical model we examine the unobserved measurement errors ei = Xi − µ0 , which
are also random variables. An error can often be interpreted as the total sum of many
small independent errors. It follows by the central limit theorem that the errors are
then approximately normally distributed and thus the Xi are also approximately normally
distributed. An appropriate statistical model could therefore be {N(µ, σ 2 ) | µ ≥ 0, σ 2 > 0}.
The mathematician and physicist Carl Friedrich Gauss discovered the normal distribution
exactly by trying to gain insights into this research question.
An intuitive way to estimate µ0 would be to take the average of the n measurements.
A common assumption is that errors have expectation zero, that is E(ei ) = 0. In that case
we obtain by the law of large numbers that
n n n
1X 1X 1X
Xi = µ0 + e i = µ0 + ei ≈ µ0 + E(e1 ) = µ0 .
n n n
i=1 i=1 i=1
We will show later on in the course that averaging is the best way, according to some
criteria, to estimate µ0 if the Xi are truly normally distributed. However, suppose that
this is not the case and instead that the Xi are Cauchy distributed. Then their first
1 Pn
moment does not exist, thus the law of large numbers does not apply and hence n i=1 ei
does not converge to zero. The estimate in this case is likely to be terrible.

3
2.2 Model validation
Throughout this course we will assume that our statistical models are correct, which means
that we assume that there is a unique (unknown) θ0 ∈ Θ such that X1 ∼ gθ0 . We have
seen in the previous example, however, that assuming a Gaussian model incorrectly can
lead to mistakes. Many times we have multiple potential statistical models, none of which
are completely undisputed. In cases like these it is necessary to validate the chosen model.
This section discusses methods that give us insight into whether our chosen model is correct
or not. We assume that (x1 , . . . , xn ) is a realisation of a random vector (X1 , . . . , Xn ) of
iid random variables with pdf g and cdf G.

2.2.1 Histograms
A simple technique to get a first impression from the density g is to plot a histogram of
the data x. Let a0 < a1 < . . . < am be an even partition of the range of the xi , that is
aj − aj−1 = c is constant for 1 ≤ j ≤ m. For any y ∈ R, the histogram function hn is
defined as
m Xn m n
!
1{aj−1 <y≤aj } 1{aj−1 <xi ≤aj } = 1{aj−1 <y≤aj } 1{aj−1 <xi ≤aj } .
X X X
hn (y) =
j=1 i=1 j=1 i=1

That is, the histogram function counts the number of observations on each interval defined
by the partition. It can be very useful to plot both a histogram and a given density in one
figure to compare them against one another. In that case we have to rescale the histogram,
since a pdf integrates to one, while hn integrates to c × n. Therefore we define
m n
1 XX
h̃n (y) = 1{aj−1 <y≤aj } 1{aj−1 <xi ≤aj }
cn
j=1 i=1

If n and m are large, then the histogram can give a good approximation of the density g.
To motivate this, take a y ∈ (aj−1 , aj ]. Then, the histogram function is approximated by
n Z aj
1 X (i) 1 1 (ii)
h̃n (y) = 1{aj−1 <xi ≤aj } ≈ P (aj−1 < X1 ≤ aj ) = g(x)dx ≈ g(y),
cn c c aj−1
i=1

where the approximation in (i) follows from the law of large numbers, while approximation
(ii) holds true if g does not vary too much on (aj−1 , aj ]. Note that variability of g on a given
interval goes down as the width of the interval decreases, which happens as m increases.
Histograms can thus give an impression of g. Unfortunately, to make the impression
good, we need a lot of data and the right choice of c, the width of the intervals. Too many
intervals, and the histogram will contain too many peeks, which makes it hard to notice
characteristics of g. Too few intervals results in a total loss of details and therefore there
is little we can say about g. Hence, we usually cannot expect more from a histogram than
a first impression. Figure 1 and Figure 2 show two simulated histograms compared to
their shared true pdf, which is a Normal(185, 36), for one hundred observations. Notice
how deceptive the second histogram can be if the true density is unknown.

4
0.07

0.06

0.05

0.04

0.03

0.02

0.01

0
165 170 175 180 185 190 195 200 205
Lengths

Figure 1

0.07

0.06

0.05

0.04

0.03

0.02

0.01

0
165 170 175 180 185 190 195 200
Lenghts

Figure 2

2.2.2 QQ-plots
Suppose that you suspect that the random sample X1 , . . . , Xn has population pdf h and
cdf H. QQ-plots are a popular way to quickly check whether these suspicions might be
true, i.e. whether g = h and G = H. The idea is based on the quantiles associated with
a distribution. Let Y be a random variable with distribution g. Then by symmetry we
have that

P (Y ≤ X(1) ) = P (X(1) < Y ≤ X(2) ) = · · ·


1
= P (X(n−1) < Y ≤ X(n) ) = P (Y > X(n) ) = .
n+1

5
It follows that the order statistics can be used as an approximation for the quantiles as
for each 1 ≤ k ≤ n we have
 k  k
P Y ≤ X(k) = ⇒ G(x(k) ) = P Y ≤ x(k) ≈
n+1   n+1
k
⇒ x(k) ≈ G−1
n+1
  
A QQ-plot, or quantile-quantile-plot, is a scatter plot of the points x(k) , H −1 n+1
k
.
If indeed G = H, then these points should all approximately lie on the y = x line of the
graph. If this is not the case, then we have an immediate visual aid that tells us that H
is not a good approximation.

2.3 The location-scale family


With statistical models being defined as a set containing probability distributions, a lot of
research has been conducted on the properties of various special collections of distributions.
One intuitive, but very flexible, collection of distributions is the location-scale family.
Essentially, a location-scale family is created by taking any pdf and allowing for its graph
to shift along the x-axis, as well as contract or expand while retaining its basic shape (and
of course while still integrating to 1). A formal definition is given below.
Definition 2.7 (3.5.5). Let g(x) be any pdf. Then,
   
1 x−µ
(2.1) g(x|µ, σ) = g | µ ∈ R, σ > 0 ,
σ σ
is called the location-scale family f .
Perhaps without realizing it, you have all been introduced already to at least one
location-scale family, namely the family of distributions given by Normal(µ, σ 2 )| µ ∈ R, σ > 0 .
To convince yourself that this indeed forms a location-scale family, take the standard nor-
mal with pdf given
1 2
g(x) = √ e−x /2 .

and apply the transformation from (2.1).
The location-scale family introduces a very simple connection between the cumulative
distribution functions as well.
Lemma 2.8. Let g(x|µ, σ) be a member  of the location-scale family g. Then, the cdf of
g(x|µ, σ) satisfies G(x|µ, σ) = G x−µ
σ , where G is the cdf of g.
Proof. Tutorial exercise. 

Lemma 2.9. Let Y be a random variable with cdf H, let µ ∈ R and σ > 0 and define
Yµ,σ = µ + σY . Then Yµ,σ has cdf Hµ,σ .
Proof. This follows immediately from calculating
   
y−µ y−µ
P (Yµ,σ ≤ y) = P (µ + σY ≤ y) = P Y ≤ =H .
σ σ


6
Example 2.10. Suppose that Y ∼ N (0, 1). Then we know that µ + σY ∼ N (µ, σ 2 ) and
thus the location-scale family of N (0, 1) is the set of all normal distributions.

Importantly, QQ-plots can be used to check whether the data generating process is a
member of a certain location-scale family. Suppose that the data is a sample drawn from
some distribution g(x|µ, σ) that is a member of the location-scale family h with CDF H.
Then, it follows that
x(k) − µ
   
k (∗) −1 k µ 1
≈ G(x(k) |µ, σ) = H ⇒H ≈ − + x(k) ,
n+1 σ n+1 σ σ

where (∗) follows from Lemma 2.8. Hence, even though the data  is
 a sample drawn from
g(x|µ, σ) and not h(x), when plotting the points (x(k) , H −1 k
n+1 , they should roughly
follow a straight line with intercept −µ/σ and slope 1/σ. In this case we can conclude
that the location-scale family of h is a good statistical model.
We now have a simple graphical aid to check if the set of normal distributions is a
good statistical model for our data. If the QQ-plot of our data with the standard normal
is approximately a straight line, then that is an indication that the model is correct. We
show the QQ-plot for simulated data from the Normal(185, 36) distribution compared
to the N(0, 1) distribution in Figure 3. In Figure 4 we compare simulated data from a
students t distribution with three degrees of freedom compared to the N(0, 1) distribution.

7
QQ Plot of Sample Data versus Standard Normal
205

200

195

Quantiles of Input Sample 190

185

180

175

170

165
-3 -2 -1 0 1 2 3
Standard Normal Quantiles

Figure 3

QQ Plot of Sample Data versus Standard Normal


30

20

10
Quantiles of Input Sample

-10

-20

-30
-4 -3 -2 -1 0 1 2 3 4
Standard Normal Quantiles

Figure 4

2.4 The exponential family


Another important family of distributions in statistics is the exponential family.

Definition 2.11 (3.4.1). A family of pdfs or pmfs is called an exponential family if it


can be expressed as
k
!
X
(2.2) g(x|θ) = h(x)c(θ)exp wi (θ)ti (x) ,
i=1

where h(x), c(θ) ≥ 0, t1 (x), . . . , tk (x) are real valued functions of x that do not depend on
θ, and w1 (θ), . . . , wk (θ) are real-valued functions of the parameter(s) θ.

8
The exponential family contains many famous probability distributions, including most
of the distributions that you studied in probability theory.
Example 2.12 (3.4.1). Let X ∼ Binomial(n, p) with pdf given by
 
n x
g(x|n, p) = p (1 − p)n−x , 0 < p < 1.
x

Then, g(x|n, p) is a member of the exponential family, which becomes clear upon rewriting
     x
n x n−x n n p
g(x|n, p) = p (1 − p) = (1 − p)
x x 1−p
     
n p
= (1 − p)n exp log x .
x 1−p
 
p
such that h(x) = nx , c(θ) = (1 − p)n , w1 (θ) = log 1−p

and t1 (x) = x.

Example 2.13 (3.4.4). Let X ∼ Normal(µ, σ 2 ) with pdf given by

(x − µ)2
 
2 1
g(x|µ, σ ) = √ exp − , µ ∈ R, and σ 2 ∈ R+ .
2πσ 2σ 2

Then, g(x|n, p) is a member of the exponential family (exercise).


One of the nice statistical properties of the exponential family is that their exist short-
cuts to derive the moments of its member distributions. However, the property that is
exploited most throughout this course, is the fact that the h(x) component can be ignored
when estimating θ. Indeed, all the information relevant to the parameter that can be
extracted from the data turns out to be contained in the ti (x) functions. This often allows
for substantial data reduction without loss of information, which is the topic for next week.
When evaluating whether a specific distribution is a member of the exponential family,
it is good practice to include the support explicitly into the expression of the distribution.
For example, we know that the Exponential(λ) distribution has pdf

g(x|λ) = λe−λx , 0 < x < ∞.

However, as (2.2) does not allow for separate inclusions of information related to the
support, i.e. the “0 < x < ∞” part, it is best to include this directly into the pdf with
the use of the indicator function:

g(x|λ) = λe−λx 1(0,∞) (x),

where (
1 x∈A
1A (x) = .
0 x∈
/A
Whenever the support of the distribution does not depend on the parameter, the indicator
function related to the support will simply get absorbed into the h(x) function. However,
if the support does depend on the parameter, the indicator function will not be parameter
free. Since we cannot split the indicator function into a function h(x) that depends only
on the data and a function c(θ) that depends only on the parameter, such distributions
will in general not be members of an exponentional family.

9
Example 2.14. Let X ∼ Binomial(k, p), with both k and p unknown. Then the pdf of X
is given by  
k x
f (x | k, p) = p (1 − p)k−x 1{0,1,...,k} (x).
p
Since the indicator function cannot be split into an h(x) and c(θ) function, nor can it be
represented by an exponential function, this is not a member of the exponential family.

10

You might also like