0% found this document useful (0 votes)
3 views

Lecture-4

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Lecture-4

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 66

Samples and Statistics

Paolo Zacchia

Probability and Statistics

Lecture 4
Samples

Statistical analysis is based on data samples collected from a


population of interest.

Definition 1
N
Sample. A sample is a collection {xi }i=1 of realizations of some N
N
random vectors {xi }i=1 associated with some population of interest.
Each unit of this population is typically called a unit of observation
and its associated realization xi is identified by a unique subscript i.

Note: if the data are collected from N random variables, the sample is
N N
written as {xi }i=1 ; if from N random matrices, as {Xi }i=1 .

Definition 2
Sample size. The dimension N of a sample is called size.
Random samples
Definition 3
Random sample. A sample is random if its realizations are drawn
from independent and identically distributed (i.i.d.) random vec-
N
tors {xi }i=1 (or variables, or matrices).

Random samples are typically thought to be the product of sampling


with replacement from a population that follows a distribution that
is described by a random vector x.
Importantly, not all samples are random.

Definition 4
Non-random sample. A sample is said non-random if the realizations
that compose it are not drawn from i.i.d. random variables, vectors, or
matrices. Instead, these may be:
• independent and not identically distributed (i.n.i.d.);
• not independent and identically distributed (n.i.i.d.);
• not independent, not identically distributed (n.i.n.i.d.).
Samples and Statistics
Although non-random samples are common (they are ubiquitous, say,
in econometrics) random samples are an important benchmark. In fact,
the i.i.d. property lets express the joint distribution of the sample as:
N
Y
fx1 ,...,xN (x1 , . . . , xN ; θ) = fx (x1 ; θ) × · · · × fx (xN ; θ) = fx (xi ; θ)
i=1

which is the distribution of an N K-long random vector (x1 , . . . , xN ).


With random samples it is easier to study the distribution of statistics.

Definition 5
Statistic. A function of the N random variables, vectors or matrices
that are specific to each i-th unit of observation and that generate a
sample is called a statistic. Any statistic is itself a random variable,
vector or matrix.

Definition 6
Sampling distribution. The probability distribution of a statistic is
called its sampling distribution.
Sample mean
Two most common and important sample statistics are defined next.

Definition 7
Sample mean. In samples derived from random vectors, the sample
mean is a vector-valued statistic usually denoted as x̄ and defined as
follows.
N
1 X
x̄ = xi
N i=1
This definition can be reduced to samples that drawn from univariate
random variables, in which case the usual notation is X̄:
N
1 X
X̄ = Xi
N i=1

or extended to samples drawn from random matrices, where one can


write X̄ and the definition is again analogous.
Sample variance-covariance
Definition 8
Sample variance-covariance. In samples collected from random vec-
tors, the sample variance-covariance is a matrix-valued statistic usually
denoted by S and defined as follows.
N
1 X T
S= (xi − x̄) (xi − x̄)
N − 1 i=1

In samples from univariate random variables, this statistic is simply


called sample variance, its associated notation is S 2 , and it is a scalar.
N
1 X 2
S2 = Xi − X̄
N − 1 i=1

In this
√ specific case, the square root of the sample variance is written
S = S 2 and is called the standard deviation. In order to extend this
definition to sampling from random matrices it is necessary to develop
three-dimensional arrays.
Properties of key sample statistics, I (1/3)
In what follows, some key results about the sample mean and sample
variance-covariance are presented. They are crucial for the derivation
moments and sometimes, distribution of these statistics.

These properties are collected in three “cumulative” theorems.

Theorem 1
Properties of simple sample statistics (1). Consider a sample
N
{xi }i=1 , its sample mean x̄, and its sample variance-covariance S.
The following two properties are true:
PN T
a. x̄ = arg mina∈RK i=1 (xi − a) (xi − a) ;
PN
b. (N − 1) S = i=1 xi xT T
i − N · x̄x̄ .

Proof.
(Continues. . . )
Properties of key sample statistics, I (2/3)
Theorem 1
Proof.
(Continued.) To show point a. note that:
N
X N
X
T T
(xi − a) (xi − a) = (xi − x̄ + x̄ − a) (xi − x̄ + x̄ − a)
i=1 i=1
N
X N
X
T T
= (xi − x̄) (xi − x̄) + (x̄ − a) (x̄ − a)
i=1 i=1
N
X N
X
T T
+ (xi − x̄) (x̄ − a) + (x̄ − a) (xi − x̄)
i=1 i=1
N
X N
X
T T
= (xi − x̄) (xi − x̄) + (x̄ − a) (x̄ − a)
i=1 i=1

where two terms in the second line are both equal to zero by definition
of sample mean; in the last line, the first term does not depend on a
while the second is minimized at a = x̄. (Continues. . . )
Properties of key sample statistics, I (3/3)

Theorem 1
Proof.
(Continued.) To show b. simply note that:
N
X N
X N
X N
X
T
(xi − x̄) (xi − x̄) = xi xT
i − xi x̄T − x̄xT
i + N · x̄x̄
T

i=1 i=1 i=1 i=1


N
X
= xi xT
i − N · x̄x̄
T

i=1

and the result again follows from the definition of a sample mean.
Properties of key sample statistics, II (1/3)
Theorem 2
Properties of simple sample statistics (2). Consider a random
N
sample {xi }i=1 drawn from a random vector x, a transformation of
this vector y = g (x), and suppose that all the moments expressed in
the mean vector E [y] and in the variance-covariance matrix Var [y] are
defined. The following two properties are true:
hP i
N
a. E i=1 yi = N · E [yi ];
hP i
N
b. Var i=1 y i = N · Var [yi ].

Proof.
To show a. simply observe that:
"N # N
X X
E yi = E [yi ] = N · E [yi ]
i=1 i=1

which follows from the linear properties of expectations and from the
moments of yi for i = 1, . . . , N being identical. (Continues. . . )
Properties of key sample statistics, II (2/3)
Theorem 2
Proof.
(Continued.) The demonstration of b. is as follows.
"N # 
N
"N #! N "N #!T 
X X X X X
Var yi = E  yi − E yi yi − E yi 
i=1 i=1 i=1 i=1 i=1

N
! N
!T 
X X
= E (yi − E [yi ]) (yi − E [yi ]) 
i=1 i=1
"N #
X T
=E (yi − E [yi ]) (yi − E [yi ])
i=1
N
X h i
T
= E (yi − E [yi ]) (yi − E [yi ])
i=1
= N · Var [yi ]

(Continues. . . )
Properties of key sample statistics, II (3/3)

Theorem 2
Proof.
(Continued.) In the above derivation for b. the first line is just the
PN
definition of variance for i=1 yi , the second line applies the linear
properties of expectations while also rearranging terms, the third line
rearranges terms again after observing that, for i 6= j:
h i
T
E (yi − E [yi ]) (yj − E [yj ]) = 0

which follows from the independence of the realizations in the random


sample, the fourth line is another application of the linear properties
of expectations, while the fifth line again exploits the fact that all the
realizations follow from identically distributed random variables.

Note: independence (from samples being i.i.d.) is used to prove point


b. but not point a. in the theorem.
Properties of key sample statistics, III (1/3)
Theorem 3
Properties of simple sample statistics (3). Consider a random
N
sample {xi }i=1 drawn from a random vector x whose mean vector is
E [x] and whose variance-covariance matrix is Var [x] < ∞ (finite).
The following three properties are true:
a. E [x̄] = E [x];
b. Var [x̄] = Var [x] /N ;
c. E [S] = Var [x].

Proof.
To show a. it is sufficient to apply Theorem 2, point a. for y = x:
" N
# N
1 X 1 X 1
E [x̄] = E xi = E [xi ] = · N · E [xi ] = E [x]
N i=1 N i=1 N

(Continues. . . )
Properties of key sample statistics, III (2/3)

Theorem 3
Proof.
(Continued.) Point b. proceeds similarly.
" N
#
1 X
Var [x̄] = Var xi
N i=1
N
1 X
= 2 Var [xi ]
N i=1
1
= · N · Var [xi ]
N2
Var [x]
=
N
(Continues. . . )
Properties of key sample statistics, III (3/3)
Theorem 3
Proof.
(Continues. . . ) The proof of point c. is as follows:
" N
!#
1 X
T T
E [S] = E xi xi − N · x̄x̄
N − 1 i=1
N
!
1 X  T
  T
= E xi xi − N · E x̄x̄
N −1 i=1
1
= (N · Var [xi ] − N · Var [x̄])
N −1
 
N 1
= 1− Var [x]
N −1 N
= Var [x]
T
the third line follows after adding and subtracting N · E [x] E [x] .
Normal sampling
• While performing statistical estimation and inference, it is
often useful to know the exact sampling distribution of
selected statistics like x̄ and S.
• This is usually possible only in selected cases, like the one
where the sample is drawn from a normal distribution.
• Let sample {xi }N 2

i=1 be drawn from X ∼ N µ, σ ; then:
!
σ2
X̄ ∼ N µ,
N

follows from the property of (independent) normal r.v.s.


• An equivalent, convenient formulation also follows.

√ X̄ − µ
N ∼ N (0, 1)
σ
The t-statistic
√  
• The “standardized” statistic N X̄ − µ /σ is appealing
for the sake of testing hypotheses about µ.
• However, it has a shortcoming: usually, σ is unknown.

• Intuitively, one could replace σ with the sample standard


deviation S. This gives rise to the following statistic.

Definition 9
N
The t-statistic. Given a univariate sample {xi }i=1 of size N drawn
from a sequence of random variables X1 , . . . , XN , a t-statistic is defined
as the following quantity:
√ X̄ − µ
t= N
S
 
where X̄ is the sample mean whose expectation is µ = E X̄ , and S is
the sample standard deviation.
Properties of normal sampling (1/6)
The exact sampling distribution of the t-statistic is well-known,
but deriving it requires a few steps.

Theorem 4
Sampling from the Normal Distribution. Consider a random sam-
N
ple {xi }i=1 drawn from
 a random variable following the normal distri-
bution X ∼ N µ, σ2 , and the random variables corresponding to the
two sample statistics X̄ and S 2 . The following three properties are true:
a. X̄ and S 2 are independent;

b. X̄ ∼ N µ, σ2 /N ;
c. (N − 1) S 2 /σ2 ∼ χ2N −1 .

Proof.
Point b. is straightforward, point c. is quite easy to show, but point a.
requires some more effort. (Continues. . . )
Properties of normal sampling (2/6)
Theorem 4
Proof.
(Continued.) Start with the observation that the sample variance can
be expressed in terms of only N − 1 of the original random variables:
N
1 X 2
S2 = Xi − X̄
N − 1 i=1
" N
#
1 2 X 2
= X1 − X̄ + Xi − X̄
N −1 i=2

N
!2 N

1  X  X 2
= Xi − X̄ + Xi − X̄ 
N −1 i=2 i=2

PN 
where the last line follows from i=1 Xi − X̄ = 0. Hence, proving
that the sample mean is independent of the sample variance amounts
to show that it is independent of N − 1 out of N normally distributed
random variables, say X2 − X̄, . . . , XN − X̄. (Continues. . . )
Properties of normal sampling (3/6)
Theorem 4
Proof.
(Continued.) Work with the standardization Zi = (Xi − µ) /σ for
i = 1, . . . , N , and let z = (Zi , . . . , ZN ). Define the following random
vector ze of length N as a function of z.
  −1
N −1 N −1


  
Z̄ N ...
Ze2   Z2 − Z̄  −N −1 1 − N −1 ... −N −1 
ze =  .  =   =  .. z
     
.. .. .. ..
 ..   .   . . . . 
ZeN ZN − Z̄ −N −1 −N −1 ... 1 − N −1

One shall show that ze is composed of independent random variables:


• this would show that Z̄ is independent of Z2 − Z̄, . . . , ZN − Z̄;
• so, X̄ would be independent of X2 − X̄, . . . , XN − X̄ (and of S 2 ).
The above transformation is linear and its Jacobian has determinant
1/N , thus it is invertible. By the properties of linear transformations,
its inverse has a Jacobian with determinant N . (Continues. . . )
Properties of normal sampling (4/6)
Theorem 4
Proof.
(Continued.) The joint p.d.f. of ze obtains directly from that of z:

N
!2 N

N 1 X 1 X 2
fz̃ (z̄, ze2 , ..., zeN )= q exp − z̄ − zei − (z̄ + zei ) 
N 2 2
(2π) i=2 i=2
r
N z̄ 2
 
N
= exp − ×
2π 2
s 
N
!2 N

N 1 X 1 X
× exp − zei − ze2 
(2π)
N −1 2 i=2 2 i=2 i

=fZ̄ (z̄) · fz̃−1 (e


z2 , . . . , zeN )

and it can be clearly decomposed into the product of two components:


the p.d.f. of Z̄ and that of all the other elements of ze. Therefore, Z̄ is
independent of Z2 − Z̄, . . . , ZN − Z̄: a. is proved. (Continues. . . )
Properties of normal sampling (5/6)
Theorem 4
Proof.
(Continued.) Moving to the other points, b. as argued is obvious,
while to demonstrate point c. it is easiest to proceed as follows.
N 2
S2 X Xi − X̄
(N − 1) 2 =
σ i=1
σ2
N 2
X Xi − µ + µ − X̄
=
i=1
σ2
N 2
2 N
X (Xi − µ) N X̄ − µ X Xi − µ
= 2
− 2
− 2 X̄ − µ
i=1
σ σ i=1
σ2
N 
√ X̄ − µ 2
2  
X Xi − µ
= − N
i=1
σ σ

(Continues. . . )
Properties of normal sampling (6/6)
Theorem 4
Proof.
(Continued.) The statistic (N − 1) S 2 /σ2 is shown to be the sum of
the squares of N independent random variables – all of which follow
the standard normal distribution, minus the square of another random
variable that follows the standard normal distribution. By the result
in a. the latter is independent of the former. Note that, using m.g.f.s:
N
Y
MZ̄¯ 2 (t) M(N −1) S2 (t) = MZi2 (t)
σ2
i=1


where Z̄¯ ≡ N X̄ − µ /σ and Zi ≡ (Xi − µ) /σ, or equivalently:


N
1 Y − 21 (N −1)
M S2
(N −1) σ
(t) = MZi2 (t) = (1 − 2t)
2 MZ̄¯ 2 (t) i=1

following since Z̄¯ 2 , Zi2 ∼ χ21 . Therefore, (N − 1) S 2 /σ2 ∼ χ2N −1 : point


c. is proved too.
A t-distribution for the t-statistic
Owning these results, it is possible to return to the t-statistic and
derive its distribution. Note that:
√ X̄ − µ
√ X̄ − µ N
t= N =s σ ∼ TN −1
S S 2 1
(N − 1) 2 √
σ N −1
is the ratio between two independent random variables:
• the numerator follows the standard normal distribution,

• while the denominator equals the square root of a random


variable following the chi-squared distribution with N − 1
degrees of freedom, divided by the square root of N − 1.
Hence, by Observation 2 from Lecture 3, a t-statistic follows the
Student’s t-distribution with N − 1 degrees of freedom.
The F-statistic
Knowing the distribution of the t-statistic helps conduct tests about µ
in a normal random sample, but what about σ2 ?

Definition 10
Normal variance ratio. Consider two univariate random samples
NX NY
{xi }i=1 and {yi }i=1 of sizes NX and NY respectively, each drawn
from two independent sequences of random variables (X1 , . . . , XNX )
and (Y1 , . . . , YNY ) whose distributions are as follows.

Xi ∼ N µX , σ2X for i = 1, . . . , NX


Yj ∼ N µY , σ2Y for j = 1, . . . , NY


The normal variance ratio is defined as the following F-statistic:


2
SX /σ2X
F=
SY2 /σ2Y

where SX & SY are the sample variances of the two random samples.
An F-distribution for the F-statistic
• The F-statistic is used to test whether any two populations
have identical/similar variance.

• To this end, one shall know the exact sampling distribution


of the F-statistic: again, random ratios come to rescue.

• Both the numerator and denominator of F, if multiplied by


NX − 1 and NY − 1 respectively, follow – by Theorem 4 –
a chi-squared distribution with those given numbers as their
respective degrees of freedom.

• Therefore, by Observation 3 from Lecture 3:


2 /σ2
SX X
F= ∼ FNX −1,NY −1
SY2 /σ2Y

a result that can be exploited in statistical inference.


Multivariate normal sampling
• The analysis of normal sampling so far concerns univariate
samples. These ideas also extend to a multivariate setting.

• Specifically, let the random sample {xi }N


i=1 be drawn from
some multivariate normal distribution x ∼ N (µ, Σ).

• By the properties of i.i.d. samples and of the multivariate


normal distribution, the following holds.
Σ
 
x̄ ∼ N µ,
N

• In statistical inference and in testing hypotheses, interest


usually falls on the vector of means µ.

• A scalar statistic that summarizes while standardizing all


sample means appears useful here.
The u-statistic.
Definition 11
N
The u-statistic. Given a multivariate sample {xi }i=1 of size N drawn
from a sequence of random vectors x1 , . . . , xN , a u-statistic is defined
as the following quantity:
T
u = N (x̄ − µ) Σ−1 (x̄ − µ)
K X
X K
σ∗−1
 
=N k` X̄k − µk X̄` − µ`
k=1 `=1

where here x̄ is the sample mean whose expectation and variance are
µ = E [xi ], and Σ/N = Var [xi ] respectively, and where σ∗−1
k` is k`-th
element of Σ−1 .
To better interpret the u-statistic, one might analyze its expression as
a quadratic form in the second line of the definition: the statistic is a
second degree polynomial of the K deviations of all univariate sample
means from their respective mean parameters, normalized through the
population variance-covariance.
The distribution of the u-statistic (1/2)

One more time, using this statistic for inference purposes requires the
derivation of its exact distribution.

Theorem 5
Sampling from the Multivariate Normal Distribution. Let a
N
random sample {xi }i=1 be drawn from some K-dimensional random
vector following the multivariate normal distribution, x ∼ N (µ, Σ).
In this environment, the u-statistic follows the chi-squared distribution
with K degrees of freedom.
T
u = N (x̄ − µ) Σ−1 (x̄ − µ) ∼ χ2K

Proof.
As usual, the result requires to find the m.g.f. of the random variable
of interest: here, the u-statistic. (Continues. . . )
The distribution of the u-statistic (2/2)
Theorem 5
Proof.
(Continued.) This requires some linear algebra.
ˆ   Σ

Mu (t) = exp N (x̄ − µ)T Σ−1 (x̄ − µ) t fx̄ x̄; µ, dx̄
RK N
ˆ
s
1 NK N
 
exp − (x̄ − µ)T (1 − 2t) Σ−1 (x̄ − µ) t dx̄
 
= K |Σ|
RK (2π) 2
ˆ
s
1 [(1 − 2t) N ]K
r
1
= × ×
(1 − 2t)K RK (2π)K |Σ|
N
 
(x̄ − µ)T (1 − 2t) Σ−1 (x̄ − µ) t dx̄
 
× exp −
2
K
= (1 − 2t)− 2

Note: the integral in the third line disappears since it is the p.d.f. of a
multivariate normal distribution with mean µ and variance-covariance
Σ/ (1 − 2t) N . By recognizing the m.g.f. it follows that u ∼ χ2K .
Hotelling’s t-squared statistic
Problems reiterate! Like in the univariate case, if Σ is unknown the
u-statistic is useless. What if Σ is replaced by its sample analog S?
Definition 12
Hotelling’s “t-squared” statistic. Given some multivariate sample
N
{xi }i=1 of size N drawn from a sequence of random vectors x1 , ..., xN ,
Hotelling’s t-squared statistic is defined as the random variable:
T
t2 = N (x̄ − µ) S −1 (x̄ − µ)
K X
X K
∗−1
 
=N Sk` X̄k − µk X̄` − µ`
k=1 `=1

where x̄ is the sample mean whose expectation is µ = E [xi ], S is the


∗−1
sample variance-covariance, and Sk` is k`-th element of S −1 .
One can prove that a rescaled version of t2 follows the F-distribution
with paired degrees of freedom K and N − K.
N −K 2 N (N − K) T
t = (x̄ − µ) S −1 (x̄ − µ) ∼ FK,N −K
K (N − 1) K (N − 1)
Order statistics
It is often useful to study the values that the realizations of a random
variable take at a given position in the order of realizations (smallest
value, highest value, et cetera).

Definition 13
N
Order statistics. Consider a sample {xi }i=1 of realizations obtained
N
from univariate random variables, {Xi }i=1 . Suppose that these values
are placed in ascending order, where subscripts surrounded by paren-
theses denote one observation’s position in the order:

x(1) ≤ x(2) ≤ · · · ≤ x(N )


N N
therefore, x(1) = min {xi }i=1 and x(N ) = max {xi }i=1 . The j-th order
statistic is the random variable – denoted as X(j) – that generates the
j-th realization in the above sequence, that is x(j) .
Every univariate sample has N associated order statistics that need to
satisfy the following property.

X(1) ≤ X(2) ≤ · · · ≤ X(N )


Minima, maxima, ranges and medians
Definition 14
Sample Minimum. The sample minimum is the first order statistic,
X(1) .

Definition 15
Sample Maximum. The sample maximum is the N -th order stati-
stic, X(N ) .

Definition 16
Sample Range. The sample range R is the difference between the
sample maximum and the sample minimum: R = X(N ) − X(1) .

Definition 17
Sample Median. The sample median M is a function of a sample’s
most central order statistics.

X N +1 if N is odd
( )
M = 1 2 
 X N +X N if N is even
2 2 ( ) ( +1)
2
Which sampling distribution for order statistics?
• One may want to provide a distribution for order statistics
(e.g. to model the probabilities for minima and maxima).

• If the sample is random, the c.d.f. of the j-th order statistic


can be expressed in terms of the following joint probability.
 
FX(j) (x) = P X(1) ≤ x ∩ · · · ∩ X(j) ≤ x

• At least j observations (possibly more) must be less than or


equal to x.

• This allows to provide formulae for the p.d.f.s and c.d.f.s of


order statistics, although these can be hard to use.

• Some relevant cases exist: that is, order statistics for uniform
distributions, minima or maxima for selected distributions.
Sampling distribution of order statistics (1/5)
Theorem 6
Sampling distribution of order statistics in random samples.
In a univariate random sample, the c.d.f. of the j-th order statistic is
based on the binomial distribution:
N  
X N k N −k
FX(j) (x) = [FX (x)] [1 − FX (x)]
k
k=j

where FX (x) is the c.d.f. of the random variable X that generates the
sample. Two particular cases are the c.d.f.s of the minimum and the
maximum, which are as follows.
N
FX(1) (x) = 1 − [1 − FX (x)]
N
FX(N ) (x) = [FX (x)]

Proof.
(Continues. . . )
Sampling distribution of order statistics (2/5)
Theorem 6
Proof.
(Continued.) For at least j realizations to be less or equal than x, the
event defined as Xi ≤ x must occur an integer number of j ≤ k ≤ N
times, whereas the complementary event Xi > x must instead occur
N − k times. If the sample is random (i.i.d.), these two events occur
with probabilities that are constant across all realizations.

P (Xi ≤ x) = FX (x)
P (Xi > x) = 1 − FX (x)

As the sample is random, all joint combinations of said events can be


expressed as the appropriate product of those probabilities. For a given
k any joint events can be expressed through a binomial distribution,
with the binomial coefficient counting all potential combinations with
k “successes” (Xi ≤ x) and N − k “failures” (Xi > x). Summing over
the eligible values of k delivers the result sought after, of which the
distributions for the minimum and the maximum are special cases.
Sampling distribution of order statistics (3/5)
If the sample is drawn from a continuous distribution, one can also
derive the p.d.f. of an order statistic of interest.

Corollary
(Theorem 6.) If X is a continuous distribution with p.d.f. fX (x), the
p.d.f. of the j-th order statistic is the following.
N! j−1 N −j
fX(j) (x) = fX (x) [FX (x)] [1 − FX (x)]
(j − 1)! (N − j)!

Proof.
The result obtains by taking the first derivative of FX(j) (x) and some
manipulation. The initial differentiation is shown in the next slide; it
applies the chain rule. The third line in the following display obtains
by isolating the term corresponding with k = j. One must then show
that the two elements that are left out cancel out against one another.
(Continues. . . )
Sampling distribution of order statistics (4/5)
Proof.
(Continued.) The initial operations are as follows.

dFX(j) (x)
fX(j) (x) =
dx
N  
X N k−1 N −k
= k [FX (x)] [1 − FX (x)] fX (x) −
k
k=j

k N −k−1
− (N − k) [FX (x)] [1 − FX (x)] fX (x)
N! j−1 N −j
= fX (x) [FX (x)] [1 − FX (x)] +
(j − 1)! (N − j)!
N  
X N k−1 N −k
+ k [FX (x)] [1 − FX (x)] fX (x) −
k
k=j+1
N  
X N k N −k−1
− (N − k) [FX (x)] [1 − FX (x)] fX (x)
k
k=j

(Continues. . . )
Sampling distribution of order statistics (5/5)
Proof.
(Continued.) Re-index the summation of the second term above and
note that in the summation of the third term, the element for k = N
is zero. Thus, the p.d.f. can be rewritten as follows.
N! j−1 N −j
fX(j) (x) = fX (x) [FX (x)] [1 − FX (x)] +
(j − 1)! (N − j)!
N −1  
X N k N −k−1
+ (k + 1) [FX (x)] [1 − FX (x)] fX (x) −
k+1
k=j
N −1  
X N k N −k−1
− (N − k) [FX (x)] [1 − FX (x)] fX (x)
k
k=j

Since, by simple manipulation of factorials, it is:


   
N N! N
(k + 1) = = (N − k)
k+1 k! (N − k − 1)! k

it follows that the two terms in question cancel out.


Order statistics for the uniform distribution
These formulae can seldom be linked to any known distributions. One
notable case is the following.

Observation 1
Consider a random sample obtained from the standard continuous uni-
form distribution, X ∼ U (0, 1). The j-th order statistic is such that
X(j) ∼ Beta (j, N − j + 1).
Proof.
Since FX (x) = x and fX (x) = 1 for x ∈ (0, 1), while FX (x) = x and
fX (x) = 0 otherwise, the density function of X(j) is, for x ∈ (0, 1):

N! N −j
fX(j) (x) = xj−1 (1 − x)
(j − 1)! (N − j)!
Γ (N + 1) (N −j+1)−1
= xj−1 (1 − x)
Γ (j) Γ (N − j + 1)
(N −j+1)−1
= B (j, N − j + 1) · xj−1 (1 − x)

the result is the p.d.f. of the postulated Beta distribution.


Min-max stability
• The result about this uniform distribution is useful: it applies to
percentiles p drawn from any distribution, since p ∼ U (0, 1).

• Other results are specific to the minima and maxima: sometimes


these two extreme order statistics follow yet another distribution
from the same family whence the sample is drawn.

• This deserves a proper definition.

Definition 18
Extreme order statistics (minimum-maximum) stability. Con-
sider a random sample drawn from some known distribution. If the
sample minimum (maximum) follows another distribution of the same
family, that distribution is said to be min-stable (max-stable).

• Min-/max-stable distributions include the exponential as well as


the GEV distributions.
Min-stability of the exponential distribution

Observation 2
Consider a random sample drawn from the exponential distribution
with parameter λ, X ∼ Exp (λ). In this case, the first order statistic
(the minimum) is such that X(1) ∼ Exp N −1 λ .
Proof.
By applying the formula for the distribution of the minimum:
N
FX(1) (x; λ, N ) = 1 − [1 − FX (x; λ)]
  N
1
= 1 − exp − x
λ
 
N
= 1 − exp − x
λ

the postulated c.d.f. obtains directly.


Max-stability of the Gumbel distribution
Observation 3
Consider a random sample drawn from the Type I GEV (Gumbel)
distribution with parameters µ and σ, X ∼ EV1 (µ, σ). The top order
statistic (the maximum) is such that X(N ) ∼ EV1 (µ − σ log (N ) , σ).
Proof.
By applying the formula for the distribution of the maximum:
N
FX(N ) (x; µ, σ, N ) = [FX (x; µ, σ)]
  N
x−µ
= exp − exp
σ
  
x−µ
= exp −N exp
σ
  
x − µ + σ log (N )
= exp − exp
σ

one obtains the Gumbel c.d.f. that was argued.


Max-stability in other GEV distributions
Observation 4
Consider a random sample drawn from the Type II GEV (Fréchet)
distribution with parameters α, µ, and σ, Y ∼ EV2 (α, µ, σ). The top
order statistic (the maximum) is such that Y(N ) ∼ EV2 α, µ, σN 1/α .
The result is identical in the Type III GEV (reverse
 Weibull) case: if
Y ∼ EV3 (α, µ, σ), it is Y(N ) ∼ EV3 α, µ, σN 1/α .
Proof.
Here, applying the formula for the distribution of the maximum:
N
FY(N ) (y; α, µ, σ, N ) = [FY (y; α, µ, σ)]
 −α !N
y−µ
= exp −
σ
  −α !
−α 1 y−µ
= exp − N
σ

allows to show both the Fréchet and the reverse Weibull results.
Min-stability of the Weibull distribution

Observation 5
Consider a random sample that is drawn from the traditional Weibull
distribution with parameters α, µ, and σ, W ∼ Weibull (α,
 µ, σ). The
sample minimum is such that W(1) ∼ Weibull α, µ, σN 1/α .
Proof.
Things proceed similarly as in the Fréchet and reverse Weibull cases.
N
FW(1) (w; α, µ, σ, N ) = 1 − [1 − FW (w; α, µ, σ)]
 −α !N
w−µ
= 1 − exp −
σ
  −α !
1 w − µ
= 1 − exp − N − α
σ

Note that here, the formula for the minimum is applied instead.
Sufficient Statistics
The final kind of statistic that is introduced in this lecture serve as a
first step for the study of estimation (Lecture 5).

Definition 19
Sufficient statistics. Consider a given sample generated by a list of
random vectors (x1 , . . . , xN ). Suppose that the joint distribution of
the sample depends, among the others, on some parameter θ; write
the associated p.m.f. or p.d.f. as fx1 ,...,xN (x1 , . . . , xN ; θ). A statistic
T = T (x1 , . . . , xN ) is said to be sufficient if the joint distribution of
the sample, conditional on it, does not depend on θ:

fx1 ,...,xN (x1 , . . . , xN ; θ)


f x1 ,...,xN |T ( x1 , . . . , xN | T (x1 , . . . , xN )) =
qT (T (x1 , . . . , xN ) ; θ)

where qT (T (x1 , . . . , xN ) ; θ) is the p.m.f. or the p.d.f. of the sufficient


statistic in question.
This can also be expressed by saying that the joint conditional density
is constant as a function of θ.
The Sufficiency Principle
• The intuition behind sufficient statistics is that they “exhaust” all
the information about θ that is contained in a sample.
• This aids estimation and inference in various ways.

• The role of sufficient statistics in inference is summarized by the


following statistical principle. This is a postulate (an axiom)
of statistical analysis.

Statistical Principle 1. Sufficiency. If T = T (x1 , . . . , xN ) is a


sufficient statistic for a parameter θ, any evaluation about the latter
should depend solely on the sufficient statistic or a function thereof.
That is, if (x1 , . . . , xN ) and (y1 , . . . , yN ) are two, possibly different
sample realizations such that

T (x1 , . . . , xN ) = T (y1 , . . . , yN )

all statistical evaluations about θ should be identical regardless of the


exact observed values in either realization.
Example: a Bernoulli sufficient statistic
• Consider a random sample drawn from X ∼ Be (p).

• The count of “successes” is a sufficient statistic for p.


N
X
T = T (X1 , . . . , XN ) = Xi
i=1
PN
Its realization is t = T (x1 , . . . , xN ) = i=1 xi

• Apply the definition, observing that T ∼ BN (p, N ).


QN xi 1−xi
fX1 ,...,XN (x1 , . . . , xN ; p) i=1 p (1 − p)
= N N −t
qT (t; p, N )

t
t p (1 − p)
pt (1 − p)N −t
= N t
t p (1 − p)N −t
t! (N − t)!
=
N!
Example: X̄ suffices for the normal’s µ (1/2)
• Consider a random sample drawn from X ∼ N µ, σ2 .


• The sample mean X̄ is a sufficient statistic for µ.

• Recall that X̄ ∼ N µ, σ2 /N , and write the realization of




X̄ follows.
N
1 X
x̄ = xi
N i=1
• In the application that follows next, the following property
is applied (as point c. of Theorem 4).
N N
2
(xi − x̄)2 + N (x̄ − µ)2
X X
(xi − µ) =
i=1 i=1
N
X
− 2 (x̄ − µ) (xi − x̄)
i=1
| {z }
=0
Example: X̄ suffices for the normal’s µ (2/2)
The derivation follows suit; note how the mentioned property is
applied in the second line.
N q 2
!
Y −1 (xi − µ)
(2πσ2 ) · exp −
2σ2

fX1 ,...,XN x1 , . . . , xN ; µ, σ2 i=1
=q 2
!
qX̄ (x̄; µ, σ2 /N ) −1 N (x̄ − µ)
2
(2πσ ) N · exp −
2σ2
N 2
!
(x − µ)
q
−N i
X
(2πσ2 ) · exp −
i=1
2σ2
= q 2
!
−1 N (x̄ − µ)
(2πσ2 ) N · exp −
2σ2
N 2
!
X (xi − x̄)
exp −
i=1
2σ2
= q
N −1
(2πσ2 ) N
Example: a sufficient order statistic
• Consider a random sample drawn from X ∼ U (0, θ).

• The maximum X(N ) is a sufficient statistic for parameter θ.

• The realization of X(N ) is x(N ) = max {x1 , . . . , xN }, and:


" N #
  d x(N ) h i
qX(N ) x(N ) ; θ = · 1 x(N ) ∈ (0, θ)
dx(N ) θ
−1
N xN
(N )
h i
= · 1 x(N ) ∈ (0, θ)
θN
• . . . hence, applying the definition here gives:

fX1 ,...,XN (x1 , . . . , xN ; θ) 1 h i


  = N −1
· 1 x (N ) ∈ (0, θ)
qX(N ) x(N ) ; θ N x(N )

since here fX1 ,...,XN (x1 , . . . , xN ; θ) = [fX (x; θ)]N = θ−N .


The factorization theorem (1/7)
The following important result helps identify sufficient statistics.

Theorem 7
Fisher-Neyman’s Factorization Theorem. Consider a sample gen-
erated by a sequence of random vectors (x1 , . . . , xN ), whose joint dis-
tribution has a p.m.f. or a p.d.f. fx1 ,...,xN (x1 , . . . , xN ; θ) that also de-
pends on some parameter θ. Then, a statistic T = T (x1 , . . . , xN ) is
sufficient for θ if and only if it is possible to identify two functions
g (T (x1 , . . . , xN ) ; θ) and h (x1 , . . . , xN ) such that the following holds.

fx1 ,...,xN (x1 , . . . , xN ; θ) = g (T (x1 , . . . , xN ) ; θ) · h (x1 , . . . , xN )

Observe that function g (T (x1 , . . . , xN ) ; θ) depends on θ, but function


h (x1 , . . . , xN ) does not.
Proof.
This proof is complex, and is only fully developed in the discrete case;
the continuous case is only sketched. (Continues. . . )
The factorization theorem (2/7)
Theorem 7
Proof.
(Continued.) Start from the discrete case and the “necessity” part:
if the factorization exists, then T is sufficient for θ. Write the p.m.f.
of T as qT (T (x1 , . . . , xN ) ; θ). Furthermore, define the set of vectors
spanning the same space as (x1 , . . . , xN ) and that result in the same
values for T , as follows.

AT (x1 , . . . , xN ) ≡ {y1 , . . . , yN : T (x1 , . . . , xN ) = T (y1 , . . . , yN )}

By the property of probability functions, the following holds.


X
qT (T (x1 , . . . , xN ) ; θ)= fx1 ,...,xN (y1 , . . . , yN ; θ)
y1 ,...,yN ∈AT
X
= g (T (y1 , . . . , yN ) ; θ) h (y1 , . . . , yN )
y1 ,...,yN ∈AT

(Continues. . . )
The factorization theorem (3/7)
Theorem 7
Proof.
(Continued.) Since T (y1 , . . . , yN ) is constant in AT (x1 , . . . , xN ), it
is:
X
qT (T (x1 , . . . , xN ) ; θ) = g (T (x1 , . . . , xN ) ; θ) h (y1 , . . . , yN )
y1 ,...,yN ∈AT

where in both cases AT is shorthand notation for AT (x1 , . . . , xN ). It


then follows that:
fx1 ,...,xN (x1 , . . . , xN ; θ) g (T (x1 , . . . , xN ) ; θ) · h (x1 , . . . , xN )
=
qT (T (x1 , . . . , xN ) ; θ) qT (T (x1 , . . . , xN ) ; θ)
h (x1 , . . . , xN )
=P
y1 ,...,yN ∈AT h (y1 , . . . , yN )

since g (T (y1 , . . . , yN ) ; θ) simplifies in the right hand side’s ratio; the


latter no longer depends on θ indicating that T is a sufficient statistic.
(Continues. . . )
The factorization theorem (4/7)
Theorem 7
Proof.
(Continued.) Consider the “sufficiency” part of the discrete case.
Recall the interpretation of a joint p.m.f. as a probability function:
N
!
[
fx1 ,...,xN (x1 , . . . , xN ; θ) = P xi = xi ; θ
i=1
N
!
[
=P xi = xi T = T (x1 , . . . , xN )
i=1
× P (T = T (x1 , . . . , xN ) ; θ)
= h (x1 , . . . , xN ) · qT (T (x1 , . . . , xN ) ; θ)

The second line follows from the definition of conditional probability


while the third just renames the previous probability function, noting
that the conditional probability of the sample given T is expressible as
some generic function h (x1 , . . . , xN ) that does not depend on θ by the
definition of sufficient statistic. (Continues. . . )
The factorization theorem (5/7)
Theorem 7
Proof.
(Continued.) Move to the continuous case. Consider some bijective
and differentiable transformations that do not depend on θ:
   
y1 g1 (x1 , . . . , xN )
 y2   g2 (x1 , . . . , xN ) 
 ..  = 
   
.. 
 .   . 
yN gN (x1 , . . . , xN )

where at least one element of this list (suppose Y11 in y1 ), is fixed as


Y11 = T (x1 , . . . , xN ) by construction. The inverse transformation is
as follows.    −1 
w1 g1 (y1 , . . . , yN )
 w2  g2−1 (y1 , . . . , yN )
 ..  = 
   
.. 
 .   . 
−1
wN gN (y1 , . . . , yN )
(Continues. . . )
The factorization theorem (6/7)
Theorem 7
Proof.
(Continued.) In order to show necessity, write the joint p.d.f. of the
transformation as:

fy1 ,...,yN (y1 , . . . , yN ; θ) = fx1 ,...,xN (w1 , . . . , wN ; θ) · |J∗ |


= g (T (w1 , . . . , wN ) ; θ) · h (w1 , . . . , wN ) · |J∗ |
= g (y11 ; θ) · h (w1 , . . . , wN ) · |J∗ |

where |J∗ | is shorthand for the absolute value of the Jacobian of the
inverse transformation, and the second line follows from hypothesis.
It is obvious that the marginal distribution of Y11 , that is the density
function qT (T (x1 , . . . , xN ) ; θ) of the statistic of interest T , inherits a
factorization analogous to the above and since y11 = T (x1 , . . . , xN ), it
can be shown that the ratio between the joint p.d.f. of the sample and
the p.d.f. of T does not depend on θ, hence T is sufficient.
(Continues. . . )
The factorization theorem (7/7)

Theorem 7
Proof.
(Continued.) In order to show the “sufficiency” part of the Theorem
(if T is sufficient, then a proper factorization can be expressed) apply
the definition of conditional density function to show that:

fy1 ,...,yN (y1 , . . . , yN ; θ) =


= qT (y11 ; θ) · f{y1 ,...,yN }\Y11 ( {y1 , . . . , yN } \ y11 | Y11 )

where the notation {·} \ Y11 denotes a list that excludes Y11 . Dividing
both sides of the above by |J∗ | returns the desired factorization for:

f{y1 ,...,yN }\Y11 ( {y1 , . . . , yN } \ y11 | Y11 )


h (x1 , . . . , xN ) =
|J∗ |

and for g (T (x1 , . . . , xN ) ; θ) = qT (T (x1 , . . . , xN ) ; θ).


Uses of the factorization theorem
The factorization theorem is especially useful to show that multiple
statistics are simultaneously sufficient for a number of associated
parameters. This is usually expressed through a vector of statistics
t (x1 , . . . , xN ).
 
T1 (x1 , . . . , xN )
 T2 (x1 , . . . , xN ) 
t (x1 , . . . , xN ) = 
 
.. 
 . 
TK (x1 , . . . , xN )
These statistics are said to be simultaneously sufficient for a vector of
parameters θ:  
θ1
 θ2 
θ= . 
 
 .. 
θJ
where generally it may be that K 6= J. The factorization theorem can
be extended to allow for g (t (x1 , . . . , xN ) ; θ) to be the joint p.d.f. of
all these statistics and for a multidimensional parameter vector.
Example: sufficiency for the normal distribution
• The earlier result on X̄ being sufficient for µ in the normal
case can also be obtained via factorization theorem with:
N (x̄ − µ)2
!
g (x̄; µ) = exp −
2σ2
N N
(xi − x̄)2
!
1

2 X
h (x1 , . . . , xN ) = exp −
2πσ2 i=1
2σ2
• . . . but this still ignores σ2 . Consider S 2 and its realization.
N
1 X
s2 = (xi − x̄)2
N − 1 i=1
 
• It turns out that X̄, S 2 are jointly sufficient for µ, σ2 . If


N
N (x̄ − µ)2 + (N − 1) S 2
!
1

  2
2 2
g x̄, s ; µ, σ = exp −
σ2 2σ2
and h (x1 , . . . , xN ) = (2π)−
N/2
, the theorem applies nicely.
Example: sufficiency for the multivariate normal
• These results extend to the multivariate case, where x̄ with
realization x̄ = N1 N
P
i=1 xi is sufficient for µ.
N
!
1X
exp − (xi − x̄)T Σ−1 (xi − x̄)
fx1 ,...,xN (x1 , . . . , xN ; µ, Σ) 2
i=1
= q
qx̄ (x̄; µ, Σ/N ) N −1
(2π)K |Σ|

N

• Again, this ignores Σ. Consider S and its realization.


N
1 X
S= (xi − x̄) (xi − x̄)T
N − 1 i=1
• Thus, (x̄, S) are jointly sufficient for (µ, Σ). Let:
g (x̄, S; µ, Σ) =
1 N N −1
 
T −1 −1
= N
exp − (x̄ − µ) Σ (x̄ − µ) − tr Σ S
|Σ| 2 2 2
NK
and h (x1 , . . . , xN ) = (2π)− 2 ; the theorem applies again.
Example: sufficiency for the uniform distribution
• Let a sample be drawn from X ∼ U (α, β), where α and β
are unknown parameters.
• The minimum X(1) and maximum X(N ) , with realizations
x(1) = min {x1 , . . . , xN }) and x(N ) = max {x1 , . . . , xN }, are
jointly sufficient for α and β.
• The joint p.d.f. of the sample here is:

fX1 ,...,XN (x1 , . . . , xN ; α, β) =


N
1

= · 1 [α ≤ x1 , . . . , xN ≤ β]
β−α
• . . . hence, the factorization theorem here applies by setting
N
1
   h i h i
g x(1) , x(N ) ; α, β = · 1 α ≤ x(1) · 1 x(N ) ≤ β
β−α
and h (x1 , . . . , xN ) = 1.
The exponential (macro-)family
The factorization theorem is extremely easy to apply to a wide array
of distributions.

Definition 20
Exponential (Macro-)family. A family of probability distributions
expressed by a vector of parameters θ = (θ1 , . . . , θJ ) is said to belong
to the exponential (macro)-family if the associated p.m.f.s or p.d.f.s can
be written, for J ≤ L, as follows.
L
!
X
fX (x; θ) = h (x) c (θ) exp w` (θ) t` (x)
`=1

Here h (x) and t` (x) are functions of the realizations x, c (θ) ≥ 0 and
w` (θ) are functions of the parameters θ; with ` = 1, . . . , L.

• The families: Bernoulli, geometric, Poisson, normal, lognormal,


Beta, Gamma (including its special cases) are all sub-families of
the exponential macro-family.
Sufficiency and the exponential family
Theorem 8
Sufficient statistics and the exponential family. If a random sam-
ple is obtained from any random variable X whose distribution belongs
to the exponential family, the L statistics in the vector:
 PN 
i=1 t1 (Xi )
 N t2 (Xi ) 
P
 i=1
t (X1 , . . . , XN ) = 

.. 
 . 
PN
i=1 tL (Xi )

are simultaneously sufficient for θ, where the functions t` (x) are as in


the previous definition of the exponential family for ` = 1, . . . , L.
Proof.
The joint density of the sample can be expressed as:
N
! L N
!
Y N
X X
fX1 ,...,XN (x1 ,..., xN ; θ) = h (xi ) [c (θ)] exp w` (θ) t` (xi )
i=1 `=1 i=1

and applying the factorization theorem is straightforward.


Sufficiency and the exponential family: examples
• The Bernoulli p.m.f. can be written, for x ∈ {0, 1}, as:
p
   
fX (x, p) = (1 − p) exp log x
1−p
PN
implying that T = i=1 Xi is sufficient for p.

• The normal p.d.f. can be written as:


!
1 µ2 µ 1
   
2
fX x; µ, σ =√ exp − 2 exp 2
x − 2 x2
2πσ2 2σ σ 2σ
PN PN 2
so T1 = i=1 Xi and T2 = i=1 Xi “suffice” for µ and σ2 .

• The Gamma p.d.f. can be written, for x > 0, as:


βα
fX (x; α, β) = exp [(α − 1) log (x) − βx]
Γ (α)
PN PN
so T1 = i=1 log (Xi ) and T2 = i=1 Xi suffice for α and β.
Transformations of sufficient statistics
• If T (x1 , . . . , xN ) is a sufficient statistic for some parameter
θ, a transformation T 0 (x1 , . . . , xN ) = g (T (x1 , . . . , xN ))
is also sufficient for θ if g (·) does not depend on θ.
• This conclusion also applies to the multidimensional case
where t0 (x1 , . . . , xN ) = g (t (x1 , . . . , xN )).
PN PN 2
• Example (normal): if T1 = i=1 Xi & T2 = i=1 Xi :
!
1 1 T2
X̄ = T1 and S 2 = T2 − 1
N N −1 N
are also sufficient for µ and σ2 of the normal distribution.
PN PN
• Example (Gamma): if T1 = i=1 log (Xi ) & T2 = i=1 Xi :
N N
T10 = exp (T1 ) = Xi and T20 = T2 =
Y X
Xi
i=1 i=1
are also sufficient for α and β of the Gamma distribution.

You might also like