Lecture-4
Lecture-4
Paolo Zacchia
Lecture 4
Samples
Definition 1
N
Sample. A sample is a collection {xi }i=1 of realizations of some N
N
random vectors {xi }i=1 associated with some population of interest.
Each unit of this population is typically called a unit of observation
and its associated realization xi is identified by a unique subscript i.
Note: if the data are collected from N random variables, the sample is
N N
written as {xi }i=1 ; if from N random matrices, as {Xi }i=1 .
Definition 2
Sample size. The dimension N of a sample is called size.
Random samples
Definition 3
Random sample. A sample is random if its realizations are drawn
from independent and identically distributed (i.i.d.) random vec-
N
tors {xi }i=1 (or variables, or matrices).
Definition 4
Non-random sample. A sample is said non-random if the realizations
that compose it are not drawn from i.i.d. random variables, vectors, or
matrices. Instead, these may be:
• independent and not identically distributed (i.n.i.d.);
• not independent and identically distributed (n.i.i.d.);
• not independent, not identically distributed (n.i.n.i.d.).
Samples and Statistics
Although non-random samples are common (they are ubiquitous, say,
in econometrics) random samples are an important benchmark. In fact,
the i.i.d. property lets express the joint distribution of the sample as:
N
Y
fx1 ,...,xN (x1 , . . . , xN ; θ) = fx (x1 ; θ) × · · · × fx (xN ; θ) = fx (xi ; θ)
i=1
Definition 5
Statistic. A function of the N random variables, vectors or matrices
that are specific to each i-th unit of observation and that generate a
sample is called a statistic. Any statistic is itself a random variable,
vector or matrix.
Definition 6
Sampling distribution. The probability distribution of a statistic is
called its sampling distribution.
Sample mean
Two most common and important sample statistics are defined next.
Definition 7
Sample mean. In samples derived from random vectors, the sample
mean is a vector-valued statistic usually denoted as x̄ and defined as
follows.
N
1 X
x̄ = xi
N i=1
This definition can be reduced to samples that drawn from univariate
random variables, in which case the usual notation is X̄:
N
1 X
X̄ = Xi
N i=1
In this
√ specific case, the square root of the sample variance is written
S = S 2 and is called the standard deviation. In order to extend this
definition to sampling from random matrices it is necessary to develop
three-dimensional arrays.
Properties of key sample statistics, I (1/3)
In what follows, some key results about the sample mean and sample
variance-covariance are presented. They are crucial for the derivation
moments and sometimes, distribution of these statistics.
Theorem 1
Properties of simple sample statistics (1). Consider a sample
N
{xi }i=1 , its sample mean x̄, and its sample variance-covariance S.
The following two properties are true:
PN T
a. x̄ = arg mina∈RK i=1 (xi − a) (xi − a) ;
PN
b. (N − 1) S = i=1 xi xT T
i − N · x̄x̄ .
Proof.
(Continues. . . )
Properties of key sample statistics, I (2/3)
Theorem 1
Proof.
(Continued.) To show point a. note that:
N
X N
X
T T
(xi − a) (xi − a) = (xi − x̄ + x̄ − a) (xi − x̄ + x̄ − a)
i=1 i=1
N
X N
X
T T
= (xi − x̄) (xi − x̄) + (x̄ − a) (x̄ − a)
i=1 i=1
N
X N
X
T T
+ (xi − x̄) (x̄ − a) + (x̄ − a) (xi − x̄)
i=1 i=1
N
X N
X
T T
= (xi − x̄) (xi − x̄) + (x̄ − a) (x̄ − a)
i=1 i=1
where two terms in the second line are both equal to zero by definition
of sample mean; in the last line, the first term does not depend on a
while the second is minimized at a = x̄. (Continues. . . )
Properties of key sample statistics, I (3/3)
Theorem 1
Proof.
(Continued.) To show b. simply note that:
N
X N
X N
X N
X
T
(xi − x̄) (xi − x̄) = xi xT
i − xi x̄T − x̄xT
i + N · x̄x̄
T
i=1
and the result again follows from the definition of a sample mean.
Properties of key sample statistics, II (1/3)
Theorem 2
Properties of simple sample statistics (2). Consider a random
N
sample {xi }i=1 drawn from a random vector x, a transformation of
this vector y = g (x), and suppose that all the moments expressed in
the mean vector E [y] and in the variance-covariance matrix Var [y] are
defined. The following two properties are true:
hP i
N
a. E i=1 yi = N · E [yi ];
hP i
N
b. Var i=1 y i = N · Var [yi ].
Proof.
To show a. simply observe that:
"N # N
X X
E yi = E [yi ] = N · E [yi ]
i=1 i=1
which follows from the linear properties of expectations and from the
moments of yi for i = 1, . . . , N being identical. (Continues. . . )
Properties of key sample statistics, II (2/3)
Theorem 2
Proof.
(Continued.) The demonstration of b. is as follows.
"N #
N
"N #! N "N #!T
X X X X X
Var yi = E yi − E yi yi − E yi
i=1 i=1 i=1 i=1 i=1
N
! N
!T
X X
= E (yi − E [yi ]) (yi − E [yi ])
i=1 i=1
"N #
X T
=E (yi − E [yi ]) (yi − E [yi ])
i=1
N
X h i
T
= E (yi − E [yi ]) (yi − E [yi ])
i=1
= N · Var [yi ]
(Continues. . . )
Properties of key sample statistics, II (3/3)
Theorem 2
Proof.
(Continued.) In the above derivation for b. the first line is just the
PN
definition of variance for i=1 yi , the second line applies the linear
properties of expectations while also rearranging terms, the third line
rearranges terms again after observing that, for i 6= j:
h i
T
E (yi − E [yi ]) (yj − E [yj ]) = 0
Proof.
To show a. it is sufficient to apply Theorem 2, point a. for y = x:
" N
# N
1 X 1 X 1
E [x̄] = E xi = E [xi ] = · N · E [xi ] = E [x]
N i=1 N i=1 N
(Continues. . . )
Properties of key sample statistics, III (2/3)
Theorem 3
Proof.
(Continued.) Point b. proceeds similarly.
" N
#
1 X
Var [x̄] = Var xi
N i=1
N
1 X
= 2 Var [xi ]
N i=1
1
= · N · Var [xi ]
N2
Var [x]
=
N
(Continues. . . )
Properties of key sample statistics, III (3/3)
Theorem 3
Proof.
(Continues. . . ) The proof of point c. is as follows:
" N
!#
1 X
T T
E [S] = E xi xi − N · x̄x̄
N − 1 i=1
N
!
1 X T
T
= E xi xi − N · E x̄x̄
N −1 i=1
1
= (N · Var [xi ] − N · Var [x̄])
N −1
N 1
= 1− Var [x]
N −1 N
= Var [x]
T
the third line follows after adding and subtracting N · E [x] E [x] .
Normal sampling
• While performing statistical estimation and inference, it is
often useful to know the exact sampling distribution of
selected statistics like x̄ and S.
• This is usually possible only in selected cases, like the one
where the sample is drawn from a normal distribution.
• Let sample {xi }N 2
i=1 be drawn from X ∼ N µ, σ ; then:
!
σ2
X̄ ∼ N µ,
N
√ X̄ − µ
N ∼ N (0, 1)
σ
The t-statistic
√
• The “standardized” statistic N X̄ − µ /σ is appealing
for the sake of testing hypotheses about µ.
• However, it has a shortcoming: usually, σ is unknown.
Definition 9
N
The t-statistic. Given a univariate sample {xi }i=1 of size N drawn
from a sequence of random variables X1 , . . . , XN , a t-statistic is defined
as the following quantity:
√ X̄ − µ
t= N
S
where X̄ is the sample mean whose expectation is µ = E X̄ , and S is
the sample standard deviation.
Properties of normal sampling (1/6)
The exact sampling distribution of the t-statistic is well-known,
but deriving it requires a few steps.
Theorem 4
Sampling from the Normal Distribution. Consider a random sam-
N
ple {xi }i=1 drawn from
a random variable following the normal distri-
bution X ∼ N µ, σ2 , and the random variables corresponding to the
two sample statistics X̄ and S 2 . The following three properties are true:
a. X̄ and S 2 are independent;
b. X̄ ∼ N µ, σ2 /N ;
c. (N − 1) S 2 /σ2 ∼ χ2N −1 .
Proof.
Point b. is straightforward, point c. is quite easy to show, but point a.
requires some more effort. (Continues. . . )
Properties of normal sampling (2/6)
Theorem 4
Proof.
(Continued.) Start with the observation that the sample variance can
be expressed in terms of only N − 1 of the original random variables:
N
1 X 2
S2 = Xi − X̄
N − 1 i=1
" N
#
1 2 X 2
= X1 − X̄ + Xi − X̄
N −1 i=2
N
!2 N
1 X X 2
= Xi − X̄ + Xi − X̄
N −1 i=2 i=2
PN
where the last line follows from i=1 Xi − X̄ = 0. Hence, proving
that the sample mean is independent of the sample variance amounts
to show that it is independent of N − 1 out of N normally distributed
random variables, say X2 − X̄, . . . , XN − X̄. (Continues. . . )
Properties of normal sampling (3/6)
Theorem 4
Proof.
(Continued.) Work with the standardization Zi = (Xi − µ) /σ for
i = 1, . . . , N , and let z = (Zi , . . . , ZN ). Define the following random
vector ze of length N as a function of z.
−1
N −1 N −1
Z̄
Z̄ N ...
Ze2 Z2 − Z̄ −N −1 1 − N −1 ... −N −1
ze = . = = .. z
.. .. .. ..
.. . . . . .
ZeN ZN − Z̄ −N −1 −N −1 ... 1 − N −1
(Continues. . . )
Properties of normal sampling (6/6)
Theorem 4
Proof.
(Continued.) The statistic (N − 1) S 2 /σ2 is shown to be the sum of
the squares of N independent random variables – all of which follow
the standard normal distribution, minus the square of another random
variable that follows the standard normal distribution. By the result
in a. the latter is independent of the former. Note that, using m.g.f.s:
N
Y
MZ̄¯ 2 (t) M(N −1) S2 (t) = MZi2 (t)
σ2
i=1
√
where Z̄¯ ≡ N X̄ − µ /σ and Zi ≡ (Xi − µ) /σ, or equivalently:
N
1 Y − 21 (N −1)
M S2
(N −1) σ
(t) = MZi2 (t) = (1 − 2t)
2 MZ̄¯ 2 (t) i=1
Definition 10
Normal variance ratio. Consider two univariate random samples
NX NY
{xi }i=1 and {yi }i=1 of sizes NX and NY respectively, each drawn
from two independent sequences of random variables (X1 , . . . , XNX )
and (Y1 , . . . , YNY ) whose distributions are as follows.
Xi ∼ N µX , σ2X for i = 1, . . . , NX
Yj ∼ N µY , σ2Y for j = 1, . . . , NY
where SX & SY are the sample variances of the two random samples.
An F-distribution for the F-statistic
• The F-statistic is used to test whether any two populations
have identical/similar variance.
where here x̄ is the sample mean whose expectation and variance are
µ = E [xi ], and Σ/N = Var [xi ] respectively, and where σ∗−1
k` is k`-th
element of Σ−1 .
To better interpret the u-statistic, one might analyze its expression as
a quadratic form in the second line of the definition: the statistic is a
second degree polynomial of the K deviations of all univariate sample
means from their respective mean parameters, normalized through the
population variance-covariance.
The distribution of the u-statistic (1/2)
One more time, using this statistic for inference purposes requires the
derivation of its exact distribution.
Theorem 5
Sampling from the Multivariate Normal Distribution. Let a
N
random sample {xi }i=1 be drawn from some K-dimensional random
vector following the multivariate normal distribution, x ∼ N (µ, Σ).
In this environment, the u-statistic follows the chi-squared distribution
with K degrees of freedom.
T
u = N (x̄ − µ) Σ−1 (x̄ − µ) ∼ χ2K
Proof.
As usual, the result requires to find the m.g.f. of the random variable
of interest: here, the u-statistic. (Continues. . . )
The distribution of the u-statistic (2/2)
Theorem 5
Proof.
(Continued.) This requires some linear algebra.
ˆ Σ
Mu (t) = exp N (x̄ − µ)T Σ−1 (x̄ − µ) t fx̄ x̄; µ, dx̄
RK N
ˆ
s
1 NK N
exp − (x̄ − µ)T (1 − 2t) Σ−1 (x̄ − µ) t dx̄
= K |Σ|
RK (2π) 2
ˆ
s
1 [(1 − 2t) N ]K
r
1
= × ×
(1 − 2t)K RK (2π)K |Σ|
N
(x̄ − µ)T (1 − 2t) Σ−1 (x̄ − µ) t dx̄
× exp −
2
K
= (1 − 2t)− 2
Note: the integral in the third line disappears since it is the p.d.f. of a
multivariate normal distribution with mean µ and variance-covariance
Σ/ (1 − 2t) N . By recognizing the m.g.f. it follows that u ∼ χ2K .
Hotelling’s t-squared statistic
Problems reiterate! Like in the univariate case, if Σ is unknown the
u-statistic is useless. What if Σ is replaced by its sample analog S?
Definition 12
Hotelling’s “t-squared” statistic. Given some multivariate sample
N
{xi }i=1 of size N drawn from a sequence of random vectors x1 , ..., xN ,
Hotelling’s t-squared statistic is defined as the random variable:
T
t2 = N (x̄ − µ) S −1 (x̄ − µ)
K X
X K
∗−1
=N Sk` X̄k − µk X̄` − µ`
k=1 `=1
Definition 13
N
Order statistics. Consider a sample {xi }i=1 of realizations obtained
N
from univariate random variables, {Xi }i=1 . Suppose that these values
are placed in ascending order, where subscripts surrounded by paren-
theses denote one observation’s position in the order:
Definition 15
Sample Maximum. The sample maximum is the N -th order stati-
stic, X(N ) .
Definition 16
Sample Range. The sample range R is the difference between the
sample maximum and the sample minimum: R = X(N ) − X(1) .
Definition 17
Sample Median. The sample median M is a function of a sample’s
most central order statistics.
X N +1 if N is odd
( )
M = 1 2
X N +X N if N is even
2 2 ( ) ( +1)
2
Which sampling distribution for order statistics?
• One may want to provide a distribution for order statistics
(e.g. to model the probabilities for minima and maxima).
• Some relevant cases exist: that is, order statistics for uniform
distributions, minima or maxima for selected distributions.
Sampling distribution of order statistics (1/5)
Theorem 6
Sampling distribution of order statistics in random samples.
In a univariate random sample, the c.d.f. of the j-th order statistic is
based on the binomial distribution:
N
X N k N −k
FX(j) (x) = [FX (x)] [1 − FX (x)]
k
k=j
where FX (x) is the c.d.f. of the random variable X that generates the
sample. Two particular cases are the c.d.f.s of the minimum and the
maximum, which are as follows.
N
FX(1) (x) = 1 − [1 − FX (x)]
N
FX(N ) (x) = [FX (x)]
Proof.
(Continues. . . )
Sampling distribution of order statistics (2/5)
Theorem 6
Proof.
(Continued.) For at least j realizations to be less or equal than x, the
event defined as Xi ≤ x must occur an integer number of j ≤ k ≤ N
times, whereas the complementary event Xi > x must instead occur
N − k times. If the sample is random (i.i.d.), these two events occur
with probabilities that are constant across all realizations.
P (Xi ≤ x) = FX (x)
P (Xi > x) = 1 − FX (x)
Corollary
(Theorem 6.) If X is a continuous distribution with p.d.f. fX (x), the
p.d.f. of the j-th order statistic is the following.
N! j−1 N −j
fX(j) (x) = fX (x) [FX (x)] [1 − FX (x)]
(j − 1)! (N − j)!
Proof.
The result obtains by taking the first derivative of FX(j) (x) and some
manipulation. The initial differentiation is shown in the next slide; it
applies the chain rule. The third line in the following display obtains
by isolating the term corresponding with k = j. One must then show
that the two elements that are left out cancel out against one another.
(Continues. . . )
Sampling distribution of order statistics (4/5)
Proof.
(Continued.) The initial operations are as follows.
dFX(j) (x)
fX(j) (x) =
dx
N
X N k−1 N −k
= k [FX (x)] [1 − FX (x)] fX (x) −
k
k=j
k N −k−1
− (N − k) [FX (x)] [1 − FX (x)] fX (x)
N! j−1 N −j
= fX (x) [FX (x)] [1 − FX (x)] +
(j − 1)! (N − j)!
N
X N k−1 N −k
+ k [FX (x)] [1 − FX (x)] fX (x) −
k
k=j+1
N
X N k N −k−1
− (N − k) [FX (x)] [1 − FX (x)] fX (x)
k
k=j
(Continues. . . )
Sampling distribution of order statistics (5/5)
Proof.
(Continued.) Re-index the summation of the second term above and
note that in the summation of the third term, the element for k = N
is zero. Thus, the p.d.f. can be rewritten as follows.
N! j−1 N −j
fX(j) (x) = fX (x) [FX (x)] [1 − FX (x)] +
(j − 1)! (N − j)!
N −1
X N k N −k−1
+ (k + 1) [FX (x)] [1 − FX (x)] fX (x) −
k+1
k=j
N −1
X N k N −k−1
− (N − k) [FX (x)] [1 − FX (x)] fX (x)
k
k=j
Observation 1
Consider a random sample obtained from the standard continuous uni-
form distribution, X ∼ U (0, 1). The j-th order statistic is such that
X(j) ∼ Beta (j, N − j + 1).
Proof.
Since FX (x) = x and fX (x) = 1 for x ∈ (0, 1), while FX (x) = x and
fX (x) = 0 otherwise, the density function of X(j) is, for x ∈ (0, 1):
N! N −j
fX(j) (x) = xj−1 (1 − x)
(j − 1)! (N − j)!
Γ (N + 1) (N −j+1)−1
= xj−1 (1 − x)
Γ (j) Γ (N − j + 1)
(N −j+1)−1
= B (j, N − j + 1) · xj−1 (1 − x)
Definition 18
Extreme order statistics (minimum-maximum) stability. Con-
sider a random sample drawn from some known distribution. If the
sample minimum (maximum) follows another distribution of the same
family, that distribution is said to be min-stable (max-stable).
Observation 2
Consider a random sample drawn from the exponential distribution
with parameter λ, X ∼ Exp (λ). In this case, the first order statistic
(the minimum) is such that X(1) ∼ Exp N −1 λ .
Proof.
By applying the formula for the distribution of the minimum:
N
FX(1) (x; λ, N ) = 1 − [1 − FX (x; λ)]
N
1
= 1 − exp − x
λ
N
= 1 − exp − x
λ
allows to show both the Fréchet and the reverse Weibull results.
Min-stability of the Weibull distribution
Observation 5
Consider a random sample that is drawn from the traditional Weibull
distribution with parameters α, µ, and σ, W ∼ Weibull (α,
µ, σ). The
sample minimum is such that W(1) ∼ Weibull α, µ, σN 1/α .
Proof.
Things proceed similarly as in the Fréchet and reverse Weibull cases.
N
FW(1) (w; α, µ, σ, N ) = 1 − [1 − FW (w; α, µ, σ)]
−α !N
w−µ
= 1 − exp −
σ
−α !
1 w − µ
= 1 − exp − N − α
σ
Note that here, the formula for the minimum is applied instead.
Sufficient Statistics
The final kind of statistic that is introduced in this lecture serve as a
first step for the study of estimation (Lecture 5).
Definition 19
Sufficient statistics. Consider a given sample generated by a list of
random vectors (x1 , . . . , xN ). Suppose that the joint distribution of
the sample depends, among the others, on some parameter θ; write
the associated p.m.f. or p.d.f. as fx1 ,...,xN (x1 , . . . , xN ; θ). A statistic
T = T (x1 , . . . , xN ) is said to be sufficient if the joint distribution of
the sample, conditional on it, does not depend on θ:
T (x1 , . . . , xN ) = T (y1 , . . . , yN )
X̄ follows.
N
1 X
x̄ = xi
N i=1
• In the application that follows next, the following property
is applied (as point c. of Theorem 4).
N N
2
(xi − x̄)2 + N (x̄ − µ)2
X X
(xi − µ) =
i=1 i=1
N
X
− 2 (x̄ − µ) (xi − x̄)
i=1
| {z }
=0
Example: X̄ suffices for the normal’s µ (2/2)
The derivation follows suit; note how the mentioned property is
applied in the second line.
N q 2
!
Y −1 (xi − µ)
(2πσ2 ) · exp −
2σ2
fX1 ,...,XN x1 , . . . , xN ; µ, σ2 i=1
=q 2
!
qX̄ (x̄; µ, σ2 /N ) −1 N (x̄ − µ)
2
(2πσ ) N · exp −
2σ2
N 2
!
(x − µ)
q
−N i
X
(2πσ2 ) · exp −
i=1
2σ2
= q 2
!
−1 N (x̄ − µ)
(2πσ2 ) N · exp −
2σ2
N 2
!
X (xi − x̄)
exp −
i=1
2σ2
= q
N −1
(2πσ2 ) N
Example: a sufficient order statistic
• Consider a random sample drawn from X ∼ U (0, θ).
Theorem 7
Fisher-Neyman’s Factorization Theorem. Consider a sample gen-
erated by a sequence of random vectors (x1 , . . . , xN ), whose joint dis-
tribution has a p.m.f. or a p.d.f. fx1 ,...,xN (x1 , . . . , xN ; θ) that also de-
pends on some parameter θ. Then, a statistic T = T (x1 , . . . , xN ) is
sufficient for θ if and only if it is possible to identify two functions
g (T (x1 , . . . , xN ) ; θ) and h (x1 , . . . , xN ) such that the following holds.
(Continues. . . )
The factorization theorem (3/7)
Theorem 7
Proof.
(Continued.) Since T (y1 , . . . , yN ) is constant in AT (x1 , . . . , xN ), it
is:
X
qT (T (x1 , . . . , xN ) ; θ) = g (T (x1 , . . . , xN ) ; θ) h (y1 , . . . , yN )
y1 ,...,yN ∈AT
where |J∗ | is shorthand for the absolute value of the Jacobian of the
inverse transformation, and the second line follows from hypothesis.
It is obvious that the marginal distribution of Y11 , that is the density
function qT (T (x1 , . . . , xN ) ; θ) of the statistic of interest T , inherits a
factorization analogous to the above and since y11 = T (x1 , . . . , xN ), it
can be shown that the ratio between the joint p.d.f. of the sample and
the p.d.f. of T does not depend on θ, hence T is sufficient.
(Continues. . . )
The factorization theorem (7/7)
Theorem 7
Proof.
(Continued.) In order to show the “sufficiency” part of the Theorem
(if T is sufficient, then a proper factorization can be expressed) apply
the definition of conditional density function to show that:
where the notation {·} \ Y11 denotes a list that excludes Y11 . Dividing
both sides of the above by |J∗ | returns the desired factorization for:
N
N (x̄ − µ)2 + (N − 1) S 2
!
1
2
2 2
g x̄, s ; µ, σ = exp −
σ2 2σ2
and h (x1 , . . . , xN ) = (2π)−
N/2
, the theorem applies nicely.
Example: sufficiency for the multivariate normal
• These results extend to the multivariate case, where x̄ with
realization x̄ = N1 N
P
i=1 xi is sufficient for µ.
N
!
1X
exp − (xi − x̄)T Σ−1 (xi − x̄)
fx1 ,...,xN (x1 , . . . , xN ; µ, Σ) 2
i=1
= q
qx̄ (x̄; µ, Σ/N ) N −1
(2π)K |Σ|
N
Definition 20
Exponential (Macro-)family. A family of probability distributions
expressed by a vector of parameters θ = (θ1 , . . . , θJ ) is said to belong
to the exponential (macro)-family if the associated p.m.f.s or p.d.f.s can
be written, for J ≤ L, as follows.
L
!
X
fX (x; θ) = h (x) c (θ) exp w` (θ) t` (x)
`=1
Here h (x) and t` (x) are functions of the realizations x, c (θ) ≥ 0 and
w` (θ) are functions of the parameters θ; with ` = 1, . . . , L.