0% found this document useful (0 votes)
23 views30 pages

SI Chapter-1

Uploaded by

Sangeetha V
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views30 pages

SI Chapter-1

Uploaded by

Sangeetha V
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

MATH 350: Statistical Inference Tanujit Chakraborty

Chapter 1: Reviews of Probability & Sampling Distributions

1 Random Variables

Recall the “fundamental principle”: Data is a realization of a random process. Through-


out this course, we will model the data using random variables. The goal of this
lecture is to review, with a statistical focus, relevant concepts concerning random vari-
ables and their distributions.

Definition 1.1. A probability space is the triplet (Ω, F, P ), where Ω is the sample
space, F is the collection of events (σ-algebra), and P is a probability measure
defined over F, with P (Ω) = 1.

Definition 1.2. (Measurable function). Let f be a function from a measurable


space (Ω, F) into the real numbers. We say that the function is measurable if for
each Borel set B ∈ B, the set {ω ∈ Ω : f (ω) ∈ B} ∈ F.

Definition 1.3. (random variable). A random variable X is a measurable func-


tion from a probability space (Ω, F, P) into the real numbers R (or a subset).
Example: Number of heads turning up (X) when tossing a coin thrice.

A discrete random variable X can take a finite or countably infinite number of possible
values. We use discrete random variables to model categorical data (for example, which
presidential candidate a voter supports) and count data (for example, how many cups
of coffee a graduate student drinks in a day). The distribution of X is specified by its
probability mass function (PMF):

fX (x) = P[X = x].

Then for any set A of values that X can take,

P[X ∈ A] = fX (x).
X

x∈A

A continuous random variable X takes values in R and models continuous real-valued


data (for example, the height of a person). For any single value x ∈ R, P[X = x] = 0.

Page 1
Instead, the distribution of X is specified by its probability density function (PDF)
fX (x), which satisfies for any set A ⊆ R
Z
P[X ∈ A] = fX (x)dx.
A

In both cases, when it is clear which random variable is being referred to, we will
simply write f (x) for fX (x).
For any random variable X and real-valued function g, the expectation or mean
of g(X) is its “average value”. If X is discrete with PMF fX (x), then
E[g(X)] = g(x)fX (x)
X

where the sum is over all possible values of X. If X is continuous with PDF fX (x),
then Z
E[g(X)] = g(x)fX (x)dx
R
The expectation is linear: For any random variables X1 , . . . , Xn (not necessarily inde-
pendent) and any c ∈ R
E [X1 + . . . + Xn ] = E [X1 ] + . . . + E [Xn ] , E[cX] = cE[X]
If X1 , . . . , Xn are independent, then
E [X1 . . . Xn ] = E [X1 ] . . . E [Xn ]
The variance of X is defined by the two equivalent expressions
Var[X] = E (X − E[X])2 = E X 2 − (E[X])2
h i h i

For any c ∈ R, Var[cX] = c2 Var[X]. If X1 , . . . , Xn are independent, then


Var [X1 + . . . + Xn ] = Var [X1 ] + . . . + Var [Xn ]
If X1 , . . . , Xn are not independent, then this is not true − Var [X1 + . . . + Xn ] will de-
pend qon the covariance between each pair of variables. The standard deviation of
X is Var[X].

The distribution of X can also be specified by its cumulative distribution func-


tion (CDF)FX (x) = P[X ≤ x]. In the discrete and continuous cases, respectively, this
is given by Z x
FX (x) = fX (y), FX (x) = fX (y)dy.
X

y:y≤x −∞

In the continuous case, the fundamental theorem of calculus implies


d
fX (x) = FX (x).
dx

Page 2
By definition, FX is monotonically increasing: FX (x) ≤ FX (y) if x < y. If FX is
continuous and strictly increasing, meaning FX (x) < FX (y) for all x < y, then FX
has an inverse function FX−1 : (0, 1) → R called the quantile function: For any
t ∈ (0, 1), FX−1 (t) is the tth quantile of the distribution of X. I.e. the probability that
X is less than this value is exactly t.

1.1 Conditional Probability

Definition 1.4. Two events A and B are independent if and only if the probability
of their intersection equals the product of their individual probabilities, that is

P (A ∩ B) = P (A)P (B).

Definition 1.5. Given two events A and B, with P (B) > 0, the conditional
probability of A given B, denoted P (A | B), is defined by the relation

P (A ∩ B) P (A, B)
P (A | B) = = .
P (B) P (B)

In connection with these definitions, the following result holds. Let {Cj : j = 1, . . . , n}
be a partition of Ω, this is, Ω = ∪nj=1 Cj and Ci ∩ Ck = ∅ for i ̸= k. Let also A be an
event. The Law of Total Probability states that
n
P (A) = P (A | Cj ) P (Cj ) .
X

j=1

Definition 1.6. Let X and Y be discrete, jointly distributed random variables.


For P (X = x ) > 0 the conditional probability function of Y given that X = x is

P (X = x, Y = y)
pY |X=x (y) = P (Y = y | X = x) = ,
P (X = x)
and the conditional cumulative distribution function of Y given that X = x is

FY |X=x (y) = P (Y ≤ y | X = x) = pY |X=x (z).


X

z≤y

Page 3
Definition 1.7. Let X and Y have a joint continuous distribution. For fX (x) >
0, the conditional density function of Y given that X = x is
fX,Y (x, y)
fY |X=x (y) = ,
fX (x)
where fX,Y is the joint probability density function of X and Y , and fX is the
marginal probability density function of X. The conditional cumulative distribu-
tion function of Y given that X = x is
Z y
FY |X=x (y) = fY |X=x (z)dz.
−∞

Remark 1. The law of total probability. Let X and Y have a joint continuous
distribution. Suppose that fX (x) > 0, and let fY |X=x (y) be the conditional density
function of Y given that X = x The law of total probability states that
Z ∞
fY (y) = fY |X=x (y)fX (x)dx.
−∞

1.2 Families of Distributions


The distributions presented here are parametric distributions. A parametric distri-
bution is a distribution that has one or more parameters (also known as statistical
parameters). Finally, a parameter (or statistical parameter) is a numerical character-
istic that indexes a family of probability distributions.
Discrete distributions: A random variable X is said to be discrete if the range of X
is countable. Some examples of discrete variables and their corresponding probability
mass functions are presented below.
Example 1.1. The Bernoulli distribution. The simplest example of a discrete random
variable corresponds to the case where the range of X is the set {0, 1}. The distribution
of X is: 
p,

for x = 1
P(X = x) = 
1 − p, for x = 0.
This distribution is known as the Bernoulli distribution, and it is often denoted as
X ∼ Bernoulli(p). Often, the event {x = 1} is called a “success”, and the event
{x = 0} is called a “failure”. Thus, the parameter p is known as “the probability of
success”. It follows that:
E[X] = 1 · p + 0 · (1 − p) = p,
Var[X] = (1 − p)2 · p + (0 − p)2 (1 − p) = p(1 − p).

Page 4
A more particular example is the case where X is the outcome observed from tossing
a fair coin once. In this case P(X = heads ) = P(X = tails ) = 12 .
Bernoulli random variables are used in many contexts, and they are often referred
to as Bernoulli trials. A Bernoulli trial is an experiment with two, and only two,
possible outcomes. Parameters: 0 ≤ p ≤ 1.
Example 1.2. The Binomial distribution. A Binomial random variable X is the total
number of successes in n Bernoulli trials. Consequently, the range of X is the set
{0, 1, 2, . . . , n}. The probability of each outcome is given by:
 
n
P(X = x) =   px (1 − p)n−x
x
 
n
where   = x!(n−x)!
n!
is the Binomial coefficient (also known as combination). If a
x
random variable X has Binomial distribution, it is denoted as X ∼ Binomial(n, p),
where n is the number of trials and p is the probability of success.
The mean and variance of a Binomial random variable are E[X] = np and Var[X] =
np(1 − p).
A more particular example is the case where X is the number of heads in n = 10
fair coin tosses. Then, for x = 0, 1, . . . , 10 :
10 10
   

P(X = x) =   0.5x (0.5)n−x =  (0.5)n .


x x

Parameters: n ∈ Z+ and 0 ≤ p ≤ 1.
Example 1.3. The Poisson distribution. A random variable X has a Poisson dis-
tribution if it takes values in the non-negative integers and its distribution is given
by:
λx −λ
P(X = x; λ) = e , x = 0, 1, . . .
x!
It is possible to show that E[X] = Var[X] = λ. Parameters: λ > 0.
Example 1.4. The Multinomial distribution. The Multinomial distribution is a gener-
alization of the binomial distribution. For n independent trials, each of which produces
and outcome (success) in one of k ≥ 2 categories, where each category has a given fixed
success probability θi , i = 1, . . . , k, the Multinomial distribution gives the probability of
any particular combination of numbers of successes for the various categories. Thus,
the pmf is
n!
p (x1 , . . . , xk ; θ1 , . . . , θk ) = θ1x1 · · · θkxk ,
x1 ! · · · xk !
where parameters θi ≥ 0 for i = 1, . . . , k and ki=1 θi = 1.
P

Page 5
There are many other discrete probability distributions of practical interest, for
example, the hypergeometric distribution, and the discrete uniform distribution, among
others.

Continuous distributions: A random variable X is said to be continuous if


its range is uncountable and its distribution is continuous everywhere. Moreover, a
random variable X is said to be absolutely continuous if there exists a nonnegative
function f such that for any open set B:
Z
P(X ∈ B) = f (x)dx
B

The function f is called the probability density function of X. This definition can
be used to link the probability density function and the cumulative distribution F as
follows: Z x
F (x) = P(X ≤ x) = f (t)dt.
−∞

Definition 1.8. Suppose that f : D → R+ is a density function. Then, the support


of f, supp(f ), is the set of points where f is positive:

supp(f ) = {x ∈ D : f (x) > 0}.

Definition 1.9. A random variable X, with cdf F has the characteristic function
Z ∞
φX (t) = E e itX
= eitx dF (x).
h i

−∞

If the pdf f exists, then


Z ∞
φX (t) = E eitX = eitx f (x)dx.
h i

−∞

Another feature of continuous distributions is that P(X = x) = 0, for all x in the


range of X. Some examples of continuous distributions are presented below.
Example 1.5. The Beta Distribution. The probability density function of a Beta ran-
dom variable X ∈ (0, 1) is:
xa−1 (1 − x)b−1
f (x; a, b) = ,
B(a, b)

where a, b > 0 are shape parameters, B(a, b) = Γ(a)Γ(b)


Γ(a+b) is the Beta function. It is im-
portant to distinguish the Beta distribution from the Beta special function.

Page 6
Definition 1.10. The Beta function and Gamma function are special functions
defined as follows Z 1
B(a, b) = ta−1 (1 − t)b−1 dt
0
Z ∞
Γ(z) = tz−1 e−t dt
0

The mean of the Beta distribution is E[X] = a+b


a
, and the mode (maximum) is Mode(X) =
a+b−2 for b > 1. The variance is Var[X] = (a+b)2 (a+b+1) . Parameters: a > 0 and
a−1 ab

b > 0.
The uniform distribution is a special case of the Beta distribution for the case
a = b = 1.
Example 1.6. Normal or Gaussian Distribution. The probability density function of
a Gaussian random variable X ∈ R is:
1  (x − µ)2 
 

f (x; µ, σ) = √ exp − ,
2πσ 2σ 2 

where −∞ < µ < ∞ and σ > 0 are parameters of this density function. In fact,
E[X] = µ and Var[X] = σ 2 . If a random variable X has normal distribution with mean
µ and variance σ 2 , we will denote it X ∼ N (µ, σ 2 ). This is one of the most popular
distributions in applications and it appears in a number of statistical and probability
models. Parameters: −∞ < µ < ∞ and σ > 0.
The Normal distribution has many interesting properties. One of them is that it
is closed under summation, meaning that the sum of normal random variables is nor-
mally distributed. That is, let X1 , . . . , Xn be i.i.d. random variables with distribution
 
2
N (µ, σ 2 ), and Y = nj=1 Xj , Z = n1 nj=1 Xj . Then, Y ∼ N (nµ, nσ 2 ) , Z ∼ N µ, σn .
P P

Example 1.7. The Logistic distribution. The probability density function of a logistic
random variable X ∈ R is:
exp − x−µ
n o

f (x; µ, σ) = σ
o2 ,
σ 1 + exp − x−µ
 n
σ

where −∞ < µ < ∞ and σ > 0 are location and scale parameters, respectively. The
2 2
mean and variance of X are given by E[X] = µ and Var[X] = π 3σ . The cdf of a
logistic random variable is

exp x−µ
n o

F (x; µ, σ) = nσ o.
1 + exp x−µ
σ

Page 7
Parameters: −∞ < µ < ∞ and σ > 0. This distribution is very popular in practice
as well. In particular, the so-called “logistic regression model” is based on this dis-
tribution, as well as some Machine Learning algorithms (logistic sigmoidal activation
function in neural networks).

Example 1.8. The Exponential distribution. The probability density function of an


Exponential random variable X > 0 is:

f (x; λ) = λ exp{−λx}

where λ > 0 is a rate parameter. The mean and variance are given by E[X] = λ1 and
Var[X] = λ12 . Parameters: λ > 0. This distribution is widely used in engineering
and the analysis of survival times.

Example 1.9. The Gamma distribution. The probability density function of a Gamma
random variable X > 0 is:
1 x
( )
f (x; κ, θ) = x κ−1
exp − ,
Γ(κ)θκ θ
Z ∞
where κ > 0 is a shape parameter, θ > 0 is a scale parameter, and Γ(z) = sz−1 e−s ds
0
is the Gamma function (for positive integers n, Γ(n) = (n − 1)!). Parameters: κ > 0
and θ > 0. The mean and variance of X are given by E[X] = κθ and Var[X] = κθ2 .
This distribution is widely used in engineering and the analysis of survival times.

1.3 Joint distributions


If random variables X1 , . . . , Xk are independent, then their distribution may be speci-
fied by specifying the individual distribution of each variable. If they are not indepen-
dent, then we need to specify their joint distribution. In the discrete case, the joint
distribution is specified by a joint PMF

fX1 ,...,Xk (x1 , . . . , xk ) = P [X1 = x1 , . . . , Xk = xk ] .

In the continuous case, it is specified by a joint PDF fX1 ,...,Xk (x1 , . . . , xk ), which
satisfies for any set A ⊆ Rk ,
Z
P [(X1 , . . . , Xk ) ∈ A] = fX1 ,...,Xk (x1 , . . . , xk ) dx1 . . . dxk .
A

When it is clear which random variables are being referred to, we will simply write
f (x1 , . . . , xk ) for fX1 ,...,Xk (x1 , . . . , xk ).

Page 8
Example 1.10. (X1 , . . . , Xk ) have a multinomial distribution,

(X1 , . . . , Xk ) ∼ Multinomial (n, (p1 , . . . , pk )) ,

if these random variables take nonnegative integer values summing to n, with joint
PMF  
n
f (x1 , . . . , xk ) =   px1 px2 . . . pxk .
1 2 k
x1 , . . . , x n

Here, p1 , . . . , pk are values in [0, 1] that satisfy p1 + . . . + pk = 1 (representing 


n
the probabilities of k different mutually exclusive outcomes), and   is
x1 , . . . , xn
 
n
the multinomial coefficient   =
x1 !x2 !...xn ! . (It is understood that the
n!
x1 , . . . , x n
above formula is only for x1 , . . . , xk ≥ 0 such that x1 + . . . + xk = n; otherwise
f (x1 , . . . , xk ) = 0.) X1 , . . . , Xk describe the number of samples belonging to each
of k different outcomes, if there are n total samples each independently belonging to
outcomes 1, . . . , k with probabilities p1 , . . . , pk . For example, if I roll a standard six-
sided die 100 times and let X1 , . . ., X6 denote the  numbers of 1’s to 6’s obtained, then
(X1 , . . . , X6 ) ∼ Multinomial 100, 6 , 6 , 6 , 6 , 6 , 6 .
1 1 1 1 1 1

A second example of a joint distribution is the Multivariate Normal distribution


(to be discussed later).
A third example is: The Dirichlet distribution is a distribution of continuous ran-
dom variables relevant to the Multinomial distribution. Sampling from a Dirichlet
distribution leads to a random vector with length k and each element of this vector is
non-negative and the summation of elements is 1, meaning that it generates a random
probability vector.

Example 1.11. The Dirichlet distribution is a multivariate distribution over the sim-
plex ki=1 xi = 1 and xi ≥ 0. Its probability density function is
P

1 Y k
p (x1 , · · · , xk ; α1 , · · · , αk ) = xαi i −1 ,
B(α) i=1
Qk
i=1 Γ(α)
where B(α) = Γ( i=1 αi )
P k with Γ(a) being the Gamma function and α = (α1 , · · · , αK )
are the parameters of this distribution.

You can view it as a generalization of the Beta distribution. For Z = (Z1 , · · · , Zk ) ∼


Dirch (α1 , · · · , αk ), E (Zi ) = Pkαi α and the mode of Zi is Pkαi −1
α −k
so each parameter
j=1 j j=1 j

αi determines the relative importance of category (state) i. Because it is a distribution

Page 9
putting probability over K categories, Dirichlet distribution is very popular in social
sciences and linguistics analysis.
The Dirichlet distribution is often used as a prior distribution for the multinomial
parameter p1 , · · · , pk in Bayesian inference.

The covariance between two random variables X and Y is defined by the two
equivalent expressions
Cov[X, Y ] = E[(X − E[X])(Y − E[Y ])] = E[XY ] − E[X]E[Y ].
So Cov[X, X] = Var[X], and Cov[X, Y ] = 0 if X and Y are independent. The covari-
ance is bilinear: For any constants a1 , . . . , ak , b1 , . . . , bm ∈ R and any random variables
X1 , . . . , Xk and Y1 , . . . , Ym (not necessarily independent),
k X
m
Cov [a1 X1 + . . . + ak Xk , b1 Y1 + . . . + bm Ym ] = ai bj Cov [Xi , Yj ] .
X

i=1 j=1

The correlation between X and Y is their covariance normalized by the product of


their standard deviations:
Cov[X, Y ]
corr(X, Y ) = .
Var[X] Var[Y ]
q q

For any a, b > 0, we have Cov[aX, bY ] = ab Cov[X, Y ]. On the other hand, the
correlation is invariant to rescaling: corr(aX, bY ) = corr(X, Y ), and satisfies always
−1 ≤ corr(X, Y ) ≤ 1.

1.4 Moment generating functions


A tool that will be particularly useful for us is the moment generating function (MGF)
of a random variable X. This is a function of a single argument t ∈ R, defined as
MX (t) = E etX
h i

Depending on the random variable X, MX (t) might be infinite for some values of t.
Here are two examples:
Example 1.12. (Normal MGF). Suppose X ∼ N (0, 1). Then

tx √1 1 −x2 +2tx
Z 2 Z
− x2
MX (t) = E e tX
= dx =
h i
e e √ e 2 dx.
2π 2π
To compute this integral, we complete the square:
Z 1 −x2 +2tx Z 1 −x2 +2tx−t2 + t2 t2
Z 1 − (x−t)2
√ e 2 dx = √ e 2 2 dx = e 2 √ e 2 dx.
2π 2π 2π

Page 10
The quantity inside the last integral above is the PDF of the N (t, 1) distribution-hence
it must integrate to 1. Then MX (t) = et /2 .
2

Now suppose X ∼ N (µ, σ 2 ). Then X = µ + σZ, where Z ∼ N (0, 1). So


2 2
µt+ σ 2t
MX (t) = E e tX
=E e µt+σtZ
=e E e
µt σtZ
= e MZ (σt) = e
µt
h i h i h i
.

For a normal random variable X, MX (t) is finite for all t ∈ R.


Example 1.13. (Gamma MGF). Suppose X ∼ Gamma(α, β), for α, β > 0. Then
α
Z ∞
tx β
Z ∞ β α α−1 (t−β)x
MX (t) = E e tX
= xα−1 e−βx dx =
h i
e x e dx.
0 Γ(α) 0 Γ(α)

lim xα−1 e(t−β)x = ∞, so certainly the integral above is infinite. If t = β,


If t > β, then x→∞

note that 0∞ xα−1 dx = α1 xα 0 = ∞, since α > 0. Hence MX (t) = ∞ for any t ≥ β. For
R

t < β, let us rewrite the above to isolate the PDF of the Gamma(α, β − t) distribution:
β α Z ∞ (β − t)α α−1 −(β−t)x
MX (t) = x e dx.
(β − t)α 0 Γ(α)
As the PDF of the Gamma(α, β − t) distribution integrates to 1, we obtain finally

∞

t≥β
MX (t) =  α
 β α t<β
(β−t)

∞

t≥β
= −α
(1 − β −1 t) t < β.

If the MGF of a random variable X is finite in any interval that contains 0 as


an interior point, as in the above two examples, then (like the PDF or CDF) it also
completely specifies the distribution of X. This is the content of the following theorem
(which we will not prove in this class):

Theorem 1.1. Let X and Y be two random variables such that, for some h > 0
and every t ∈ (−h, h), both MX (t) and MY (t) are finite and MX (t) = MY (t).
Then X and Y have the same distribution.

The reason why the MGF will be useful for us is because if X1 , . . . , Xn are inde-
pendent, then the MGF of their sum satisfies

MX1 +...+Xn (t) = E et(X1 +...+Xn ) = E etX1 × . . . × E etXn = MX1 (t) . . . MXn (t)
h i h i h i

This gives us a very simple tool to understand the distributions of sums of independent
random variables.

Page 11
2 Sampling Distributions

For data X1 , . . . , Xn , a statistic T (X1 , . . . , Xn ) is any real-valued function of the data.


In other words, it is any number that you can compute from the data. For example,
the sample mean
1
X̄ = (X1 + . . . + Xn )
n
the sample variance
1  2 2 
S =
2
X1 − X̄ + . . . + Xn − X̄


n−1
and the range
R = max (X1 , . . . , Xn ) − min (X1 , . . . , Xn )
are all statistics. Since the data X1 , . . . , Xn are realizations of random variables, a
statistic is also a (realization of a) random variable. A major use of probability in
this course will be to understand the distribution of a statistic, called its sampling
distribution, based on the distribution of the original data X1 , . . . , Xn . Let’s work
through some examples:
IID
Example 2.1. (Sample mean of IID normals). Suppose X1 , . . . , Xn ∼ N (µ, σ 2 ).
The sample mean X̄ is actually a special case of the quantity a1 X1 + . . . + an Xn from
Example 5.1, where ai = n1 , µi = µ, and σi2 = σ 2 for all i = 1, . . . , n. Then from that
Example,
2
 
σ
X̄ ∼ N µ,  .
n
IID
Example 2.2. (Chi-squared distribution). Suppose X1 , . . . , Xn ∼ N (0, 1). Let’s de-
rive the distribution of the statistic X12 + . . . + Xn2 .

By independence of X12 , . . . , Xn2 ,

MX12 +...+Xn2 (t) = MX12 (t) × . . . × MXn2 (t).

We may compute, for each Xi , its MGF

tx2 √1 1
√ e(t− 2 )x dx.
  Z 2 Z 1
tXi2 − x2 2
MXi2 (t) = E e = e e dx =
2π 2π

If t ≥ 21 , then MXi2 (t) = ∞. Otherwise,


v
1 u1 − 2t − 1 (1−2t)x2
Z u
MXi2 (t) = √ t
e 2 dx.
1 − 2t 2π

Page 12
We recognize the quantity inside this integral as the PDF of the N 0, 1−2t distribution,
 
1

and hence the integral equals 1. Then



∞ 1

t≥
MXi2 (t) =  2
(1 − 2t)−1/2 t< 1
2

This is the MGF of the Gamma 12 , 12 distribution, so Xi2 ∼ Gamma 21 , 12 . This is


   

also called the chi-squared distribution with 1 degree of freedom, denoted χ21 .
Going back to the sum,

∞ 1

t≥
M X12 +...+Xn2 (t) = M (t) × . . . × M (t) = 
X12 Xn2
2
(1 − 2t)−n/2 t< 1
2

This is the MGF of the Gamma n2 , 12 distribution, so X12 + . . . + Xn2 ∼ Gamma n2 , 21 .


   

This is called the chi-squared distribution with n degrees of freedom, denoted


χ2n .
The mean and variance of the χ2n can be obtained from the MGF:
d d h
[Mχ2 (t)] = (1 − 2t)−n/2 = E χ2 = n.
i  

dt t=0 dt t=0

d2 d2 h
[M (t)] = (1 2t)−n/2
= + 2); Var χ2 = 2n.
i  
χ 2 − n(n ∴
dt2 t=0
dt2
t=0

Example 2.3. The Chi-square distribution. The probability density function of the
chi-square (χ2n ) distribution with n degrees of freedom is
n n
x 2 −1 e− 2
f (x; n) = n  n  , x > 0.
22Γ 2

The mean of the chi-square distribution is n and the variance is 2n. Parameters:
n > 0.
Example 2.4. Student’s t-distribution. Student’s t-distribution has pdf
!2 − ν+1
Γ ν+1 1 x−µ
  
2

f (x) = √ 2  ν  1 +  ,
σ νπΓ 2 ν σ

where x ∈ R, µ ∈ R, σ > 0, and ν > 0. The mean of the Student’s t-distribution


is µ for degree of freedom ν > 0, and undefined for ν ≤ 1. The variance of this
ν
distribution is ν−2 for ν > 2, and undefined or infinite for ν ≤ 2. Moreover, the
Student’s t-distribution converges to the normal distribution (pointwise) as ν → ∞.
Parameters: µ ∈ R, σ > 0, and ν > 0.

Page 13
Relationship with other distributions:
• Let X1 , . . . , Xn be i.i.d random variables with distribution N (0, 1). Let Y =
j=1 Xj , then Y ∼ χn .
Pn 2 2

• Let X ∼ N (0, 1) and W ∼ χ2n . Then, Y = √XW has Student’s t-distribution


n
(µ = 0, σ = 1) with n degrees of freedom.
Example 2.5. F distribution. Suppose X and Y are independently distributed as chi-
squared with m and n degrees of freedom, respectively. Define a statistic Fm,n as the
ratio of the dispersion of the two distributions
X/m
Fm,n =
Y /n
is said to have F distribution with (m, n) degrees of freedom.

Remark 2. “Degrees of freedom” in the expression of χ2 , t, and F distribution is


the number of unrestricted variables in the expression. Suppose ni=1 (Xi − X̄)2 has
P

(n−1) degrees of freedom since out of n variables one restriction ni=1 (Xi − X̄) = 0
P

is imposed. As a rule of thumb, degrees of freedom = (the number of quantities


involved in the expression) - (the number of linear restrictions).

3 Kernels and Parameters

Definition 3.1. The qth quantile of the distribution of a random variable X, is


that value x such that P (X < x) = q. If q = 0.5, the value is called the median.
The cases q = 0.25 and q = 0.75 correspond to the lower quartile and upper
quartile, respectively.

Definition 3.2. The kernel of a probability density function (pdf) or probability


mass function (pmf) is the factor of the pdf or pmf in which any factors that are
not functions of any of the variables in the domain are omitted.

For example, the kernel of the Beta distribution is:

K(x; a, b) = xa−1 (1 − x)b−1 .

The kernel of the Gaussian (Normal) distribution is:


(x − µ)2 
 

K(x; µ, σ) = exp −

.
2σ 2 

Page 14
Types of parameters
The parameters of a distribution are classified into three types: location parameters,
scale parameters, and shape parameters.

Definition 3.3. A parameter µ of a distribution function F (x; µ) is called a lo-


cation parameter if F (x; µ) = F (x − µ; 0). An equivalent definition can be made
on the probability density function. This is, a parameter µ of a density function
f (x; µ) is called a location parameter if f (x; µ) = f (x − µ; 0).

An example of a location parameter is the parameter µ in the Gaussian distribution.

Definition 3.4. A parameter σ > 0 of a distribution function F (x; σ) is called a


scale parameter if F (x; σ) = F σx ; 1 . An equivalent definition can be made on the
probability density function. That is, a parameter  σ of a density function f (x; σ)
is called a scale parameter if f (x; σ) = σ f σ ; 1 .

1 x

An example of a scale parameter is the parameter σ in the Gaussian distribution.

Definition 3.5. A shape parameter is a parameter that is neither a location nor


a scale parameter. This kind of parameter controls the shape of a distribution
(equivalently, density) function; e.g., Lehmann alternatives.

An example of a shape parameter is the parameter κ in the Gamma distribution.

Definition 3.6. A distribution is said to belong to the Location-Scale family


of distributions if it is parameterized in terms of a location and a scale parameter.
Examples of members of the location-scale family are the Gaussian and Logistic
distributions. The Gamma distribution is not a member of this family (why?).

Definition 3.7. A distribution F (x; θ) is said to be Identifiable if F (x; θ1 ) =


F (x; θ2 ), for all x, implies that θ1 = θ2 for all possible values of θ1 , θ2 .

Page 15
4 Bivariate Normal Distribution

Definition 4.1. A bivariate random variable (X, Y ) is said to have a bivariate


normal distribution if the pdf of (X, Y ) is of the following form:
 2 2 
1
   
1 x−µ1 x−µ1 y−µ2 y−µ
−2 −2ρ + σ 2
f (x, y) = (1−ρ2 )
(x, y) ∈ R2
σ1 σ1 σ2
√ e 2
,
2πσ1 σ2 1 − ρ2
where µ1 , µ2 ∈ R, σ1 , σ2 > 0, |ρ| < 1. Then, we write (X, Y ) ∼
BN (µ1 , µ2 , σ12 , σ22 , ρ).

4.1 Marginal Distribution:

Note that,

1  x − µ1
 !2 !2 
x − µ1 y − µ2 y − µ2
! !
− 2ρ +

1 − ρ2  σ1 σ1 σ2 σ2 
 
y − µ2 − ρ σσ21 (x − µ1 }2 x − µ1
!2
(y − βx)2 (x − µ1 )2
= + = +
 
 ,
σ22 (1 − ρ2 ) σ1  2
σ2.1 σ12

where βx = µ2 + ρ σσ21 (x − µ1 ) σ2·1


2
= σ22 (1 − p2 ). Marginal pdf of X is
"  2 #
2
exp − 12 x−µ1
σ1 Z ∞e 2·1
− (y−βx)
1 2σ 2

x−µ1
2
−1
fX (x) = √ × √ dy = √ e 2 σ1
,x ∈ R
σ1 2π −∞ ( 2π)σ2·1 σ1 2π

Here, X ∼ N (µ1 , σ12 ). Similarly, it can be shown that Y ∼ N (µ2 , σ22 ).

4.2 Conditional Distribution:

The PDF of Y given X = x is


 2  2 
x−µ1
− 21 + y−β x
1√ σ1 σ 2·1
e
fXY (x, y) 1
 2
y−βx
2πσ1 σ2 1−ρ2 − 12
fY |X (y|x) = = = √ e σ2·1
, y ∈ R.
fX (x) σ2·1 2π
 2
x−µ1
1 − 12

σ1 2π
e σ1

 
Hence, Y |X = x ∼ N (βx, σ2·1
2
). ⇔ Y |X = x ∼ N µ2 + ρ σσ21 (x − µ1 ) , σ22 (1 − ρ2 ) .
 
Similarly, it can be shown that, X|Y = y ∼ N µ1 + ρ σσ21 (y − µ2 ) , σ12 (1 −p ) . 2

Page 16
Remark 3. 1. Note that E(Y |X = x) = µ2 + ρ σσ21 (x − µ1 ) and var(Y |X = x) =
σ22 (1 − ρ2 ). Hence, the regression of Y on X is linear and the conditional
distribution is homoscedastic.

2.
E(XY ) = E[E(XY |X)] = E[X · E(Y |X)]
σ2
" ( )#
= E X µ2 + ρ (x − µ1 )
σ1
σ2 2
= µ1 µ2 + ρ · σ1 [∵ E {X (X − µ1 )} = E (X − µ1 )2 = σ12
i

σ1
E(XY ) − µ1 µ2
⇒ E(XY ) = µ1 µ2 + ρσ1 σ2 ⇒ = ρ ⇒ ρXY = ρ.
σ1 σ2

3. If ρ2 = 1, then the PDF becomes undefined. But ρ = ±1, then P [αX + βY +


γ = 0] = 1 for some non-null (α, β), which it known as singular on degenerate
Bivariate distribution,

5 Multivariate Normal Distribution

Definition 5.1. The multivariate normal distribution of a p-dimensional random


vector X = (X1 , . . . , Xp ) is said to be distributed as a multivariate Normal if and
only if its probability density function is:
1 1
!
ϕX (x1 , . . . , xp ; µ, Σ) = q exp − (X − µ)⊤ Σ−1 (X − µ) ,
(2π) |Σ|
p 2

where µ = E[X] is the location parameter and Σ = E (X − µ)(X − µ)⊤ is the


h i

covariance matrix. We denote it as X ∼ Np (µ, Σ).

The Multivariate Normal distribution of dimension p is a distribution for p random


variables X1 , . . . , Xp which generalizes the normal distribution for a single variable. It
is parametrized by a mean vector µ ∈ Rp and a symmetric covariance matrix Σ ∈ Rp×p ,
and we write
(X1 , . . . , Xp ) ∼ N (µ, Σ).
Rather than writing down the general formula for its joint PDF (which we will not use
in this course), let’s define this distribution by the following properties:

Page 17
Definition 5.2. (X1 , . . . , Xp ) have a multivariate normal distribution if, for every
choice of constants a1 , . . . , ap ∈ R, the linear combination a1 X1 + . . . + ap Xp has
a (univariate) normal distribution. (X1 , . . . , Xp ) have the specific multivariate
normal distribution N (µ, Σ) when, in addition,

1. E [Xi ] = µi and Var [Xi ] = Σii for every i = 1, . . . , p, and

2. Cov [Xi , Xj ] = Σij for every pair i ̸= j.

When (X1 , . . . , Xp ) are multivariate normal, each Xi has a (univariate) normal


distribution, as may be seen by taking ai = 1 and all other aj = 0 in the above
definition. The vector µ specifies the means of these individual normal variables, the
diagonal elements of Σ specify their variances, and the off-diagonal elements of Σ
specify their pairwise covariances.
Example 5.1. If X1 , . . . , Xp are normal and independent, then a1 X1 + . . . + ap Xp has
a normal distribution for any a1 , . . . , ap ∈ R. To show this, we can use the MGF:
Suppose Xi ∼ N (µi , σi2 ). Then ai Xi ∼ N (ai µi , a2i σi2 ), so (from Example 1.12) ai Xi
has MGF
a2 σ 2 t2
ai µi t+ i 2i
Mai Xi (t) = e .
As a1 X1 , . . . , ap Xp are independent, the MGF of their sum is the product of their MGFs:

Ma1 X1 +...+ap Xp (t) = Ma1 X1 (t) × . . . × Map Xp (t)


a2 2 2
1 σ1 t a2 2 2
p σp t
=e a1 µ1 t+ 2 × ... × e ap µp t+ 2

(a21 σ12 +...+a2p σp2 )t2


= e(a1 µ1 +...+ap µ )t+
p 2 .

But this is the MGF of a N a1 µ1 + . . . + ap µp , a21 σ12 + . . . + a2p σp2 random variable!
 

As the MGF uniquely determines the distribution, this implies a1 X1 + . . . + ap Xp has


this normal distribution.
Then by definition, (X1 , . . . , Xp ) are multivariate normal. More specifically, in this
case we must have (X1 , . . . , Xp ) ∼ N (µ, Σ) where µi = E [Xi ] , Σii = Var [Xi ], and
Σij = 0 for all i ̸= j.
Example 5.2. Suppose (X1 , . . . , Xp ) have a multivariate normal distribution, and
(Y1 , . . . , Ym ) are such that each Yj (j = 1, . . . , m) is a linear combination of X1 , . . . , Xp :

Yj = aj1 X1 + . . . + ajp Xp

for some constants aj1 , . . . , ajp ∈ R. Then any linear combination of (Y1 , . . . , Ym )
is also a linear combination of (X1 , . . . , Xp ), and hence is normally distributed. So
(Y1 , . . . , Ym ) also have a multivariate normal distribution.

Page 18
For two arbitrary random variables X and Y , if they are independent, then
corr(X, Y ) = 0. The converse is in general not true: X and Y can be uncorrelated with-
out being independent. But this converse is true in the special case of the multivariate
normal distribution; more generally, we have the following:

Theorem 5.1. Suppose X is multivariate normal and can be written as X =


(X1 , X2 ), where X1 and X2 are subvectors of X such that each entry of X1 is
uncorrelated with each entry of X2 . Then X1 and X2 are independent.

To visualize what the joint PDF of the multivariate normal distribution looks like,
let’s just consider the two-dimensional setting k = 2, where we obtain the special case
of a Bivariate Normal distribution for two random variables X, Y . In this case, the
distribution is specified by the means µ1 and µ2 of X and Y , the variances σ12 and
σ22 of X and Y , and the correlation ρ between X and Y . When σ12 = σ22 = 1 and
µ1 = µ2 = 0, the contours of the joint PDF of X and Y are shown below, for ρ = 0 on
the left and ρ = 0.75 on the right:

When ρ = 0, X and Y are independent standard normal variables, and these


contours are circular; the joint PDF has a peak at 0 and decays radially away from 0.
When ρ = 0.7, the contours are ellipses. As ρ increases to 1, the contours concentrate
more and more around the line y = x. (In the general k-dimensional setting and for
general µ and Σ, the joint PDF has a single peak at the mean µ ∈ Rk , and it decays
away from µ with contours that are ellipsoids around µ, with their shape depending
on Σ.)
Example 5.3. Multivariate t Distribution. The probability density function of the
d-dimensional multivariate Student’s t distribution is given by
−(ν+d)/2
1 1 Γ((ν + d)/2)  x′ Σ−1 x 

f (x, Σ, ν) = 1+ ,
|Σ|1/2 (νπ)d Γ(ν/2)
q
ν

Page 19
where x is a 1 × d vector, Σ is a d × d symmetric, positive definite matrix, and ν is a
positive scalar. While it is possible to define the multivariate Student’s t for singular
Σ, the density cannot be written as above.
The multivariate Student’s t distribution is a generalization of the univariate Stu-
dent’s t to two or more variables. It is a distribution for random vectors of correlated
variables, each element of which has a univariate Student’s t distribution. In the same
way, as the univariate Student’s t distribution can be constructed by dividing a stan-
dard univariate normal random variable by the square root of a univariate chi-square
random variable, the multivariate Student’s t distribution can be constructed by di-
viding a multivariate normal random vector having zero mean and unit variances by
a univariate chi-square random variable. The multivariate Student’s t distribution is
parameterized with a correlation matrix, Σ, and a positive scalar degree of freedom
parameter, ν. ν is analogous to the degrees of freedom parameter of a univariate Stu-
dent’s t distribution. The off-diagonal elements of Σ contain the correlations between
variables. Note that when Σ is the identity matrix, variables are uncorrelated; however,
they are not independent.
The multivariate Student’s t distribution is often used as a substitute for the mul-
tivariate normal distribution in situations where it is known that the marginal distri-
butions of the individual variables have fatter tails than the normal.

6 Modes of Convergence

Definition 6.1. Let X1 , X2 , . . . be a sequence of independent random variables.


Xn converges almost surely (a.s.) to the random variable X, as n → ∞, if and
only if
P ({ω ∈ Ω : Xn (ω) → X(ω) as n → ∞}) = 1.
a·s
Notation: Xn → X as n → ∞. Almost sure convergence is often referred to as
strong convergence.

Definition 6.2. Let X1 , X2 , . . . be a sequence of independent random variables.


Xn converges in probability to the random variable X, as n → ∞, if and only if,
for all ε > 0 :
P (|Xn − X| > ε) → 0 as n → ∞.
P
Notation: Xn → X as n → ∞.

Recall now that the expectation (or the mean) of a continuous random variable X
with probability density function f is defined as:
Z ∞
E[X] = xf (x)dx,
−∞

Page 20
the n-th moment of the random variable X is defined as:
Z ∞
E [X n ] = xn f (x)dx,
−∞

and the n-th absolute moment of the random variable X is defined as:
Z ∞
E|X| =n
|x|n f (x)dx.
−∞

Definition 6.3. Let X1 , X2 , . . . be a sequence of independent random variables.


Xn converges in r-mean to the random variable X, as n → ∞, if and only if, for
all ε > 0 :
E |Xn − X|r → 0 as n → ∞.
r
Notation: Xn → X as n → ∞.

Definition 6.4. Let X1 , X2 , . . . be a sequence of independent random variables.


Xn converges in distribution to the random variable X, as n → ∞, if and only if:

FXn (x) → FX (x) as n → ∞ for all x ∈ C (FX ) ,

where FXn and FX are the cumulative distribution functions of Xn and X, re-
spectively, and C (FX ) is the continuity set of FX (that is, the points where FX is
d
continuous). Notation: Xn → X as n → ∞.

Theorem 6.1. Slutsky’s theorem. Let X1 , X2 , . . . and Y1 , Y2 , . . . be sequences of


random variables. Suppose that
d P
Xn → X, and Yn → a, as n → ∞,

where a is some constant. Then, as n → ∞


d
Xn + Yn → X + a,
d
Xn − Yn → X − a,
d
Xn · Yn → X · a,
Xn d X
→ , for a ̸= 0.
Yn a

Page 21
Theorem 6.2. Convergence of sums of sequences of random variables

1. Let X1 , X2 , . . . and Y1 , Y2 , . . . be sequences of random variables. Suppose that


ass as.
Xn → X, and Yn → Y, as n → ∞.

Then,
a.s
Xn + Yn → X + Y, a.s n → ∞.

2. Let X1 , X2 , . . . and Y1 , Y2 , . . . be sequences of random variables. Suppose that


P P
Xn → X, and Yn → Y, as n → ∞.

Then,
P
Xn + Yn → X + Y, as n → ∞

6.1 The Law of Large Numbers and the Central Limit Theorem

Definition 6.5. We say that two random variables X and Y are identically dis-
tributed if and only if P (X ≤ x) = P (Y ≤ x), for all x. If two variables are
independent and identically distributed, we say that they are “i.i.d.”.

Theorem 6.3. The weak law of large numbers. Let X1 , X2 , . . . be a sequence of


i.i.d. random variables with finite mean µ, and set Sn = X1 +X2 +· · ·+Xn , n ≥ 1.
Then
Sn P
X̄n = → µ as n → ∞.
n
Alternatively, for any fixed ε > 0, as n → ∞,

P X̄n − µ > ε → 0.
h i

Corollary 6.1. Let h be a measurable function and X1 , . . . , Xn be a sequence


of i.i.d. random variables with distribution F . Suppose that E[h(X)] < ∞, for
X ∼ F . Then, by the law of large numbers
1X n
P
h (Xi ) → E[h(X)] as n → ∞.
n i=1

Page 22
Theorem 6.4. The strong law of large numbers. Let X1 , X2 , . . . be a sequence
of i.i.d. random variables with finite mean µ and finite variance, and set Sn =
X1 + X2 + · · · + Xn , n ≥ 1. Then
Sn a.s
X̄n = →µ a.s n → ∞.
n

Theorem 6.5. The central limit theorem (univariate case). Let X1 , X2 , . . . be a


sequence of i.i.d. random variables with finite mean µ and finite variance σ 2 , and
set Sn = X1 + X2 + · · · + Xn , n ≥ 1. Then
Sn − nµ d
√ → N (0, 1) as n → ∞.
σ n

Alternatively, for any fixed x ∈ R, as n → ∞,


   
X̄n − µ 
P n ≤ x → Φ(x),
σ

where Φ is the CDF of the N (0, 1) distribution.

Theorem 6.6. The central limit theorem (multivariate case) Let X1 , X2 , . . . be


a sequence of i.i.d. p-dimensional random vectors, where Xi = (Xi1 , . . . , Xip ).
Suppose that
E ∥X1 ∥2 = E X11
2
+ · · · + X1p
2
 
< ∞,
and set Sn = X1 + X2 + · · · + Xn . The central limit theorem asserts that
√ d
n (Sn − E [X1 ]) → N (0, Cov [X1 ]) as n → ∞,

where Cov [X1 ] is the p × p covariance matrix of the random vector X1 .

The LLN and CLT can be used as building blocks to understand other statistics,
via the Continuous Mapping Theorem:

Theorem 6.7. (Continuous mapping theorem). Let X1 , X2 , . . . be a sequence of


random variables on R. Suppose that g : R → R is a continuous function (almost
surely). Then,
d d
Xn → X implies g (Xn ) → g(X),
P P
Xn → X implies g (Xn ) → g(X),
a.s. a.s.
Xn → X implies g (Xn ) → g(X).

Page 23
7 Order Statistics

Introduction: Let X1 , X2 , . . . , Xn be a random sample of size n drawn from a pop-


ulation with distribution function F . If the observations X1 , X2 , . . . , Xn are arranged
in increasing order of magnitude then the rearranged random variables X(1) ≤ X(2) ≤
. . . ≤ X(n) are called the order statistics of the sample. In case of sampling from con-
tinuous population we have X(1) < X(2) < . . . < X(n) with probability 1. It is clear that
the order statistic X(i)′
s is dependent though the original observation X1 , X2 , . . . , Xn
are independent. Thus the rth order statistic of the sample of size r is simply the rth
order statistic (rth smallest observation in the sample) and is denoted by X(r) .

7.1 Exact sampling distribution of order statistic


(a) Distribution of X(1) : Let us consider a random sample X1 , X2 , . . . , Xn drawn
from a population having distribution function F (·). X(1) is the first order statistic,
the distribution function of X(1) is given by
FX(1) (x) = 1 − P X(1) > x
h i

= 1 − P [X1 > x, X2 > x, . . . ., Xn > x]


n
=1− P [Xi > x] [∵ X1 , . . . , Xn are independent ]
Y

i=1
= 1 − {P [X1 > x]}n
= 1 − (1 − F (x))n .
∴ The pdf of X(1) is given by.
fX(1) (x) = n(1 − F (x))n−1 f (x).

(b) Distribution of X(n) : The distribution function of X(n) is given by,


FX(n) (x) = P X(n) ≤ x
h i

= P [X1 ⩽ x, X2 ≤ x, . . . , Xn ⩽ x]
= (P [X1 ⩽ x])n [∵ Xi′ s are i.i.d. ]
= [F (x)]n
∴ The pdf of X(n) is given by,
fX(n) (x) = n{F (x)}n−1 f (x)

(c) Distribution of X(r) , the general case: Let X(1) , X(2) , · · · , X(n) order statis-
 

tics corresponding to the sample observation (X1 , X2 , . . . , Xn ) having joint p.d.f.


n
fθ (x1 , x2 , . . . , xn ) = fθ (xi )
Y

i=1

Page 24
if g (y1 , y2 , . . . , yn ) be the p.d.f. of X(1) , X(2) , · · · , X(n) then
 

g (y1 , y2 , . . . , yn ) = n!f (y1 , y2 , . . . , yn ) , −∞ < y1 < y2 < · · · < yn < ∞.

Let, Fr (y) be the distribution function of rth order statistic X(r) Then,

Fr (y) = P [X(r) ≤ y]
= P [at least r of the n sample observation are ≤ y]
 
n n
=  [F (y)]t [1 − F (y)]n−t , where F is the d.f. of X
X

t=r t
 
r−1 n
=1−  [F (y)]t [1 − F (y)]n−t
X

t=0 t
1 Z 1−F (y)
=1− z n−r (1 − z)r−1 dz.
β(n − r + 1, r) 0
So that the pdf of X(r) is
d 1
fr (y) = Fr (y) = [1 − F (y)]n−r [F (y)]n−1 · f (y) ·
dy β(n − n + 1, n)
n!
= [F (y)]r−1 [1 − F (y)]n−r · f (y)
(r − 1)!(n − r)!
Particular case: Let, r = 1, then the p.d.f. of minimum order statistic is,
f1 (y) = n[1 − F (y)]n−1 f (y).
Let, r = n, then the p.d.f. of maximum order statistic is, fn (y) = n[F (y)]n−1 f (y).

1; 0 <

x<θ
Example 7.1. Let X ∼ R(0, θ) with pdf fθ (x) = θ .
0, ow

∴ The pdf of X(n) is


  n−1
n y
· 1
,0 < y < θ
fn (y) = 

θ θ
0 , otherwise
and the pdf of X(1) is
y n−1

n 1− ,0 < y < θ
h i
1
·
f1 (y) = 

θ θ
0 , otherwise

Example 7.2. Let X ∼ Exp(θ, 1)



e−(x−θ) if x > θ
∴ fθ (x) = 

0 otherwise

Page 25
∴ PDF of X(1) is
  Z y n−1
n 1− e−(x−θ)
dx · e−(y−θ) if θ < y < ∞

f1 (y) = 

θ
 0 otherwise

(d) Joint distribution of X(1) and X(n) :


The joint distribution function of X(1) and X(n) is given by

FX(1) ,X(n) (x, y) = P X(1) ≤ x, X(n) ≤ y


h i

= P X(n) ≤ y − P X(1) > x, X(n) < y


h i h i

= P [X1 , X2 , . . . , Xn ≤ y] − P [x < X1 , X2 , . . . , Xn < y]


= [F (y)]n − [F (y) − F (x)]n .

∴ The joint pdf of X(1) and X(n) is given by,

∂2
fX(1) ,X(n) (x, y) = FX ,X (x, y)
∂x∂y (1) (n)
= n(n − 1)[F (y) − F (x)]n−2 f (x)f (y)

(e) Joint pdf of X(r) and X(s) , the general case:


The joint pdf of X(r) and X(s) is given by,

1 h h k k
" #
fX(r) ,X(s) (x, y) = lim P x − < X(r) < x + , y − < X(s) < y + , if r < s
h↓0 k↓0 hk 2 2 2 2
1 h h h
" !
= lim P (r − 1) obs. < x − , one obs. ∈ x − , x + ,
h↓0 k↓0 hk 2 2 2
h k k k
! !
(s − r − 1) obs. ∈ x + , y − , one obs. ∈ y − , y + ,
2 2 2 2
k
#
(n − s) obs. > y +
2
n! a particular case
" #
= lim P
h↓ 0k↓0 (r − 1)!(n − s)!(s − r − 1)! hk
hf (x)
!)r−1
n! h k
( ( !
= lim F x− · F y−
h↓0 k↓0 (r − 1)!(s − r − 1)!(n − s)! 2 h 2
k · f (y)
!)s−r−1 !)n−s
h k
(
−F x + 1−F y+
2 k 2
n!
= {F (x)}r−1 {F (y) − F (x)}s−r−1
(r − 1)!(n − s)!(s − r − 1)!
{1 − F (y)}n−s f (x)f (y)

Page 26
Sample Median & Sample Range: Let X(1) ≤ X(2) ≤ · · · ≤ X(n) denote the
order statistics of a random sample X1 , X2 , . . . , Xn from a density f (·). The sample
median is defined to be the middle order statistic if n is odd and the average of the
middle two order statistics if n is even. The sample range is defined to be X(n) − X(1) ,
sample mid-range is defined to be {X(n) + X(1)}/2.

Example 7.3. Let X1 , X2 , . . . , Xn be iid RV’s with common pdf



1

if 0 < x < 1
f (x) = 
0 otherwise

Find the distribution of the sample range.


Solution: The joint PDF of X(1) and X(n) is given by

fX(1) ,X(n) (x, y) = n(n − 1)(F (y) − F (x))n−2 , 0 < x < y < 1.
Ry
Now, F (y) = 0 1dx = y similarly,

∴ f (x) = x
fX(1) ,X(n) (x, y) = n(n − 1)(y − x)n−2 , 0 < x < y < 1.
   
Let us consider the following transformation. X(1) , X(n) −→ X(1) , R such that

R = X(n) − X(1)

∂X(n)
|J| = = 1.
∂R
Here, r = y − x ⇒ y = x + r; 0 < y < 1 ⇒ 0 < x < 1 − r, 0 < r < 1.
The joint PDF of X(1) and R is given by,

fX(1) ,R (x, r) = n(n − 1)rn−2, , 0 < x < 1 − r; 0 < r < 1

∴ PDF of R is given by,


Z 1−r
fR (r) = n(n − 1)r n−2
dx, 0 < r < 1
0
= n(n − 1)rn−2 (1 − r), 0 < r < 1

Hence the answer.

Page 27
8 Additional tools

Higher moments can help us understand tail behavior, as seen in Markov and Cheby-
shev’s Inequalities. This bound gets better as we take higher-order moments.
• Markov’s inequality: P [|X| ≥ α] ≤ |X|k where α > 0.
h i
1
αk E

• Chebyshev inequality. Let m = E[X] and α > 0


1
P [|X − m| ≥ α] ≤ Var[X]
α2

• Chernoff Bound. Suppose X is a random variable whose moment generating func-


tion is MX (t) and a ∈ R

P (X ≥ a) = P (etX ≥ eta ) ≤ e−ta MX (t) for some t > 0

• Jensen’s inequality. Let φ be a convex function on an interval containing the range


of X. Then,
φ(E[X]) ≤ E[φ(X)]
The opposite is true for concave distributions.
• Holder’s inequality. Let p > 1, q > 1 such that 1
p + 1
q =1
1 1
E∥XY ∥ ≤ E [|X|p ] p · E [|Y |q ] q .

The case p = q = 2 is known as the Cauchy-Schwarz inequality (show |rXY | ≤ 1!).


• The rank of a matrix M is the dimension of the vector space generated by its
columns. This corresponds to the maximal number of linearly independent columns
of M. The rank is commonly denoted rank(M) or Rank(M). The square root of
1 1 1
a non-singular matrix M, denoted M 2 , is the matrix that satisfies M = M 2 M 2 .
• Reparameterisation.

Definition 8.1. Let f (x; θ) be a pdf with parameters θ = (θ1 , . . . , θp )⊤ ∈ Θ ⊂


Rp , x ∈ D ⊂ Rn . A reparameterisation η = φ(θ) is a change of variables
θj 7→ ηj , j = 1, . . . , p, via a one-to-one function φ such that, for each θ ∈ Θ,
there exists η ∈ φ(Θ) such that f (x; θ) = f (x; φ−1 (η)). Analogously, for each
η ∈ φ(Θ), there exists θ ∈ Θ such that f (x; η) = f (x; φ(θ)).

The use of reparameterization is very common in statistics. For instance, the


Exponential distribution is often parameterized in terms of the rate parameter λ

Page 28
or in terms of the mean β = λ1 . Another example is the Normal distribution, which
is often parameterized in terms of the mean µ and the standard deviation σ; or
in terms of the mean µ and the variance σ2 = σ 2 ; or in terms of the mean µ and
the precision τ = σ12 . They are all equivalent as there exists a one-to-one function
between the different parameterizations.
• Indicator function.

Definition 8.2. The indicator function of the set A is a function

IA : X → {0, 1} defined as:



1

if x ∈ A,
IA (x) = 
0 if x ∈
/ A.

• Method of Lagrange’s Multipliers: Suppose we wish to minimize or maxi-


mize a function of two variables z = f (x, y) where (x, y) is constrained to satisfy
g(x, y) = 0. Assuming that these functions have continuous derivatives, we can
visualize g(x, y) = 0 as a curve along with the level curve of z = f (x, y).

Intuitively, if we move the level curve in the direction of increasing z, the largest or
smallest z occurs at a point where a level curve touches g(x, y) = 0. The quadrants
of ‘f ’ and ‘g ’ should be in the same or opposite direction. Then ∇f = −λ∇g for
some constant λ ∋ ∇{f + λg} = 0.
Proof. g(x, y) = 0, − dx dy
= − ggxy and for f (x, y) = c, dx
dy
= − ffxy .

fx dy gx fx fy
At the point of tangency, − = =− ⇒ = = −λ (say)
fy dx gy fx gy
∴ (fx , fy ) = −λ (gx , gy ) .

Page 29
Hence, to find the maximum on minimum of f (x, y) subject to g(x, y) = 0, we find
all the solution of equation,

∇{f + λg} = 0 and g(x, y) = 0


∂F ∂F
⇒ =0= , g(x, y) = 0; where F (x, y) = f (x, y) + λg(x, y).
∂x ∂y

Local maxima and minima will be among the solutions. If the curve g(x, y) = 0
is closed and bounded, then the absolute maxima and minima of f (x, y) exist and
are among these solutions.
General Case: To maximize or minimizes z = f (x1 , x2 , . . . , xn ) subject to the
constraints gi (x1 , x2 , . . . , xn ) = 0; i = 1(1)k, solve the following equations simulta-
neously,
 
k
∇ f + λi gi  = 0 and gi (x1 , x2 , . . . , xn ) = 0, i = 1(1)k.
 X 

i=1

The numbers λ1 , λ2 , . . . , λk are called the Lagrange’s multipliers. The method for
finding the extrema of a function subject to some constraints is called the “method
of Lagrange’s Multipliers”.

Example 8.1. Maximize f (x, y) = x2 y subject to x2 + xy = 12.


We let F (x, y) = x2 y + λ (x2 + xy − 12)
∂F ∂F
0= = 2xy + λ(2x + y) → (i), 0 = = x2 + λx → (ii), x2 + xy = 12 → (iii)
∂x ∂y

From (ii) → x(x + λ) = 0 ⇒ x = −λ as x = 0 is not a solution of x2 + xy = 12.


From (i) → −2λy + λ2(−λ) + λy = 0 ⇒ −λy = 2λ2 ⇒ y = −2λ
From (iii) → x = −λ, y = −2λ, then x2 + xy = 12 gives λ = ±2.
∴ (x, y) = (−2, −4) or (2, 4). Hence max{xy} = 16, min{xy} = −16.

Page 30

You might also like