SI Chapter-1
SI Chapter-1
1 Random Variables
Definition 1.1. A probability space is the triplet (Ω, F, P ), where Ω is the sample
space, F is the collection of events (σ-algebra), and P is a probability measure
defined over F, with P (Ω) = 1.
A discrete random variable X can take a finite or countably infinite number of possible
values. We use discrete random variables to model categorical data (for example, which
presidential candidate a voter supports) and count data (for example, how many cups
of coffee a graduate student drinks in a day). The distribution of X is specified by its
probability mass function (PMF):
P[X ∈ A] = fX (x).
X
x∈A
Page 1
Instead, the distribution of X is specified by its probability density function (PDF)
fX (x), which satisfies for any set A ⊆ R
Z
P[X ∈ A] = fX (x)dx.
A
In both cases, when it is clear which random variable is being referred to, we will
simply write f (x) for fX (x).
For any random variable X and real-valued function g, the expectation or mean
of g(X) is its “average value”. If X is discrete with PMF fX (x), then
E[g(X)] = g(x)fX (x)
X
where the sum is over all possible values of X. If X is continuous with PDF fX (x),
then Z
E[g(X)] = g(x)fX (x)dx
R
The expectation is linear: For any random variables X1 , . . . , Xn (not necessarily inde-
pendent) and any c ∈ R
E [X1 + . . . + Xn ] = E [X1 ] + . . . + E [Xn ] , E[cX] = cE[X]
If X1 , . . . , Xn are independent, then
E [X1 . . . Xn ] = E [X1 ] . . . E [Xn ]
The variance of X is defined by the two equivalent expressions
Var[X] = E (X − E[X])2 = E X 2 − (E[X])2
h i h i
y:y≤x −∞
Page 2
By definition, FX is monotonically increasing: FX (x) ≤ FX (y) if x < y. If FX is
continuous and strictly increasing, meaning FX (x) < FX (y) for all x < y, then FX
has an inverse function FX−1 : (0, 1) → R called the quantile function: For any
t ∈ (0, 1), FX−1 (t) is the tth quantile of the distribution of X. I.e. the probability that
X is less than this value is exactly t.
Definition 1.4. Two events A and B are independent if and only if the probability
of their intersection equals the product of their individual probabilities, that is
P (A ∩ B) = P (A)P (B).
Definition 1.5. Given two events A and B, with P (B) > 0, the conditional
probability of A given B, denoted P (A | B), is defined by the relation
P (A ∩ B) P (A, B)
P (A | B) = = .
P (B) P (B)
In connection with these definitions, the following result holds. Let {Cj : j = 1, . . . , n}
be a partition of Ω, this is, Ω = ∪nj=1 Cj and Ci ∩ Ck = ∅ for i ̸= k. Let also A be an
event. The Law of Total Probability states that
n
P (A) = P (A | Cj ) P (Cj ) .
X
j=1
P (X = x, Y = y)
pY |X=x (y) = P (Y = y | X = x) = ,
P (X = x)
and the conditional cumulative distribution function of Y given that X = x is
z≤y
Page 3
Definition 1.7. Let X and Y have a joint continuous distribution. For fX (x) >
0, the conditional density function of Y given that X = x is
fX,Y (x, y)
fY |X=x (y) = ,
fX (x)
where fX,Y is the joint probability density function of X and Y , and fX is the
marginal probability density function of X. The conditional cumulative distribu-
tion function of Y given that X = x is
Z y
FY |X=x (y) = fY |X=x (z)dz.
−∞
Remark 1. The law of total probability. Let X and Y have a joint continuous
distribution. Suppose that fX (x) > 0, and let fY |X=x (y) be the conditional density
function of Y given that X = x The law of total probability states that
Z ∞
fY (y) = fY |X=x (y)fX (x)dx.
−∞
Page 4
A more particular example is the case where X is the outcome observed from tossing
a fair coin once. In this case P(X = heads ) = P(X = tails ) = 12 .
Bernoulli random variables are used in many contexts, and they are often referred
to as Bernoulli trials. A Bernoulli trial is an experiment with two, and only two,
possible outcomes. Parameters: 0 ≤ p ≤ 1.
Example 1.2. The Binomial distribution. A Binomial random variable X is the total
number of successes in n Bernoulli trials. Consequently, the range of X is the set
{0, 1, 2, . . . , n}. The probability of each outcome is given by:
n
P(X = x) = px (1 − p)n−x
x
n
where = x!(n−x)!
n!
is the Binomial coefficient (also known as combination). If a
x
random variable X has Binomial distribution, it is denoted as X ∼ Binomial(n, p),
where n is the number of trials and p is the probability of success.
The mean and variance of a Binomial random variable are E[X] = np and Var[X] =
np(1 − p).
A more particular example is the case where X is the number of heads in n = 10
fair coin tosses. Then, for x = 0, 1, . . . , 10 :
10 10
Parameters: n ∈ Z+ and 0 ≤ p ≤ 1.
Example 1.3. The Poisson distribution. A random variable X has a Poisson dis-
tribution if it takes values in the non-negative integers and its distribution is given
by:
λx −λ
P(X = x; λ) = e , x = 0, 1, . . .
x!
It is possible to show that E[X] = Var[X] = λ. Parameters: λ > 0.
Example 1.4. The Multinomial distribution. The Multinomial distribution is a gener-
alization of the binomial distribution. For n independent trials, each of which produces
and outcome (success) in one of k ≥ 2 categories, where each category has a given fixed
success probability θi , i = 1, . . . , k, the Multinomial distribution gives the probability of
any particular combination of numbers of successes for the various categories. Thus,
the pmf is
n!
p (x1 , . . . , xk ; θ1 , . . . , θk ) = θ1x1 · · · θkxk ,
x1 ! · · · xk !
where parameters θi ≥ 0 for i = 1, . . . , k and ki=1 θi = 1.
P
Page 5
There are many other discrete probability distributions of practical interest, for
example, the hypergeometric distribution, and the discrete uniform distribution, among
others.
The function f is called the probability density function of X. This definition can
be used to link the probability density function and the cumulative distribution F as
follows: Z x
F (x) = P(X ≤ x) = f (t)dt.
−∞
Definition 1.9. A random variable X, with cdf F has the characteristic function
Z ∞
φX (t) = E e itX
= eitx dF (x).
h i
−∞
−∞
Page 6
Definition 1.10. The Beta function and Gamma function are special functions
defined as follows Z 1
B(a, b) = ta−1 (1 − t)b−1 dt
0
Z ∞
Γ(z) = tz−1 e−t dt
0
b > 0.
The uniform distribution is a special case of the Beta distribution for the case
a = b = 1.
Example 1.6. Normal or Gaussian Distribution. The probability density function of
a Gaussian random variable X ∈ R is:
1 (x − µ)2
f (x; µ, σ) = √ exp − ,
2πσ 2σ 2
where −∞ < µ < ∞ and σ > 0 are parameters of this density function. In fact,
E[X] = µ and Var[X] = σ 2 . If a random variable X has normal distribution with mean
µ and variance σ 2 , we will denote it X ∼ N (µ, σ 2 ). This is one of the most popular
distributions in applications and it appears in a number of statistical and probability
models. Parameters: −∞ < µ < ∞ and σ > 0.
The Normal distribution has many interesting properties. One of them is that it
is closed under summation, meaning that the sum of normal random variables is nor-
mally distributed. That is, let X1 , . . . , Xn be i.i.d. random variables with distribution
2
N (µ, σ 2 ), and Y = nj=1 Xj , Z = n1 nj=1 Xj . Then, Y ∼ N (nµ, nσ 2 ) , Z ∼ N µ, σn .
P P
Example 1.7. The Logistic distribution. The probability density function of a logistic
random variable X ∈ R is:
exp − x−µ
n o
f (x; µ, σ) = σ
o2 ,
σ 1 + exp − x−µ
n
σ
where −∞ < µ < ∞ and σ > 0 are location and scale parameters, respectively. The
2 2
mean and variance of X are given by E[X] = µ and Var[X] = π 3σ . The cdf of a
logistic random variable is
exp x−µ
n o
F (x; µ, σ) = nσ o.
1 + exp x−µ
σ
Page 7
Parameters: −∞ < µ < ∞ and σ > 0. This distribution is very popular in practice
as well. In particular, the so-called “logistic regression model” is based on this dis-
tribution, as well as some Machine Learning algorithms (logistic sigmoidal activation
function in neural networks).
f (x; λ) = λ exp{−λx}
where λ > 0 is a rate parameter. The mean and variance are given by E[X] = λ1 and
Var[X] = λ12 . Parameters: λ > 0. This distribution is widely used in engineering
and the analysis of survival times.
Example 1.9. The Gamma distribution. The probability density function of a Gamma
random variable X > 0 is:
1 x
( )
f (x; κ, θ) = x κ−1
exp − ,
Γ(κ)θκ θ
Z ∞
where κ > 0 is a shape parameter, θ > 0 is a scale parameter, and Γ(z) = sz−1 e−s ds
0
is the Gamma function (for positive integers n, Γ(n) = (n − 1)!). Parameters: κ > 0
and θ > 0. The mean and variance of X are given by E[X] = κθ and Var[X] = κθ2 .
This distribution is widely used in engineering and the analysis of survival times.
In the continuous case, it is specified by a joint PDF fX1 ,...,Xk (x1 , . . . , xk ), which
satisfies for any set A ⊆ Rk ,
Z
P [(X1 , . . . , Xk ) ∈ A] = fX1 ,...,Xk (x1 , . . . , xk ) dx1 . . . dxk .
A
When it is clear which random variables are being referred to, we will simply write
f (x1 , . . . , xk ) for fX1 ,...,Xk (x1 , . . . , xk ).
Page 8
Example 1.10. (X1 , . . . , Xk ) have a multinomial distribution,
if these random variables take nonnegative integer values summing to n, with joint
PMF
n
f (x1 , . . . , xk ) = px1 px2 . . . pxk .
1 2 k
x1 , . . . , x n
Example 1.11. The Dirichlet distribution is a multivariate distribution over the sim-
plex ki=1 xi = 1 and xi ≥ 0. Its probability density function is
P
1 Y k
p (x1 , · · · , xk ; α1 , · · · , αk ) = xαi i −1 ,
B(α) i=1
Qk
i=1 Γ(α)
where B(α) = Γ( i=1 αi )
P k with Γ(a) being the Gamma function and α = (α1 , · · · , αK )
are the parameters of this distribution.
Page 9
putting probability over K categories, Dirichlet distribution is very popular in social
sciences and linguistics analysis.
The Dirichlet distribution is often used as a prior distribution for the multinomial
parameter p1 , · · · , pk in Bayesian inference.
The covariance between two random variables X and Y is defined by the two
equivalent expressions
Cov[X, Y ] = E[(X − E[X])(Y − E[Y ])] = E[XY ] − E[X]E[Y ].
So Cov[X, X] = Var[X], and Cov[X, Y ] = 0 if X and Y are independent. The covari-
ance is bilinear: For any constants a1 , . . . , ak , b1 , . . . , bm ∈ R and any random variables
X1 , . . . , Xk and Y1 , . . . , Ym (not necessarily independent),
k X
m
Cov [a1 X1 + . . . + ak Xk , b1 Y1 + . . . + bm Ym ] = ai bj Cov [Xi , Yj ] .
X
i=1 j=1
For any a, b > 0, we have Cov[aX, bY ] = ab Cov[X, Y ]. On the other hand, the
correlation is invariant to rescaling: corr(aX, bY ) = corr(X, Y ), and satisfies always
−1 ≤ corr(X, Y ) ≤ 1.
Depending on the random variable X, MX (t) might be infinite for some values of t.
Here are two examples:
Example 1.12. (Normal MGF). Suppose X ∼ N (0, 1). Then
tx √1 1 −x2 +2tx
Z 2 Z
− x2
MX (t) = E e tX
= dx =
h i
e e √ e 2 dx.
2π 2π
To compute this integral, we complete the square:
Z 1 −x2 +2tx Z 1 −x2 +2tx−t2 + t2 t2
Z 1 − (x−t)2
√ e 2 dx = √ e 2 2 dx = e 2 √ e 2 dx.
2π 2π 2π
Page 10
The quantity inside the last integral above is the PDF of the N (t, 1) distribution-hence
it must integrate to 1. Then MX (t) = et /2 .
2
t < β, let us rewrite the above to isolate the PDF of the Gamma(α, β − t) distribution:
β α Z ∞ (β − t)α α−1 −(β−t)x
MX (t) = x e dx.
(β − t)α 0 Γ(α)
As the PDF of the Gamma(α, β − t) distribution integrates to 1, we obtain finally
∞
t≥β
MX (t) = α
β α t<β
(β−t)
∞
t≥β
= −α
(1 − β −1 t) t < β.
Theorem 1.1. Let X and Y be two random variables such that, for some h > 0
and every t ∈ (−h, h), both MX (t) and MY (t) are finite and MX (t) = MY (t).
Then X and Y have the same distribution.
The reason why the MGF will be useful for us is because if X1 , . . . , Xn are inde-
pendent, then the MGF of their sum satisfies
MX1 +...+Xn (t) = E et(X1 +...+Xn ) = E etX1 × . . . × E etXn = MX1 (t) . . . MXn (t)
h i h i h i
This gives us a very simple tool to understand the distributions of sums of independent
random variables.
Page 11
2 Sampling Distributions
n−1
and the range
R = max (X1 , . . . , Xn ) − min (X1 , . . . , Xn )
are all statistics. Since the data X1 , . . . , Xn are realizations of random variables, a
statistic is also a (realization of a) random variable. A major use of probability in
this course will be to understand the distribution of a statistic, called its sampling
distribution, based on the distribution of the original data X1 , . . . , Xn . Let’s work
through some examples:
IID
Example 2.1. (Sample mean of IID normals). Suppose X1 , . . . , Xn ∼ N (µ, σ 2 ).
The sample mean X̄ is actually a special case of the quantity a1 X1 + . . . + an Xn from
Example 5.1, where ai = n1 , µi = µ, and σi2 = σ 2 for all i = 1, . . . , n. Then from that
Example,
2
σ
X̄ ∼ N µ, .
n
IID
Example 2.2. (Chi-squared distribution). Suppose X1 , . . . , Xn ∼ N (0, 1). Let’s de-
rive the distribution of the statistic X12 + . . . + Xn2 .
tx2 √1 1
√ e(t− 2 )x dx.
Z 2 Z 1
tXi2 − x2 2
MXi2 (t) = E e = e e dx =
2π 2π
Page 12
We recognize the quantity inside this integral as the PDF of the N 0, 1−2t distribution,
1
also called the chi-squared distribution with 1 degree of freedom, denoted χ21 .
Going back to the sum,
∞ 1
t≥
M X12 +...+Xn2 (t) = M (t) × . . . × M (t) =
X12 Xn2
2
(1 − 2t)−n/2 t< 1
2
dt t=0 dt t=0
d2 d2 h
[M (t)] = (1 2t)−n/2
= + 2); Var χ2 = 2n.
i
χ 2 − n(n ∴
dt2 t=0
dt2
t=0
Example 2.3. The Chi-square distribution. The probability density function of the
chi-square (χ2n ) distribution with n degrees of freedom is
n n
x 2 −1 e− 2
f (x; n) = n n , x > 0.
22Γ 2
The mean of the chi-square distribution is n and the variance is 2n. Parameters:
n > 0.
Example 2.4. Student’s t-distribution. Student’s t-distribution has pdf
!2 − ν+1
Γ ν+1 1 x−µ
2
f (x) = √ 2 ν 1 + ,
σ νπΓ 2 ν σ
Page 13
Relationship with other distributions:
• Let X1 , . . . , Xn be i.i.d random variables with distribution N (0, 1). Let Y =
j=1 Xj , then Y ∼ χn .
Pn 2 2
(n−1) degrees of freedom since out of n variables one restriction ni=1 (Xi − X̄) = 0
P
K(x; µ, σ) = exp −
.
2σ 2
Page 14
Types of parameters
The parameters of a distribution are classified into three types: location parameters,
scale parameters, and shape parameters.
Page 15
4 Bivariate Normal Distribution
Note that,
1 x − µ1
!2 !2
x − µ1 y − µ2 y − µ2
! !
− 2ρ +
1 − ρ2 σ1 σ1 σ2 σ2
y − µ2 − ρ σσ21 (x − µ1 }2 x − µ1
!2
(y − βx)2 (x − µ1 )2
= + = +
,
σ22 (1 − ρ2 ) σ1 2
σ2.1 σ12
Hence, Y |X = x ∼ N (βx, σ2·1
2
). ⇔ Y |X = x ∼ N µ2 + ρ σσ21 (x − µ1 ) , σ22 (1 − ρ2 ) .
Similarly, it can be shown that, X|Y = y ∼ N µ1 + ρ σσ21 (y − µ2 ) , σ12 (1 −p ) . 2
Page 16
Remark 3. 1. Note that E(Y |X = x) = µ2 + ρ σσ21 (x − µ1 ) and var(Y |X = x) =
σ22 (1 − ρ2 ). Hence, the regression of Y on X is linear and the conditional
distribution is homoscedastic.
2.
E(XY ) = E[E(XY |X)] = E[X · E(Y |X)]
σ2
" ( )#
= E X µ2 + ρ (x − µ1 )
σ1
σ2 2
= µ1 µ2 + ρ · σ1 [∵ E {X (X − µ1 )} = E (X − µ1 )2 = σ12
i
σ1
E(XY ) − µ1 µ2
⇒ E(XY ) = µ1 µ2 + ρσ1 σ2 ⇒ = ρ ⇒ ρXY = ρ.
σ1 σ2
Page 17
Definition 5.2. (X1 , . . . , Xp ) have a multivariate normal distribution if, for every
choice of constants a1 , . . . , ap ∈ R, the linear combination a1 X1 + . . . + ap Xp has
a (univariate) normal distribution. (X1 , . . . , Xp ) have the specific multivariate
normal distribution N (µ, Σ) when, in addition,
But this is the MGF of a N a1 µ1 + . . . + ap µp , a21 σ12 + . . . + a2p σp2 random variable!
Yj = aj1 X1 + . . . + ajp Xp
for some constants aj1 , . . . , ajp ∈ R. Then any linear combination of (Y1 , . . . , Ym )
is also a linear combination of (X1 , . . . , Xp ), and hence is normally distributed. So
(Y1 , . . . , Ym ) also have a multivariate normal distribution.
Page 18
For two arbitrary random variables X and Y , if they are independent, then
corr(X, Y ) = 0. The converse is in general not true: X and Y can be uncorrelated with-
out being independent. But this converse is true in the special case of the multivariate
normal distribution; more generally, we have the following:
To visualize what the joint PDF of the multivariate normal distribution looks like,
let’s just consider the two-dimensional setting k = 2, where we obtain the special case
of a Bivariate Normal distribution for two random variables X, Y . In this case, the
distribution is specified by the means µ1 and µ2 of X and Y , the variances σ12 and
σ22 of X and Y , and the correlation ρ between X and Y . When σ12 = σ22 = 1 and
µ1 = µ2 = 0, the contours of the joint PDF of X and Y are shown below, for ρ = 0 on
the left and ρ = 0.75 on the right:
f (x, Σ, ν) = 1+ ,
|Σ|1/2 (νπ)d Γ(ν/2)
q
ν
Page 19
where x is a 1 × d vector, Σ is a d × d symmetric, positive definite matrix, and ν is a
positive scalar. While it is possible to define the multivariate Student’s t for singular
Σ, the density cannot be written as above.
The multivariate Student’s t distribution is a generalization of the univariate Stu-
dent’s t to two or more variables. It is a distribution for random vectors of correlated
variables, each element of which has a univariate Student’s t distribution. In the same
way, as the univariate Student’s t distribution can be constructed by dividing a stan-
dard univariate normal random variable by the square root of a univariate chi-square
random variable, the multivariate Student’s t distribution can be constructed by di-
viding a multivariate normal random vector having zero mean and unit variances by
a univariate chi-square random variable. The multivariate Student’s t distribution is
parameterized with a correlation matrix, Σ, and a positive scalar degree of freedom
parameter, ν. ν is analogous to the degrees of freedom parameter of a univariate Stu-
dent’s t distribution. The off-diagonal elements of Σ contain the correlations between
variables. Note that when Σ is the identity matrix, variables are uncorrelated; however,
they are not independent.
The multivariate Student’s t distribution is often used as a substitute for the mul-
tivariate normal distribution in situations where it is known that the marginal distri-
butions of the individual variables have fatter tails than the normal.
6 Modes of Convergence
Recall now that the expectation (or the mean) of a continuous random variable X
with probability density function f is defined as:
Z ∞
E[X] = xf (x)dx,
−∞
Page 20
the n-th moment of the random variable X is defined as:
Z ∞
E [X n ] = xn f (x)dx,
−∞
and the n-th absolute moment of the random variable X is defined as:
Z ∞
E|X| =n
|x|n f (x)dx.
−∞
where FXn and FX are the cumulative distribution functions of Xn and X, re-
spectively, and C (FX ) is the continuity set of FX (that is, the points where FX is
d
continuous). Notation: Xn → X as n → ∞.
Page 21
Theorem 6.2. Convergence of sums of sequences of random variables
Then,
a.s
Xn + Yn → X + Y, a.s n → ∞.
Then,
P
Xn + Yn → X + Y, as n → ∞
6.1 The Law of Large Numbers and the Central Limit Theorem
Definition 6.5. We say that two random variables X and Y are identically dis-
tributed if and only if P (X ≤ x) = P (Y ≤ x), for all x. If two variables are
independent and identically distributed, we say that they are “i.i.d.”.
P X̄n − µ > ε → 0.
h i
Page 22
Theorem 6.4. The strong law of large numbers. Let X1 , X2 , . . . be a sequence
of i.i.d. random variables with finite mean µ and finite variance, and set Sn =
X1 + X2 + · · · + Xn , n ≥ 1. Then
Sn a.s
X̄n = →µ a.s n → ∞.
n
√
X̄n − µ
P n ≤ x → Φ(x),
σ
The LLN and CLT can be used as building blocks to understand other statistics,
via the Continuous Mapping Theorem:
Page 23
7 Order Statistics
i=1
= 1 − {P [X1 > x]}n
= 1 − (1 − F (x))n .
∴ The pdf of X(1) is given by.
fX(1) (x) = n(1 − F (x))n−1 f (x).
= P [X1 ⩽ x, X2 ≤ x, . . . , Xn ⩽ x]
= (P [X1 ⩽ x])n [∵ Xi′ s are i.i.d. ]
= [F (x)]n
∴ The pdf of X(n) is given by,
fX(n) (x) = n{F (x)}n−1 f (x)
(c) Distribution of X(r) , the general case: Let X(1) , X(2) , · · · , X(n) order statis-
i=1
Page 24
if g (y1 , y2 , . . . , yn ) be the p.d.f. of X(1) , X(2) , · · · , X(n) then
Let, Fr (y) be the distribution function of rth order statistic X(r) Then,
Fr (y) = P [X(r) ≤ y]
= P [at least r of the n sample observation are ≤ y]
n n
= [F (y)]t [1 − F (y)]n−t , where F is the d.f. of X
X
t=r t
r−1 n
=1− [F (y)]t [1 − F (y)]n−t
X
t=0 t
1 Z 1−F (y)
=1− z n−r (1 − z)r−1 dz.
β(n − r + 1, r) 0
So that the pdf of X(r) is
d 1
fr (y) = Fr (y) = [1 − F (y)]n−r [F (y)]n−1 · f (y) ·
dy β(n − n + 1, n)
n!
= [F (y)]r−1 [1 − F (y)]n−r · f (y)
(r − 1)!(n − r)!
Particular case: Let, r = 1, then the p.d.f. of minimum order statistic is,
f1 (y) = n[1 − F (y)]n−1 f (y).
Let, r = n, then the p.d.f. of maximum order statistic is, fn (y) = n[F (y)]n−1 f (y).
1; 0 <
x<θ
Example 7.1. Let X ∼ R(0, θ) with pdf fθ (x) = θ .
0, ow
0 otherwise
Page 25
∴ PDF of X(1) is
Z y n−1
n 1− e−(x−θ)
dx · e−(y−θ) if θ < y < ∞
f1 (y) =
θ
0 otherwise
∂2
fX(1) ,X(n) (x, y) = FX ,X (x, y)
∂x∂y (1) (n)
= n(n − 1)[F (y) − F (x)]n−2 f (x)f (y)
1 h h k k
" #
fX(r) ,X(s) (x, y) = lim P x − < X(r) < x + , y − < X(s) < y + , if r < s
h↓0 k↓0 hk 2 2 2 2
1 h h h
" !
= lim P (r − 1) obs. < x − , one obs. ∈ x − , x + ,
h↓0 k↓0 hk 2 2 2
h k k k
! !
(s − r − 1) obs. ∈ x + , y − , one obs. ∈ y − , y + ,
2 2 2 2
k
#
(n − s) obs. > y +
2
n! a particular case
" #
= lim P
h↓ 0k↓0 (r − 1)!(n − s)!(s − r − 1)! hk
hf (x)
!)r−1
n! h k
( ( !
= lim F x− · F y−
h↓0 k↓0 (r − 1)!(s − r − 1)!(n − s)! 2 h 2
k · f (y)
!)s−r−1 !)n−s
h k
(
−F x + 1−F y+
2 k 2
n!
= {F (x)}r−1 {F (y) − F (x)}s−r−1
(r − 1)!(n − s)!(s − r − 1)!
{1 − F (y)}n−s f (x)f (y)
Page 26
Sample Median & Sample Range: Let X(1) ≤ X(2) ≤ · · · ≤ X(n) denote the
order statistics of a random sample X1 , X2 , . . . , Xn from a density f (·). The sample
median is defined to be the middle order statistic if n is odd and the average of the
middle two order statistics if n is even. The sample range is defined to be X(n) − X(1) ,
sample mid-range is defined to be {X(n) + X(1)}/2.
fX(1) ,X(n) (x, y) = n(n − 1)(F (y) − F (x))n−2 , 0 < x < y < 1.
Ry
Now, F (y) = 0 1dx = y similarly,
∴ f (x) = x
fX(1) ,X(n) (x, y) = n(n − 1)(y − x)n−2 , 0 < x < y < 1.
Let us consider the following transformation. X(1) , X(n) −→ X(1) , R such that
R = X(n) − X(1)
∂X(n)
|J| = = 1.
∂R
Here, r = y − x ⇒ y = x + r; 0 < y < 1 ⇒ 0 < x < 1 − r, 0 < r < 1.
The joint PDF of X(1) and R is given by,
Page 27
8 Additional tools
Higher moments can help us understand tail behavior, as seen in Markov and Cheby-
shev’s Inequalities. This bound gets better as we take higher-order moments.
• Markov’s inequality: P [|X| ≥ α] ≤ |X|k where α > 0.
h i
1
αk E
Page 28
or in terms of the mean β = λ1 . Another example is the Normal distribution, which
is often parameterized in terms of the mean µ and the standard deviation σ; or
in terms of the mean µ and the variance σ2 = σ 2 ; or in terms of the mean µ and
the precision τ = σ12 . They are all equivalent as there exists a one-to-one function
between the different parameterizations.
• Indicator function.
Intuitively, if we move the level curve in the direction of increasing z, the largest or
smallest z occurs at a point where a level curve touches g(x, y) = 0. The quadrants
of ‘f ’ and ‘g ’ should be in the same or opposite direction. Then ∇f = −λ∇g for
some constant λ ∋ ∇{f + λg} = 0.
Proof. g(x, y) = 0, − dx dy
= − ggxy and for f (x, y) = c, dx
dy
= − ffxy .
fx dy gx fx fy
At the point of tangency, − = =− ⇒ = = −λ (say)
fy dx gy fx gy
∴ (fx , fy ) = −λ (gx , gy ) .
Page 29
Hence, to find the maximum on minimum of f (x, y) subject to g(x, y) = 0, we find
all the solution of equation,
Local maxima and minima will be among the solutions. If the curve g(x, y) = 0
is closed and bounded, then the absolute maxima and minima of f (x, y) exist and
are among these solutions.
General Case: To maximize or minimizes z = f (x1 , x2 , . . . , xn ) subject to the
constraints gi (x1 , x2 , . . . , xn ) = 0; i = 1(1)k, solve the following equations simulta-
neously,
k
∇ f + λi gi = 0 and gi (x1 , x2 , . . . , xn ) = 0, i = 1(1)k.
X
i=1
The numbers λ1 , λ2 , . . . , λk are called the Lagrange’s multipliers. The method for
finding the extrema of a function subject to some constraints is called the “method
of Lagrange’s Multipliers”.
Page 30