Math Statistics
Math Statistics
edu/~shervine
VIP Cheatsheet: Probability Remark: for any event B in the sample space, we have P (B) =
n
X
P (B|Ai )P (Ai ).
i=1
Afshine Amidi and Shervine Amidi r Extended form of Bayes’ rule – Let {Ai , i ∈ [[1,n]]} be a partition of the sample space.
We have:
September 8, 2020 P (Ak |B) =
P (B|Ak )P (Ak )
n
X
P (B|Ai )P (Ai )
i=1
Introduction to Probability and Combinatorics
r Sample space – The set of all possible outcomes of an experiment is known as the sample r Independence – Two events A and B are independent if and only if we have:
space of the experiment and is denoted by S. P (A ∩ B) = P (A)P (B)
r Event – Any subset E of the sample space is known as an event. That is, an event is a set
consisting of possible outcomes of the experiment. If the outcome of the experiment is contained
in E, then we say that E has occurred. Random Variables
r Axioms of probability – For each event E, we denote P (E) as the probability of event E r Random variable – A random variable, often noted X, is a function that maps every element
occurring. By noting E1 ,...,En mutually exclusive events, we have the 3 following axioms: in a sample space to a real line.
n
! n
[ X r Cumulative distribution function (CDF) – The cumulative distribution function F ,
(1) 0 6 P (E) 6 1 (2) P (S) = 1 (3) P Ei = P (Ei ) which is monotonically non-decreasing and is such that lim F (x) = 0 and lim F (x) = 1, is
x→−∞ x→+∞
i=1 i=1 defined as:
F (x) = P (X 6 x)
r Permutation – A permutation is an arrangement of r objects from a pool of n objects, in a
given order. The number of such arrangements is given by P (n, r), defined as: Remark: we have P (a < X 6 B) = F (b) − F (a).
n!
P (n, r) = r Probability density function (PDF) – The probability density function f is the probability
(n − r)! that X takes on values between two adjacent realizations of the random variable.
r Relationships involving the PDF and CDF – Here are the important properties to know
r Combination – A combination is an arrangement of r objects from a pool of n objects, where in the discrete (D) and the continuous (C) cases.
the order does not matter. The number of such arrangements is given by C(n, r), defined as:
P (n, r) n!
C(n, r) = = Case CDF F PDF f Properties of PDF
r! r!(n − r)! X X
(D) F (x) = P (X = xi ) f (xj ) = P (X = xj ) 0 6 f (xj ) 6 1 and f (xj ) = 1
Remark: we note that for 0 6 r 6 n, we have P (n,r) > C(n,r).
xi 6x j
ˆ x ˆ +∞
dF
Conditional Probability (C) F (x) = f (y)dy f (x) = f (x) > 0 and f (x)dx = 1
−∞ dx −∞
r Bayes’ rule – For events A and B such that P (B) > 0, we have:
P (B|A)P (A)
P (A|B) = r Variance – The variance of a random variable, often noted Var(X) or σ 2 , is a measure of the
P (B) spread of its distribution function. It is determined as follows:
Remark: we have P (A ∩ B) = P (A)P (B|A) = P (A|B)P (B). Var(X) = E[(X − E[X])2 ] = E[X 2 ] − E[X]2
r Partition – Let {Ai , i ∈ [[1,n]]} be such that for all i, Ai 6= ∅. We say that {Ai } is a partition
if we have: r Standard deviation – The standard deviation of a random variable, often noted σ, is a
n
measure of the spread of its distribution function which is compatible with the units of the
[ actual random variable. It is determined as follows:
∀i 6= j, Ai ∩ Aj = ∅ and Ai = S p
i=1 σ= Var(X)
r Expectation and Moments of the Distribution – Here are the expressions of the expected r Marginal density and cumulative distribution – From the joint density probability
value E[X], generalized expected value E[g(X)], kth moment E[X k ] and characteristic function function fXY , we have:
ψ(ω) for the discrete and continuous cases:
Case Marginal density Cumulative function
E[X k ]
X XX
Case E[X] E[g(X)] ψ(ω) (D) fX (xi ) = fXY (xi ,yj ) FXY (x,y) = fXY (xi ,yj )
n n n n j xi 6x yj 6y
X X X X
(D) xi f (xi ) g(xi )f (xi ) xki f (xi ) f (xi )eiωxi ˆ ˆ ˆ
+∞ x y
i=1 i=1 i=1 i=1 (C) fX (x) = fXY (x,y)dy FXY (x,y) = fXY (x0 ,y 0 )dx0 dy 0
ˆ +∞ ˆ +∞ ˆ +∞ ˆ +∞
−∞ −∞ −∞
Cov(X,Y ) , σXY
2
= E[(X − µX )(Y − µY )] = E[XY ] − µX µY
r Transformation of random variables – Let the variables X and Y be linked by some
function. By noting fX and fY the distribution function of X and Y respectively, we have:
r Correlation – By noting σX , σY the standard deviations of X and Y , we define the correlation
dx between the random variables X and Y , noted ρXY , as follows:
fY (y) = fX (x)
dy 2
σXY
ρXY =
σX σY
r Leibniz integral rule – Let g be a function of x and potentially c, and a, b boundaries that
may depend on c. We have: Remarks: For any X, Y , we have ρXY ∈ [−1,1]. If X and Y are independent, then ρXY = 0.
ˆ ˆ b r Main distributions – Here are the main distributions to have in mind:
∂ b ∂b ∂a ∂g
g(x)dx = · g(b) − · g(a) + (x)dx
∂c a ∂c ∂c a ∂c
Type Distribution PDF ψ(ω) E[X] Var(X)
n
r Chebyshev’s inequality – Let X be a random variable with expected value µ and standard X ∼ B(n, p) P (X = x) = px q n−x (peiω + q)n np npq
deviation σ. For k, σ > 0, we have the following inequality: x
Binomial x ∈ [[0,n]]
1 (D)
P (|X − µ| > kσ) 6
k2 µx −µ iω
X ∼ Po(µ) P (X = x) = e eµ(e −1) µ µ
x!
Poisson x∈N
Jointly Distributed Random Variables
1 eiωb − eiωa a+b (b − a)2
X ∼ U (a, b) f (x) =
r Conditional density – The conditional density of X with respect to Y , often noted fX|Y , b−a (b − a)iω 2 12
Uniform x ∈ [a,b]
is defined as follows:
2
fXY (x,y) 1 −1
x−µ
1 2
σ2
fX|Y (x) = (C) X ∼ N (µ, σ) f (x) = √ e2 σ
eiωµ− 2 ω µ σ2
fY (y) 2πσ
Gaussian x∈R
1 1 1
r Independence – Two random variables X and Y are said to be independent if we have: X ∼ Exp(λ) f (x) = λe−λx
1− iω
λ
λ λ2
fXY (x,y) = fX (x)fY (y) Exponential x ∈ R+
X −µ
h i
any known ∼ N (0,1) X − zα √σ ,X + zα √σ
√σ 2 n 2 n
n
Afshine Amidi and Shervine Amidi Xi ∼ N (µ, σ)
X −µ
h i
small unknown ∼ tn−1 X − tα √s ,X + tα √s
√s 2 n 2 n
September 8, 2020 n
X −µ
h i
known ∼ N (0,1) X − zα √σ ,X + zα √σ
√σ 2 n 2 n
n
Xi ∼ any large
Paramater estimation X −µ
h i
unknown ∼ N (0,1) X − zα √s ,X + zα √s
√s 2 n 2 n
r Random sample – A random sample is a collection of n random variables X1 , ..., Xn that n
are independent and identically distributed with X.
Xi ∼ any small any Go home! Go home!
r Estimator – An estimator θ̂ is a function of the data that is used to infer the value of an
unknown parameter θ in a statistical model.
r Confidence interval for the variance – The single-line table below sums up the test
r Bias – The bias of an estimator θ̂ is defined as being the difference between the expected statistic to compute when determining the confidence interval for the variance.
value of the distribution of θ̂ and the true value, i.e.:
Distribution Sample size µ Statistic 1 − α confidence interval
Bias(θ̂) = E[θ̂] − θ
s2 (n − 1)
h i
s2 (n−1) s2 (n−1)
Xi ∼ N (µ,σ) any any ∼ χ2n−1 χ2
, χ2
σ2 2 1
Remark: an estimator is said to be unbiased when we have E[θ̂] = θ.
r Sample mean and variance – The sample mean and the sample variance of a random
Hypothesis testing
sample are used to estimate the true mean µ and the true variance σ 2 of a distribution, are
noted X and s2 respectively, and are such that: r Errors – In a hypothesis test, we note α and β the type I and type II errors respectively. By
noting T the test statistic and R the rejection region, we have:
n n
1 1 α = P (T ∈ R|H0 true) and / R|H1 true)
β = P (T ∈
X X
X= Xi and s2 = σ̂ 2 = (Xi − X)2
n n−1
i=1 i=1
r p-value – In a hypothesis test, the p-value is the probability under the null hypothesis of
having a test statistic T at least as extreme as the one that we observed T0 . We have:
r Central Limit Theorem – Let us have a random sample X1 , ..., Xn following a given
distribution with mean µ and variance σ 2 , then we have: Case Left-sided Right-sided Two-sided
p-value P (T 6 T0 |H0 true) P (T > T0 |H0 true) P (|T | > |T0 ||H0 true)
σ
X ∼ N µ, √
n→+∞ n
r Sign test – The sign test is a non-parametric test used to determine whether the median of
a sample is equal to the hypothesized median. By noting V the number of samples falling to
the right of the hypothesized median, we have:
Confidence intervals
r Confidence level – A confidence interval CI1−α with confidence level 1 − α of a true pa- Statistic when np < 5 Statistic when np > 5
rameter θ is such that 1 − α of the time, the true value is contained in the confidence interval: V − n
1 2
V ∼ B n, p = 2
Z= √ ∼ N (0,1)
H0 n H0
P (θ ∈ CI1−α ) = 1 − α 2
r Confidence interval for the mean – When determining a confidence interval for the mean r Testing for the difference in two means – The table below sums up the test statistic to
µ, different test statistics have to be computed depending on which case we are in. The following compute when performing a hypothesis test where the null hypothesis is:
table sums it up: H0 : µX − µY = δ
Distribution of Xi , Yi nX , nY 2 , σ2
σX Statistic r Sum of squared errors – By keeping the same notations, we define the sum of squared
Y
errors, also known as SSE, as follows:
(X − Y ) − δ n n
any known q ∼ N (0,1) X X
2
σX 2
σY H0 SSE = (Yi − Ŷi )2 = (Yi − (A + Bxi ))2 = SY Y − BSXY
nX
+ nY
i=1 i=1
(X − Y ) − δ
Normal large unknown q ∼ N (0,1) r Least-squares estimates – When estimating the coefficients α, β with the least-squares
s2
X
s2
Y
H0 method which is done by minimizing the SSE, we obtain the estimates A, B defined as follows:
nX
+ nY
SXY SXY
A=Y − x and B=
(X − Y ) − δ SXX SXX
small unknown σX = σY q ∼ tnX +nY −2
1 1 H0
s nX
+ nY r Key results – When σ is unknown, this parameter is estimated by the unbiased estimator
s2 defined as follows:
D−δ
Normal, paired any unknown ∼ tn−1 SY Y − BSXY s2 (n − 2)
s
√D H0 s2 = and we have ∼ χ2n−2
n n−2 σ2
Di = Xi − Yi nX = nY
The table below sums up the properties surrounding the least-squares estimates A, B when
σ is known or not:
r χ2 goodness of fit test – By noting k the number of bins, n the total number of samples,
pi the probability of success in each bin and Yi the associated number of samples, we can use
the test statistic T defined below to test whether or not there is a good fit. If npi > 5, we have: Coeff σ Statistic 1 − α confidence interval
q q
A−α 1 X
2
1 X
2
k
known q ∼ N (0,1) A − zα σ n
+ SXX
,A + z α σ n
+ SXX
2 2 2
X (Yi − npi )2 σ 1
+ X
T = ∼ χ2df with df = (k − 1) − #(estimated parameters) n SXX
npi H0 α
i=1 q q
2 2
unknown q A−α ∼ tn−2 A − tα s
2
1
n
+ X
SXX
,A + t α s
2
1
n
+ X
SXX
2
s 1+ X
r Test for arbitrary trends – Given a sequence, the test for arbitrary trends is a non- n SXX
parametric test, whose aim is to determine whether the data suggest the presence of an increasing
trend: B−β
h
σ σ
i
known √ σ ∼ N (0,1) B − zα √ ,B + z α √
H0 : no trend versus H1 : there is an increasing trend 2 SXX 2 SXX
SXX
β
If we note x the number of transpositions in the sequence, the p-value is computed as: B−β
h i
s s
unknown √ s ∼ tn−2 B − tα √ ,B + t α √
p-value = P (T 6 x) 2 SXX 2 SXX
SXX