100% found this document useful (1 vote)
126 views

n i=1 α/2 i σ n n

1. Statistical models involve a sample space E containing possible outcomes of a random variable X, and a family of probability distributions {Pθ} parameterized by θ. The goal is to estimate the true parameter θ based on observations. 2. To construct a confidence interval for the mean μ of a distribution, the central limit theorem can be used. As the sample size n increases, the distribution of the sample mean X̅ approaches a normal distribution. This allows constructing a 1-α confidence interval for μ. 3. The likelihood ratio test compares the likelihood of the data under the null hypothesis H0 to the alternative hypothesis H1 using a test statistic. If the test statistic is greater than a threshold, the null

Uploaded by

Alexander CTO
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
126 views

n i=1 α/2 i σ n n

1. Statistical models involve a sample space E containing possible outcomes of a random variable X, and a family of probability distributions {Pθ} parameterized by θ. The goal is to estimate the true parameter θ based on observations. 2. To construct a confidence interval for the mean μ of a distribution, the central limit theorem can be used. As the sample size n increases, the distribution of the sample mean X̅ approaches a normal distribution. This allows constructing a 1-α confidence interval for μ. 3. The likelihood ratio test compares the likelihood of the data under the null hypothesis H0 to the alternative hypothesis H1 using a test statistic. If the test statistic is greater than a threshold, the null

Uploaded by

Alexander CTO
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Capstone-Cheatsheet Statistics 1 distributions and massaging the ex- for some test statistic Tn and thres- Right-tailed p-values:

for some test statistic Tn and thres- Right-tailed p-values: 7.4 Likelihood Ratio Test
by Blechturm, Page 1 of 2 pression yields an an asymptotic CI: hold c ∈ R. Threshold c is usually qα/2 Parameter space Θ ⊆ Rd and H0 is
(d) Rejection region: Z that parameters θr+1 through θd have
P n
Xi=1
p p
−−−−−→ N (nµ, (n) (σ 2 )) √ pvalue = P(X ≥ x|H0 ) Tn = S̃
n→∞ q V ar(X )
I = [θ̂n − α/2 √n i , Rψ = {Tn > c} values θcr+1 through θdc leaving the
X−µ
1 Statistical models P n −nµ
Xi=1 (d) = other r unspecified. That is:
√ √ √ Two-sided p-values: If asymptotic, √σ̂
−−−−−→ N (0, 1) Symmetric about zero and acceptan-
(n) (σ 2 ) n→∞ q V ar(X ) create normalized Tn using parame-
n
H0 : (θr+1 , ..., θd )T = θr+1...d = θ0
E, {Pθ }θ∈Θ θ̂n + α/2 √n i ] ce Region interval: √ X n −µ0
ters from H0 . Then use Tn to get to n
E is a sample space for X i.e. a set that probabilities. = q σ Construct two estimators:
ψ = 1{|Tn | − c > 0}.
contains all possible outcomes of X This expression depends on the real S̃n
{Pθ }θ∈Θ is a family of probability dis- Variance of the Mean: variance V ar(Xi ) of the r.vs, the va- Power of the test: σ2 bnMLE = argmaxθ∈Θ (`n (θ))
θ
pvalue = 2min{P(X ≤ x|H0 ), P(X ≥ x|H0 )}
tributions on E. 2 riance has to be estimated. ∼
N (0,1)
Θ is a parameter set, i.e. a set consis- V ar(Xn ) = ( σn )2 V ar(X1 + X2 , ..., Xn ) Three possible methods: plugin (use P(|Z| > |Tn,θ0 (X n )| = 2(1 − Φ(Tn ))
r
2
bnc = argmaxθ∈Θ (`n (θ))
θ
πψ = infθ∈Θ1 (1 − βψ (θ)) χn−1 0
ting of some possible values of Θ. sample mean or empirical variance), n−1
σ2
θ is the true parameter and unknown. = n solve (solve quadratic inequality), con- Where βψ is the probability of making Z ∼ N (0, 1) Test statistic:
In a parametric model we assume that ∼ tn−1
servative (use the theoretical maxi- a Type2 Error and inf is the maxi-
Θ ⊂ Rd , for some d ≥ 1. Expectation of the mean:
mum of the variance). mum. bnMLE ) − `(X1 , ..Xn |θ
Tn = 2(`(X1 , ..Xn |θ bnc ))
1.1 Identifiability 6.2 Comparisons of two proporti- Works bc. under H0 the numerator
5.2 Sample Mean and Sample Va- Two-sided test: ons
E[Xn ] = 1
+ X2 , ..., Xn ] N (0, 1) and the denominator
n E[X1 riance H1 : θ , Θ0 iid Wilk’s Theorem: under H0 , if the
Let X1 , . . . , Xn ∼ Bern(px ) and S̃n 1 2
= µ.
iid
Let X1 , ..., Xn ∼ Pµ , where E(Xi ) = µ ∼ n−1 χn−1 are independent by MLE conditions are satisfied:
θ , θ 0 ⇒ Pθ , Pθ 0 1(|Tn | > qα/2 ) iid
σ2
and V ar(Xi ) = σ 2 for all i = 1, 2, ..., n Y1 , . . . , Yn ∼ Bern(py ) and be X Cochran’s Theorem.
Pθ = Pθ 0 ⇒ θ = θ 0 (d) 2
4 Quantiles of a Distribution One-sided tests: independent of Y . p̂x = 1/n ni=1 Xi
P
Sample Mean: Tn −−−−−→ χd−r
Let α in (0, 1). The quantile of order Student’s T test at level α: n→∞
and p̂x = 1/n ni=1 Yi
P
A Model is well specified if: H1 : θ > Θ0
1 − α of a random variable X is the 1 Pn
number qα such that: Xn = i=1 Xi ψα = 1{|Tn | > qα/2 (tn−1 )}
n 1(Tn < −qα )H1 : θ < Θ0 H0 : px = py ; H1 : px , py
∃θ s.t. P = Pθ
Sample Variance: 1(Tn > qα ) Likelihood ratio test at level α:
To get the asymptotic Variance use Student’s T test (one sample, one-
2 Estimators P (X ≤ qα ) = qα = 1 − α Type1 Error: multivariate Delta-method. Consider 2
ψα = 1{Tn > qα (χd−r )}
A statistic is any measurable function Sn = 1 Pn
− X n )2 p̂x − p̂y = g(p̂x , p̂y ); g(x, y) = x − y, then sided):
P(X ≥ qα ) = α n i=1 (Xi Test rejects null hypothesis ψ = 1 but
calculated with the data (Xn , max(Xi ), 1 Pn 2 2 it is actually true H0 = T RU E also p (d) 7.5 Implicit Testing
etc). FX (qα ) = 1 − α = n ( i=1 Xi ) − X n known as the level of a test. (n)(g(p̂x , p̂y ) − g(px − py )) −−−−−→ ψα = 1{Tn > qα (tn−1 )}
Type2 Error: n→∞ Todo
An estimator θ̂n of θ is any statistic FX−1 (1 − α) = α Unbiased estimator of sample va- N (0, ∇g(px − py )T Σ∇g(p x − py )) 7.6 Goodness of Fit Discrete Distri-
Test does not reject null hypothesis Student’s T test (two samples, two- butions
which does not depend on θ. riance: ψ = 0 but alternative hypothesis is
If the distribution is standard true H1 = T RU E ⇒ N (0, px (1 − px) + py (1 − py)) sided): Let X1 , ..., Xn be iid samples from
Estimators are random variables normal X ∼ N (0, 1): 1 Pn
 2 iid a categorical distribution. Test
S̃n = n−1 i=1 Xi − X n
i.i.d.
Example: Let X1 , . . . , Xn ∼ Ber(p∗ ). 7 Non-asymptotic Hypothesis tests Let X1 , ..., Xn ∼ N (µX , σX2 ) and
if they depend on the data (= H0 : p = p0 against H1 : p , p0 .
realizations of random variables). n Question: is p∗ = 1/2. 7.1 Chi squared
iid
Y1 , ..., Yn ∼ N (µY , σY2 ), suppose
= n−1 Sn H0 : p∗ = 1/2; H1 : p∗ , 1/2 Example: against the uniform
P(|X| > qα ) = α we want to test H0 : µX = µY vs
If asymptotic level α then we need to The χd2 distribution with d degrees of distribution p0 = (1/K, . . . , 1/K)> .
An estimator θ̂n is weakly consistent = 2Φ(qα/2 ) 5.3 Delta Method H1 : µX , µY .
P
standardize the estimated parameter freedom is given by the distribution
if: limn→∞ θ̂n = θ or θ̂n −−−−−→ E[g(X)]. To find the asymptotic CI if the esti- Test statistic under H0 :
p̂ = X n first. of Z12 +Z22 +· · ·+Zd2 , where Z1 , . . . , Zd ∼
iid
n→∞ Use standardization if a gaussian mator is a function of the mean. Goal
If the convergence is almost surely it has unknown mean and variance is to find an expression that converges N (0, 1) X n −Y m (p̂k −pk0 )2 (d)
√ |X −0.5| Tn,m = q Tn = n
PK 2
−−−−−→ χK−1
is strongly consistent. X ∼ N (µ, σ 2 ) to get the quantiles by a function of the mean using the CLT.
Tn = n √ n σˆ2 X σˆ2 Y k=1 pk0 n→∞
Asymptotic normality of an estima- using Z-tables (standard normal
p
Let Zn be a sequence of r.v. (n)(Zn − 0.5(1−0.5) If V ∼ χk2 : n + m
tor: tables). (d) ψn = 1 (Tn > qα/2 ) E = E[Z12 ] + E[Z22 ] + . . . + E[Zd2 ] = d Test at level alpha:
θ) −−−−−→ N (0, σ 2 ) and let g : R −→ R Welch-Satterthwaite formula:
p (d) n→∞
(n)(θ̂n − θ) −−−−−→ N (0, σ 2 ) be continuously differentiable at θ, where qα/2 denotes the qα/2 quantile 2
ψα = 1{Tn > qα (χK−1 )}
n→∞
then: of a standard Gaussian, and α is de- V ar(V ) = V ar(Z12 ) + V ar(Z22 ) + . . . + When samples are different sizes
t−µ
 
P (X ≤ t) = P Z ≤ σ termined by the required level of ψ. V ar(Zd2 ) = 2d we need to finde the Student’s T
σ2 is called the Asymptotic Variance  t−µ  √ (d) Note the absolute value in Tn for this distribution of: Tn,m ∼ tN 7.7 Kolmogorov-Smirnov test
of the estimator θ̂n . In the case of the =Φ σ n(g(Zn ) − g(θ)) −−−−−→ two sided test. Cochranes Theorem: 7.8 Kolmogorov-Lilliefors test
n→∞ iid
sample mean it is the same variance Pivot: If X1 , ..., Xn ∼ N (µ, σ 2 ), then sample Calculate the degrees of freedom for 7.9 QQ plots
as as the single Xi . Z=
X−µ
∼ N (0, 1) N (0, g 0 (θ)2 σ 2 ) Let Tn be a function of the random mean X n and the sample variance Sn tN with: Heavier tails: below > above the
σ samples X1 , . . . , Xn , θ. Let g(Tn ) be a
If the estimator is a function of the are independent. The sum of squares diagonal.
sample mean the Delta Method is t−µ
qα = Example: let X1 , ..., Xn exp(λ) where random variable whose distribution of n variables follows a chi squared 2 Lighter tails: above > below the
σˆ2 X σˆ2 Y

σ
needed to compute the asymptotic λ > 0 . Let X n = n1 ni=1 Xi denote the
P is the same for all θ . Then, g is called distribution with (n-1) degrees of free- + diagonal.
n m
variance.Asymptotic Variance , Va- 5 Confidence intervals a pivotal quantity or a pivot. dom: N= ≥ min(n, m) Right-skewed: above > below >
sample mean. By the CLT, we know 2
σˆ2 X σˆ2 Y
2
riance of an estimator. Confidence Intervals follow the form: Example: let X be a random variable + above the diagonal.
Bias of an estimator: √   (d) n2 (n−1) m2 (m−1)
that n X n − λ1 −−−−−→ N (0, σ 2 ) for with mean µ and variance σ 2 . Let nSn 2
∼ χn−1 Left-skewed: below > above > below
(statistic) ± (critical value)(estimated n→∞ σ2
Bias(θ̂n ) = E[θˆn ] − θ X1 , . . . , Xn be iid samples of X. Then, the diagonal.
standard deviation of statistic) some value of σ 2 that depends on λ. N should be rounded down.
If we set g : R → R and x 7→ 1/x, then Xn −µ
gn , If formula for unbiased sample 7.3 Walds Test 8 Distances between distributions
Quadratic risk of an estimator Let (E, (Pθ )θ∈Θ ) be a statistical model by the Delta method: σ
variance is used:
iT Squared distance of θ bnMLE to true θ0 8.1 Total variation distance
based on observations X1 , . . . Xn and h
is a pivot with θ = µ σ 2 being the The total variation distance TV bet-
R(θ̂n ) = E[(θ̂n − θ)2 ] assume Θ ⊆ R. Let α ∈ (0, 1). √  using the fisher information I(θ bnMLE )
ween the propability measures P and
 
Non asymptotic confidence interval n g(X n ) − g λ1 parameter vector (not the same set of (n−1)Sn as metric.
= Bias2 + V ariance paramaters that we use to define a sta-
2
∼ χn−1 Q with a sample space E is defined as:
of level 1 − α for θ: σ2 iid
(d) Let X1 , . . . , Xn ∼ Pθ ∗ for some true TV(P, Q) = maxA⊂E |P(A) − Q(A)|,
Any random interval I , depending on −−−−−→ N (0, g 0 (E[X])2 VarX) tistical model).
3 LLN and CLT the sample X1 , . . . Xn but not at θ and n→∞ 6.1 P-Value parameter θ ∗ ∈ Rd and the maximum Calculation with f and g:
iid such that: The (asymptotic) p-value of a test ψα 7.2 Student’s T Test likelihood estimator θ bnMLE for θ ∗ .
Let X1 , ..., Xn ∼ Pµ , where E(Xi ) = µ (d)  2
1 1
Pθ [I 3 θ] ≥ 1 − α, ∀θ ∈ Θ −−−−−→ N (0, g 0 ) is the smallest (asymptotic) level α Non-asymptotic hypothesis test T V (P, Q) =
and V ar(Xi ) = σ 2 for all i = 1, 2, ..., n n→∞ λ λ2
Confidence interval of asymptotic le- at which ψα rejects H0 . It is random for small samples (works on large Test H0 : θ ∗ = 0 vs H1 : θ ∗ , 0 (1 P
and Xn = n1 ni=1 Xi . 2 R x∈E |f (x) − g(x)|, discr
P
vel 1 − α for θ: (d) since it depends on the sample. It can samples too), data must be gaussian.
Any random interval I whose bounda- −−−−−→ N (0, λ2 ) also interpreted as the probability Under H0 , the asymptotic normality
1
2 x∈E |f (x) − g(x)|dx, cont
Law of large numbers: n→∞
ries do not depend on θ and such that: that the test-statistic Tn is realized Student’s T distribution with d bnMLE implies that:
of the MLE θ
P ,a.s. limn→∞ Pθ [I 3 θ] ≥ 1 − α, ∀θ ∈ Θ 6 Asymptotic Hypothesis tests given the null hypothesis. degrees of freedom: td := √ Z Symmetry: T V (P, Q) = T V (Q, P)
Xn −−−−−→ µ 5.1 Two-sided asymptotic CI Two hypotheses (Θ0 disjoint set from V /n
√ (d) Positive: T V (P, Q) ≥ 0
n→∞ ( If pvalue ≤ α , H0 is rejected by ψα at where Z ∼ N (0, 1) and V ∼ χk2 are bnMLE − 0) 2 −−−−−→ χ2
n I (0)1/2 (θ
P ,a.s.
iid
Let X1 , . . . , Xn = X̃ and X̃ ∼ Pθ . A two- H0 : θΘ0 d n→∞ Definite: T V (P, Q) = 0 ⇐⇒ P = Q
1 Pn Θ1 ): . Goal is to reject the (asymptotic) level α independent.
n i=1 g(Xi ) −
−−−−→ E[g(X)]
n→∞ sided CI is a function depending on H1 : θΘ1 Triangle inequality: T V (P, V) ≤
Test statistic: T V (P, Q) + T V (Q, V)
X̃ giving an upper and lower bound H0 using a test statistic. The smaller the p-value, the more con- Student’s T test (one sample +
Central Limit Theorem for Mean: in which the estimated parameter lies fidently one can reject H0 . bnMLE − θ0 )> I(θ bnMLE − θ0 ) If the support of P and Q is disjoint:
bnMLE )(θ
A test ψ has level α if two-sided): Tn =n(θ
I = [l(X̃, u(X̃)] with a certain probabi- Left-tailed p-values:
X −µ (d) αψ (θ) ≤ α, ∀θ ∈ Θ0 . and asymptotic
lity P(θ ∈ I ) ≥ 1 − qα and conversely (d)
p
(n) √n 2 −−−−−→ N (0, 1) pvalue = P(X ≤ x|H0 ) iid
Let X1 , ..., Xn ∼ N (µ, σ 2 ) and suppose −−−−−→ χd2 T V (P, V) = 1
(σ ) n→∞ P(θ < I ) ≤ α level α if limn→∞ Pθ (ψ = 1) ≤ α. n→∞
= P(Z < Tn,θ0 (X n ))) we want to test H0 : µ = µ0 = 0 vs.
(d) Since the estimator is a r.v. depending H1 : µ , 0.
p
(n)(Xn − µ) −−−−−→ N (0, σ 2 ) A hypothesis-test has the form TV between continuous and discrete
n→∞ on X̃ it has a variance V ar(θ̂n and a Wald test of level α: r.v:
= Φ(Tn,θ0 (X n ))
mean E[θ̂n ]. Since the CLT is valid for Test statistic follows Student’s T dis-
Central Limit Theorem for Sums: every distribution standardizing the ψ = 1{Tn ≥ c} Z ∼ N (0, 1) tribution: ψα = 1{Tn > qα (χd2 )} T V (P, V) = 1
Capstone-Cheatsheet Statistics 1 9.1 Fisher Information • A few more technical condi- We can often use an improper prior, Theoretical linear regression: let Then:
by Blechturm, Page 2 of 2 The Fisher information is the tions i.e. a prior that is not a proper pro- (S) (j)
X, Y be two random variables with ψα = max ψα/K
covariance matrix of the gradient bability distribution (whose integral two moments such as V [X] > 0. The j∈S
of the loglikelihood function. It is The asymptotic variance of the MLE is diverges), and still get a proper poste- theoretical linear regression of Y on ∇F(β) = 2X > (Y − Xβ)
equal to the negative expectation the inverse of the fisher information. rior. For example, the improper prior X is the line a∗ + b∗ x where where K = |S|. The rejection region
8.2 KL divergence of the Hessian of the loglikelihood (d) therefore is the union of all rejection
p
bnMLE − θ ∗ ) −−−−−→ Nd (0, I (θ ∗ )−1 )
(n)(θ π(θ) = 1 on Θ gives the likelihood as Least squares estimator: setting
The KL divergence (aka relative entro- function and captures the negative regions:
n→∞ a posterior. i ∇F(β) = 0 gives us the expression of
py) KL between between probability of the expected curvature of the
h
∗ ∗
(a , b ) = argmin(a,b)∈R2 E (Y − a − bX) 2
11.2 Jeffreys Prior (S)
[ (j)
measures P and Q with the common loglikelihood function. 10 Method of Moments β̂: Rα = Rα/K
sample space E and pmf/pdf functi- iid
Let X1 , . . . , Xn ∼ Pθ ∗ associated with j∈S
ons f and g is defined as: Let θ ∈ Θ ⊂ Rd and let (E, {Pθ }θ∈Θ ) model (E, {Pθ }θ∈Θ ), with E ⊆ R and Which gives:
β̂ = (X > X)−1 X > Y
p
be a statistical model. Let fθ (x) be the πJ (θ) ∝ detI(θ) This test has nonasymptotic level at
Θ ⊆ R, for some d ≥ 1
KL(P, Q) = pdf of the distribution Pθ . Then, the Population moments: Cov(X,Y ) most α:
Fisher information of the statistical where I(θ) is the Fisher information. b∗ = V [X]
, a∗ = E[Y ] − b∗ E[X] **Geometric interpretation**: X β̂ is
 p(x) 
discr model is.
P
mk (θ) = Eθ [X1k ], 1 ≤ k This prior is invariant by reparame- the orthogonal projection of Y onto
R x∈E p(x) ln q(x) , ≤d

   X  
the subspace spanned by the columns (S) (j)
 p(x) terization, which means that if we ha- Noise: we model the noise of Y ¶H0 Rα ≤ ¶H0 Rα/K = α


x∈E
p(x) ln q(x) dx, cont I (θ) = Cov(∇`(θ)) = Empirical moments: ve η = φ(θ), then the same prior gives around the regression line by a ran- of X:
j∈S
= E[∇`(θ))∇`(θ)T ] − us a probability distribution for η ve- dom variable ε = Y − a∗ − b∗ X, such
E[∇`(θ)]E[∇`(θ)] = Pn as: X β̂ = P Y
The KL divergence is not a distance ck (θ) = Xnk
m = 1
n
k
i=1 Xi
rifying: This test also works for implicit tes-
measure! Always sum over the = −E[H`(θ)] Convergence of empirical moments: where P = X(X > X)−1 X > is the expres- ting (for example, β1 ≥ β2 ).
support of P !
q
˜
π̃J (η) ∝ det I(η) E[ε] = 0, Cov(X, ε) = 0 sion of the projector. 13 Generalized Linear Models
Asymetric in general: Where `(θ) = ln fθ (X).If ∇`(θ) ∈ Rd it P ,a.s. **Statistic inference**: let us suppose
KL(P, Q) , KL(Q, P) is a d × d matrix. The definition when m
ck −−−−−→ mk We relax the assumption that µ is
n→∞ We have to estimate a∗ and b∗ from that:
Nonnegative: KL(P, Q) ≥ 0 the distribution has a pmf pθ (x) is The change of parameter follows the * The design matrix X is determini- linear. Instead, we assume that g ◦µ
also the same, with the expectation P ,a.s. following formula: the data. We have n random pairs is linear, for some function g:
Definite: if P = Q then KL(P, Q) = 0 (c cd ) −−−−−→ (m1 , . . . , md )
m1 , . . . , m (X1 , Y1 ), . . . , (Xn , Yn ) ∼iid (X, Y ) such stic and rank(X) = p. * The model is
trian- taken with respect to the pmf.
n→∞
Does not satisfy MOM Estimator M is a map from as: **homoscedastic**: ε1 , . . . , εn are i.i.d. * g(µ(x)) = xT β
gle inequality in general: π̃J (η) = det(∇φ−1 (η))πJ (φ−1 (η))
Let (R, {Pθ }θ∈R ) denote a continuous the parameters of a model to the mo- The noise is Gaussian:  ∼ Nn (0, σ 2 In ). The function g is assumed to be
KL(P, V)  KL(P, Q) + KL(Q, V) ments of its distribution. This map is We therefore have:
statistical model. Let fθ (x) denote the known, and is referred to as the link
invertible, (ie. it results into a system
11.3 Bayesian confidence region Yi = a∗ + b∗ Xi + εi function. It maps the domain of the
pdf (probability density function) of Let α ∈ (0, 1). A *Bayesian confidence
Estimator of KL divergence:
the continuous distribution Pθ . Assu- of equations that can be solved for the Y ∼ Nn (Xβ ∗ , σ 2 In ) dependent variable to the entire real
region with level α* is a random sub- The Least Squares Estimator (LSE)
me that fθ (x) is twice-differentiable true parameter vector θ ∗ ). Find the set R ⊂ Θ depending on X1 , ..., Xn Properties of the LSE:
Line.
h  p ∗ (X) i
KL (Pθ ∗ , Pθ ) = Eθ ∗ ln pθ (X) as a function of the parameter θ. moments (as many as parameters), set of (a∗ , b∗ ) is the minimizer of the squa- it has to be strictly increasing,
(and the prior π) such that: red sum: it has to be continuously differentia-
θ up system of equations, solve for pa- β̂ ∼ Np (β ∗ , σ 2 (X > X)−1 )
c θ , Pθ ) = const − 1 Pn Formula for the calculation of Fisher rameters, use empirical moments to ble and
KL(P ∗ n i=1 log(pθ (Xi )) Information of X: estimate. P [θ ∈ R|X1 , ..., Xn ] ≥ 1 − α its range is all of R
)The quadratic risk of β̂ is given by:
Pn 2
ψ : Θ → Rd (ân , b̂n ) = argmin(a,b)∈R2 i=1 (Yi − a − bXi 13.1 The Exponential Family
9 Maximum likelihood estimation R∞

∂fθ (x) 2

Bayesian confidence region and con- A family of distribution {Pθ : θ ∈ Θ},
∂θ

Let E, (Pθ )θ∈Θ be a statistical mo- I (θ) = dx θ 7→ (m1 (θ), m2 (θ), . . . , md (θ)) fidence interval are distinct noti- The estimators are given by: h i  
where the parameter space Θ ⊂ Rk
−∞ fθ (x) ons.The Bayesian framework can be E kβ̂ − β ∗ k22 = σ 2 T r (X > X)−1
del associated with a sample of i.i.d. −1 ∗ ∗ ∗ used to estimate the true underlying is -k dimensional, is called a
random variables X1 , X2 , . . . , Xn . Assu- Models with one parameter (ie. M (m1 (θ ), m2 (θ ), . . . , md (θ )) b̂n = XY −XY
, ân = Y − b̂n X k-parameter exponential family on
parameter. In that case, it is used to 2 The prediction error is given by:
me that there exists θ ∗ ∈ Θ such that Bernulli): The MOM estimator uses the
build a new class of estimators, based
X 2 −X
R1 if the pmf or pdf fθ : Rq → R of
Xi ∼ Pθ ∗ . empirical moments:
on the posterior distribution. Pθ can be written in the form:
h i
The likelihood of the model is the I (θ) = Var(` 0 (θ)) The Multivariate Regression is given E kY − X β̂k22 = σ 2 (n − p)
M −1 n1 ni=1 Xi , n1 ni=1 Xi2 , . . . , n1 ni=1 Xid 11.4 Bayes estimator
 P 
product of the n samples of the
P P by:
I (θ) = −E(` 00 (θ)) posterior mean: fθ (y) =
pdf/pmf: The unbiased estimator of σ 2 is:
h(y) exp (η(θ) · T(y) − B(θ)) where
Assuming M −1 is continuously diffe- R Pp (j) ∗
Models with multiple parameters (ie. + εi = Xi> β ∗ +εi η1 (θ)
  
rentiable at M(0), the asymptotical θ̂(π) = Θ θπ(θ|X1 , ..., Xn )dθ Yi = j=1 Xi βj

Gaussians): n 

1 1 X 2
 
Ln (X1 , X2 , . . . , Xn , θ) = variance of the MOM estimator is:
 .
σˆ2 = kY − X β̂k22 = η(θ) =  .  : Rk → Rk
|{z} |{z}   
ε̂i


Maximum a posteriori estimator 1×p p×1 
.
(Qn I (θ) = −E [H`(θ)] n−p n−p 
  
pθ (xi ) if E is discrete i=1
 
Qi=1 (MAP): 
 
ηk (θ)

n

(d) (1)
i=1 fθ (xi ) if E is continous


(n)(θnMM − θ) −−−−−→ N (0, Γ )
p
Cookbook: We can assuming that the Xi are 1 T1 (y)
  
By **Cochran’s Theorem**:
€ 
MAP

n→∞ θ̂(π) = argmaxθ∈Θ π(θ|X1 , ..., Xn ) for the intercept. 
 
 . 

q k
Better to use 2nd derivative. T(y) =  ..  : R → R

The maximum likelihood estimator where, σˆ2

  

is the (unique) θ that minimizes Γ (θ) = The MAP is equivalent to the MLE, if • If β ∗ = (a∗ , b∗ >)> , β1∗ = a∗ is (n − p) 2
∼ χn−p , β̂ ⊥ σˆ2 
  
σ2 Tk (y)


c (Pθ ∗ , Pθ ) over the parameter space.
KL • Find loglikelihood
h −1
∂M
iT h −1 i the prior is uniform. the intercept. 
(M(θ)) Σ(θ) ∂M

(M(θ)) : Rk → R

B(θ)

(The minimizer of the KL divergence ∂θ ∂θ 12 OLS • the εi is the noise, satisfying **Significance test**: let us test H0 : 
• Take second derivative : Rq → R.

Γ (θ) = ∇θ (M −1 )T Σ∇θ (M −1 ) βj = 0 against H1 : βj , 0. Let us call h(y)

is unique due to it being strictly con- Given two random variables X and Y , Cov(Xi , εi ) = 0
(=Hessian if multivariate) how can we predict the values of Y
vex in the space of distributions once Σθ is the covariance matrix of   if k = 1 it reduces to:
is fixed.) • Massage second derivative or the random vector of the moments given X? The Multivariate Least Squares Esti- γj = (X > X)−1 > 0
Let us consider jj
Hessian (isolate functions of (X11 , X12 . . . , X1d ). mator (LSE) of β ∗ is the minimizer of fθ (y) = h(y) exp (η(θ)T (y) − B(θ))
Xi to use with −E(` 00 (θ)) or (X1 , Y1 ), . . . , (Xn , Yn ) ∼iid P whe- then:
bnMLE
θ c n (Pθ ∗ , Pθ )
= argminθ∈Θ KL 11 Bayesian Statistics the sum of square errors: 14 Expectation
−E [H`(θ)]. re P is an unknown joint distribution. R +inf
Bayesian inference conceptually P can be described entirely by: E [X] = −inf x · fX (x) dx
= argmaxθ∈Θ ni=1 ln pθ (Xi ) β̂j − βj
Pn
− Xi> β)2
P
• Find the expectation of the amounts to weighting the likelihood β̂ = argminβ∈Rp i=1 (Yi ∼ tn−p
Ln (θ) by a prior knowledge we
q R +inf
Q  functions of Xi and subsitu- σˆ2 γj
= argmaxθ∈Θ ln ni=1 pθ (Xi ) E [g (X)] = −inf g (x) · fX (x) dx
R
te them back into the Hes- might have on θ. Given a statistical g(X) = f (X, y)dy Matrix form: we can rewrite these ex-
sian or the second derivati- model we technically model our pressions. Let Y = (Y1 , . . . , Yn )> ∈ Rn , R +inf
ve. Be extra careful to sub- parameter θ as if it were a random f (x,Y ) We can define the test statistic for our E [X Y = y] = −inf x · fX|Y (x|y) dx
Since taking derivatives of products h(Y |X = x) = g(x) and  = (ε1 , . . . , εn )> . test:
situte the right power back. variable. We therefore define the Let
is hard but easy for sums and exp() is prior distribution (PDF): Integration limits only have to be
very common in pdfs we usually ta- E[Xi ] , E[Xi2 ]. where f is the joint PDF, g the margi-  > β̂j over the support of the pdf. Discrete
(j)
ke the log of the likelihood function nal density of X and h the conditional X1  Tn = q r.v. same as continuous but with sums
• Don’t forget the minus sign! π(θ)
X =  ..  ∈ Rn×p σˆ2 γj and pmfs.
before maximizing it. density. What we are interested in is  
9.2 Asymptotic normality of the ma- h(Y |X).  . 
Let X1 , ..., Xn . We note Ln (X1 , ..., Xn |θ)  > Total expectation theorem:
`((X1 , X2 , . . . , Xn , θ)) = ln(Ln (X1 , X2 , . . . , Xn , θ)) ximum likelihood estimator Regression function: For a partial de- Xn The test with non-asymptotic level α
Pn Under certain conditions the MLE is the joint probability distribution of scription, we can consider instead the is given by: R +inf
= i=1 ln(Li (Xi , θ) asymptotically normal and consistent. X1 , ..., Xn conditioned on θ where θ ∼ conditional expection of Y given X = X is called the **design matrix**. The E [X] = −inf fY (y) · E [X Y = y] dy
This applies even if the MLE is not the π. This is exactly the likelihood from x: regression is given by: (j) (j)
the frequentist approach. ψα = 1{|Tn | > qα/2 (tn−p )}
sample average. Law of iterated expectation:
Cookbook: set up the likelihood func- Let the true parameter θ ∗ ∈ Θ. Ne- 11.1 Bayes’ formula
tion, take log of likelihood function. cessary assumptions: **Bonferroni’s test**: if we want to test
E[Y ] = E[E[Y |X]]
R
Take the partial derivative of the lo- . The posterior distribution verifies: x 7→ f (x) = E[Y |X = x] = yh(y|x)dy Y = Xβ ∗ + the significance level of multiple tests
glikelihood function wrt. the parame- • The parameter is identifiable at the same time, we cannot use the sa-
∀θ ∈ Θ, π(θ|X1 , ..., Xn ) ∝ and the LSE is given by: me level α for each of them. We must Expectation of constant a:
ter(s). Set the partial derivative(s) to We can also consider different descrip-
zero and solve for the parameter. • For all θ ∈ Θ, the support Pθ tions of the distribution, like the me- use a stricter test for each of them. Let
does not depend on θ (e.g. li- π(θ)Ln (X1 , ..., Xn |θ) us consider S ⊆ {1, . . . , p}. Let us consi- E[a] = a
If an indicator function on the dian, quantiles or the variance.
ke in U nif (0, θ)); Linear regression: trying to fit any β̂ = argminβ∈Rp kY − Xβk22 der
pdf/pmf does not depend on the para- Product of independent r.vs X and Y
meter, it can be ignored. If it depends The constant is the normalization fac- function to E[Y |X = x] is a nonpara- :
• θ ∗ is not on the boundary of tor to ensure the result is a proper Let us suppose n ≥ p and rank(X) = p.
on the parameter it can’t be ignored Θ; metric problem; therefore, we restrict
because there is an discontinuity in distribution, and does not depend on the problem to the tractable one of If we write: H0 : ∀j ∈ S, βj = 0, H1 : ∃j ∈ S, βj , 0 E[X · Y ] = E[X] · E[Y ]
the loglikelihood function. The maxi- • Fisher information I (θ) is in- θ: linear function:
mum/minimum of the Xi is then the vertible in the neighborhood The *Bonferroni’s test* with signifi- Product of dependent r.vs X and Y :
maximum likelihood estimator. of θ ∗ π(θ|X1 , ..., Xn ) = R π(θ)Ln (X1 ,...,Xn |θ) f : x 7→ a + bx F(β) = kY − Xβk22 = (Y − Xβ)> (Y − Xβ) cance level α is given by:
 
Capstone-Cheatsheet Statistics 1 E[X] = p Multinomial fθ (y) = exp yθ − (− ln(−θ)) + 0 E[X 4 ] = µ4 + 6µ2 σ 2 + 3σ 4 “Out of n people, we want to form a Σ = Cov(X) =
by Blechturm, Page 3 of 2 committee consisting of a chair and σ σ12 ... σ1d 
Parameters n > 0 and p1 , . . . , pr . | {z } |{z}
 11
V ar(X) = p(1 − p) n! b(θ) c(y,φ) Quantiles: other members. We allow the commit- σ21 σ22 ... σ2d 
px (x) = x !,...,x ! p1 , . . . , pr tee size to be any integer in the range
1 n
 
Likelihood n trials: θ = −λ = − µ1 Uniform 1, 2, . . . , n . How many choices do we
 .
 . . .. . 
E[X · Y ] , E[X] · E[Y ] E[Xi ] = n ∗ pi  . . . . 
Parameters have in selecting a committee-chair . .
( a and b, continuous.

φ=1

Ln (X , . . . , Xn , p) = P 1 combination?" σd1 σd2 ... σdd
E[X·Y ] = E[E[Y ·X|Y ]] = E[Y ·E[X|Y ]] Pn1 V ar(Xi ) = npi (1 − pi ) Shifted Exponential , if a < x <b
= p i=1 Xi (1 − p)n− i=1 Xi
n fx (x) = b−a The covariance matrix Σ is a d × d
Parameters 0, o.w.
( λ, a ∈ R, continuous n matrix. It is a table of the pairwise
!
Linearity of Expectation where a and Likelihood:  X n
c are given scalars: Loglikelihood n trials: λexp(−λ(x − a)), x >= a  0, f orx ≤ a n2n−1 = i. covariances of the elemtents of the
Qn Tj fx (x) = 
 x−a
 i random vector. Its diagonal elements
px (x) = , where 0, x <= a Fx (x) =  b−a , x ∈ [a, b)
j=1 pj i=0
are the variances of the elements of
`n (p) =

E[aX + cY ] = aE[X] + cE[Y ] 1,
T j = 1(Xi = j) is the count
x≥b

Fx (x) = the random vector, the off-diagonal
ln (p) ni=1 Xi
P
= + 18.2 Finding Joint PDFS
E[X] = a+b elements are its covariances. Note
(
If Variance of X is known:   how often an outcome is seen in 1 − exp(−λ(x − a)), if x >= a 2
n − ni=1 Xi ln (1 − p) fX,Y (x, y) = fX (x)fY |X (y | x) that the covariance is commutative
P
trials. 0, x <= a (b−a)2
V ar(X) = 12 e.g. σ12 = σ21
E[X 2 ] = var(X) − E[X] 1 19 Random Vectors
MLE: Loglikelihood:  E[X] = a + λ T
Likelihood: Alternative forms:

15 Variance `n = nj=2 Tj ln pj
P
1(maxi (xi ≤b)) A random vector X = X (1) , . . . , X (d)
Pn 1 L(x1 . . . xn ; b) =
i=1 (Xi ) V ar(X) =
Variance is the squared distance from p̂MLE = n λ2 bn of dimension d × 1 is a vector-valued Σ = E[XX T ] − E[X]E[X]T =
the mean. Poisson function from a probability space ω
Fisher Information: Likelihood: Loglikelihood: = E[XX T ] − µX µTX
Parameter λ. discrete, approximates to Rd :
V ar(X) = E[(X − E(X))2 ] 1 the binomial PMF when n is large, p L(X1 . . .Xn ; λ, θ) = Cauchy Let the random vector X ∈ Rd and A
I(p) = p(1−p) is small, and λ = np. continuous, parameter m, X : Ω −→ Rd and B be conformable matrices of

λn exp −λ ni=1 (Xi − a) 1mini=1,...,n (Xi )≥a .
h i P
V ar (X) = E X 2 − (E [X])2 constants.
Canonical exponential form: k Loglikelihood: fm (x) = π1 1+(x−m)
1
2
px (k) = exp(−λ) λk! for k = 0, 1, . . . ,
 (1) 
Variance of a product with constant a: X (ω)
 
E[X] = notdef ined! X (2) (ω) Cov(AX + B) = Cov(AX) =
fθ (y) = exp yθ − ln(1 + eθ ) + 0 `(λ, a) := n ln λ − λ ni=1 Xi + nλa
P
E[X] = λ V ar(X) = notdef ined!
  ACov(X)AT = AΣAT
V ar(aX) = a2 V ar (X) | {z } |{z} MLE: ω −→  . 
Every Covariance matrix is positive
c(y,φ)  . 
b(θ)
V ar(X) = λ λ̂MLE = 1 med(X) R= P (X > M) = P (X < M)  .  definite.
Variance of sum of two dependent X n −â
p
X (d) (ω)
 
θ = ln ∞
r.v.: 1−p
Likelihood: âMLE = mini=1,...,n (Xi ) = 1/2 = 1/2 π1 · 1+(x−m)
1
2 dx Σ≺0
φ=1 Pn
i x
Univariate Gaussians where each X (k) , is a (scalar) random
V ar(X + Y ) = V ar(X) + V ar(Y ) + Ln (x1 , . . . , xn , λ) = ni=1
Q λ
Qni=1 e−nλ Chi squared variable on Ω.
i=1 xi ! Parameters µ and σ 2 > 0, continuous The χd2 distribution with d degrees of
Gaussian Random Vectors
2Cov(X, Y ) Binomial 2 A random vector X = (X (1) , . . . , X (d) )T
Loglikelihood: 1 (x−µ)
Parameters p and n, discrete. f (x) = √ exp(− 2σ 2 ) freedom is given by the distribution PDF of X: joint distribution of its is a Gaussian vector, or multivariate
Variance of sum/difference of two `n (λ) = (2πσ 2 )
independent r.v.:
Describes the number of successes in
E[X] = µ
iid
of Z12 +Z22 +· · ·+Zd2 , where Z1 , . . . , Zd ∼ components X (1) , . . . , X (d) . Gaussian or normal variable, if any li-
= −nλ+log(λ)( ni=1 xi ))−log( ni=1 xi !)
P Q
n independent Bernoulli trials. near combination of its components
V ar(X) = σ 2 N (0, 1)
V ar(X + Y ) = V ar(X) + V ar(Y ) CDF of X: is a (univariate) Gaussian variable or
px (k) = nk pk (1 − p)n−k , k = 0, . . . , n MLE: If V ∼ χk2 :

CDF of standard gaussian: a constant (a “Gaussian"variable with
V ar(X − Y ) = V ar(X) + V ar(Y ) 1 Pn Rd → [0, 1] zero variance), i.e., if α T X is (univaria-
E[X] = np λ̂MLE = n i=1 (Xi ) Rz 2
E = E[Z12 ] + E[Z22 ] + . . . + E[Zd2 ] = d
Φ(z) = −∞ √1 e−x /2 dx te) Gaussian or constant for any con-
16 Covariance Fisher Information: 2π V ar(V ) = V ar(Z12 ) + V ar(Z22 ) + . . . + x 7→ P(X (1) ≤ x(1) , . . . , X (d) ≤ x(d) ).
V ar(X) = np(1 − p)
Likelihood: stant non-zero vector α ∈ Rd .
The Covariance is a measure of
how much the values of each of 1 V ar(Zd2 ) = 2d The sequence X1 , X2 , . . . converges Multivariate Gaussians
Likelihood: I(λ) = λ The distribution of, X the
two correlated random variables L(x1 . . . Xn ; µ, σ2 ) = Student’s T Distribution in probability to X if and only if
determine each other

Tn := √ Z where Z ∼ N (0, 1), and Z each component of the sequence d-dimensional Gaussian or nor-
Canonical exponential form: = √1 n exp − 2σ1 2 ni=1 (Xi − µ)2
P
Ln(X1 , . . . , Xn , θ) = V /n mal distribution, is completely
 Pn Pn (σ 2π) and V are independent
(k) (k)
X1 , X2 , . . . converges in probability specified by the vector mean
= ni=1 XK θ i=1 Xi (1 − θ)nK− i=1 Xi
Q
Cov(X, Y ) = E[(X − µX )(Y − µY )]
 
i fθ (y) = exp yθ − eθ − ln y! Loglikelihood: 18.1 Useful to know
|{z} |{z} to X (k) . µ = E[X] = (E[X (1) ], . . . , E[X (d) ])T and
Cov(X, Y ) = E[XY ] − E[X]E[Y ] Loglikelihood: 18.1.1 Min of iid exponential the d × d covariance matrix Σ. If Σ is
b(θ) c(y,φ) `n (µ, σ 2 ) = r.v
Expectation of a random vector
√ invertible, then the pdf of X is:
θ = ln λ = −nlog(σ 2π) − 1 Pn
− µ)2 The expectation of a random vector is
i=1 (Xi
P 
Cov(X, Y ) = E[(X)(Y − µY )] n
`n (θ) = C + i=1 Xi log θ + 2σ 2 Let X1 , . . . , Xn n be i.i.d. Exp(λ) ran- the elementwise expectation. Let X be
φ=1 MLE: 1 1 T −1
 Pn 
nK − i=1 Xi log(1 − θ) dom variables. a random vector of dimension d × 1. fX (x) = q e− 2 (x−µ) Σ (x−µ) ,
Possible notations: (2π)d det(Σ)
Poisson process: µ̂M LE = X n Distribution of mini (Xi)
k arrivals in t slots E[X (1) ] x ∈ Rd
 
Cov(X, Y ) = σ (X, Y ) = σ(X,Y ) MLE: c2 LE = 1 Pn (X − X )2
(λt)k σ
px (k, t) = P(Nt = k) = e−λt k! M i=1 i n
n
 
 . 
Covariance is commutative: Fisher Information: Fisher Information: P(mini (Xi ) ≤ t) = E[X] =  .  . Where det(Σ) is the determinant of Σ,
 .  which is positive when Σ is invertible.
E[Nt ] = λt = 1 − P(mini (Xi ) ≥ t)
n E[X (d) ]
 
I(p) = 1 !
If µ = 0 and Σ is the identity matrix,
Cov(X, Y ) = Cov(Y , X) p(1−p) 0
V ar(Nt ) = λt I(µ, σ 2 ) = σ 2 1 = 1 − (P(X1 ≥ t))(P(X2 ≥ t)) . . . (P(X ≥ t)) then X is called a standard normal
Canonical exponential form: 0 2σ 4
The expectation n of a random matrix
Covariance with of r.v. with itself is Exponential is the expected value of each of its random vector .
variance: Canonical exponential form: = 1 − (1 − FX (t))n = 1 − e−nλx If the covariant matrix Σ is diagonal,
fp (y) = Parameter( λ, continuous elements. Let X = {Xij } be an n × p the pdf factors into pdfs of univariate
Cov(X, X) = E[(X − µX )2 ] = V ar(X) Gaussians are invariant under affine random matrix. Then E[X], is the
exp(y (ln(p) − ln(1 − p)) + n ln(1 − p) + ln( y f)x)(x) = λexp(−λx), if x >= 0
n
transformation:
Differentiate w.r.t x to get the pdf of Gaussians, and hence the components
| {z } | {z } | {z } 0, o.w. mini (Xi): n × p matrix of numbers (if they exist): are independent.
Useful properties: θ −b(θ) c(y,φ)P (X > a) = exp(−λa) aX + b ∼ N (X + b, a2 σ 2 ) E[X] = The linear transform of a gaussian
Cov(aX + h, bY + c) = abCov(X, Y ) Geometric (
Sum of independent gaussians: fmin (x) = (nλ)e−(nλ)x E[X ]
11 E[X12 ] ... E[X1p ] X ∼ Nd (µ, Σ) with conformable
Number of T trials up to (and inclu- F (x) = 1 − exp(−λx), if x >= 0 
E[X21 ] E[X22 ] ... E[X2p ] matrices A and B is a gaussian:
Cov(X, X + Y ) = V ar(X) + cov(X, Y ) x 0, o.w.
ding) the first success. Let X∼N (µX , σX2 ) and Y ∼N (µY , σY2 )
 
18.1.2 Counting Commitees  . . .. .  AX + B = Nd (Aµ + b, AΣAT )
Cov(aX + bY , Z) = aCov(X, Z) + pT (t) = (1 − p)t−1 , t = 1, 2, ... E[X] = 1  .
 . . . . 
λ
If Y = X + Z, then . .  Multivariate CLT
bCov(Y , Z) E[T ] = p1 Out of 2n people, we want to choose a 
E[Xn1 ] E[Xn2 ] ... E[Xnp ]
E[X 2 ] = 2 Y ∼ N (µX + µY , σX + σY ) committee of n people, one of whom Let X1 , . . . , Xd ∈ Rd be independent
1−p λ2
var(T ) = 1 will be its chair. In how many diffe- Let X and Y be random matrices of copies of a random vector X such that
If Cov(X, Y ) = 0, we say that X and Y p2 V ar(X) = If U = X − Y, then rent ways can this be done?"
λ2 the same dimension, and let A and B E[x] = µ (d × 1 vector of expectations)
are uncorrelated. If X and Y are in- Pascal U ∼ N (µX − µY , σX + σY )
dependent, their Covariance is zero. Likelihood: be conformable matrices of constants. and Cov(X) = Σ
The negative binomial or Pascal distri-
! !
 P  2n 2n − 1
The converse is not always true. It is bution is a generalization of the geo- L(X1 . . . Xn ; λ) = λn exp −λ ni=1 Xi Symmetry: n = 2n . (d)
only true if X and Y form a gaussi- metric distribution. It relates to the n n−1 E[X + Y ] = E[X] + E[Y ] p
(n)(Xn − µ) −−−−−→ N (0, Σ)
an vector, ie. any linear combination random experiment of repeated inde- Loglikelihood: If X ∼ N (0, σ 2 ), then −X ∼ N (0, σ 2 ) E[AXB] = AE[X]B n→∞
“In a group of 2n people, consisting of
αX + βY is gaussian for all (α, β) ∈ R2 pendent trials until observing m suc- n boys and n girls, we want to select a
p (d)
cesses. I.e. the time of the kth arrival. `n (λ) = nln(λ) − λ
Pn
P(|X| > x) = 2P(X > x) Covariance Matrix (n)Σ−1/2 Xn − µ −−−−−→ N (0, Id )
without {0, 0}. i=1 (Xi ) committee of n people. In how many n→∞
17 correlation coefficient Yk = T1 + ...Tk ways can this be done?" Let X be a random vector of dimensi-
MLE: Standardization: Where Σ−1/2 is the d × d matrix such
Cov(X,Y ) on d × 1 with expectation µX . that Σ−1/2 Σ−1/2 = Σ1 and Id is the
ρ(X, Y ) = √ Ti ∼ iidGeometric(p) ! X n ! !
V ar(XV ar(Y
Pn n X−µ 2n n n Matrix outer products! identity matrix.
λ̂MLE = Z= ∼ N (0, 1) =
18 Important probability distributi- E[Yk ] = k i=1 (Xi ) σ
n i n−i
ons p i=0 Σ = E[(X − µX )(X − µX )T ] = Multivariate Delta Method
t−µ
 
Fisher Information: P (X ≤ t) = P Z ≤ σ
Bernoulli k(1−p “How many subsets does a set with 2n 20 Algebra
V ar(Yk ) = p2 Higher moments:  X1 − µ1  Absolute Value Inequalities:
  
Parameter 1
( p ∈ [0, 1], discrete t−1  k t−k
I(λ) = λ2
elements have?"  X2 − µ2 

 (x)| < a ⇒ −a < f (x) < a
p, if k = 1 pYk (t) = k−1 p (1 − p)
|f
E  . . .  [X1 − µ1 , X2 − µ2 , . . . , Xd − µd ]

px (k) = Canonical exponential form: E[X 2 ] = µ2 + σ 2 2n !
|f  (x)| > a ⇒ f (x) > a or f (x) < −a
(1 − p), if k = 0 X 2n 
X d − µd

t = k, k + 1, ... E[X 3 ] = µ3 + 3µσ 2 22n =
Capstone-Cheatsheet Statistics 1 Σ = E[(X − µX )(X − µX )T ]
by Blechturm, Page 4 of 2
= E[XX T ] − E[X]E[X]T
= E[XX T ] − µX µTX
21 Matrixalgebra
kAxk2 = (Ax)T (Ax) = xT AT Ax =
xT AT Ax
22 Calculus
Differentiation
R under
 the integral sign
d b(x) 0
dx a(x) f (x, t)dt = f (x, b(x))b (x) −
R b(x)
f (x, a(x))a0 (x) + a(x) fx (x, t)dt.
Concavity in 1 dimension
If g : I → R is twice differentiable in
the interval I:
concave:
if and only if g 00 (x)≤0 for all x ∈ I

strictly concave:
if g 00 (x)<0 for all x ∈ I

convex:
if and only if g 00 (x)≥0 for all x ∈ I

strictly convex if:


g 00 (x)>0 for all x ∈ I

Multivariate Calculus
The Gradient ∇ of a twice differntia-
ble function f is defined as:
∇f : Rd → Rd
 ∂f 
 ∂θ1 
θ1 
 
 ∂f 
θ2   
   ∂θ2 
θ =  .  7→  . 
 
 .   . 
 .   . 
θd  
 ∂f 

∂θd θ
Hessian

The Hessian of f is a symmetric


matrix of second partial derivatives
of f

Hh(θ) = ∇2 h(θ) =
2h ∂2 h
 ∂θ∂ ∂θ
 
(θ) ··· ∂θ1 ∂θd
(θ) 
 1 1 
 . 
 ∈
 .

 . 

 ∂2 h ∂2 h 
∂θ ∂θ
(θ) · · · ∂θd ∂θd
(θ) 
d 1
Rd×d

A symmetric (real-valued) d × d
matrix A is:

Positive semi-definite:
xT A x ≥ 0 for all x ∈ Rd .

Positive definite:
xT A x > 0 for all non-zero vectors
x ∈ Rd

Negative semi-definite (resp. negative


definite):

xT A x is negative for all x ∈ Rd − {0}.

Positive (or negative) definiteness


implies positive (or negative)
semi-definiteness.

If the Hessian is positive definite


then f attains a local minimum at a
(convex).

If the Hessian is negative definite at a,


then f attains a local maximum at a
(concave).

If the Hessian has both positive and


negative eigenvalues then a is a saddle
point for f .
23 Covariance Matrix
Let X be a random vector of dimensi-
on d × 1 with expectation µX .
Matrix outer products!

You might also like