Statistical Inference Notes Melon University
Statistical Inference Notes Melon University
1 Statistical Inference
A central concern of statistics and machine learning is to estimate things about some under-
lying population on the basis of samples. Formally, given a sample,
X1 , . . . , Xn ∼ F,
To make meaningful inferences about F from samples we typically restrict F in some natural
way. A statistical model is a set of distributions F. Broadly, there are two possibilities:
(a) A Gaussian model: This is a simple two parameter model. Here we suppose that:
(x − µ)2
1
F = f (x; µ, σ) = √ exp − , µ ∈ R, σ > 0 .
σ 2π 2σ 2
F = pθ (x) = θx (1 − θ)1−x : 0 ≤ θ ≤ 1 .
(a) Estimating the CDF: Here the model consists of any valid CDF, i.e. a function
that is between 0 and 1, is monotonically increasing, right-continuous and equal
to 0 at −∞ and 1 at ∞. We are given samples X1 , . . . , Xn ∼ F and the goal is
to estimate F .
(b) Density estimation: In density estimation, we are given samples X1 , . . . , Xn ∼ fX ,
where fX is an unknown density that we would like to estimate. It turns out that
the class of all possible densities is too big for this problem to be well posed so
we need to assume some smoothness on the density. A typical assumption is that
the model is given by:
Z Z
00 2
F = f : (f (x)) dx < ∞, f (x)dx = 1, f (x) ≥ 0 .
1
2 Point Estimation
Point estimation in statistics refers to calculating a single “best guess” of the value of an
unknown quantity of interest. The quantity of interest could be a parameter or, for instance,
a density function. Typically, we will use θb or θbn to denote a point estimator. A point
estimator is a function of the data X1 , . . . , Xn :
θbn = g(X1 , . . . , Xn ),
b(θbn ) = Eθ (θbn ) − θ.
v(θbn ) = Eθ (θbn − θn )2
q
where θn = Eθ (θbn ). The standard error is defined to be se = v(θbn ).
In the olden days, there was a lot of emphasis on unbiased estimators, and we wanted to find
the unbiased estimators with small (or minimal variance). In modern statistics, we often use
biased estimators because the reduction in variance often justifies the bias.
We call an estimator of a parameter consistent if the estimator converges to the true param-
eter in probability, i.e. for any :
Pθ (|θbn − θ| ≥ ) → 0,
P
as n → ∞. In other words, θbn → θ or θbn − θ = oP (1).
One way to compute the quality of an estimator is via its mean squared error:
MSE = Eθ (θ − θbn )2 .
The MSE can be decomposed as the sum of the squared bias and variance, i.e.:
MSE = Eθ (θ − θbn )2
= Eθ (θ − θn + θn − θbn )2
= b(θbn )2 + v(θbn ).
2
qm
A simple consequence of this decomposition is: b(θbn ) → 0 and v(θbn ) → 0 then θbn → θ and
P
hence θbn → θ.
What is the bias of this estimator? What is its variance? Is the estimator consistent?
4 Asymptotic Normality
Often estimators that we study will have an asymptotically normal distribution. This means
that: (θbn − θ)/se N (0, 1). We will refer to this property as asymptotic normality.
5 Confidence Sets
In general, for a parameter θ we define a 1 − α confidence set Cn to be any random set which
has the property that:
Pθ (θ ∈ Cn ) ≥ 1 − α
for all θ. We refer to Pθ (θ ∈ Cn ) as the coverage of the confidence set Cn . The confidence
set Cn is a random set (and θ is a fixed parameter).
One can think about the coverage guarantee in the following way:
You repeat the experiment many times, each time constructing a different confidence interval
Cn . Then 1 − α of these different sets will contain the corresponding true parameter. Notice,
that the true parameter does not have to be fixed, so in some sense the experiment you
conduct can be different each time.
We already saw a way to construct confidence intervals for a Bernoulli parameter using
Hoeffding’s inequality. More generally, we can always use concentration inequalities to con-
struct confidence intervals. These confidence intervals are often loose and we instead resort
to approximate (asymptotic) confidence intervals.
3
is asymptotically N (0, 1). In these cases we have that θbn ≈ N (θ, v(θbn )). Define, zα/2 =
Φ−1 (1 − α/2). Then we would construct a confidence interval:
q q
Cn = θn − zα/2 v(θn ), θn + zα/2 v(θbn ) .
b b b
Pθ (θ ∈ Cn ) → 1 − α,
1. We previously constructed confidence sets using Hoeffding’s inequality. They took the
form:
r r !
log(2/α) log(2/α)
Cn = pbn − , pbn + .
2n 2n
2. If we instead use the normal approximation: we first note that the variance of our
estimator is:
p(1 − p)
v(θbn ) = .
n
However, we cannot use this variance to create our confidence set, so we instead esti-
mate the variance as:
pbn (1 − pbn )
vb(θbn ) = .
n
With this we would use the confidence interval:
q q
Cn = pbn − zα/2 vb(θn ), pbn + zα/2 vb(θn ) .
b b
It is easy to verify that this interval is always shorter than the Hoeffding interval but
it is only asymptotically correct.
4
6 Hypothesis testing
Typically, the way that statistical hypothesis testing proceeds is by defining a so-called null
hypothesis. We then collect data, and typically the question we ask is whether the data
provides enough evidence to reject the null hypothesis.
Example: Suppose X1 , . . . , Xn ∼ Ber(p), and we want to test if the coin is fair. In this
case the null hypothesis would be:
H0 : p = 1/2.
We typically also specify a alternate hypothesis. In this case, the alternative hypothesis
is:
H1 : p 6= 1/2.
Typically, hypothesis testing proceeds by defining a test statistic. In this case, a natural
statistic might be:
n
1X
T = Xi − p .
n i=1
It might make sense to reject the null hypothesis if T is large. We will be more precise
about this later on particularly by defining the different types of errors, and how to set the
threshold for T .