0% found this document useful (0 votes)
63 views5 pages

Statistical Inference Notes Melon University

This document provides an overview of key concepts in statistical inference and hypothesis testing: 1. It defines parametric and non-parametric statistical models used to make inferences about populations based on samples. 2. It introduces point estimation to calculate a "best guess" of an unknown quantity and discusses properties like bias, variance, and consistency of estimators. 3. It explains the bias-variance decomposition of mean squared error and how it relates to consistency of estimators. 4. It describes the concept of asymptotic normality of estimators. 5. It defines confidence sets and how asymptotic confidence intervals are constructed when an estimator is asymptotically normal. 6. It provides an example of hypothesis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views5 pages

Statistical Inference Notes Melon University

This document provides an overview of key concepts in statistical inference and hypothesis testing: 1. It defines parametric and non-parametric statistical models used to make inferences about populations based on samples. 2. It introduces point estimation to calculate a "best guess" of an unknown quantity and discusses properties like bias, variance, and consistency of estimators. 3. It explains the bias-variance decomposition of mean squared error and how it relates to consistency of estimators. 4. It describes the concept of asymptotic normality of estimators. 5. It defines confidence sets and how asymptotic confidence intervals are constructed when an estimator is asymptotically normal. 6. It provides an example of hypothesis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

LECTURE NOTES 8

1 Statistical Inference

A central concern of statistics and machine learning is to estimate things about some under-
lying population on the basis of samples. Formally, given a sample,

X1 , . . . , Xn ∼ F,

what can we infer about F ?

To make meaningful inferences about F from samples we typically restrict F in some natural
way. A statistical model is a set of distributions F. Broadly, there are two possibilities:

1. Parametric model: In a parametric model, the set of possible distributions F can


be described by a finite number of parameters. Here are a few examples:

(a) A Gaussian model: This is a simple two parameter model. Here we suppose that:

(x − µ)2
   
1
F = f (x; µ, σ) = √ exp − , µ ∈ R, σ > 0 .
σ 2π 2σ 2

(b) The Bernoulli model: This is a one parameter model where:

F = pθ (x) = θx (1 − θ)1−x : 0 ≤ θ ≤ 1 .


2. Non-parametric model: A non-parametric model is one which where F cannot be


parameterized by a finite number of parameters. Here are a few popular examples:

(a) Estimating the CDF: Here the model consists of any valid CDF, i.e. a function
that is between 0 and 1, is monotonically increasing, right-continuous and equal
to 0 at −∞ and 1 at ∞. We are given samples X1 , . . . , Xn ∼ F and the goal is
to estimate F .
(b) Density estimation: In density estimation, we are given samples X1 , . . . , Xn ∼ fX ,
where fX is an unknown density that we would like to estimate. It turns out that
the class of all possible densities is too big for this problem to be well posed so
we need to assume some smoothness on the density. A typical assumption is that
the model is given by:
 Z Z 
00 2
F = f : (f (x)) dx < ∞, f (x)dx = 1, f (x) ≥ 0 .

1
2 Point Estimation

Point estimation in statistics refers to calculating a single “best guess” of the value of an
unknown quantity of interest. The quantity of interest could be a parameter or, for instance,
a density function. Typically, we will use θb or θbn to denote a point estimator. A point
estimator is a function of the data X1 , . . . , Xn :

θbn = g(X1 , . . . , Xn ),

so that θbn is a random variable. The bias of an estimator is written as:

b(θbn ) = Eθ (θbn ) − θ.

Similarly, the variance of an estimator is given by:

v(θbn ) = Eθ (θbn − θn )2
q
where θn = Eθ (θbn ). The standard error is defined to be se = v(θbn ).

In the olden days, there was a lot of emphasis on unbiased estimators, and we wanted to find
the unbiased estimators with small (or minimal variance). In modern statistics, we often use
biased estimators because the reduction in variance often justifies the bias.

We call an estimator of a parameter consistent if the estimator converges to the true param-
eter in probability, i.e. for any :

Pθ (|θbn − θ| ≥ ) → 0,
P
as n → ∞. In other words, θbn → θ or θbn − θ = oP (1).

3 The Bias-Variance decomposition

One way to compute the quality of an estimator is via its mean squared error:

MSE = Eθ (θ − θbn )2 .

The MSE can be decomposed as the sum of the squared bias and variance, i.e.:

MSE = Eθ (θ − θbn )2
= Eθ (θ − θn + θn − θbn )2
= b(θbn )2 + v(θbn ).

2
qm
A simple consequence of this decomposition is: b(θbn ) → 0 and v(θbn ) → 0 then θbn → θ and
P
hence θbn → θ.

Example: Suppose X1 , . . . , Xn ∼ Ber(p), and our estimator:


n
1X
pbn = Xi .
n i=1

What is the bias of this estimator? What is its variance? Is the estimator consistent?

4 Asymptotic Normality

Often estimators that we study will have an asymptotically normal distribution. This means
that: (θbn − θ)/se N (0, 1). We will refer to this property as asymptotic normality.

5 Confidence Sets

In general, for a parameter θ we define a 1 − α confidence set Cn to be any random set which
has the property that:
Pθ (θ ∈ Cn ) ≥ 1 − α
for all θ. We refer to Pθ (θ ∈ Cn ) as the coverage of the confidence set Cn . The confidence
set Cn is a random set (and θ is a fixed parameter).

One can think about the coverage guarantee in the following way:

You repeat the experiment many times, each time constructing a different confidence interval
Cn . Then 1 − α of these different sets will contain the corresponding true parameter. Notice,
that the true parameter does not have to be fixed, so in some sense the experiment you
conduct can be different each time.

We already saw a way to construct confidence intervals for a Bernoulli parameter using
Hoeffding’s inequality. More generally, we can always use concentration inequalities to con-
struct confidence intervals. These confidence intervals are often loose and we instead resort
to approximate (asymptotic) confidence intervals.

It is often the case that:


θb − θ
qn
v(θbn )

3
is asymptotically N (0, 1). In these cases we have that θbn ≈ N (θ, v(θbn )). Define, zα/2 =
Φ−1 (1 − α/2). Then we would construct a confidence interval:
 q q 
Cn = θn − zα/2 v(θn ), θn + zα/2 v(θbn ) .
b b b

We now need to verify that:

Pθ (θ ∈ Cn ) → 1 − α,

as n → ∞, which is what it means to be an asymptotic confidence interval.


q q
Pθ (θ ∈ Cn ) = P(θbn − zα/2 v(θbn ) ≤ θ ≤ θbn + zα/2 v(θbn ))
 
θn − θ
b
= Pθ −zα/2 ≤ q ≤ zα/2 
v(θbn )
→ P(−zα/2 ≤ Z ≤ zα/2 ) = 1 − α.

Example: Bernoulli confidence sets:

1. We previously constructed confidence sets using Hoeffding’s inequality. They took the
form:
r r !
log(2/α) log(2/α)
Cn = pbn − , pbn + .
2n 2n

2. If we instead use the normal approximation: we first note that the variance of our
estimator is:
p(1 − p)
v(θbn ) = .
n
However, we cannot use this variance to create our confidence set, so we instead esti-
mate the variance as:
pbn (1 − pbn )
vb(θbn ) = .
n
With this we would use the confidence interval:
 q q 
Cn = pbn − zα/2 vb(θn ), pbn + zα/2 vb(θn ) .
b b

It is easy to verify that this interval is always shorter than the Hoeffding interval but
it is only asymptotically correct.

4
6 Hypothesis testing

Typically, the way that statistical hypothesis testing proceeds is by defining a so-called null
hypothesis. We then collect data, and typically the question we ask is whether the data
provides enough evidence to reject the null hypothesis.

Example: Suppose X1 , . . . , Xn ∼ Ber(p), and we want to test if the coin is fair. In this
case the null hypothesis would be:

H0 : p = 1/2.

We typically also specify a alternate hypothesis. In this case, the alternative hypothesis
is:

H1 : p 6= 1/2.

Typically, hypothesis testing proceeds by defining a test statistic. In this case, a natural
statistic might be:
n
1X
T = Xi − p .
n i=1

It might make sense to reject the null hypothesis if T is large. We will be more precise
about this later on particularly by defining the different types of errors, and how to set the
threshold for T .

You might also like