0% found this document useful (0 votes)
13 views33 pages

4 Estimation

The document provides an overview of statistical concepts related to estimation, including definitions of sample, statistic, and statistical model. It discusses estimation theory, the role of estimators, and their properties such as mean, variance, and bias, as well as the likelihood function and maximum likelihood estimation (MLE) for various distributions. Additionally, it covers sample variance and covariance estimators, highlighting their properties and consistency in estimating parameters.

Uploaded by

Ananya Rao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views33 pages

4 Estimation

The document provides an overview of statistical concepts related to estimation, including definitions of sample, statistic, and statistical model. It discusses estimation theory, the role of estimators, and their properties such as mean, variance, and bias, as well as the likelihood function and maximum likelihood estimation (MLE) for various distributions. Additionally, it covers sample variance and covariance estimators, highlighting their properties and consistency in estimating parameters.

Uploaded by

Ananya Rao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

CS 215

Data Analysis and Interpretation


Estimation
Suyash P. Awate
Sample
• Definition:
If random variables X1, …, XN, are i.i.d.,
then they constitute a random sample of size N from the common
distribution
• N = “sample size”
• One set of observed data is one instance/realization of the sample
• i.e., {x1, …, xN}
• The common distribution from which data was “drawn” is usually
unknown
Statistic
• Definition:
Let X1, …, XN denote a sample associated with random variable X
(i.e., all of X1, …, XN have the same distribution as X).
Let T(X1, …, XN) be a function of the sample.
Then, random variable T is called the statistic.
• For the drawn sample {x1, …, xN},
the value t := T(x1, …, xN) is an instance of the statistic
Model
• Statistical model
• Typically, a probabilistic description of real-world phenomena
• Description involves a distribution that may involve some parameters
• e.g., P(X; θ)
• Describes/represents a data-generation process
• Designed by people
• Unlike data that is observed/measured/acquired
• Nature doesn’t generate models
Estimation
• Estimation theory
• A branch of statistics that deals with estimating the values of parameters
(underlying a statistical model) based on measured/empirical data
• While data generation starts with parameters and leads to data,
estimation starts with data and leads to parameters
• Estimation problem
• Given: Data
• Assumption: Data was generated from a parametric family of distributions
(i.e., a family of models)
• Goal: To infer the distribution parameters
(i.e., the distribution/model instance from the family of distributions/models)
that the data was generated from
Estimator, Estimate
• Estimator
• A deterministic (not stochastic) rule/formula/function/algorithm
for calculating/computing an estimate of a given quantity
(e.g., a parameter value)
based on observed data
• Sometimes the estimator is obtained as a closed-form expression
• But not always
• An estimator T(X1, …, XN) is also a statistic
• Estimate
• A value resulting from applying the estimator to data
Estimator Mean, Variance, Bias
• Let X1, …, XN be a sample on a random variable X with PDF/PMF P(X; θ)
• Let T(X1, …, XN) be a estimator for parameter whose true value is θ
• Mean of the estimator (definition):
Expected value of T, i.e., E[T]
• Bias of the estimator (definition)
Bias(T) := E[T] – θ
• Unbiased estimator (definition)
is one where Bias(T) = 0
E[T]
• Variance of the estimator (definition)
Var(T) := E[(T – E[T])2]
• Mean squared error (MSE) of the estimator (definition)
• Expected value of the squared error MSE(T) := E[(T – θ)2]
Estimator MSE, Bias, Variance
• MSE(T) := E[(T – θ)2]
= E[(T – E[T] + E[T] – θ)2]
= E[(T – E[T])2] + E[(E[T] – θ)2] + E[2(T – E[T])(E[T] – θ)]
= Var(T) + (Bias(T))2 + 0 E[T]

: Variance + Bias2
• Bias-variance
decomposition/“tradeoff”:
• If two estimators T1 and T2
have same MSE,
then
if one estimator (say, T1) has a smaller bias magnitude,
it (i.e., T1) also has a larger variance
Estimator Mean, Variance, Bias
• Let X1, …, XN be a sample on a random variable X with PDF/PMF P(X; θ)
• Let T(X1, …, XN) be a estimator for parameter whose true value is θ
• Consistent estimator (definition)
• Estimator TN = T(X1, …, XN) is consistent if ∀𝜖 > 0, lim 𝑃 𝑇𝑁 − 𝜃 ≥ 𝜖 = 0
!→#
• Thus, TN is said to “converge in probability” to 𝜃

Law of large numbers: For all ε > 0, as n→∞, P(|𝑋 – μ | ≥ ε) → 0


Likelihood Function
• Let X1, …, XN be a sample on a random variable X with PDF/PMF P(X; θ)
• Definition: Likelihood function L(θ; X1, …, XN) := ∏$!"# 𝑃(𝑋! ; 𝜃)
• We want to use the likelihood function to estimate θ from the sample
• Sometimes, analysis relies on log(L(θ; X1, …, XN)),
leveraging that log(.) is strictly monotonically increasing within (0,∞)
• Some assumptions (#)
1. Different values of θ correspond to different CDFs associated with P(X; θ)
• i.e., parameter θ identifies a unique CDF
2. All PMFs/PDFs have common support for all parameters θ
• i.e., support of X cannot depend on θ
• Under these assumptions, the likelihood function has a nice property
(as discussed next)
Likelihood Function
• Theorem: Let θtrue be the parameter value that led to sample X1, …, XN.
Assume 𝐸%(';)"#$% ) [𝑃(𝑋; 𝜃)/𝑃(𝑋; 𝜃+,-. )] exists (e.g., it is finite). Then,
lim 𝑃 𝐿 𝜃+,-. ; 𝑋# , ⋯ , 𝑋$ > 𝐿 𝜃; 𝑋# , ⋯ , 𝑋$ ; 𝜃+,-. = 1, ∀𝜃 ≠ 𝜃+,-.
$→0
• Proof:
( ! + ,! ;.
• Event 𝐿 𝜃$%&' ; 𝑋( , ⋯ , 𝑋! > 𝐿 𝜃; 𝑋( , ⋯ , 𝑋! ≡ ∑)*( log <0
! + ,! ;."#$%
• We want to show that, as N→∞, this event (with strict inequality) has prob. 1
• Because of the law of large numbers: Law of large numbers:
( ! + ,! ;. + ,;. For all ε > 0, as n→∞,
lim ! ∑)*( log + , ;. → 𝐸+(,;."#$%) log + ,;. P(|𝑌 – μ | ≥ ε) → 0
!→# ! "#$% "#$%
• Common support implies prob-ratio is >0 and <∞. So sum & expectation exist.
Then, log(.) is strictly concave within (0,∞). Then, Jensen’s inequality makes
+ ,;. Jensen’s inequality:
above expectation strictly< log 𝐸+ ,;."#$% + ,;.
"#$% When g(.) is strictly concave,
EP(X)[g(h(X))] < g(EP(X) [h(X)])
Likelihood Function
• Theorem: Let θtrue be the parameter value that led to sample X1, …, XN.
Assume 𝐸%(';)"#$% ) [𝑃(𝑋; 𝜃)/𝑃(𝑋; 𝜃+,-. )] exists (e.g., it is finite). Then,
lim 𝑃 𝐿 𝜃+,-. ; 𝑋# , ⋯ , 𝑋$ > 𝐿 𝜃; 𝑋# , ⋯ , 𝑋$ ; 𝜃+,-. = 1, ∀𝜃 ≠ 𝜃+,-.
$→0
• Proof:
+ ,;.
• Consider the summation/integration underlying log 𝐸+ ,;."#$% + ,;."#$%
• Expectation is summing/integrating only over support of 𝑃 𝑋; 𝜃!"#$ .
Thinking empirically, instances of x ∼ 𝑃 𝑋; 𝜃!"#$ never lie outside support of PMF/PDF.
The first 𝑃 𝑋; 𝜃!"#$ term indicates a PMF/PDF; second one indicates a transformation.
• When the support of 𝑃 𝑋; 𝜃!"#$ is a superset of the support of 𝑃 𝑋; 𝜃 ,
the summation/integral underlying the expectation evaluates to 1
% &;(
and log 𝐸% &;(!"#$ % &;( = log 1 = 0
!"#$
• If ∀𝜃 ≠ 𝜃!"#$, we want the expectation to evaluate to 1,
then all PMFs/PMFs 𝑃 𝑋; 𝜃 need to have the same support.
Maximum Likelihood (ML) Estimation
• Definition:
An estimator T = T(X1, …, XN) is a “maximum likelihood (ML) estimator”
if T := arg maxθ 𝐿 𝜃; 𝑋# , ⋯ , 𝑋$
• “arg maxθ g(θ)”: the argument (i.e., θ) that maximizes the function g(.)
• “maxθ g(θ)”: the maximum possible value of the function g(.) across all θ
• Properties of ML estimation
• Sometimes, ML estimator may not exist, or it may not be unique
• When assumptions (#) hold, and max of likelihood function exists & is unique,
then ML estimator is a consistent estimator
• When sample size is finite, it loses convergence guarantee
• When sample size is finite, this behavior holds for most methods,
unless very strong assumptions (usually not holding in practice) are made on the data
• In practice, a large enough sample size take ML estimate T sufficiently close to
θtrue so that the ML estimate T is still useful
MLE for Bernoulli
• Let θ := probability of success
• θ must lie within [0,1]
$
• Likelihood function L(θ) := !"# θ'& 1 − θ
∏ (#1'& )

• ML estimate for θ is what ?


• At maximum of L(θ):
• First derivative must be zero
• This gives one equation in one unknown θ
• Second derivative must be negative
• ML estimate is sample mean, i.e., ∑!
)*( 𝑋) /𝑁
MLE for Binomial
• Let θ := probability of success P(X=k;θ,M) = MCk θk (1-θ)(M-k)
• θ must lie within [0,1]
• Let M := number of Bernoulli tries for each Binomial random variable
• Let { Xi : i = 1, …, N} model repeated draws from Binomial, where
Xi models number of successes in i-th draw from Binomial
$

• ML estimate for θ is sample mean !"# 𝑋! /(𝑁𝑀)
• Interpretation:
• N independent Binomials draws,
where each Binomial has M independent Bernoulli draws,
is equivalent to NM independent Bernoulli draws
• Total number of successes in NM Bernoulli trials is ∑!
)*( 𝑋)
MLE for Poisson
• Parameter is average rate of arrivals/hits ƛ P(X=k; λ) = λk e-λ / k!

• ML estimate is sample mean ∑$ !"# 𝑋! /𝑁


• Note that ƛ is both mean and variance of the Poisson random variable
• So, sample variance can also estimate ƛ
• But computing sample variance needs computing sample mean anyway
• Also, sample mean is an “efficient” estimator (more on this later)
Sample-Variance Estimator
• Sample variance estimate for 𝜎2 is biased

• Asymptotically (as n→∞) unbiased


• So, (corrected) estimator of variance is Sc := S2.n/(n-1) that is unbiased
Sample-Variance Estimator
• What about estimator of standard deviation 𝜎 defined as 𝜎< ∶= 𝑆23 ?
• Is E 𝜎= = σ ?
• Sqrt(.) is a strictly concave function within (0,∞)
• Apply Jensen’s inequality:
E S)* < 𝐸 S)* = 𝜎
• Excepting the degenerate case when distribution has variance 0
Sample-Variance Estimator
• Variance of sample variance
• Variance of (uncorrected or corrected) sample-variance
tends to zero asymptotically (as N→∞)
• When (finite-variance) conditions underlying the law of large numbers hold
• https://en.wikipedia.org/wiki/Variance#Distribution_of_the_sample_variance
• https://mathworld.wolfram.com/SampleVarianceDistribution.html
• Then, (uncorrected or corrected) sample variance is a consistent estimator
Sample-Covariance Estimator
• Consider a joint PDF/PMF P(X,Y) with Cov(X,Y) = E[XY] – E[X]E[Y]
• Let E[XY] = μxy , E[X] = μx , E[Y] = μy
• Let (Xi,Yi) and (Xj,Yj) be i.i.d. (e.g., Xi independent of Xj and Yj for all i≠j)
# 5 # #
?
• Sample-covariance estimator 𝐶 = 4 ∑!"# 𝑋! 𝑌! − ∑5!"# 𝑋! ∑5!"# 𝑌!
5 5
( 4 ( 4 (
• 𝐸 3 ∑)*( 𝑋) 𝑌) = 3 ∑)*( 𝐸[𝑋) 𝑌) ] = 4 𝑛𝜇56 = 𝜇56
( 4 ( 4 ( (
• 𝐸 4 )*( 𝐸[𝑋) ] 4 )*( 𝐸[𝑌) ] = 4& ∑) 𝐸[𝑋) 𝑌) ]
∑ ∑ + 4& ∑)78 𝐸[𝑋) 𝑌9 ]
( ( ( 4:(
= & 𝑛𝜇56 + & 𝑛 𝑛 − 1 𝜇5 𝜇6 = 𝜇56 + 𝜇5 𝜇6
4 4 4 4
51#
• So, expectation of sample-covariance = 𝜇67 − 𝜇6 𝜇7
5
• Asymptotically unbiased. Corrected version will be unbiased.
• Can be shown to be consistent
MLE for Gaussian
• Parameters are mean μ and standard deviation 𝜎
• Likelihood function L(μ,𝜎) is a function of 2 variables
• Maximizing likelihood function L(μ,𝜎) is equivalent to
maximizing log-likelihood function log(L(μ,𝜎))
• Because log(.) function is a (strictly) monotonically increasing within (0,∞)
• Need to solve for 2 equations in 2 unknowns
• ML estimate for μ is sample mean
• ML estimate for 𝜎2 is sample variance
MLE for Half-Normal
• PDF:

• ML estimate is:

• This isn’t sample mean,


isn’t sample std. dev.,
isn’t sample median
MLE for Laplace
• PDF:

• ML estimates
• For location parameter:
sample median
• For scale parameter:
mean/average absolute deviation
(MAD/AAD)
from the median
MLE for Uniform Distribution (Continuous)
• Parameters are: lower limit ‘a’ and upper limit ‘b’ (a < b)
• Support of PDF depends on parameters
• Let data from U(a,b) be {x1, …, xN}, sorted in increasing order, & x1 < xN
• What are ML estimates ?
• First, data must lie within [a,b]
• a ≤ x1 , else likelihood function = 0
• b ≥ xN , else likelihood function = 0
• Likelihood function L(a,b; {x1, …, xN}) := (1/(b–a))N
• Log-likelihood function log(L(a,b); {x1, …, xN}) = –N.log(b–a)
• Partial derivative w.r.t. ‘a’ is N/(b–a) > 0
• Partial derivative w.r.t. ‘b’ is (–N/(b–a)) < 0
• L(a,b) is maximum when a = x1 and b = xN
MLE for Uniform Distribution (Continuous)
• Parameters are: lower limit ‘a’ and upper limit ‘b’ (a < b)
• Let data from U(a,b) be {x1, …, xN}, sorted in increasing order, & x1<xN
• Analysis of consistency
• For estimator of ‘b’: ∀𝜖 > 0 and 𝜖 < (b-a), consider 𝑃 𝑏 − max 𝑥' ≥ 𝜖
'(),⋯,,
= 𝑃 𝑏 − 𝑥) ≥ 𝜖 𝑃 𝑏 − 𝑥- ≥ 𝜖 ⋯ 𝑃 𝑏 − 𝑥, ≥ 𝜖
(/01)03 ,
= 𝑃 𝑥) ≤ 𝑏 − 𝜖 ⋯ 𝑃 𝑥, ≤ 𝑏 − 𝜖 = (/03) Estimator TN = T(X1, …, XN) is consistent if
∀𝜖 > 0, lim 𝑃 𝑇𝑁 − 𝜃 ≥ 𝜖 = 0
which → 0 as N→ ∞ !→#

• For estimator of ‘a’: ∀𝜖 > 0 and 𝜖 < (b-a), consider 𝑃 min 𝑥' − 𝑎 ≥ 𝜖
'(),⋯,,
= 𝑃 𝑥) ≥ 𝑎 + 𝜖 𝑃 𝑥- ≥ 𝑎 + 𝜖 ⋯ 𝑃 𝑥, ≥ 𝑎 + 𝜖
341 03 , (/03)01 ,
= 1 − 𝑃 𝑥) ≤ 𝑎 + 𝜖 ⋯ 1 − 𝑃 𝑥, ≤ 𝑎 + 𝜖 = 1− /03
= (/03)
which → 0 as N→ ∞
MLE for Uniform Distribution (Continuous)
• Parameters are: lower limit ‘a’ and upper limit ‘b’ (a < b)
• Let data from U(a,b) be {x1, …, xN}, sorted in increasing order, & x1<xN
• Analysis of bias Bias(T) := E[T] – θ
• Without loss of generality, let a≥0 (shifted random variable)
• For non-negative random variable, apply tail-sum formula
J*#
𝐸[ max 𝑥) ] = J 1−𝑃 max 𝑥) ≤ 𝑡 𝑑𝑡
)*(,⋯,! J*K )*(,⋯,!
J*L J*M J*#
=J 1 𝑑𝑡 + J 1−𝑃 max 𝑥) ≤ 𝑡 𝑑𝑡 + J 1 − 1 𝑑𝑡
J*K J*L )*(,⋯,! J*M
J*M !
𝑡−𝑎
=𝑎+J 1− 𝑑𝑡
J*L 𝑏−𝑎
M:L M:L
=𝑎+ 𝑏−𝑎 − =𝑏− (check that makes sense for N=1)
!N( !N(
Linear Regression
• Given: Data (𝑥! , 𝑦! ) 5!"#
• Linear Model: 𝑌! = 𝛼+,-. + 𝛽+,-. X! + 𝜂! ,
where errors 𝜂! (in measuring 𝑌! ; not 𝑋! )
are zero-mean i.i.d. Gaussian random variables
• Goal: Estimate 𝛼+,-. , 𝛽+,-.
• Log-likelihood function
4
• L 𝛼, 𝛽; (𝑥) , 𝑦) ) )*( = log ∏) 𝐺 𝑦) ; 𝛼 + 𝛽𝑥) , 𝜎 O
• Partial derivative w.r.t. 𝛼 is 0 implies: 𝛼 = 𝑦J − 𝛽𝑥̅ (bar denotes mean)
• Partial derivative w.r.t. 𝛽 is 0 implies: ∑!(𝑦! − 𝛼 − 𝛽𝑥! )𝑥! = 0
• Substituting expression for 𝛼 gives:
∑) 𝑦) − 𝑦X 𝑥) 𝑥𝑦 − 𝑥̅ 𝑦X SampleCov 𝑋, 𝑌
𝛽= = O =
∑) 𝑥) − 𝑥̅ 𝑥) 𝑥 − 𝑥̅ O SampleVar(𝑋)
Linear Regression Slope m := Cov(X,Y) / Var(X)
Intercept c := E[Y] – Cov(X,Y) E[X] / Var(X)
• Analysis of estimates
PQRST'UVW ,,X
• Slope 𝛽 =
PQRST'YQ%(,)
• Unbiased (see next slide)
(ratio of sample-covariance and sample-variance is same with/without correction)
• Can be shown to be consistent (see next slide)
• Intercept 𝛼 = 𝑦X − 𝛽𝑥̅
• We already know that 𝑦6 and 𝑥̅ are unbiased and consistent estimators of E[Y] and E[X]
• Unbiased
• If 𝛽 is unbiased
• Can be shown to be consistent
• If 𝛽 is consistent
Linear Regression
5 5 5 5
∑&(6& 16)(7
̅ :
& 17) ∑& 6& 16̅ 7& 1 ∑& 6& 16̅ 7: ∑& 6& 16̅ 7&
6 6 6 6
•𝛽 = = =
;<=>?.@<,(') ;<=>?.@<,(') ;<=>?.@<,(')
• But, as per model, 𝑦! = 𝛼+,-. + 𝛽+,-. 𝑥! + 𝜂! . Substituting 𝑦! gives:
5 5
∑& 6& 16̅ A"#$% BC"#$7 6& BD& ∑& 6& 16̅ C"#$% 6& BD&
6 6
•𝛽 = =
;<=>?.@<,(') ;<=>?.@<,(')
5 5 5
∑& 6& 16̅ C"#$% (6& 16)B
̅ ∑& 6& 16̅ C"#$% 6B
̅ ∑& 6& 16̅ D&
6 6 6
•=
;<=>?.@<,(')
∑& 6& 16̅ D&
• = 𝛽+,-. +
5 ;<=>?.@<,(')
• So, E 𝛽 = 𝛽+,-. , because E 𝜂! = 0. So, unbiased.
∑& 6& 16̅ 8 @<, D& 5 ;<=>?.@<, E F 8 F8
• Var 𝛽 = (58 ) ;<=>?.@<, ' 8
= (58 ) ;<=>?.@<, ' 8
= 5 ;<=>?.@<,(')
• So, consistent (using Chebyshev’s inequality)
Linear Regression
• Interpretation of estimates
• Line passes through (𝑥,̅ 𝑦)
X
• If x ≔ 𝑥,̅ then y = 𝛼 + 𝛽𝑥̅ = 𝑦6 − 𝛽𝑥̅ + 𝛽𝑥̅ = 𝑦6
• “Residuals” 𝜂) sum to 0
• ∑+ 𝜂+ = ∑+ 𝑦+ − 𝛼 − 𝛽𝑥+ = 𝑛𝑦6 − 𝑛 𝑦6 − 𝛽𝑥̅ − 𝛽𝑛𝑥̅ = 0
• Slope 𝛽 = SampleCov(X,Y) / SampleVar(X)

j j
• “Centering” data
• Weighted average of “slope” for specific points (𝑦+ − 𝑦)/(𝑥
6 + − 𝑥)
̅
• Larger weight for datum (𝑥$ , 𝑦$ ) if 𝑥$ coordinate farther from center 𝑥̅
• Weights are non-negative and sum to 1 (convex combination)
• Intercept 𝛼 = 𝑦X − 𝛽𝑥̅
• From center (𝑥,̅ 𝑦),
6 line with estimated slope 𝛽 intersects ‘y’ axis at 𝑦6 − 𝛽𝑥̅
Linear Regression
• Effect of outliers
A Poem on MLE
• https://www.math.utep.edu/faculty/
lesser/MLE.html
On Preparation for Events (Exams) in Life
• From the Iron Man
• “I don’t really prepare for anything like an event.”
• “The goal is to be at a certain level of fitness.”
• “I should be able to run a full marathon whenever I want.”
• “That is the constant level of fitness that I aspire to.”
• “I keep my fitness level as a goal, not an event as a goal.”
• “There is no such thing as a good shortcut.”
• “If you want to be healthy,
and you want to be fit,
and you want to be happy,
you have to work hard.”
• https://youtu.be/x_96xVfdzu0?t=303

You might also like