Intro To Data Science Lecture 2

9/6/2022
OUTLINE
1. Probability and Random Variable

Introduction to Data Science for

 2. Probability Distributions
Civil Engineers  3. Estimation
 Desirable Properties of Estimator
 Maximum Likelihood Estimator
 4. Hypothesis Testing and Confidence Intervals
Lecture 1b. Review of Probability and
Statistics
Some of the figures in this presentation are taken from "An Introduction to Statistical
Learning, with applications in R" (Springer, 1st Edition, 2013; 2nd Edition, 2021) with
Fall 2022 permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani
1 2
PROBABILITY
1. PROBABILITY AND RANDOM VARIABLE
 Axioms of Probability 𝒴
 Probability  0≤𝑃 𝐵 ≤1
𝐴
 A measure of the expectation that an event will occur. The 𝐵
 𝑃 𝒴 =1
probability 𝑃(𝐸) of an event 𝐸 is a real number in the range
of 0 to 1.
 Set operations
 Definition Based on Relative Frequency
 Let 𝒴 be a universal set, 𝐴 and 𝐵 are two subsets of 𝒴.
 An experiment is repeated 𝑛 trials, and 𝑦 is the outcome of
ith trial. Let 𝑦 be define on 𝒴, and 𝐵 be a subset of 𝒴.  Union of 𝐴 and 𝐵: 𝐴 ∪ 𝐵
 Define 𝑁(𝐵) = 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝐵, that is, the number of  Intersection of 𝐴 and 𝐵: 𝐴 ∩ 𝐵

outcome 𝑦 ’s that are in B.  Complement of 𝐴: 𝐴̅ = 𝒴 − 𝐴
( )
 Relative frequency: 𝑃 𝐵 =  Inclusion-exclusion principle:
 Law of Relative Frequency: 𝑃 𝐵 ≈ 𝑃 𝐵 𝑤ℎ𝑒𝑛 𝑛 ≫ 1  Let 𝐴 be the number of members in set 𝐴, for two sets,
 𝑃 𝐵 is defined as the probability of 𝐵, that is, the 𝐴∪𝐵 = 𝐴 + 𝐵 − 𝐴∩𝐵
probability that the outcome of a single trial will be in 𝐵.
 For 3 finite sets 𝐴, 𝐵, and 𝐶, we have: 𝐴 ∪ 𝐵 ∪ 𝐶 = 𝐴
 𝑃 is a distribution on 𝒴.
+ 𝐵 + 𝐶 − 𝐴∩𝐵 − 𝐴∩𝐶 − 𝐵∩𝐶 + 𝐴∩𝐵∩𝐶
3 4 1
9/6/2022
SET OPERATIONS RANDOM VARIABLE

 DeMorgan’s Law 𝒴  Random Variable (R.V.)
 For any sets (e.g., 𝐴 and 𝐵), we have 𝐴  A variable that has alternative values with a probability
 𝐴 ∪ 𝐵 = 𝐴̅ ∩ 𝐵
distribution.
𝐵  Discrete or Continuous
 𝐴 ∩ 𝐵 = 𝐴̅ ∪ 𝐵
 Expected Value of a R.V.
 Probability based on set operations  Also known as expectation, mean, or first moment
 𝑃 𝐴 ∪ 𝐵 = 𝑃 𝐴 + 𝑃 𝐵 − 𝑃(𝐴 ∩ 𝐵)  Let 𝑋 be a R.V., its expected value is
 𝐸 𝑋 =𝜇
 𝑃 𝐴 ∪ 𝐵 ≤ 𝑃 𝐴 + 𝑃(𝐵)
∑ 𝑃(𝑋 )𝑋 𝑤ℎ𝑒𝑛 𝑋 𝑖𝑠 𝑎 𝑑𝑖𝑠𝑐𝑟𝑒𝑡𝑒 𝑅. 𝑉. ; 𝑃 𝑖𝑠 𝑎 𝑝𝑟𝑜𝑏. 𝑚𝑎𝑠𝑠 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛
 If 𝐴 and 𝐵 are disjoint, 𝑃 𝐴 ∩ 𝐵 = 0 =
∫ 𝑓 𝑥 𝑥𝑑𝑥 𝑤ℎ𝑒𝑛 𝑋 𝑖𝑠 𝑎 𝑐𝑜𝑛𝑡𝑖𝑛𝑢𝑜𝑢𝑠 𝑅. 𝑉. ; 𝑓 𝑥 𝑖𝑠 𝑎 𝑝𝑟𝑜𝑏. 𝑑𝑒𝑛𝑠𝑖𝑡𝑦 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛
 𝑃 𝐴 ∪ 𝐵 = 𝑃 𝐴 + 𝑃(𝐵)
 Variance
 𝑃(𝐴 ∩ 𝐵) ≤ 𝑚𝑖𝑛 𝑃 𝐴 , 𝑃(𝐵)
𝑉𝑎𝑟 𝑋 = 𝜎 = 𝐸 𝑋 − 𝜇 =𝐸 𝑋−𝐸 𝑋 =𝐸 𝑋 − 𝐸 𝑋
 𝑃 𝐴̅ = 𝑃 𝒴 − 𝐴 = 1 − 𝑃(𝐴)
 𝜎 is standard deviation of 𝑋.
 If 𝐴 ⊆ 𝐵, then 𝑃 𝐴 ≤ 𝑃(𝐵)
 Jensen’s Inequality 𝐸 𝑋 > 𝐸 [𝑋] (because 𝑉𝑎𝑟(𝑋) > 0)
5 6
PROPERTIES OF EXPECTATIONS MORE PROPERTIES OF EXPECTATIONS
 If a and b are constants, 𝑋 and 𝑌 are random variables  𝑉𝑎𝑟(𝑋) = 𝐸(𝑋 ) – 𝜇 = 𝐸(𝑋 ) – 𝐸(𝑋)
 𝐸(𝑎) = 𝑎, 𝑉𝑎𝑟(𝑎) = 0
 𝑉𝑎𝑟(𝑎𝑋 + 𝑏) = 𝑎2𝑉𝑎𝑟(𝑋)
 𝐸(𝑎𝑋 + 𝑏) = 𝑎𝐸(𝑋) + 𝑏
 𝑉𝑎𝑟(𝑋 + 𝑌) = 𝑉𝑎𝑟(𝑋) + 𝑉𝑎𝑟(𝑌) + 2𝐶𝑜𝑣(𝑋, 𝑌)
 𝐸(𝑋 + 𝑌) = 𝐸(𝑋) + 𝐸(𝑌)
 𝑉𝑎𝑟(𝑋 − 𝑌) = 𝑉𝑎𝑟(𝑋) + 𝑉𝑎𝑟(𝑌) − 2𝐶𝑜𝑣(𝑋, 𝑌)
 𝐸(𝑋 − 𝑌) = 𝐸(𝑋) − 𝐸(𝑌)
 If 𝑋 and 𝑌 are uncorrelated, 𝑉𝑎𝑟(𝑋 + 𝑌) = 𝑉𝑎𝑟(𝑋)
 𝐸 𝑋− 𝜇 =0 𝑜𝑟 𝐸(𝑋 − 𝐸(𝑋)) = 0 + 𝑉𝑎𝑟(𝑌)
 𝐸((𝑎𝑋)2) = 𝑎2𝐸(𝑋2)
7 8 2
9/6/2022
JOINT PROBABILITY AND CORRELATION STATISTICAL INDEPENDENCE
 Joint Probability Distribution  Statistical independence

 Probability distribution for two or more R.V.s.  If 𝑃 𝑋, 𝑌 = 𝑃 𝑋 𝑃 𝑌  𝑋 and Y are statistically independent
 𝑃 𝑋 , 𝑌 = 𝑃 represents the joint probability of event 𝑋 and  If two R.V.’s 𝑋 and Y are independent, then
event 𝑌 (i.e., the probability of both events happening)  𝐸 𝑋𝑌 = 𝐸 𝑋 𝐸(𝑌)  𝐶𝑜𝑣 𝑋, 𝑌 = 0
 Covariance between two random variables, 𝑋 and 𝑌  However, if 𝐶𝑜𝑣 𝑋, 𝑌 = 0, we cannot say 𝑋 and Y are independent.
Why?
𝐶𝑜𝑣 𝑋, 𝑌 = 𝐸 𝑋 − 𝐸 𝑋 𝑌−𝐸 𝑌 = 𝐸 𝑋𝑌 − 𝐸 𝑋 𝐸(𝑌)
 “Statistical Independence” versus “Linear Independence”
 In a set of vectors (e.g., X and Y), if no vector can be written as a
 Covariance is a measure of the linear association between
linear combination of the others, these vectors are linearly
𝑋 and 𝑌.
independent.
 Correlation coefficient  Linear combination of vectors X and Y: 𝑎𝑿 + 𝑏𝒀 (a, b are real
𝐶𝑜𝑣(𝑋, 𝑌) numbers)
𝜌 𝑋, 𝑌 =
𝜎 𝜎  𝑎𝑿 + 𝑏𝒀 = 𝟎 if and only if 𝑎 = 𝑏 = 0  X and Y are linearly indep.
 𝜌 𝑋, 𝑌 is in the range of -1 to 1.
9 10
2. PROBABILITY DISTRIBUTIONS BERNOULLI DISTRIBUTION

 Discrete  A 0,1 -valued random variable is commonly referred to
 Bernoulli Distribution as an indicator or Bernoulli random variable.
 Binomial Distribution
 Negative Binomial Distribution
 Hypergeometric Distribution
 Let 𝑋 be such a random variable and set 𝑝 = 𝑃(𝑋 = 1).
Then the p.m.f. of 𝑋 is given by 𝑃 0 = 1 − 𝑝, and 𝑃 1
 Poisson Distribution
= 𝑝, which can be written as
𝑃 𝑥 = 𝑝 (1 − 𝑝) , 𝑥 ∈ 0,1
 Continuous
 Normal Distribution
 A Bernoulli random variable has
 Chi-square Distribution
 a mean of 𝑝
 t Distribution
 a variance of 𝑝(1 − 𝑝)
 F Distribution
 Exponential Distribution
 Logistic Distribution
11 12 3
9/6/2022
NORMAL DISTRIBUTION STANDARDIZED NORMAL DISTRIBUTION

 A continuous R.V. 𝑋 that has a p.d.f. 𝑓(𝑥) given by  The p.d.f. of a standardized normal distribution is
1 symmetric about 0, and is bell-shaped 0.45
0.4
𝑓 𝑥 = 𝑒 ,𝜎 > 0 0.35
𝜎 2𝜋 0.3
0.25
f(x)
0.2
is normally distributed with a mean 𝜇 and a standard 0.15

0.1
deviation 𝜎. We write it as 𝑋 ∼ 𝑁 𝜇, 𝜎 .
0.05
 Therefore, we have -4 -3 -2 -1
0
0 1 2 3 4
 When 𝜇 = 0 and 𝜎 = 1, the normal distribution p.d.f. is

x
𝑃 𝑋 ≤ −𝑥 = 𝑃(𝑋 ≥ 𝑥)
reduced to the form 𝐹 −𝑥 = 𝑃 𝑋 ≥ 𝑥 = 1 − 𝑃 𝑋 ≤ 𝑥 = 1 − 𝐹(𝑥)
1  𝐹 𝑥 is the cumulative distribution function (c.d.f.) of the
𝑓 𝑥 = 𝑒
2𝜋 random variable 𝑋.
The corresponding random variable is said to follow a  𝐹 𝑥 =𝑃 𝑋≤𝑥 =∑ 𝑃(𝑥 ) , 𝑥 ≤ 𝑥, if 𝑋 is discrete
standardized or unit normal distribution.  𝐹 𝑥 =𝑃 𝑋≤𝑥 =∫ 𝑓 𝑡 𝑑𝑡 , if 𝑋 is continuous
 If 𝑋 is normally distributed with mean 𝜇 and standard
 𝐹′ 𝑥 = 𝑓(𝑥) at all continuity points 𝑥 of 𝑓, where 𝐹′ 𝑥
deviation 𝜎, then 𝑍 = follows a standardized normal is the derivative of 𝐹 at 𝑥.
distribution, We write it as Z ∼ 𝑁 0,1 .  𝐹(𝑥) is a nondecreasing function of 𝑥, and 0 ≤ 𝐹 𝑥 ≤ 1.
13 14
CHI-SQUARE DISTRIBUTION
PROPERTIES OF NORMAL DISTRIBUTIONS  A chi-square distribution is defined as the sum of the
squares of 𝑛 independent unit normal distributions (𝑛 > 0),
 If 𝑋~𝑁(𝜇, 𝜎2), then 𝑎𝑋 + 𝑏 ~𝑁(𝑎𝜇 + 𝑏, 𝑎2𝜎2) and 𝑛 is named the degrees of freedom of the chi-square
distribution. That is, if
 A linear combination of independent, identically
distributed (i.i.d.) normal random variables will 𝑋= 𝑍
also be normally distributed
 If Y1,Y2, … Yn are i.i.d. and ~𝑁(𝜇, 𝜎2), then in which 𝑍 , . . . , 𝑍 are independent, unit normal random variables,
then 𝑋 follows a chi-square distribution with 𝑛 degrees of freedom,
denoted as 𝑋~𝜒 .
  2 
Y ~ N   , 
 n 
15 16 4
9/6/2022
CHI-SQUARE DISTRIBUTION CENTRAL LIMIT THEOREM
 Some characteristics of chi-square distributions:  If 𝑌 is the sum of 𝑛 independent and identically distributed
 The distribution is over the nonnegative region; (i.i.d.) random variables, 𝑋 , 𝑋 , ⋯ 𝑋 , then the distribution
of 𝑌 can be well-approximated by a normal distribution
 With the increase of the number of degrees of freedom, the when 𝑛 is large.
distribution shifts to the right (larger values) and tends to be
more symmetric and bell-shaped (towards normal distribution  If 𝑋 (𝑖 = 1,2, ⋯ 𝑛) has a mean 𝜇 and variance 𝜎 , then 𝑌 = ∑ 𝑋
based on the central limit theorem) has a mean 𝑛𝜇 and a variance 𝑛𝜎 .
 A chi-square distribution with 𝑛 degrees of freedom has a mean 𝑛 ∑

and a variance 2𝑛.  When 𝑌 is normally distributed, 𝑌 = = (𝑛 is a nonzero real
number) is also normally distributed. If 𝑌 has a mean 𝑛𝜇 and a
 If 𝑋 follows a chi-square distribution with 𝑛 degrees of freedom, 𝑌 variance 𝑛𝜎 , 𝑌 has a mean 𝜇 and a variance . This tells us that
follows a chi-square distribution with 𝑚 degrees of freedom, and 𝑋 the average of 𝑛 independent observations tend to become
and 𝑌 are independent, then 𝑋 + 𝑌 follows a chi-square normally distributed as 𝑛 increases.
distribution with 𝑛 + 𝑚 degrees of freedom.
17 18
T DISTRIBUTION
T DISTRIBUTION
 A random variable 𝑇 that follows a t-distribution
(sometimes also called Student’s t-distribution) with 𝑛  The characteristics of the t-distribution similar to the
degrees of freedom can be written as the quotient of two standardized normal distribution:
independent variables Z and R where Z is unit normal and R  Its p.d.f. is bell-shaped, symmetric about the mean (0).
is the root mean square of 𝑛 other independent unit normal  The mean, median, and mode of a t-distribution equal 0.
variables; that is,  Its p.d.f. curve never touches the x axis.
𝑡 = =
𝝌𝟐
𝒏
 Characteristics of the t-distribution that differ from the
note that 𝜒 is chi-square distribution with 𝑛 degrees of freedom; Z standardized normal distribution:
and 𝜒 are statistically independent. 0.45
Normal  The variance of a t-distribution is greater than 1.
0.4
t (n=1)
0.35
t (n=2)  The t-distribution is a family of curves based on the concept of
0.3
0.25
t (n=5) degrees of freedom.
t (n=10)
f(x)
0.2  As the degrees of freedom increase, the t-distribution approaches

0.15
the standardized normal distribution. Why?
0.1
0.05
0 x
-4 -2 0 2 4
19 20 5
9/6/2022
COMMON USE OF T DISTRIBUTION COMMON USE OF T DISTRIBUTION

 Suppose a simple random sample of size n, (𝑋 , 𝑋 , ⋯ , 𝑋 ), is  When the unknown 𝜎 is replaced with 𝑆 ,
taken from a population which follows a normal distribution
𝑍= becomes 𝑇 =
with mean 𝜇 and variance 𝜎 . The sample mean 𝑋
= 𝑋 + 𝑋 + ⋯ + 𝑋 then follows a normal distribution
with mean 𝜇 and variance . Let 𝑍 = , then Z ∼ 𝑁 0,1 .  It can be proven that 𝑇 follows a t-distribution with 𝑛 − 1
degrees of freedom.
 If 𝜎 is known, we may make statistical inference regarding 𝜇
or 𝑋 using the standardized normal distribution.
 Then we may make statistical inference regarding 𝜇 or 𝑋
 However, generally 𝜎 is unknown. We may make an
using the t-distribution.
unbiased estimation of variance 𝜎 with so-called sample
variance 𝑆 = ∑ 𝑋 −𝑋 from the observed sample.
21 22
F DISTRIBUTION
3. ESTIMATION
 A random variable F that follows an F distribution with
degrees of freedom of 𝑛 and 𝑛 can be written as the  For a population with unknown distribution parameters
quotient of two independent variables 𝑅 and 𝑅 where 𝑅 is (e.g., mean μ and variance 𝜎 ), we take a sample of N data
the mean square of 𝑛 independent unit normal variables, and obtain estimates of population characteristics from the
and 𝑅 is the mean square of 𝑛 other independent unit sample.
normal variables; that is,
 Estimator: a rule that gives a sample estimate
𝝌𝟐𝒏𝟏  Estimate: a number that is calculated from the sample based on
𝐹 , = =
𝒏𝟏
the estimator
𝝌𝟐𝒏𝟐
𝒏𝟐  For example, let X represent the roughness of pavement in
note that 𝜒 and 𝜒 are chi-square distributions with 𝑛 and 𝑛 Florida. X can be treated as a R.V. (why?)
degrees of freedom, respectively; 𝜒 and 𝜒 are statistically  We are interested in the mean of 𝑋 (i.e., 𝐸 𝑋 = 𝜇 =?)
independent.  We may measure the roughness at N randomly selected locations
 In hypothesis testing of parameters of a linear regression across the Florida pavement network, and get N samples of
model, F distribution is used for testing of a hypothesis roughness, (𝑋 , 𝑋 , ⋯ , 𝑋 ).
involving multiple parameters.  𝑋 , 𝑋 , ⋯ , 𝑋 are i.i.d. (why?)
 In contrast, t-distribution is used for testing of a hypothesis involving one parameter.
23 24 6
9/6/2022
3. ESTIMATION AN EXAMPLE OF ESTIMATOR
 Estimator  We construct an estimator of 𝜇 as follows

 Desirable properties of estimator 𝜇̂ = 𝑋 = ∑ 𝑋 this is known as sample mean
 (1). Unbiasedness  𝜇̂ (or 𝑋) is an estimator of 𝜇
 (2). Efficiency 𝜇̂ (or 𝑋) is a random variable! So it has a probability distribution

 (3). Minimum Mean Square Error with a mean and a variance.
 (4). Large Sample Properties (Consistency and Asymptotic  How are its mean and variance compared to those of X?
Efficiency)
 We claim 𝐸 𝑋 = 𝜇 , and 𝑣𝑎𝑟 𝑋 = 𝑣𝑎𝑟 𝑋 =
 Maximum likelihood Estimation (MLE)
 Proof:
1 1 1 1 1
𝐸 𝑋 =𝐸 𝑋 = 𝐸 𝑋 = 𝐸(𝑋 ) = 𝜇 = 𝑁𝜇 =𝜇
𝑁 𝑁 𝑁 𝑁 𝑁
1 1 1 1 1 𝜎
𝑣𝑎𝑟 𝑋 = 𝑣𝑎𝑟 𝑋 = 𝑣𝑎𝑟 𝑋 = 𝑣𝑎𝑟(𝑋 ) = 𝜎 = 𝑁𝜎 =
𝑁 𝑁 𝑁 𝑁 𝑁 𝑁
25 26
DESIRABLE PROPERTIES OF ESTIMATOR (1). UNBIASEDNESS

 (1). Unbiasedness  𝜃 is an unbiased estimator of 𝜃 if 𝐸 𝜃 = 𝜃
 (2). Efficiency
 (3). Minimum Mean Square Error  For example, we have the following estimators of 𝜇
 (4). Large Sample Properties (Consistency and Asymptotic
 1. 𝜇1 = 𝑋 = ∑ 𝑋 (unbiased, as proved before)
Efficiency)
 2. 𝜇2 = 𝑚𝑒𝑑𝑖𝑎𝑛 (𝑋 , 𝑋 , ⋯ , 𝑋 ) (only unbiased for a symmetric

distribution)
Let
𝜃 be a population parameter to be estimated
𝜃 be an estimator of 𝜃  3. 𝜇3 = ∑ 𝑋 (always overestimate 𝜇 . However, the bias
tends to disappear as N. So,
lim 𝐸 𝜇3 = 𝜇 (unbiased)
→
27 28 7
9/6/2022
AN UNBIASED ESTIMATOR OF POPULATION (2). EFFICIENCY

VARIANCE
 𝜃 is an efficient estimator of 𝜃 if its variance is the smallest
among all unbiased estimators.
 Sample variance 𝑆 computed as follows is an unbiased  For example, the sample mean estimator of 𝜇 , 𝑋 = ∑ 𝑋 , is the
estimator of population variance 𝜎 efficient estimator among all unbiased estimators of 𝜇 (minimum
variance).
1
𝑆 = (𝑋 − 𝑋) 𝑉𝑎𝑟 𝜃 = 𝐸 𝜃−𝐸 𝜃 =𝐸 𝜃 − 𝐸 𝜃
𝑁−1
 Cramer-Rao Lower Bound (CRLB): For an unbiased estimator 𝜃

 Note it is divided by (N-1) instead of N. Why?
of 𝜃, which is a parameter of a distribution 𝑓(𝑥|𝜃), using a sample
 One degree of freedom is used in computing the sample
of size N, the variance of the estimator, 𝑣𝑎𝑟(𝜃 ), has a lower
mean. bound:
1
𝑣𝑎𝑟(𝜃 ) ≥
𝜕ln[𝑓 𝑥 𝜃 ]
𝑁 𝐸[ ]
𝜕𝜃
−1
𝑣𝑎𝑟(𝜃 ) ≥
𝜕 𝑙𝑛[𝑓 𝑥 𝜃 ]
𝑁 𝐸[ ]
𝜕𝜃
29 30
(2). EFFICIENCY (3). MINIMUM MEAN SQUARE ERROR

 Let’s check if sample mean is an efficient estimator  Let 𝜃 be an estimator of 𝜃, define Bias 𝜃 = 𝐸 𝜃 − 𝜃
 Let 𝑋 be a Bernoulli random variable and set 𝑝 = 𝑃(𝑋 = 1). Then  Mean Square Error (MSE) is
the p.m.f. of 𝑋 is given by 𝑃 0 = 1 − 𝑝, and 𝑃 1 = 𝑝, which can 𝑀𝑆𝐸 = 𝐸[𝜃 − 𝜃] = 𝐸[𝜃 − 𝐸 𝜃 + 𝐸 𝜃 − 𝜃]

be written as
= 𝐸[𝜃 − 𝐸 𝜃 ] +𝐸[𝐸 𝜃 − 𝜃] = 𝑣𝑎𝑟(𝜃 )+[Bias 𝜃 ]
𝑓 𝑥 = 𝑝 (1 − 𝑝) , 𝑥 ∈ 0,1
 It is easy to prove that
𝐸(𝑋) = 𝑝 and 𝑉𝑎𝑟(𝑋) = 𝑝(1 − 𝑝)  To minimize MSE, both variance and bias are taken into account.
 A trade-off of bias and variance of the estimator
 We are interested in 𝑝. The sample mean estimator of 𝑝 is 𝑝̂
( )  The goal is to maximize the precision of prediction
= ∑ 𝑋 , and its variance is 𝑣𝑎𝑟 𝑝̂ = 𝑣𝑎𝑟 𝑋 =
 For example, a (biased) estimator with very low variance and
( )
= some bias may be more desirable than an unbiased estimator
with high variance.
 Let’s calculate the CRLB, since
[ 𝑥𝑝 ]
 = 𝑥𝑙𝑛𝑝 + 1 − 𝑥 ln 1 − 𝑝 = +
[ 𝑥𝑝 ] ( )
 𝐸[ ] = 𝐸[ + ] = 𝐸[ ] = =
( ) ( ) ( )
( )
We have CRLB =
31 32 8
9/6/2022
(4). LARGE SAMPLE PROPERTIES (4). LARGE SAMPLE PROPERTIES

 Let 𝜃 be an estimator of 𝜃 based on N samples (𝑋 , 𝑋 , ⋯ , 𝑋 )  3. Asymptotic Efficiency
𝜃 = 𝑓(𝑋 , 𝑋 , ⋯ , 𝑋 )  The variance of estimator 𝜃 goes to the CRLB when 𝑁 → ∞
 Large sample property is the property of the estimator 𝜃 when N

goes to infinity, i.e., lim 𝜃 lim 𝑣𝑎𝑟(𝜃 ) = 𝐶𝑅𝐿𝐵
→ →
 1. Convergence in distribution (Central Limit Theorem)

 2. Consistency
 Bias goes to 0 when 𝑁 → ∞
 𝜃 is a consistent estimator of 𝜃 if the probability limit of 𝜃 is 𝜃.
lim 𝑃𝑟𝑜𝑏 |𝜃 − 𝜃| < 𝜀 = 1 or written as plim 𝜃 = 𝜃

→ →
 MSE consistency
lim 𝑀𝑆𝐸 𝜃 =0
→
which indicates 𝑣𝑎𝑟(𝜃 ) → 0, and [Bias 𝜃 ] → 0 when 𝑁 → ∞
33 34
MAXIMUM LIKELIHOOD ESTIMATION (MLE) MLE

 A method of estimating the parameters of a probability distribution  Instead of maximize the likelihood (a product), it is easier to
by maximizing a likelihood function, so that under the assumed maximize the logarithm of the likelihood ( a sum). (why?)
statistical model the observed data is most probable
ℒ 𝑋; 𝜃 = 𝑙𝑜𝑔𝐿 𝑋; 𝜃 = 𝑙𝑜𝑔𝑓(𝑋 ; 𝜃)
 For example, N samples (𝑋 , 𝑋 , ⋯ , 𝑋 ) taken from a population with
a p.d.f of 𝑓(𝑋; 𝜃), in which 𝜃 is a parameter to be estimated.  Since the logarithm transformation is monotonic, 𝜃 determined based on
 The likelihood of the N independent samples is maximum logarithm of the likelihood is the same as the value determined
based on maximum likelihood.
𝑙𝑜𝑔𝐿
𝐿 𝑋; 𝜃 = 𝑓(𝑋 ; 𝜃)
 MLE is to find a value (𝜃 ) for 𝜃 so that the likelihood 𝐿 𝑋; 𝜃 is

maximized. 𝐿
𝐿 𝑋; 𝜃
 How to find the 𝜃 ?
𝑑ℒ 𝑋; 𝜃
𝑀𝑎𝑥 𝐿 𝑋; 𝜃 =0
𝑑𝜃
Need to check
𝑑 ℒ 𝑋; 𝜃
𝜃 <0
𝜃 𝑑𝜃
35 36 9
9/6/2022
MLE EXAMPLE – BERNOULLI DISTRIBUTION MLE EXAMPLE – NORMAL DISTRIBUTION

 For a large population of pavement sections, we want to estimate the Now let us assume that the sample points came from the population with
percentage of sections that have failed. We take N random samples normal distribution with unknown mean and variance. Let us
of pavement sections from the population and find N1 have not failed assume that we have n observations, y=(y1,y2,,,yn). We want to
estimate the population mean and variance. Then log likelihood
and N- N1 have failed. We use 𝑋 , 𝑋 , ⋯ , 𝑋 to represent the
function will have the form:
conditions of the N samples. 𝑋 (i=1,…,N) is a Bernoulli random
variable.
n
1 ( y   )2 n n
( y   )2
l ( y1 , y 2 , , , yn |  ,  2 )   ln( e(  i 2 ))  n ln( 2 )  ln( 2 )   i 2
0 𝑖𝑓 𝑝𝑎𝑣𝑒𝑚𝑒𝑛𝑡 𝑖 ℎ𝑎𝑠 𝑓𝑎𝑖𝑙𝑒𝑑, 𝑤𝑖𝑡ℎ 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 1 − 𝑝 i 1 2 2 2 2 i 1 2
𝑋 =
1 𝑖𝑓 𝑝𝑎𝑣𝑒𝑚𝑒𝑛𝑡 𝑖 ℎ𝑎𝑠 𝑛𝑜𝑡 𝑓𝑎𝑖𝑙𝑒𝑑, 𝑤𝑖𝑡ℎ 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑝 If we get derivative of this function w.r.t mean value and variance then
 The likelihood of observing 𝑋 is 𝑃 𝑋 = 𝑝 (1 − 𝑝) we can write: n
 The likelihood of observing 𝑋 , 𝑋 , ⋯ , 𝑋 is dl 1 n y  i


d  2
( y
i 1
i   )  0  ˆ  i 1
n
 ˆ  y
𝐿 𝑿; 𝑝 = 𝑝 (1 − 𝑝) = 𝑝 (1 − 𝑝) = 𝑝 (1 − 𝑝) dl n n n
d ( 2 )
 2 
2 2 4
( y
i 1
i   )2  0
 The log-likelihood is ℒ 𝑿; 𝑝 = 𝑙𝑜𝑔𝐿 𝑿; 𝑝 = 𝑁 log 𝑝 + 𝑁 − 𝑁 log(1 − 𝑝) Fortunately first of these equations can be solved without knowledge
𝑑ℒ 𝑿; 𝑝 𝑁 𝑁−𝑁 𝑁 1
about the second one. Then if we use result from the first solution in
 = − = 0 ⇒ 𝑝̂ = = 𝑋 =𝑋 the second solution (substitute  by its estimate) then we can solve
𝑑𝑝 𝑝 1−𝑝 𝑁 𝑁
second
n
equation also. Result of this will be sample variance:
1
 The MLE estimator happens to be the sample mean, which is unbiased and s 2   ( yi  ˆ )2
efficient. n i 1 Note it is divided by n, not (n-1), so this is a biased estimator!
37 38
4. HYPOTHESIS TESTING AND CONFIDENCE

MLE PROPERTIES INTERVALS
 Properties of MLE estimators  The discussion here is in the context of estimation
 Not necessarily unbiased  𝜃 is an estimator of 𝜃. Now we have calculated a value
 Not necessarily efficient (estimate) of 𝜃 based on N samples (𝑋 , 𝑋 , ⋯ , 𝑋 ). We want
 But is consistent and asymptotically efficient to make some guess about the unknown parameter 𝜃 (e.g., 𝜃
= 0). How reliable is our guess?
 We will answer this question in a hypothesis testing
approach.
 Null hypothesis: 𝐻 : 𝜃=0
 Alternative hypothesis: 𝐻: 𝜃≠0
 We will use 𝜃 and its estimate to test our hypothesis.
 Remember 𝜃 is a random variable, so it has a prob. distribution.
 If the estimate falls in a region with very small probability based
on the prob. distribution under the null hypothesis, we think our
null hypothesis is unlikely, so we reject it and accept the
alternative hypothesis.
39 40 10
9/6/2022
4A. HYPOTHESIS TESTING 4A. HYPOTHESIS TESTING

𝑓(𝜃 |𝜃)
 Note: Accepting the null hypothesis does not necessarily
mean the null hypothesis is true. It only means that the
observed data are compatible with the null hypothesis.
𝜃  Example:
−𝜃 𝜃=0 𝜃
 We want to estimate the mean of a population that follows a
normal distribution 𝑁 𝜇, 𝜎 .
 Under the null hypothesis, 𝜃 follows the distribution of 𝑓(𝜃 |𝜃
= 0).  We have N samples (𝑋 , 𝑋 , ⋯ , 𝑋 ) that are i.i.d.
 If the estimate is less than −𝜃 or greater than 𝜃 , we reject  We know the sample mean is an unbiased and efficient
the null hypothesis 𝜃 = 0, and accept the alternative estimator, 𝜇̂ = 𝑋

hypothesis.  Our Null hypothesis is 𝐻 : 𝜇 = 10
 The probability of such decision (i.e., making a type I error,  Alternative hypothesis: 𝐻 : 𝜇 ≠ 10
that is, to reject the null hypothesis when it is true) is the
 Since 𝑋~𝑁 𝜇, (why?), we use a test statistic = = 𝑍,
sum of the two shaded areas, which is called the level of
statistical significance, typically taken as 0.01 or 0.05. and know 𝑍~𝑁 0,1
P reject 𝐻 𝐻 is true = 0.01 𝑜𝑟 0.05
41 42

1
𝑓 𝑍 = 𝑒  Sample variance 𝑆 is computed as (why divided by N-1
2𝜋 instead of N?)
𝛼 𝛼
= 0.005 2
= 0.005 1
2 𝑆 = (𝑋 − 𝑋)
𝑍 𝑁−1
−𝑍 0 𝑍
 If we select a level of statistical significance of  = 0.01, from  Remember our test statistic 𝑍 = . If we replace 𝜎 with S,
a unit normal distribution table, we may find that 𝑍 =2.5758.
the test statistic does not follow a standardized normal
 We compute the sample mean 𝑋 from samples (𝑋 , 𝑋 , ⋯ , 𝑋 )
distribution any more. Instead, it follows a t-distribution with
 If Z = is in the acceptance region [−𝑍 , 𝑍 ], we accept 𝐻 , N-1 degrees of freedom.
otherwise we reject 𝐻 .  We write =𝑡 and use t-distribution for our hypothesis

testing.
 But,…., usually 𝜎 is unknown. What to do? We may use
sample variance 𝑆 to replace the population variance 𝜎 .
43 44 11
9/6/2022

Acceptance region  p-value:
Type I Error:
0.45
 0.4
 A p-value (probability value) describes the exact

0.35
 The null hypothesis is rejected when it

0.3
0.25
significance level associated with a test statistic.
f(x)
is true.
0.2
0.15
/2 0.1 /2
 The probability of making Type I error 0.05
Typically we compare the p-value with the level of

0
is the level of statistical significance or -4 -2 0

x
2 4 
the size of the test, and is typically

Z/2 Z1-/2 statistical significance α.
represented by the symbol α. Rejection region Rejection region  If p-value > α, we accept the null hypothesis, otherwise we reject
it.
 Type II Error:
 The null hypothesis is accepted when 1
𝑓 𝑍 = 𝑒
it is false. 0.45
0.4
Acceptance region
2𝜋
0.35
 The probability of making Type II 0.3

p 𝑣𝑎𝑙𝑢𝑒 p 𝑣𝑎𝑙𝑢𝑒
0.25
error is represented by the symbol β
f(x)
0.2
1- 2 2
/2
0.15
𝛼 𝛼
 1- β is the probability of correctly = 0.005
0.1
0.05  = 0.005 2
2
rejecting the null hypothesis when it is -4 -2
0
0 2
/2 4 𝑍
x
false. It is called the power of the test. Z/2 Z1-/2 −𝑍 −𝑍 0 𝑍 𝑍
(test statistic
Rejection region Rejection region based on N samples)
45 46
4B. CONFIDENCE INTERVALS 4B. CONFIDENCE INTERVALS

 Confidence Interval (C.I.) is a type of interval estimate of a  Confidence Interval for the variance σ2 of a normal distribution
population parameter, which may include the unknown with unknown mean
population parameter with a specified level of probability.  we know that the sample variance 𝑠 is an unbiased estimator of 𝜎 . It
( ) ∑
 For example, “a 95% confidence interval for parameter μ is [a,b]” can be shown that = follows a chi-square (𝜒 )
means that there is 95% probability that the interval [a,b] distribution with 𝑁 − 1 degrees of freedom.
constructed from a set of sample data will include the unknown  A 1 − 𝛼 100% confidence interval for 𝜎 can be constructed as
parameter μ.
 Confidence Interval for the mean of a normal distribution with
known standard deviation σ
X−μ N − 1)𝑠 N − 1)𝑠
𝑃 𝑍 ≤𝑍 =1−𝛼 P σ ≤Z = 1−α ,
0.45 Acceptance region
n χ χ
0.4
0.35
, ,
0.3
0.25
σ σ
f(x)
0.2
/2
0.15
0.1 /2 P X−Z ≤μ≤X+Z =1−α
0.05
0
n n
-4 -2 0 2 4
x
Z/2 Z1-/2
Rejection region Rejection region

The 1 − 𝛼 100% confidence interval for 𝜇 is
𝑋−𝑍 , 𝑋+𝑍 (what if σ is unknown?)
47 48 12

Intro To Data Science Lecture 2

Uploaded by

Copyright:

Available Formats

Intro To Data Science Lecture 2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Intro To Data Science Lecture 2

Uploaded by

Copyright:

Available Formats

9/6/2022

1. Probability and Random Variable

 Define 𝑁(𝐵) = 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝐵, that is, the number of  Intersection of 𝐴 and 𝐵: 𝐴 ∩ 𝐵

SET OPERATIONS RANDOM VARIABLE

PROPERTIES OF EXPECTATIONS MORE PROPERTIES OF EXPECTATIONS

JOINT PROBABILITY AND CORRELATION STATISTICAL INDEPENDENCE

 Joint Probability Distribution  Statistical independence

2. PROBABILITY DISTRIBUTIONS BERNOULLI DISTRIBUTION

NORMAL DISTRIBUTION STANDARDIZED NORMAL DISTRIBUTION

is normally distributed with a mean 𝜇 and a standard 0.15

 When 𝜇 = 0 and 𝜎 = 1, the normal distribution p.d.f. is

CHI-SQUARE DISTRIBUTION CENTRAL LIMIT THEOREM

 A chi-square distribution with 𝑛 degrees of freedom has a mean 𝑛 ∑

0.2  As the degrees of freedom increase, the t-distribution approaches

COMMON USE OF T DISTRIBUTION COMMON USE OF T DISTRIBUTION

3. ESTIMATION AN EXAMPLE OF ESTIMATOR

 Estimator  We construct an estimator of 𝜇 as follows

DESIRABLE PROPERTIES OF ESTIMATOR (1). UNBIASEDNESS

 2. 𝜇2 = 𝑚𝑒𝑑𝑖𝑎𝑛 (𝑋 , 𝑋 , ⋯ , 𝑋 ) (only unbiased for a symmetric

AN UNBIASED ESTIMATOR OF POPULATION (2). EFFICIENCY

 Cramer-Rao Lower Bound (CRLB): For an unbiased estimator 𝜃

(2). EFFICIENCY (3). MINIMUM MEAN SQUARE ERROR

(4). LARGE SAMPLE PROPERTIES (4). LARGE SAMPLE PROPERTIES

 Large sample property is the property of the estimator 𝜃 when N

 1. Convergence in distribution (Central Limit Theorem)

 𝜃 is a consistent estimator of 𝜃 if the probability limit of 𝜃 is 𝜃.

lim 𝑃𝑟𝑜𝑏 |𝜃 − 𝜃| < 𝜀 = 1 or written as plim 𝜃 = 𝜃

which indicates 𝑣𝑎𝑟(𝜃 ) → 0, and [Bias 𝜃 ] → 0 when 𝑁 → ∞

MAXIMUM LIKELIHOOD ESTIMATION (MLE) MLE

 MLE is to find a value (𝜃 ) for 𝜃 so that the likelihood 𝐿 𝑋; 𝜃 is

MLE EXAMPLE – BERNOULLI DISTRIBUTION MLE EXAMPLE – NORMAL DISTRIBUTION

 The likelihood of observing 𝑋 , 𝑋 , ⋯ , 𝑋 is dl 1 n y  i

4. HYPOTHESIS TESTING AND CONFIDENCE

4A. HYPOTHESIS TESTING 4A. HYPOTHESIS TESTING

the null hypothesis 𝜃 = 0, and accept the alternative estimator, 𝜇̂ = 𝑋

4A. HYPOTHESIS TESTING 4A. HYPOTHESIS TESTING

otherwise we reject 𝐻 .  We write =𝑡 and use t-distribution for our hypothesis

4A. HYPOTHESIS TESTING 4A. HYPOTHESIS TESTING

 A p-value (probability value) describes the exact

 The null hypothesis is rejected when it

Typically we compare the p-value with the level of

is the level of statistical significance or -4 -2 0

the size of the test, and is typically

 The probability of making Type II 0.3

error is represented by the symbol β

4B. CONFIDENCE INTERVALS 4B. CONFIDENCE INTERVALS

Rejection region Rejection region

You might also like