STA248
STA248
STA248
Jenci Wei
Winter 2022
1
Contents
1 Statistics and Sampling Distributions 3
2 Point Estimation 5
7 Analysis of Variance 19
8 Logistic Regression 23
Page 2
1 Statistics and Sampling Distributions
Statistic: any quantity whose value can be calculated from sample data
• E.g. y, s2
Population parameter: an unknown numerical value
• We are interested to conduct a statistical inference about the population parameter
• E.g. µ, σ 2
Population information
• Size of population: N
• Population mean: µ
• Population variance: σ 2
• Population distribution: Y
Sample information
• Sample size: n
• Samples: y1 , y2 , . . . , yn
• Sample mean: y
• Sample variance: s2
• Mean of the sampling distribution y: µy = E(y) = µ
√
• Standard deviation of the sampling distribution y: σy = σ/ n
– Called the standard error of the mean
Central Limit Theorem
• Refinement of the law of large numbers
• For a large number (n ≥ 30) of iid RVs y1 , . . . , yn with finite variance, the average y approximately
has a normal distribution, no matter what the distribution of the yi is
• Let y1 , . . . , yn be iid RVs with E(yi ) = µ and V (yi ) = σ 2 < ∞. Define
y−µ
Zn = √
σ/ n
The Zn follows the standard normal distribution for a large sample size n ≥ 30, i.e. Zn ∼ N (0, 1) for
n ≥ 30
– If σ is unknown, then
y−µ
Zn = √ ∼ N (0, 1)
s/ n
where s is the sample SD
The Sampling Distribution of the Sample Proportion
• Consider an event A in the sample space of some experiment with p = P (A). Let y be the number of
times A occurs when the experiment is repeated n independent times, and define the sample proportion
p̂ = y/n. Then
Page 3
1. E(p̂) = p
q
p(1−p) p(1−p)
2. V (p̂) = n and σp̂ = n
3. As n increases, the distribution of p̂ approaches a normal distribution
– p̂ is approximately normal, provided that np ≥ 10 and np(1 − p) ≥ 10
Gosset’s Theorem
• Let W1 and W2 be independent χ2 -distributed RVs with ν1 and ν2 df, respectively. Then
W1 /ν1
F =
W2 /ν2
has an F distribution with ν1 nuumerator degrees of freedom and ν2 denominator degrees of freedom
Page 4
2 Point Estimation
An estimator is a rule, often expressed as a formula, that tells how to calculate the value of an estimate
based on the measurements contained in a sample
n
• E.g. the sample mean y = 1
P
n yi is one possible point estimator of the population mean µ
i=1
– Otherwise θ̂ is biased
• The bias of a point estimator θ̂ is B(θ̂) = E(θ̂) − θ
= V (θ̂) + B(θ̂)2
Confidence Intervals
• An interval estimator is a rule specifying the method for using the sample measuurements to calculate
two numbers that form the endpoints of the interval
1. We want the interval to contain the target parameter θ
2. We want the interval to be narrow
Relative Efficiency
• Given two unbiased estimators θ̂1 and θ̂2 of a parameter θ. The efficiency of θ̂1 relative to θ̂2 , denoted
eff(θ̂1 , θ̂2 ), is the ratio
V (θ̂2 )
eff(θ̂1 , θ̂2 ) =
V (θ̂1 )
Page 5
Consistency
• An unbiased estimator θ̂n for θ is a consistent estimator of θ if
lim V (θ̂n ) = 0
n→∞
Likelihood Function
• Let y1 , . . . , yn be sample observations taken on corresponding RVs Y1 , . . . , Yn whose distributions de-
pend on a parameter θ. If Y1 , . . . , Yn are discrete RVs, the likelihood of the sample, L(y1 , . . . , yn |θ),
is defined to be the joint probability of y1 , . . . , yn .
– If Y1 , . . . , Yn are cts RVs, the likelihood L(y1 , . . . , yn |θ) is the joint density evaluated at y1 , . . . , yn
The Method of Moments
µ0k = E(Y k )
• The method of moments is based on the idea that sample moments should provide good estimates of
the corresponding population moments
The Method of Maximum Likelihood
• Suppose that the likelihood function depends on k parameters θ1 , . . . , θk . Choose the estimates of those
parameters that maximize the likelihood L(y1 , . . . , yn |θ1 , . . . , θk )
• The likelihood function is a function of the parameters θ1 , . . . , θk
– We sometimes write the lilelihood function as L(θ1 , . . . , θk )
• Maximum likelihood estimators are referred to as MLEs
Page 6
3 Statistical Intervals Based On a Simgle Sample
Confidence Interval for Proportion
• Whenever we estimate the SD of a sampling distribution, we call it a standard error
• For a sample proportion p̂, the standard error is
r
p̂q̂
SE(p̂) =
n
p̂ ± Zα/2 SE(p̂)
– 100(1−α)% of samples this size will produce confidence intervals that capture the true proportion
– We are 100(1 − α)% confident that the true proportion lies in our interval
• The extend of the interval on either side of p̂ is called the margin of error (ME):
ME = Zα/2 SE(p̂)
– Zα/2 is called the critical value and α is called the level of significance
A Confidence Interval for the Mean
• 100(1 − α)% confidence interval for the population mean µ:
y ± tn−1, α2 SE(y)
√
where the standard error of the mean SE(y) = s/ n
• If n ≥ 30, then 100(1 − α)% confidence interval for the population mean µ:
y ± Zα/2 SE(y)
√
where the standard error of the mean SE(y) = s/ n
Confidence Interval for σ 2
Page 7
4 Tests of Hypotheses Based on a Single Sample
Test of Hypothesis
• Statistical hypothesis: a statement about the numerical valuue of a population parameter
– E.g. popuulation mean, population SD
• Null hypothesis (H0 ): some claim about the population parameter that the researcher wants to test
– Either reject or not reject
• Alternative hypothesis (Ha ): the values of a population parameter for which the researcher wants
to gather evidence to support
– E.g.
H0 : µ ≤ 24
Ha : µ > 24
• Test statistic: a sample statistic, computed from information provided in the sample
– Used to decide between the null and alternative hypotheses
• Type I error: the researcher rejects the null hypothesis when H0 is true
• Observed significance level (p-value): the probability, assuming that H0 is true, of observing a
value of the test statistic that is at least as contradictory to the null hypothesis, and supportive of the
alternative hypothesis, as the actual one computed from the sample data
Large-Sample α-Level Hypothesis Tests
• H0 : θ = θ0
θ > θ 0
(upper-tail alternative)
• Ha : θ < θ0 (lower-tail alternative)
θ 6= θ0 (two-tailed alternative)
Page 8
µ > µ0
(upper-tail alternative)
• Ha : µ < µ0 (lower-tail alternative)
µ 6= µ0 (two-tailed alternative)
Y −µ
• Test statistic: t = √0,
S/ n
where Y is the sample mean and S is the sample SD
{t > tα,n−1 }
(upper-tail RR)
• Rejection region: {t < −tα,n−1 } (lower-tail RR)
|t| > tα/2,n−1 (two-tailed RR)
Page 9
5 Inferences Based on Two Samples
Comparing two population means: independent sampling – large-sample case
• Properties of the sampling distribution of y 1 − y 2
1. The mean of the sampling distribution of y 1 − y 2 is µ1 − µ2
– µ1 and µ2 are the means of the two populations
2. If the two samples are independnet, then the SD of the sampling distribution is
s
σ12 σ2
σy1 −y2 = + 2
n1 n2
– σ12 and σ22 are the variances of the two populations being sampled
– n1 and n2 are the respective sample sizes
– σy1 −y2 is also referred to as the standard error of the statistic y 1 − y 2
3. By the CLT, the sampling distribution of y 1 − y 2 is approximately normal for large samples
• When σ12 and σ22 are known, the 100(1 − α)% CI for µ1 − µ2 is
s
σ12 σ2
(y 1 − y 2 ) ± Zα/2 + 2
n1 n2
• When σ12 and σ22 are unknown, the 100(1 − α)% CI for µ1 − µ2 is
s
s21 s2
(y 1 − y 2 ) ± Zα/2 + 2
n1 n2
Page 10
µ1 − µ2 > D0
(upper-tail alternative)
– Ha : µ1 − µ2 < D0 (lower-tail alternative)
µ1 − µ2 6= D0 (two-tailed alternative)
• Small-sample case
– Assumptions
1. Independent samples
2. Samples are from normal distribution
3. σ12 = σ22
– Test statistic
y 1 − y 2 − D0
T = q ∼ tn1 +n2 −2
Sp n11 + n12
• Large-sample case
– Test statistic when σ12 and σ22 are known:
y − y − D0
Zc = 1q 2 2 2 ∼ N (0, 1)
σ1 σ2
n1 + n2
E(p̂1 − p̂2 ) = p1 − p2
3. If the sample sizes n1 and n2 are large, the sampling distribution of p̂1 − p̂2 is approximately
normal
• Assumptions and conditions when comparing proportions
1. Randomization condition: the data in each group is drawn independently and at random from
the target population
2. The (least important) 10% condition: the sample is less than 10% of the population
3. Independent group assumption: the two groups we are comparing are independent of each
other
4. Success/failure conditions: both groups are big enough so that at least 10 successes and at
least 10 failures have been observed in each group
Page 11
• In the large-sample case, the 100(1 − α)% CI for p1 − p2 is
s
p̂1 (1 − p̂1 ) p̂2 (1 − p̂2 )
(p̂1 − p̂2 ) ± Zα/2 +
n1 n2
– Test statistic:
p̂1 − p̂2
Zc = r
p̂(1 − p̂) n11 + 1
n2
where
n1 p̂1 + n2 p̂2
p̂ =
n1 + n2
Paired Samples and Blocks: Paired t-Test
• Paired data
– Two results are dependent of each other
– Since we care about the difference, we could only look at the difference and ignore the original
columns
– Use simple one-sample t-test
– Sample size is the number of pairs
• Hypotheses
• Test statistic
xd − d0
t= √ ∼ tnd −1
sd / nd
– xd is the sample mean difference
– sd is the sample SD of differences
– nd is the number of differences (i.e. number of pairs)
– Assumptions: the population of differences in test scores is approximately normally distributed.
The sample differences are randomly selected from the population differences
• Confidence interval: large sample
sd
xd ± Zα/2 √
nd
Page 12
– Conditions required: a random sample of differences is selected from the target population of
differences, and that the sample size nd is large (i.e. nd ≥ 30)
• Confidence interval: small sample
sd
xd ± tα/2 √
nd
– tα/2 is based on nd − 1 degrees of freedom
– Conditions required: a random sample of differences is selected from the target population of
differences, and that the population of differences has a distribution that is approximately normal
Page 13
6 Regression and Correlation
Deterministic Model
• Hypothesizes an exact relationship between variables
• E.g. y = f (x)
• Implies that y can always be determined exactly when the value of x is known
• The sum of squared vertical deviations from the points (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ) to the line is
n
X
g(b0 , b1 ) = [yi − (b0 + b1 xi )]2
i=1
• The point estimates of β0 and β1 , denoted by β̂0 and β̂1 , respectively, are called the least squares
estimates whose values minimize g(b0 , b1 )
• The estimated regression line or least squares regression line (LSRL) is the line whose equation
is
y = β̂0 + β̂1 x
• The least squares estimate of the slope coefficient β1 of the true regression line is
P
(xi − x)(yi − y)
b1 = β̂1 = P
(xi − x)2
Page 14
• The least squares estimate of the intercept β0 of the true regression line is
b0 = β̂0 = y − β̂1 x
• Under the normality assumption of the simple linear regression model, β̂0 and β̂1 are the maximum
likelihood estimates
• Notations for sums
n
X
Sxy = (xi − x)(yi − y)
i=1
Xn
Sxx = (xi − x)2
i=1
n
X
Syy = (yi − y)2
i=1
• The residuals (estimated error) e1 , e2 , . . . , en are the vertical deviations from the LSRL, i.e. the ith
residual is
ei = yi − ŷi = yi − β̂0 + β̂1 xi = (yi − y) − β̂1 (xi − x)
• The error sum of squares (or residual sum of squares), denoted by SSE, is
X X X
SSE = (ei − e)2 = e2i = (yi − ŷi )2
Page 15
• R2 is interpreted as the proportion of observed y variation that can be explained by the simple linear
regression model
• The closer R2 is to 1, the more successful the simple linear regression model is in explaining y variation
Decomposition of Total Sum of Squares
• Therefore
SST = SSR + SSE
σ2
V (β̂1 ) = σβ̂2 =
1 Sxx
σ
σβ̂1 =√
Sxx
– σ can be replaced by its estimate σ̂
3. The estimator β̂1 has a normal distribution
– Because it is a linear function of independent normal RVs
Page 16
• As a result, the assumptions of the simple linear regression model imply that
β̂1 − β1
T = ∼ tn−2
Sβ̂1
• A 100(1 − α)% confidence interval for the slope β1 of the true regression line is
σ̂
β̂1 ± tn−2,α/2 √
Sxx
– Test statistic:
β̂1 − β1
T = ∼ tn−2
Sβ̂1
ŷ = β̂0 + β̂1 x∗
2. The variance of ŷ is
(x∗ − x)2
1
V (ŷ) = σŷ2 =σ 2
+
n Sxx
The estimated variance of ŷ is
(x∗ − x)2
1
Sŷ2 = σ̂ 2
+
n Sxx
3. ŷ has a normal distribution, because it is a linear function of the yi s, which are normally distributed
and independent
Page 17
• Consequently, the variable
ŷ − E[y]
t= ∼ tn−2
Sŷ
– The variance of ŷ − y is
(x∗ − x)2
1 2
V [ŷ − y] = σ 1 + +
n Sxx
– The estimated variance of ŷ − y is
(x∗ − x)2
2 2 1
Sŷ−y = σ̂ 1 + +
n Sxx
• The sample correlation coefficient for the n pairs (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ) is
n
1 X xi − x yi − y Sxy
r= =√ p
n − 1 i=1 Sx Sy Sxx Syy
• Properties of r
1. The value of r does not depend on which of the two variables is labelled x and which is labelled y
2. The value of r is independent of the units in which x and y are measured, i.e. r is unitless
3. The square of the sample correlation gives the value of the coefficient of determination that would
result from fitting the simple linear regression model, i.e. r2 = R2
4. −1 ≤ r ≤ 1
5. r = ±1 iff all (xi , yi ) pairs lie on a straight line
Page 18
7 Analysis of Variance
The analysis of variance (ANOVA) is a collection of statistical procedures for the analysis of quantitative
responses
• The simplest ANOVA problem is referred to variously as a single-factor, single-classification, or one-
way ANOVA and involves the analysis of data samplesd from two or more numerical populations (i.e.
distributions)
Factors are those variables whose effect on the resonse is of interest to the experimenter
• Also called independent variables
• Quantitative factors are measured on a numerical scale
Terminology
• The mathematical model for the data from a completely randomized design (CRD) with an
unequal number of replicates for each factor level is
yij = µ + τi + ij
where
– yij is the resopnse for the jth experimental unit subject to the ith level of the treatment factor,
i ∈ [1, t], j ∈ [1, ni ]
– ni is the number of experimental units or replications in ith level of the treatment factor
Page 19
– The distribution of the experimental errors, ij , are mutually independent due to the random-
izaiton and is assumed to be normally distributed
– τi represents the treatment effect
– µ is the overall mean
• We could write the null hpothesis in terms of the treatments effects, where H0 : τ1 = τ2 = · · · = τt
• Assumptions
– The t population or treatment distributions are all normal with the same variance σ 2 , i.e. the
yij s are independent and normally distributed with
E(yij ) = µi = µ + τi
V (yij ) = σ 2
Page 20
• A measure of within-samples variations is the error sum of squares (SSE), given by
ni
t X
X
SSE = (yij − y i. )2
i=1 j=1
= SSTotal − SSTr
SSTr SSE
MSTr = MSE =
t−1 n−t
E(MSE) = σ 2
Page 21
• We consider the equal number of replications n0 = n1 = · · · = nt . For each i < j, form the interval
r
MSE
y i. − y j. ± Qα,t,n−t
n0
• Assumption: the t sample sizes n1 , n2 , . . . , nt are reasonably close to each other (i.e. mild imbalance)
• Let s
MSE 1 1
dij = Qα,t,n−t +
2 ni nj
y i. − y j. − dij ≤ µi − µj ≤ y i. − y j. + dij
Page 22
8 Logistic Regression
Logit Function
eβ0 +β1 x
p(x) = = [1 + exp(−β0 − β1 x)]−1
1 + eβ0 +β1 x
Odds
• Logistic regression means assuming that p(x) is related to x by the logit function
p(x)
= exp(β0 + β1 x)
1 − p(x)
Likelihood Function
• There are no analytical solutions for the MLEs β̂0 and β̂1
• The maximization process must be carried out using iterative numerical methods
β̂1 −β1
• For large n, the MLE has approximately a normal distribution and the standardized variable Sβ̂
1
has approximately a standard normal distribution
Page 23
9 Chi-Squared Tests (Extra)
A multinomial experiment satisfies the following conditions:
1. The experiment consists of a sequence of n trials for some fixed n
2. Each trial can result in one of the same k possible outcomes (aka categories)
3. The trials are independent
4. Th probability that a trial results in a category i is pi , which is a constant
P
The parameters p1 , . . . , pk must satisfy pi ≥ 0 and pi = 1
• Generalization of a binomial experiment, allows each trial to result in one of > 2 possible outcomes
Null hypothesis: pi s are assigned some fixed values, alternative ypothesis: at least one of the pi s has a value
different from that asserted by H0
Page 24
Fisher’s Chi-Squared Theorem
• Under general regularity conditions on θ1 , . . . , θm , and the πi (θ)s, if θ1 , . . . , θm are estimated by max-
imizing the multinomial expression, then the rv
k k
X (Ni − nP̂i )2 X [Ni − nπi (θ̂)]2
χ2 = =
i=1 nP̂i i=1 nπi (θ̂)
Page 25
10 Bayesian Estimation (Extra)
Prior Distribution
• A prior distribution for a parameter θ, denoted π(θ), is a probability distribution on the set of
possible values for θ
• If the possible values of θ form an interval I, then π(θ) is a pdf that must satisfy
Z
π(θ)dθ = 1
I
• If θ is potentially any value in a discrete set D, then π(θ) is a pmf that must satisfy
X
π(θ) = 1
θ∈D
Posterior Distribution
• Suppose X1 , . . . , Xn have joint pdf f (x1 , . . . , xn ; θ) and the unknown parameter θ has been assigned
a continuous prior distribution π(θ), then the posterior distribution of θ, given the observations
X1 = x1 , . . . , Xi = xi , is
π(θ)f (x1 , . . . , xn ; θ)
π(θ|x1 , . . . , xn ) = R ∞
−∞
π(θ)f (x1 , . . . , xn ; θ)dθ
Page 26