STA248

Download as pdf or txt
Download as pdf or txt
You are on page 1of 26

STA248 Notes

Jenci Wei
Winter 2022

1
Contents
1 Statistics and Sampling Distributions 3

2 Point Estimation 5

3 Statistical Intervals Based On a Simgle Sample 7

4 Tests of Hypotheses Based on a Single Sample 8

5 Inferences Based on Two Samples 10

6 Regression and Correlation 14

7 Analysis of Variance 19

8 Logistic Regression 23

9 Chi-Squared Tests (Extra) 24

10 Bayesian Estimation (Extra) 26

Page 2
1 Statistics and Sampling Distributions
Statistic: any quantity whose value can be calculated from sample data
• E.g. y, s2
Population parameter: an unknown numerical value
• We are interested to conduct a statistical inference about the population parameter
• E.g. µ, σ 2
Population information
• Size of population: N
• Population mean: µ
• Population variance: σ 2
• Population distribution: Y
Sample information
• Sample size: n
• Samples: y1 , y2 , . . . , yn
• Sample mean: y
• Sample variance: s2
• Mean of the sampling distribution y: µy = E(y) = µ

• Standard deviation of the sampling distribution y: σy = σ/ n
– Called the standard error of the mean
Central Limit Theorem
• Refinement of the law of large numbers
• For a large number (n ≥ 30) of iid RVs y1 , . . . , yn with finite variance, the average y approximately
has a normal distribution, no matter what the distribution of the yi is
• Let y1 , . . . , yn be iid RVs with E(yi ) = µ and V (yi ) = σ 2 < ∞. Define
y−µ
Zn = √
σ/ n
The Zn follows the standard normal distribution for a large sample size n ≥ 30, i.e. Zn ∼ N (0, 1) for
n ≥ 30
– If σ is unknown, then
y−µ
Zn = √ ∼ N (0, 1)
s/ n
where s is the sample SD
The Sampling Distribution of the Sample Proportion
• Consider an event A in the sample space of some experiment with p = P (A). Let y be the number of
times A occurs when the experiment is repeated n independent times, and define the sample proportion
p̂ = y/n. Then

Page 3
1. E(p̂) = p
q
p(1−p) p(1−p)
2. V (p̂) = n and σp̂ = n
3. As n increases, the distribution of p̂ approaches a normal distribution
– p̂ is approximately normal, provided that np ≥ 10 and np(1 − p) ≥ 10
Gosset’s Theorem

• If y1 , . . . , yn is a random sample from a N (µ, σ) distribution, then a RV


y−µ

s/ n

has the t distribution with n − 1 degrees of freedom, i.e. tn−1


Chi-Squared Distribution
• Let y1 , . . . , yn be a random sample from a normal distribution with mean µ and variance σ 2 . Then
n
(n − 1)s2 1 X
= (yi − y)2
σ2 σ 2 i=1

has a χ2 distribution with n − 1 degrees of freedom (df)


F Distribution

• Let W1 and W2 be independent χ2 -distributed RVs with ν1 and ν2 df, respectively. Then

W1 /ν1
F =
W2 /ν2

has an F distribution with ν1 nuumerator degrees of freedom and ν2 denominator degrees of freedom

Page 4
2 Point Estimation
An estimator is a rule, often expressed as a formula, that tells how to calculate the value of an estimate
based on the measurements contained in a sample
n
• E.g. the sample mean y = 1
P
n yi is one possible point estimator of the population mean µ
i=1

The Bias and Mean Square Error of Point Estimators


• Let θ̂ be a point estimator for a parameter θ. Then θ̂ is an unbiased estimator if E(θ̂) = θ

– Otherwise θ̂ is biased
• The bias of a point estimator θ̂ is B(θ̂) = E(θ̂) − θ

• The mean square error of a point estimator θ̂ is


h i
MSE(θ̂) = E (θ̂ − θ)2

= V (θ̂) + B(θ̂)2

Evaluating the Goodness of a Point Estimator


• The error of estimation  is the distance between an estimator and its target parameter, i.e.  = |θ̂ − θ|

Confidence Intervals
• An interval estimator is a rule specifying the method for using the sample measuurements to calculate
two numbers that form the endpoints of the interval
1. We want the interval to contain the target parameter θ
2. We want the interval to be narrow

• Interval estimators are also called confidence intervals


– The upper and lower endpoints of a confidence interval are called the upper and lower confi-
dence limits, respectively
– Suppose that θ̂L and θ̂U are the (random) lower and upper confidence limits, respectively, for a
parameter θ. Then if
P (θ̂L ≤ θ ≤ θ̂U ) = 1 − α
the probability 1 − α is the confidence coefficient
Large-Sample Confidence Intervals
• The endpoints for a 100(1 − α)% confidence interval for θ are given by

θ̂L = θ̂ − zα/2 σθ̂


θ̂U = θ̂ + zα/2 σθ̂

Relative Efficiency
• Given two unbiased estimators θ̂1 and θ̂2 of a parameter θ. The efficiency of θ̂1 relative to θ̂2 , denoted
eff(θ̂1 , θ̂2 ), is the ratio
V (θ̂2 )
eff(θ̂1 , θ̂2 ) =
V (θ̂1 )

Page 5
Consistency
• An unbiased estimator θ̂n for θ is a consistent estimator of θ if

lim V (θ̂n ) = 0
n→∞

Likelihood Function
• Let y1 , . . . , yn be sample observations taken on corresponding RVs Y1 , . . . , Yn whose distributions de-
pend on a parameter θ. If Y1 , . . . , Yn are discrete RVs, the likelihood of the sample, L(y1 , . . . , yn |θ),
is defined to be the joint probability of y1 , . . . , yn .
– If Y1 , . . . , Yn are cts RVs, the likelihood L(y1 , . . . , yn |θ) is the joint density evaluated at y1 , . . . , yn
The Method of Moments

• Consider the kth moment of a RV, taken about the origin, is

µ0k = E(Y k )

The corresopnding kth sample moment is the average


n
1X k
m0k = Y
n i=1 i

• The method of moments is based on the idea that sample moments should provide good estimates of
the corresponding population moments
The Method of Maximum Likelihood
• Suppose that the likelihood function depends on k parameters θ1 , . . . , θk . Choose the estimates of those
parameters that maximize the likelihood L(y1 , . . . , yn |θ1 , . . . , θk )
• The likelihood function is a function of the parameters θ1 , . . . , θk
– We sometimes write the lilelihood function as L(θ1 , . . . , θk )
• Maximum likelihood estimators are referred to as MLEs

Page 6
3 Statistical Intervals Based On a Simgle Sample
Confidence Interval for Proportion
• Whenever we estimate the SD of a sampling distribution, we call it a standard error
• For a sample proportion p̂, the standard error is
r
p̂q̂
SE(p̂) =
n

• 100(1 − α)% confidence interval for the population proportion p is

p̂ ± Zα/2 SE(p̂)

– 100(1−α)% of samples this size will produce confidence intervals that capture the true proportion
– We are 100(1 − α)% confident that the true proportion lies in our interval
• The extend of the interval on either side of p̂ is called the margin of error (ME):

ME = Zα/2 SE(p̂)

– Zα/2 is called the critical value and α is called the level of significance
A Confidence Interval for the Mean
• 100(1 − α)% confidence interval for the population mean µ:

y ± tn−1, α2 SE(y)

where the standard error of the mean SE(y) = s/ n
• If n ≥ 30, then 100(1 − α)% confidence interval for the population mean µ:

y ± Zα/2 SE(y)

where the standard error of the mean SE(y) = s/ n
Confidence Interval for σ 2

• 100(1 − α)% confidence interval for the population variance σ 2 :


!
(n − 1)S 2 (n − 1)S 2
,
χ2α/2 χ21−α/2

Page 7
4 Tests of Hypotheses Based on a Single Sample
Test of Hypothesis
• Statistical hypothesis: a statement about the numerical valuue of a population parameter
– E.g. popuulation mean, population SD
• Null hypothesis (H0 ): some claim about the population parameter that the researcher wants to test
– Either reject or not reject
• Alternative hypothesis (Ha ): the values of a population parameter for which the researcher wants
to gather evidence to support
– E.g.

H0 : µ ≤ 24
Ha : µ > 24

• Test statistic: a sample statistic, computed from information provided in the sample
– Used to decide between the null and alternative hypotheses
• Type I error: the researcher rejects the null hypothesis when H0 is true

α = P (Type I error) = P (Reject H0 |H0 )

Value of α is the level of the test


• Rejection region: the set of possible values of the test statistic for which we could reject H0
• Type II error: the researcher accepts the null hypothesis when H0 is false

β = P (Type II error) = P (Do not reject H0 |¬H0 )

• Observed significance level (p-value): the probability, assuming that H0 is true, of observing a
value of the test statistic that is at least as contradictory to the null hypothesis, and supportive of the
alternative hypothesis, as the actual one computed from the sample data
Large-Sample α-Level Hypothesis Tests
• H0 : θ = θ0

θ > θ 0
 (upper-tail alternative)
• Ha : θ < θ0 (lower-tail alternative)

θ 6= θ0 (two-tailed alternative)

• Test statistic: Z = θ̂−θ0


σθ̂

{z > zα }
 (upper-tail RR)
• Rejection region: {z < −zα } (lower-tail RR)


|z| > zα/2 (two-tailed RR)
Small-Sample Test for µ
• Assumptions: Y1 , . . . , Yn constitute a random sample from a normal distribution with E(Yi ) = µ
• H0 : µ = µ0

Page 8

µ > µ0
 (upper-tail alternative)
• Ha : µ < µ0 (lower-tail alternative)

µ 6= µ0 (two-tailed alternative)

Y −µ
• Test statistic: t = √0,
S/ n
where Y is the sample mean and S is the sample SD

{t > tα,n−1 }
 (upper-tail RR)
• Rejection region: {t < −tα,n−1 } (lower-tail RR)


|t| > tα/2,n−1 (two-tailed RR)

Test of Hypothesis Concerning a Population Variance


• Assumptions: Y1 , . . . , Yn constitute a random sample from a normal distribution with E(Yi ) = µ and
V (Yi ) = σ 2
• H0 : σ 2 = σ02

2 2
σ > σ0
 (upper-tail alternative)
• Ha : σ 2 < σ02 (lower-tail alternative)

 2
σ 6= σ02 (two-tailed alternative)
2
• Test statistic: χ2 = (n−1)S
σ02

2 2
χ > χα,n−1 (upper-tail RR)


2 2
• Rejection region: nχ < χ1−α,n−1 o (lower-tail RR)
 χ2 > χ2 2 2


α/2,n−1 ∨ χ < χ1−α/2,n−1 (two-tailed RR)

Test of Hypothesis: σ12 = σ22


• Assumptions: independent samples from normal populations
• H0 : σ12 = σ22
• Ha : σ12 > σ22
S12
• Test statistic: F = S22

• Rejection region: F > Fα , where Fα is chosen so that P (F > Fα ) = α when F has ν1 = n1 − 1


numerator df and ν2 = n2 − 1 denominator df

Page 9
5 Inferences Based on Two Samples
Comparing two population means: independent sampling – large-sample case
• Properties of the sampling distribution of y 1 − y 2
1. The mean of the sampling distribution of y 1 − y 2 is µ1 − µ2
– µ1 and µ2 are the means of the two populations
2. If the two samples are independnet, then the SD of the sampling distribution is
s
σ12 σ2
σy1 −y2 = + 2
n1 n2

– σ12 and σ22 are the variances of the two populations being sampled
– n1 and n2 are the respective sample sizes
– σy1 −y2 is also referred to as the standard error of the statistic y 1 − y 2
3. By the CLT, the sampling distribution of y 1 − y 2 is approximately normal for large samples
• When σ12 and σ22 are known, the 100(1 − α)% CI for µ1 − µ2 is
s
σ12 σ2
(y 1 − y 2 ) ± Zα/2 + 2
n1 n2

• When σ12 and σ22 are unknown, the 100(1 − α)% CI for µ1 − µ2 is
s
s21 s2
(y 1 − y 2 ) ± Zα/2 + 2
n1 n2

Comparing two population means: independent sampling – small-sample case


• Assumptions
1. Both sampled populations are approximately normally distributed
2. The samples have equal population variances (i.e. σ12 = σ22 = σ 2 )
3. Random samples are selected independently of each other
• 100(1 − α)% CI for µ1 − µ2 is
s  
1 1
(y 1 − y 2 ) ± tα/2 Sp2 +
n1 n2

– Sp2 is the pooled sample variance where

(n1 − 1)S12 + (n2 − 1)S22


Sp2 =
n1 + n2 − 2

– tα/2 is based on n1 + n2 − 2 degrees of freedom


Comparing two population means: independent sampling – hypothesis testing
• Hypotheses
– H0 : µ1 − µ2 = D0

Page 10

µ1 − µ2 > D0
 (upper-tail alternative)
– Ha : µ1 − µ2 < D0 (lower-tail alternative)

µ1 − µ2 6= D0 (two-tailed alternative)

• Small-sample case
– Assumptions
1. Independent samples
2. Samples are from normal distribution
3. σ12 = σ22
– Test statistic
y 1 − y 2 − D0
T = q ∼ tn1 +n2 −2
Sp n11 + n12

• Large-sample case
– Test statistic when σ12 and σ22 are known:
y − y − D0
Zc = 1q 2 2 2 ∼ N (0, 1)
σ1 σ2
n1 + n2

– Test statistic when σ12 and σ22 are unknown:


y − y − D0
Zc = 1q 2 2 2 ∼ N (0, 1)
s1 s2
n1 + n2

Comparing two population proportions: independent sampling


• Properties of the sampling distribution of p̂1 − p̂2
1. The mean of the sampling distribution of p̂1 − p̂2 is p1 − p2 , i.e.

E(p̂1 − p̂2 ) = p1 − p2

– p1 and p2 are the proportions of the two populations


– p̂1 − p̂2 is an unbiased estimator of p1 − p2
2. The SD of the sampling distribution of p̂1 − p̂2 is
s
p1 (1 − p1 ) p2 (1 − p2 )
σp̂1 −p̂2 = +
n1 n2

3. If the sample sizes n1 and n2 are large, the sampling distribution of p̂1 − p̂2 is approximately
normal
• Assumptions and conditions when comparing proportions
1. Randomization condition: the data in each group is drawn independently and at random from
the target population
2. The (least important) 10% condition: the sample is less than 10% of the population
3. Independent group assumption: the two groups we are comparing are independent of each
other
4. Success/failure conditions: both groups are big enough so that at least 10 successes and at
least 10 failures have been observed in each group

Page 11
• In the large-sample case, the 100(1 − α)% CI for p1 − p2 is
s
p̂1 (1 − p̂1 ) p̂2 (1 − p̂2 )
(p̂1 − p̂2 ) ± Zα/2 +
n1 n2

Comparing two population proportions: independent sampling – hypothesis testing

• Large-sample test of hypothesis about p1 − p2 : normal statistic:


– H0 : p1 − p2 = 0

p1 − p2 > 0 (upper-tail alternative)

– Ha : p1 − p2 < 0 (lower-tail alternative)

p1 − p2 6= 0 (two-tailed alternative)

– Test statistic:
p̂1 − p̂2
Zc = r  
p̂(1 − p̂) n11 + 1
n2

where
n1 p̂1 + n2 p̂2
p̂ =
n1 + n2
Paired Samples and Blocks: Paired t-Test
• Paired data
– Two results are dependent of each other
– Since we care about the difference, we could only look at the difference and ignore the original
columns
– Use simple one-sample t-test
– Sample size is the number of pairs
• Hypotheses

– We make inferences about the mean of the population of differences, µd = µ1 − µ2


– H0 : µd = d0

µd > d0
 (upper-tail alternative)
– Ha : µd < d0 (lower-tail alternative)

µd 6= d0 (two-tailed alternative)

• Test statistic
xd − d0
t= √ ∼ tnd −1
sd / nd
– xd is the sample mean difference
– sd is the sample SD of differences
– nd is the number of differences (i.e. number of pairs)
– Assumptions: the population of differences in test scores is approximately normally distributed.
The sample differences are randomly selected from the population differences
• Confidence interval: large sample
sd
xd ± Zα/2 √
nd

Page 12
– Conditions required: a random sample of differences is selected from the target population of
differences, and that the sample size nd is large (i.e. nd ≥ 30)
• Confidence interval: small sample
sd
xd ± tα/2 √
nd
– tα/2 is based on nd − 1 degrees of freedom
– Conditions required: a random sample of differences is selected from the target population of
differences, and that the population of differences has a distribution that is approximately normal

Page 13
6 Regression and Correlation
Deterministic Model
• Hypothesizes an exact relationship between variables
• E.g. y = f (x)
• Implies that y can always be determined exactly when the value of x is known

• No allowance for error


Probabilistic Model
• Includes both a deterministic component and a random error component

• E.g. y = f (x) + random error


Simple Linear Regression
y = β0 + β1 x + 
• The deterministic portion of the model graphs as a straight line

• y is the dependent or response variable


• x is the independent or predictor variable
• β0 + β1 x is the deterministic component
•  is the random error component which is assumed to follow a N (0, σ) distribution

• β0 is the y-intercept of the line


• β1 is the slope of the line
Estimating Model Parameters

• Let (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ) be the observed n-pairs


• The vertical deviation of the point (xi , yi ) from a line y = b0 + b1 x is

height of point − height of line = yi − (b0 + b1 xi )

• The sum of squared vertical deviations from the points (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ) to the line is
n
X
g(b0 , b1 ) = [yi − (b0 + b1 xi )]2
i=1

• The point estimates of β0 and β1 , denoted by β̂0 and β̂1 , respectively, are called the least squares
estimates whose values minimize g(b0 , b1 )
• The estimated regression line or least squares regression line (LSRL) is the line whose equation
is
y = β̂0 + β̂1 x

• The least squares estimate of the slope coefficient β1 of the true regression line is
P
(xi − x)(yi − y)
b1 = β̂1 = P
(xi − x)2

Page 14
• The least squares estimate of the intercept β0 of the true regression line is

b0 = β̂0 = y − β̂1 x

• Under the normality assumption of the simple linear regression model, β̂0 and β̂1 are the maximum
likelihood estimates
• Notations for sums
n
X
Sxy = (xi − x)(yi − y)
i=1
Xn
Sxx = (xi − x)2
i=1
n
X
Syy = (yi − y)2
i=1

Residuals and Estimating σ


• The fitted (or predicted) values ŷ1 , ŷ2 , . . . , ŷn are obtained by successively substituting the x values
x1 , x2 , . . . , xn into the equation of the LRSL, i.e. the ith fitted value is

ŷi = β̂0 + β̂1 xi = y + β̂1 (xi − x)

• The residuals (estimated error) e1 , e2 , . . . , en are the vertical deviations from the LSRL, i.e. the ith
residual is  
ei = yi − ŷi = yi − β̂0 + β̂1 xi = (yi − y) − β̂1 (xi − x)

• The error sum of squares (or residual sum of squares), denoted by SSE, is
X X X
SSE = (ei − e)2 = e2i = (yi − ŷi )2

• The least squares estimate of σ 2 is


SSE
σ̂ 2 =
n−2
• The residual standard deviation is an estimate of σ given by
r
SSE
σ̂ =
n−2

• SSE can be computed by


2
Sxy
SSE = Syy −
Sxx
Coefficient of Determination
• Total sum of squares: a quantitative measure of the total amount of variation in the observed y values
X
SST = (yi − y)2 = Syy

• The coefficient of determination, denoted by R2 , is given by


SSE
R2 = 1 −
SST

Page 15
• R2 is interpreted as the proportion of observed y variation that can be explained by the simple linear
regression model
• The closer R2 is to 1, the more successful the simple linear regression model is in explaining y variation
Decomposition of Total Sum of Squares

• The total sum of squares can be decomposed by


X
SST = (yi − y)2
X 2
= [(yi − ŷi ) + (ŷi − y)]
X X
= (yi − ŷi )2 + (ŷi − y)2

• The regression sum of squares is X


SSR = (ŷi − y)2

• Therefore
SST = SSR + SSE

• Coefficient of determination can be rewritten to


SSR
R2 =
SST

Inferences About the Regression Coefficient β1

• Assumptions and conditions


1. Linearity assumption: the straight enough condition is satisfied if a scatterplot looks straight
2. Independent assumption: the errors in the true underlying regression model (i.e. the s) must
be mutually independent
– No way of checking whether this holds
3. Equal variance assumption: the variability of y should be about the same for all values of x
4. Normal population assumption: the errors around the idealized regression line at each value
of x follows a normal model
– The response y is normally distributed at any x value

• Properties of the estimated slope


1. The mean value of β̂1 is E(β̂1 ) = β1
– β̂1 is an unbiased estimator of β1
2. The variance and SD of β̂1 are

σ2
V (β̂1 ) = σβ̂2 =
1 Sxx
σ
σβ̂1 =√
Sxx
– σ can be replaced by its estimate σ̂
3. The estimator β̂1 has a normal distribution
– Because it is a linear function of independent normal RVs

Page 16
• As a result, the assumptions of the simple linear regression model imply that

β̂1 − β1
T = ∼ tn−2
Sβ̂1

• A 100(1 − α)% confidence interval for the slope β1 of the true regression line is
σ̂
β̂1 ± tn−2,α/2 √
Sxx

• Hypothesis testing procedures


– H0 : β1 = β10

β1 > β10
 (upper-tail alternative)
– Ha : β1 < β10 (lower-tail alternative)

β1 6= β10 (two-tailed alternative)

– Test statistic:
β̂1 − β1
T = ∼ tn−2
Sβ̂1

Inferences for the (Mean) Response


• We want to choose an estimator of the mean y value using the least squares prediction equation

ŷ = β̂0 + β̂1 x∗

where x∗ is some fixed value of x


• Substituting β̂0 and β̂1 :
n 
(x∗ − x)(xi − x)

X 1
ŷ = + yi
i=1
n Sxx
n
X
= di yi
i=1

(x∗ −x)(xi −x)


h i
1
where di = n + Sxx

• The coefficients d1 , . . . , dn involve the xi s and x∗ , all of which are fixed


• Sampling distribution of ŷ
1. The mean value of ŷ is

E[ŷ] = E[β̂0 + β̂1 x∗ ] = E[β0 + β1 x∗ ] = E[y]

2. The variance of ŷ is
(x∗ − x)2
 
1
V (ŷ) = σŷ2 =σ 2
+
n Sxx
The estimated variance of ŷ is
(x∗ − x)2
 
1
Sŷ2 = σ̂ 2
+
n Sxx
3. ŷ has a normal distribution, because it is a linear function of the yi s, which are normally distributed
and independent

Page 17
• Consequently, the variable
ŷ − E[y]
t= ∼ tn−2
Sŷ

Prediction Interval for a Future Value of y


• Prediction error
– The prediction error is
ŷ − y = ŷ − (β0 + β1 x∗ + )

– The variance of ŷ − y is
(x∗ − x)2
 
1 2
V [ŷ − y] = σ 1 + +
n Sxx
– The estimated variance of ŷ − y is

(x∗ − x)2
 
2 2 1
Sŷ−y = σ̂ 1 + +
n Sxx

– Consequently, the variable


(ŷ − y)
t= ∼ tn−2
Sŷ−y
Correlation

• The sample correlation coefficient for the n pairs (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ) is
n   
1 X xi − x yi − y Sxy
r= =√ p
n − 1 i=1 Sx Sy Sxx Syy

• Properties of r
1. The value of r does not depend on which of the two variables is labelled x and which is labelled y
2. The value of r is independent of the units in which x and y are measured, i.e. r is unitless
3. The square of the sample correlation gives the value of the coefficient of determination that would
result from fitting the simple linear regression model, i.e. r2 = R2
4. −1 ≤ r ≤ 1
5. r = ±1 iff all (xi , yi ) pairs lie on a straight line

Page 18
7 Analysis of Variance
The analysis of variance (ANOVA) is a collection of statistical procedures for the analysis of quantitative
responses
• The simplest ANOVA problem is referred to variously as a single-factor, single-classification, or one-
way ANOVA and involves the analysis of data samplesd from two or more numerical populations (i.e.
distributions)

The response variable is the variable of interest to be measured in the experiment


• Also called dependent variable
• Typically quantitative

Factors are those variables whose effect on the resonse is of interest to the experimenter
• Also called independent variables
• Quantitative factors are measured on a numerical scale
Terminology

• Factor level: values of the factor utilized in the experiment


• Treatment: factor level combinations utilized in the experiment
• Experimental unit: objecto n which the response and factors are observed or measured
• Design study: an experiment in which the analyst controls the specification of the treatments and
the method of assigning the experimental units to each treatment
• Observational study: an experiment in which the analyst simply observes the treatments and the
response on a sample of experimental units
Single-Factor ANOVA

• Focuses on comparison of 2 or more populations


• t is the number of populations/treatments being compared
• µi is the mean of population i (or the true average resopnse when treatment i is applied)

• The hypotheses of interests are


– H0 : µ = µ2 = · · · = µt
– Ha : at least 2 of the µi s are different
Sigle-Factor ANOVA Model

• The mathematical model for the data from a completely randomized design (CRD) with an
unequal number of replicates for each factor level is

yij = µ + τi + ij

where

– yij is the resopnse for the jth experimental unit subject to the ith level of the treatment factor,
i ∈ [1, t], j ∈ [1, ni ]
– ni is the number of experimental units or replications in ith level of the treatment factor

Page 19
– The distribution of the experimental errors, ij , are mutually independent due to the random-
izaiton and is assumed to be normally distributed
– τi represents the treatment effect
– µ is the overall mean

• We could write the null hpothesis in terms of the treatments effects, where H0 : τ1 = τ2 = · · · = τt
• Assumptions
– The t population or treatment distributions are all normal with the same variance σ 2 , i.e. the
yij s are independent and normally distributed with

E(yij ) = µi = µ + τi
V (yij ) = σ 2

Single-Factor ANOVA Notations


• The sample means fo the data in the ith level of the treatment factor is represented by
yi
y i. =
ni

• The grand mean is


y..
y .. =
n
where
t
P
– n= ni
i=1
ni
P
– yi. = yij
j=1
ni
t P
P
– y.. = yij
i=1 j=1

• A measure of between-samples variation is the treatment sum of squares (SSTr), given by


ni
t X
X
SSTr = (y i. − y .. )2
i=1 j=1
t
X
= ni (y i. − y .. )2
i=1
t
X y2 i. y..2
= −
i=1
ni n

• The total sum of squares is


ni
t X
X
SSTotal = (yij − y .. )2
i=1 j=1
ni
t X
X
2 y..2
= yij −
i=1 j=1
n

Page 20
• A measure of within-samples variations is the error sum of squares (SSE), given by
ni
t X
X
SSE = (yij − y i. )2
i=1 j=1

= SSTotal − SSTr

Single-Factor ANOVA Result


• When the ANOVA assumptions are satisfied:
1. SSE and SSTr are independent RVS
SSE
2. σ2 ∼ χ2df =n−t
SSTr
3. When H0 is true, σ2 ∼ χ2df =t−1
• The mean square for treatments (MSTr) and the mean square for error (MSE) are

SSTr SSE
MSTr = MSE =
t−1 n−t

• When the ANOVA assumptions are satisfied,

E(MSE) = σ 2

that is, MSE is an unbiased estimator for σ 2


• Moreover, when H0 is true,
E(MSTr) = σ 2
in this case, MSTr is an unbiased estimator for σ 2
• When ANOVA assumptions are satisfied and H0 is true, the test statistic f = MSTr
MSE has an F distri-
bution with t − 1 numerator df and n − t denominator df
• Rejection region for level α test: f > Fα,t−1,n−t
• p-value: area under Ft−1,n−t curve to the right of f
Multiple Comparisons in ANOVA
• When H0 is rejected, we want to know which of the µi s are different with each other
• Let Z1 , Z2 , . . . , Zm be m independent standard normal RVs, and let W be a χ2 RV independent of the
Zi s. Then the distribution of
max Zm − min Zm
max |Zi − Zj | i∈[1,m] i∈[1,m]
Q= p = p
W/ν W/ν

is the studentized range distribution


• This distribution has 2 parameters
1. m is the number of Zi s
2. ν is the denominator df
• We denote the critical value that captures the upper-tail area α under the density curve of Q by Qα,m,ν
Multiple Comparisons in ANOVA Result

Page 21
• We consider the equal number of replications n0 = n1 = · · · = nt . For each i < j, form the interval
r
MSE
y i. − y j. ± Qα,t,n−t
n0

• There are t(t − 1)/2 such intervals, e.g. µ1 − µ2 , µ1 − µ3 , etc.


• The simultaneous CI that every interval includes for the correponding value of µi − µj is 100(1 − α)%
Multiple Comparisons when Sample Sizes are Unequal – Tukey-Kramer Procedure

• Assumption: the t sample sizes n1 , n2 , . . . , nt are reasonably close to each other (i.e. mild imbalance)
• Let s  
MSE 1 1
dij = Qα,t,n−t +
2 ni nj

• Then the probability is approximately 1 − α that

y i. − y j. − dij ≤ µi − µj ≤ y i. − y j. + dij

for every i and j with i 6= j


• The simultaneous confidence level of 100(1 − α)% is an approximate

Page 22
8 Logistic Regression
Logit Function
eβ0 +β1 x
p(x) = = [1 + exp(−β0 − β1 x)]−1
1 + eβ0 +β1 x
Odds
• Logistic regression means assuming that p(x) is related to x by the logit function

p(x)
= exp(β0 + β1 x)
1 − p(x)

– The expression on the left side is called the odds


Log-Odds

• Taking natural logs on both sides,


 
p(x)
log = β0 + β1 x
1 − p(x)

the logarithm of the odds is a linear function of the predictor


• The slope parameter β1 is the change in the log-odds associated with a one-unit increase in x
• The quantity eβ1 is the odds ratio, because it represents the ratio of the odds of success when the
predictor variable equals x + 1 to the odds of success when the predictor variable equals x

Likelihood Function
• There are no analytical solutions for the MLEs β̂0 and β̂1
• The maximization process must be carried out using iterative numerical methods
β̂1 −β1
• For large n, the MLE has approximately a normal distribution and the standardized variable Sβ̂
1
has approximately a standard normal distribution

Page 23
9 Chi-Squared Tests (Extra)
A multinomial experiment satisfies the following conditions:
1. The experiment consists of a sequence of n trials for some fixed n
2. Each trial can result in one of the same k possible outcomes (aka categories)
3. The trials are independent
4. Th probability that a trial results in a category i is pi , which is a constant
P
The parameters p1 , . . . , pk must satisfy pi ≥ 0 and pi = 1
• Generalization of a binomial experiment, allows each trial to result in one of > 2 possible outcomes
Null hypothesis: pi s are assigned some fixed values, alternative ypothesis: at least one of the pi s has a value
different from that asserted by H0

E.g. an experiment with n = 50 and k = 3 might yield N1 = 22, N2 = 13, N3 = 15


• The Ni s are the observed counts
E(Ni ) = (total number of trials)(hypothesized probability of category i) = npi0
• These are the expected counts under H0
Pearson’s Chi-Squared Theorem
• When H0 : p1 = p10 , . . . , pk = pk0 is true, the statistic
k
X (Ni − npi0 )2 X (observed count - expected count)2
χ2 = =
i=1
npi0 expected count
all categories

has approximately a chi-squared distribution with k − 1 df


• This approximation is reasonable provided that npi0 ≥ 5 for every i
Chi-Squared Goodness-of-Fit Test
• H0 : p1 = p10 , . . . , pk = pk0
• Ha : at least one pi does not equal pi0
• Test statistic value:
k
X (Ni − npi0 )2
χ2 =
i=1
npi0
n o
• Rejection region for level α test: χ2 ≥ χ2α,k−1

Goodness-of-Fit Tests for Composite Hypotheses


• H0 : p1 = π1 (θ), . . . , pk = πk (θ) for some θ = (θ1 , . . . , θm )
• Ha : the hypothesis H0 is not true
Method of Multinomial Estimation
• Let n1 , . . . , nk denote the observed values of N1 , . . . , Nk . Then θ̂1 , . . . , θ̂m are those values of the θj s
that maximize the expression

P (N1 = n, . . . , Nk = nk ) ∝ [π1 (θ)]n1 × · · · × [πk (θ)]nk

Page 24
Fisher’s Chi-Squared Theorem
• Under general regularity conditions on θ1 , . . . , θm , and the πi (θ)s, if θ1 , . . . , θm are estimated by max-
imizing the multinomial expression, then the rv
k k
X (Ni − nP̂i )2 X [Ni − nπi (θ̂)]2
χ2 = =
i=1 nP̂i i=1 nπi (θ̂)

has an approximately a chi-squared distribution with k − 1 − m df when H0 is true


• An approximately level α test of H0 vs. Ha is then to reject H0 if χ2 ≥ χ2α,k−1−m

• This test can be used if nπi (θ̂) ≥ 5 for every i

Page 25
10 Bayesian Estimation (Extra)
Prior Distribution
• A prior distribution for a parameter θ, denoted π(θ), is a probability distribution on the set of
possible values for θ
• If the possible values of θ form an interval I, then π(θ) is a pdf that must satisfy
Z
π(θ)dθ = 1
I

• If θ is potentially any value in a discrete set D, then π(θ) is a pmf that must satisfy
X
π(θ) = 1
θ∈D

Posterior Distribution
• Suppose X1 , . . . , Xn have joint pdf f (x1 , . . . , xn ; θ) and the unknown parameter θ has been assigned
a continuous prior distribution π(θ), then the posterior distribution of θ, given the observations
X1 = x1 , . . . , Xi = xi , is

π(θ)f (x1 , . . . , xn ; θ)
π(θ|x1 , . . . , xn ) = R ∞
−∞
π(θ)f (x1 , . . . , xn ; θ)dθ

• If X1 , . . . , Xn is discrete, the joint pdf is replaced by their joint pmf


• Constructing the posterior distribution of a parameter requires a specific probability model f (x1 , . . . , xn ; θ)
for the observed data

Page 26

You might also like