iMS HW4: Ignoring Depenence
David Vuk
1-12-2024
School data
We import and split the data on parent going to secondary school.
# Importing and sorting data
load("C:/Users/David/Desktop/iMS/HW4/school.Rdata")
split_data <- split(school, school$Parents)
X <- split_data$"TRUE"
Y <- split_data$"FALSE"
a)
From our model it follows that:
1 n1 σ2 1 n1 σ2
Xi ∼ N (µ1 , σ 2 ) =⇒ Σi=1 [Xi ] ∼ N (µ1 , ) =⇒ Σi=1 [Xi − µ1 ] ∼ N (0, )
n1 n1 n1 n1
Similarly:
1 n2 σ2 1 n2 σ2
Yi ∼ N (µ2 , σ 2 ) =⇒ Σi=1 [Yi ] ∼ N (µ2 , ) =⇒ Σi=1 [Yi − µ2 ] ∼ N (0, )
n2 n2 n2 n2
And thus:
1 n1 1 n2 σ2 σ2 n1 + n2
Σi=1 [Xi − µ1 ] − Σi=1 [Yi − µ2 ] ∼ N (µ1 − µ2 , + ) = N (0, σ 2 )
n1 n2 n1 n2 n1 n2
√
n1 n2 1 1 1 n2
=⇒ √ √ ( Σn1 [Xi − µ1 ] − Σ [Yi − µ2 ]) ∼ N (0, 1)
n1 + n2 σ 2 n1 i=1 n2 i=1
Since
1 n1 p
Σ (Xi − X̄)2 −→ σ 2 for n1 → ∞
n1 i=1
1 n2 p
Σ (Yi − Ȳ )2 −
→ σ 2 for n2 → ∞
n2 i=1
1 1 n1 1 1 n2
we have that σ 2 can be estimated with 2 n1 Σi=1 (Xi − X̄)2 + 2 n2 Σi=1 (Yi − Ȳ )2 , so
√
n n
√ 1 2 ( 1 Σn1 [Xi − µ1 ] − 1 n2
− µ2 ])
n1 +n2 n1 i=1 n2 Σi=1 [Yi D
q −→ N (0, 1)
1 1 n1 1 1 n2
2 n1 Σi=1 (Xi − X̄)2 + 2 n2 Σi=1 (Yi − Ȳ )2
1
b)
q √
1 1 n1 1 1 n2 n n
Denote F := 2 n1 Σi=1 (Xi − X̄)2 + 2 n2 Σi=1 (Yi − Ȳ )2 and E := √ 1 2 .
n1 +n2
Then we have that:
E 1 n1 1 n2 α
P(| ( Σi=1 [Xi − µ1 ] − Σi=1 [Yi − µ2 ])| > Φ−1 (1 − )) = α
F n1 n2 2
Which allows us to calculate the (asymptotic) CI:
α E 1 1 n2 α
Φ−1 ( ) ≤ ( Σni=1
1
[Xi − µ1 ] − Σ [Yi − µ2 ]) ≤ Φ−1 (1 − )
2 F n1 n2 i=1 2
α E α
Φ−1 ( ) ≤ (X̄ − µ1 − Ȳ + µ2 ) ≤ Φ−1 (1 − )
2 F 2
F α F α
−X̄ + Ȳ + Φ−1 ( ) ≤ (−µ1 + µ2 ) ≤ −X̄ + Ȳ + Φ−1 (1 − )
E 2 E 2
F α F α
X̄ − Ȳ − Φ−1 (1 − ) ≤ µ1 − µ2 ≤ X̄ − Ȳ − Φ−1 ( )
E 2 E 2
F −1 α F −1 α
X̄ − Ȳ − Φ (1 − ) ≤ µ1 − µ2 ≤ X̄ − Ȳ + Φ (1 − )
E 2 E 2
F −1 α F −1 α
Which means that the CI is: (X̄ − Ȳ − EΦ (1 − 2 ), X̄ − Ȳ + EΦ (1 − 2 )). Now we calculate this CI
using the data:
# Calculate CI
X_mean = mean(X$Grade)
Y_mean = mean(Y$Grade)
n1 = length(X$Grade)
n2 = length(Y$Grade)
E_data = sqrt(n1 * n2)/sqrt(n1 + n2)
F_data = sqrt(0.5 * var(X$Grade) + 0.5 * var(Y$Grade))
lower_bound = X_mean - Y_mean - (F_data/E_data) * 1.96
upper_bound = X_mean - Y_mean + (F_data/E_data) * 1.96
print(lower_bound)
## [1] 0.7297661
print(upper_bound)
## [1] 1.555468
This shows that it is very unlikely that there is no difference between true means of the grades between
groups. Moreover this is statistical evidence that pupils with at least one parent with secondary school
education get better grades.
2
c)
From the model we have:
1 n1 1 n1
Σ [Xi − µ1 ] = Σ [Vi ] − µ1 + Z1
n1 i=1 n1 i=1
1 n3 1 n3
Σ [Yi − µ2 ] = Σ [Wi ] − µ2 + Z1
n3 i=1 n1 i=1
so, it follows that:
1 n1 1 n3 1 n1 1
Σ [Xi − µ1 ] − Σ [Yi − µ2 ] = Σ [Vi ] − µ1 − ( Σni=1
3
[Wi ] − µ2 )
n1 i=1 n3 i=1 n1 i=1 n1
And since
1 n1 σ2 1 n3 σ2
Σi=1 [Vi ] − µ1 ∼ N (0, 2 ) and Σi=1 [Wi ] − µ2 ∼ N (0, 2 )
n1 n1 n1 n3
holds, we have the following:
1 n1 1 n3 σ2 σ2
Σi=1 [Xi − µ1 ] − Σi=1 [Yi − µ2 ] ∼ N (0, 2 + 2 )
n1 n3 n1 n3
√ 1 n1 1 n3
n1 n3 ( Σ [Xi − µ1 ] − Σ [Yi − µ2 ]) ∼ N (0, σ12 (n1 + n3 ))
n1 i=1 n3 i=1
In a completely similar manner, we can obtain:
√ 1 n1 +n2 1 n3 +n4
n2 n4 ( Σ [Xi − µ1 ] − Σ [Yi − µ2 ]) ∼ N (0, σ12 (n2 + n4 ))
n2 i=n1 +1 n4 i=n3 +1
and so:
√ 1 n1 1 n3
Σ [Xi − µ1 ] −
n1 n3 ( Σ [Yi − µ2 ]) +
n1 i=1 n3 i=1
√ 1 1 n3 +n4
n2 n4 ( Σni=n
1 +n2
[Xi − µ1 ] − Σ [Yi − µ2 ]) ∼ N (0, σ12 (n1 + n2 + n3 + n4 ))
n2 1 +1
n4 i=n3 +1
√
n1 n3 1 1 n3
=⇒ √ ( Σni=1
1
[Xi − µ1 ] − Σ [Yi − µ2 ]) +
n1 + n2 + n3 + n4 n1 n3 i=1
√
n2 n4 1 1 n3 +n4
√ ( Σn1 +n2 [Xi − µ1 ] − Σ [Yi − µ2 ]) ∼ N (0, σ12 )
n1 + n2 + n3 + n4 n2 i=n1 +1 n4 i=n3 +1
3
d)
Denote (for ease of reference)
Ai := Vi + Z1 for i = 1, ..., n1 Bi := Vn1 +i + Z2 for i = 1, ..., n2
Ci := Wi + Z1 for i = 1, ..., n3 Di := Wn3 +i + Z2 for i = 1, ..., n4
Then Var(A) = Var(B) = Var(C) = Var(D) = σ12 holds, since within these independent groups the values
of Z1 and Z2 do not contribute to the variances of the group. So we can use the empirical variances of these
groups as an estimator for σ12 . More specifically, we can make a sum of a fraction of the empirical variances,
where that fraction is the proportion of the sample of that group compared with the entire sample size. This
yields a more accurate estimator, while remaining a consistent estimator:
n1 1 n1 n 2 1 n2 n 3 1 n3 n 4 1 n4
σˆ12 := [ Σi=1 (Ai −Ā)2 ]+ [ Σi=1 (Bi −B̄)2 ]+ [ Σi=1 (Ci −C̄)2 ]+ [ Σ (Di −D̄)2 ]
ntotal n1 ntotal n2 ntotal n3 ntotal n4 i=1
1 m
Where ntotal := n1 + n2 + n3 + n4 . The empirical variance of a normal distribution ( m Σi=1 (Gi − Ḡ)2 , where
2 2
Gi iid and G1 ∼ N (0, σG ) for some σG ∈ R with a sample size m ∈ N) is a consistent estimator (corollary
p p
3.9), so σˆ12 as defined above also is, since if αi −
→ σ for i = 1, 2, 3, 4 then aatot
1
α1 + aatot
2
α2 + aatot
3
α3 + aatot
4
α4 −
→
a1 a2 a3 a4
atot σ + atot σ + atot σ + atot σ = σ for atot := a1 + a2 + a3 + a4 .
# Split data further
split_X <- split(X, X$School)
split_Y <- split(Y, Y$School)
A <- split_X$GP
B <- split_X$MS
C <- split_Y$GP
D <- split_Y$MS
n1 <- length(A$Grade)
n2 <- length(B$Grade)
n3 <- length(C$Grade)
n4 <- length(D$Grade)
ntot <- n1 + n2 + n3 + n4
# Calculate sigma_hat
sigma_hat = (n1/ntot) * var(A$Grade) + (n2/ntot) * var(B$Grade) +
(n3/ntot) * var(C$Grade) + (n4/ntot) * var(D$Grade)
e)
From c) we know that: √
n 1 n 3 1 n1 1 n3
√ ( Σ [Xi − µ1 ] − Σ [Yi − µ2 ]) +
ntot n1 i=1 n3 i=1
√
n2 n4 1 n1 +n2 1 n3 +n4
√ ( Σi=n1 +1 [Xi − µ1 ] − Σ [Yi − µ2 ]) ∼ N (0, σ12 )
ntot n2 n4 i=n3 +1
√
n 1 n 3 1 n1 1 n3
=⇒ p ( Σi=1 [Xi − µ1 ] − Σ [Yi − µ2 ]) +
ntot σ1 n1
2 n3 i=1
√
n2 n4 1 n1 +n2 1 n3 +n4
( Σi=n1 +1 [Xi − µ1 ] − Σ [Yi − µ2 ]) ∼ N (0, 1)
n4 i=n3 +1
p
ntot σ1 n2
2
4
√ √
n n n n
Denote H := √ 1 3 2 and K := √ 2 4 2 . Then, using notation from d), we it follows that:
ntot σ1 ntot σ1
1 n1 1 n3 1 1 n3 +n4
H( Σ [Xi − µ1 ] − Σ [Yi − µ2 ]) + K( Σni=n
1 +n2
[Xi − µ1 ] − Σ [Yi − µ2 ])
n1 i=1 n3 i=1 n2 1 +1
n4 i=n3 +1
1 n1 1 1 1
= H( Σ [Xi ] − µ1 − ( Σni=1
3
[Yi ] − µ2 )) + K( Σni=n
1 +n2
[Xi ] − µ1 − ( Σni=n
3 +n4
[Yi ] − µ2 ))
n1 i=1 n3 n2 1 +1
n4 3 +1
= H(Ā − µ1 − (C̄ − µ2 )) + K(B̄ − µ1 − (D̄ − µ2 ))
= H(Ā − µ1 − C̄ + µ2 ) + K(B̄ − µ1 − D̄ + µ2 )
= H(Ā − C̄) + K(B̄ − D̄) − µ1 (H + K) + µ2 (H + K)
= H(Ā − C̄) + K(B̄ − D̄) − (µ1 − µ2 )(H + K) ∼ N (0, 1)
From this we can construct the asymptotic confidence interval, similar to what we have done in b):
α α
Φ−1 ( ) ≤ H(Ā − C̄) + K(B̄ − D̄) − (µ1 − µ2 )(H + K) ≤ Φ−1 (1 − )
2 2
α α
−H(Ā − C̄) − K(B̄ − D̄) + Φ−1 ( ) ≤ −(µ1 − µ2 )(H + K) ≤ −H(Ā − C̄) − K(B̄ − D̄) + Φ−1 (1 − )
2 2
1 α 1 α
[−H(Ā− C̄)−K(B̄ − D̄)+Φ−1 ( )] ≤ −(µ1 −µ2 ) ≤ [−H(Ā− C̄)−K(B̄ − D̄)+Φ−1 (1− )]
(H + K) 2 (H + K) 2
1 α 1 α
[H(Ā − C̄) + K(B̄ − D̄) − Φ−1 (1 − )] ≤ µ1 − µ2 ≤ [H(Ā − C̄) + K(B̄ − D̄) − Φ−1 ( )]
(H + K) 2 (H + K) 2
1 α 1 α
[H(Ā − C̄) + K(B̄ − D̄) − Φ−1 (1 − )] ≤ µ1 − µ2 ≤ [H(Ā − C̄) + K(B̄ − D̄) + Φ−1 (1 − )]
(H + K) 2 (H + K) 2
Which gives us the confidence interval:
1 α 1 α
( [H(Ā − C̄) + K(B̄ − D̄) − Φ−1 (1 − )], [H(Ā − C̄) + K(B̄ − D̄) + Φ−1 (1 − )])
(H + K) 2 (H + K) 2
Which, in the original notation. is equal to:
p √ √
ntot σ12 n 1 n 3 1 n1 1 n3 n2 n4 1 n1 +n2 1 α
(√ √ [p ( Σ i=1 [Xi ]− Σ [Yi ])+ ( Σi=n1 +1 [Xi ]− Σni=N
3 +n4
[Yi ])−Φ−1 (1− )],
n3 i=1 3 +1
p
n1 n3 + n2 n4 ntot σ12 n1 ntot σ12 n2 n4 2
p √ √
ntot σ12 n 1 n 3 1 n1 1 n3 n2 n4 1 n1 +n2 1 α
√ √ [p ( Σ i=1 [Xi ]− Σ i=1 [Yi ])+ p ( Σi=n1 +1 [Xi ]− Σni=N
3 +n4
3 +1
[Yi ])+Φ−1 (1− )])
n1 n3 + n2 n4 ntot σ1 n1
2 n3 ntot σ1 n2
2 n4 2
Now we can compute these numbers using σˆ12 in place of σ12 :
H <- sqrt(n1 * n3)/sqrt(ntot * sigma_hat)
K <- sqrt(n2 * n4)/sqrt(ntot * sigma_hat)
lower_bound <- (1/(H + K)) * (H * (mean(A$Grade) - mean(C$Grade)) +
K * (mean(B$Grade - mean(D$Grade))) - 1.96)
upper_bound <- (1/(H + K)) * (H * (mean(A$Grade) - mean(C$Grade)) +
K * (mean(B$Grade - mean(D$Grade))) + 1.96)
print(lower_bound)
## [1] 0.4209079
5
print(upper_bound)
## [1] 1.244709
Again, there is statistical evidence that pupils with at least one parent with secondary school education get
better grades.