0% found this document useful (0 votes)
4 views11 pages

Homework 1

This document outlines Homework I for STOR 455/002, due on January 30, 2025. It includes exercises on bivariate datasets, variance, covariance, and linear regression models, providing detailed mathematical proofs and derivations. The exercises require the application of statistical concepts to demonstrate relationships between variables and to derive least-squares estimates.

Uploaded by

Songquan Dong
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views11 pages

Homework 1

This document outlines Homework I for STOR 455/002, due on January 30, 2025. It includes exercises on bivariate datasets, variance, covariance, and linear regression models, providing detailed mathematical proofs and derivations. The exercises require the application of statistical concepts to demonstrate relationships between variables and to derive least-squares estimates.

Uploaded by

Songquan Dong
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

STOR 455/002 - Homework I

January 17, 2025

This homework is due on 11:59 pm (Eastern Time) on Jan 30, 2025. The homework is to be
submitted in Gradescope.

Exercise 1 (13 points)


Consider a bivariate dataset consisting of 𝑛 observations on 2 numerical variables 𝑥 (") and
(&)
𝑥 ($) . Let 𝑥% denotes the value of the variable 𝑥 (&) for the 𝑗-th observation, where 𝑖 = 1,2
and 𝑗 = 1, … , 𝑛.

(a) (2 + 3 + 3 points)
Fix two real numbers 𝑎 and 𝑏. Consider the new variable 𝑧 : = 𝑎𝑥 (") + 𝑏. In other words, the
(")
value of the variable 𝑧 for the 𝑗-th observation is given by 𝑧% = 𝑎𝑥% + 𝑏. Show that
(")
𝑧‾' = 𝑎𝑥‾' + 𝑏, var(𝑧) = 𝑎$ var6𝑥 (") 7, cov6𝑧, 𝑥 ($) 7 = 𝑎cov6𝑥 (") , 𝑥 ($) 7.
Use the definition of sample mean, variance and covariance.

' '
1 1 (")
𝑧‾' = ; 𝑧% = ;<𝑎𝑥% + 𝑏=
𝑛 𝑛
%(" %("

' '
1 (") 1
𝑧‾' = 𝑎 ⋅ ; 𝑥% + ; 𝑏
𝑛 𝑛
%(" %("

(") " (")


Simplify using 𝑥‾' = ' ∑'%(" 𝑥% :

(")
𝑧‾' = 𝑎𝑥‾' + 𝑏
'
1 $
var(𝑧) = ;6𝑧% − 𝑧‾' 7
𝑛−1
%("

(") (")
Substitute 𝑧% = 𝑎𝑥% + 𝑏 and 𝑧‾' = 𝑎𝑥‾' + 𝑏:

'
1 (") (")
$
var(𝑧) = ; C𝑎𝑥% + 𝑏 − <𝑎𝑥‾' + 𝑏=D
𝑛−1
%("

'
1 (") (")
$
var(𝑧) = ; 𝑎$ <𝑥% − 𝑥‾' =
𝑛−1
%("

'
1 (") (")
$
var(𝑧) = 𝑎 ⋅ ;<𝑥% − 𝑥‾' = = 𝑎$ var6𝑥 (")7
$
𝑛−1
%("

'
($)
1 ($) ($)
cov6𝑧, 𝑥 7= ;6𝑧% − 𝑧‾' 7 <𝑥% − 𝑥‾' =
𝑛−1
%("

(") (")
Substitute 𝑧% = 𝑎𝑥% + 𝑏 and 𝑧‾' = 𝑎𝑥‾' + 𝑏:

'
($)
1 (") (") ($) ($)
cov6𝑧, 𝑥 7= ; 𝑎 <𝑥% − 𝑥‾' =<𝑥% − 𝑥‾' =
𝑛−1
%("

'
($)
1 (") (") ($) ($)
cov6𝑧, 𝑥 7=𝑎⋅ ;<𝑥% − 𝑥‾' = <𝑥% − 𝑥‾' = = 𝑎cov6𝑥 (") , 𝑥 ($) 7
𝑛−1
%("
(b) (5 points)
Show that

var6𝑥 (") + 𝑥 ($) 7 = var6𝑥 (")7 + var6𝑥 ($)7 + 2cov6𝑥 (") , 𝑥 ($) 7.

Write down Var6𝑥 (") + 𝑥 ($)7 as


'
(") ($)
1 (") ($) (") ($)
$
var6𝑥 +𝑥 7= ;<𝑥% + 𝑥% − 𝑥‾' − 𝑥‾' = .
𝑛−1
%("

Expand in the following way :


$ $ $
(") ($) (") ($) (") (") ($) ($) (") (") ($) ($)
<𝑥% + 𝑥% − 𝑥‾' − 𝑥‾' = = <𝑥% − 𝑥‾' = + <𝑥% − 𝑥‾' = + 2<𝑥% − 𝑥‾' =<𝑥% − 𝑥‾' =.

'
(") ($)
1 (") ($) (") ($)
$
var6𝑥 +𝑥 7= ;<𝑥% + 𝑥% − 𝑥‾' − 𝑥‾' =
𝑛−1
%("

$ $ $
(") ($) (") ($) (") (") ($) ($) (") (") ($) ($)
<𝑥% + 𝑥% − 𝑥‾' − 𝑥‾' = = <𝑥% − 𝑥‾' = + <𝑥% − 𝑥‾' = + 2<𝑥% − 𝑥‾' =<𝑥% − 𝑥‾' =

var6𝑥 (") + 𝑥 ($) 7


' '
1 (") (")
$
($) ($)
$
= F;<𝑥% − 𝑥‾' = + ;<𝑥% − 𝑥‾' =
𝑛−1
%(" %("
'
(") (") ($) ($)
+ 2 ;<𝑥% − 𝑥‾' = <𝑥% − 𝑥‾' =G
%("

var6𝑥 (") + 𝑥 ($) 7 = var6𝑥 (") 7 + var6𝑥 ($) 7 + 2cov6𝑥 (") , 𝑥 ($)7
Exercise 2 (7 points)
Consider a dataset of 𝑛 observations which collected data on a variable 𝑌. The value for
the numerical variable 𝑌 for the 𝑖-th observation is given by 𝑌& , for 𝑖 = 1, … , 𝑛. Establish the
following sum-of-squares decomposition. For any real number 𝑐,
' '

;(𝑌& − 𝑐) = ;(𝑌& − 𝑌‾' )$ + 𝑛(𝑌‾' − 𝑐)$ ,


$

&(" &("

where 𝑌‾' is the sample mean of the variable 𝑌. Observe that the above identity shows the
LHS to be minimized uniquely at 𝑐 = 𝑌‾' .
Write (𝑌& − 𝑐) on the left hand side as
(𝑌& − 𝑐) = (𝑌& − 𝑌‾' ) + (𝑌‾' − 𝑐),
and use the quadratic formula (𝑎 + 𝑏)$ = 𝑎$ + 𝑏 $ + 2𝑎𝑏 to expand (𝑌& − 𝑐)$ .
' '

;(𝑌& − 𝑐)$ = ;(𝑌& − 𝑌‾' )$ + 𝑛(𝑌‾' − 𝑐)$ .


&(" &("

(𝑌& − 𝑐) = (𝑌& − 𝑌‾' ) + (𝑌‾' − 𝑐). (𝑌& − 𝑐)$ = [(𝑌& − 𝑌‾' ) + (𝑌‾' − 𝑐)]$ .
(𝑌& − 𝑐)$ = (𝑌& − 𝑌‾' )$ + (𝑌‾' − 𝑐)$ + 2(𝑌& − 𝑌‾' )(𝑌‾' − 𝑐).
' ' ' '

;(𝑌& − 𝑐) = ;(𝑌& − 𝑌‾'


$ )$ + ;(𝑌‾' − 𝑐) + 2 ;(𝑌& − 𝑌‾' ) (𝑌‾' − 𝑐).
$

&(" &(" &(" &("

since ∑'&("(𝑌‾' − 𝑐)$ = 𝑛(𝑌‾' − 𝑐)$ and ∑'&("(𝑌& − 𝑌‾' ) = 0,


' '

;(𝑌& − 𝑐) = ;(𝑌& − 𝑌‾' )$ + 𝑛(𝑌‾' − 𝑐)$ .


$

&(" &("

Exercise 3 (13 points)


Consider the bivariate dataset of 𝑛 observations
{(𝑌" , 𝑥" ), … , (𝑌' , 𝑥' )},
where 𝑌 and 𝑥 are respectively response and explanatory variables. We want to fit a simple
linear regression model of 𝑌 on 𝑥, given by
𝑌& = 𝛽) + 𝛽" 𝑥& + 𝜀& , 𝑖 = 1, … , 𝑛.
In this exercise we shall prove (without using calculus) the following algebraic expressions
for the least-square estimates :
𝑟*+ 𝑠+
𝛽Q) = 𝑌‾' − 𝛽Q" 𝑥‾' , 𝛽Q" = .
𝑠*
In other words, the above quantities solve the following minimization problem:

(a) (2 points )
Use Exercise 1 to show that for any (𝛽) , 𝛽" ),
' '

;(𝑌& − 𝛽) − 𝛽" 𝑥& )$ = ;(𝑌& − 𝑌‾' − 𝛽" 𝑥& + 𝛽" 𝑥‾' )$ + 𝑛(𝑌‾' − 𝛽" 𝑥‾' − 𝛽) )$ .
&(" &("

(b) (2 points)
Deduce from Part (a) that any solution 6𝛽Q) , 𝛽Q" 7 to the optimization problem in () satisfies
the relation

𝛽Q) = 𝑌‾' − 𝛽Q" 𝑥‾' .

(c) (2 points)
Deduce from Part (b) that the solution 𝛽Q" to () solves the following optimization problem.

(d) (4 points)
Show that
'
1 $
;6(𝑌& − 𝑌‾' ) − 𝛽" (𝑥& − 𝑥‾' )7 = 𝛽"$ Var(𝑥) − 2𝛽" Cov(𝑥, 𝑌) + Var(𝑌).
𝑛−1
&("

(e) (3 points)
Use Part (d) to deduce that () is solved at
𝑟*+ 𝑠+
𝛽Q" = .
𝑠*
Use the fact that for any positive real number 𝑎 and real numbers 𝑏, 𝑐, the quadratic
expression 𝑎𝑡 $ + 𝑏𝑡 + 𝑐 is minimized at 𝑡 = −𝑏/2𝑎.
a.
Using Exercise 2, substitute 𝑐 = 𝛽) + 𝛽" 𝑥& into the sum-of-squares decomposition. Let 𝑐 =
𝑌‾' − 𝛽" 𝑥‾' + 𝛽) − (𝑌‾' − 𝛽" 𝑥‾' ). Expanding:
' '
$
;(𝑌& − 𝛽) − 𝛽" 𝑥& )$ = ;6𝑌& − 𝑌‾' − 𝛽" (𝑥& − 𝑥‾' )7 + 𝑛(𝑌‾' − 𝛽" 𝑥‾' − 𝛽) )$ .
&(" &("

This matches the required decomposition.

𝑌& − 𝛽) − 𝛽" 𝑥& = 6𝑌& − 𝑌‾' − 𝛽" (𝑥& − 𝑥‾' )7 + (𝑌‾' − 𝛽" 𝑥‾' − 𝛽) ).

' '
$
;(𝑌& − 𝛽) − 𝛽" 𝑥& )$ = ;6𝑌& − 𝑌‾' − 𝛽" (𝑥& − 𝑥‾' )7 + 𝑛(𝑌‾' − 𝛽" 𝑥‾' − 𝛽) )$ .
&(" &("

The cross-term vanishes because ∑'&("(𝑌& − 𝑌‾' ) = 0 and ∑'&("(𝑥& − 𝑥‾' ) = 0.


b. The total sum of squares is minimized when the second term 𝑛(𝑌‾' − 𝛽" 𝑥‾' − 𝛽) )$ is zero.
Thus:

𝑌‾' − 𝛽" 𝑥‾' − 𝛽) = 0


𝛽Q) = 𝑌‾' − 𝛽Q" 𝑥‾' .

c. Substitute 𝛽Q) = 𝑌‾' − 𝛽Q" 𝑥‾' to the original minimization problem:

' '
$ $
;6𝑌& − 𝛽Q) − 𝛽" 𝑥& 7 = ;6(𝑌& − 𝑌‾' ) − 𝛽" (𝑥& − 𝑥‾' )7 .
&(" &("

𝛽Q" minimizes this reduced sum of squares


d,

$
6(𝑌& − 𝑌‾' ) − 𝛽" (𝑥& − 𝑥‾' )7 = (𝑌& − 𝑌‾' )$ − 2𝛽" (𝑌& − 𝑌‾' )(𝑥& − 𝑥‾' ) + 𝛽"$ (𝑥& − 𝑥‾' )$ .

'
1
;[⋯ ] = 𝛽"$ Var(𝑥) − 2𝛽" Cov(𝑥, 𝑌) + Var(𝑌).
𝑛−1
&("
e. 𝛽"$ Var(𝑥) − 2𝛽" Cov(𝑥, 𝑌) + Var(𝑌) is a quadratic in 𝛽" . Minimizing this quadratic gives:

Cov(𝑥, 𝑌) 𝑟*+ 𝑠+
𝛽Q" = = ,
Var(𝑥) 𝑠*

,-.(*,+)
where 𝑟*+ = 0! 0"
, 𝑠* = XVar(𝑥), and 𝑠+ = XVar(𝑌).

Exercise 4 (12 points)


The file contains data from the public health study on Nepalese children. The dataset has
877 observations on 3 variables:

(a) (5 + 5 points)
Fit separate simple linear regression models with being the response variable and being
the predictor variable on the sub-datasets of male and female children. Report the
estimated co-efficients and a scatter plot (along with the fitted line) for each sub-
population.
nepal <- read.csv("/Users/macbook/Desktop/STOR455/nepal.csv")
males <- subset(nepal, sex == 1)
females <- subset(nepal, sex == 2)
male_model <- lm(weight~height, data = males)
female_model <- lm(weight~height, data = females)
male_model$coefficients

## (Intercept) height
## -9.0869252 0.2393433

female_model$coefficients

## (Intercept) height
## -8.3712108 0.2281936

plot(males$height, males$weight, main = "Weight vs. Height (Males)", xlab =


"Height (cm)", ylab = "Weight (kg)", pch = 19)
abline(male_model, col = "red", lwd = 2)
plot(females$height, females$weight, main = "Weight vs. Height (Females)",
xlab = "Height (cm)", ylab = "Weight (kg)", pch = 19)
abline(male_model, col = "red", lwd = 2)
(b) (2 points)
Comment on the goodness-of-fit of the simple linear regression models for the two sub-
populations. For which sub-population, the model fits the data better?
summary(male_model)

##
## Call:
## lm(formula = weight ~ height, data = males)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.7192 -0.5064 -0.0510 0.4496 3.2427
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.086925 0.288998 -31.44 <2e-16 ***
## height 0.239343 0.003341 71.63 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8373 on 453 degrees of freedom
## Multiple R-squared: 0.9189, Adjusted R-squared: 0.9187
## F-statistic: 5131 on 1 and 453 DF, p-value: < 2.2e-16

summary(female_model)

##
## Call:
## lm(formula = weight ~ height, data = females)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.82127 -0.57982 -0.02652 0.50813 3.15115
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.371211 0.303580 -27.57 <2e-16 ***
## height 0.228194 0.003551 64.26 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8916 on 420 degrees of freedom
## Multiple R-squared: 0.9077, Adjusted R-squared: 0.9075
## F-statistic: 4129 on 1 and 420 DF, p-value: < 2.2e-16

Examining the regression outputs for both subpopulations, the male model exhibits a
lower standard error (0.8383 vs. 0.8924), indicating that its predictions are slightly more
precise compared to the female model. Additionally, the male subpopulation
demonstrates a higher R-squared value (0.9188 vs. 0.9073) and a lower residual standard
error (RSE) than the female subpopulation. These results suggest that the simple linear
regression model provides a better fit for the male subpopulation, as it explains a slightly
greater proportion of the variance in weight and produces more precise predictions.

Exercise 5 (3 + 2 points)
Consider a bivariate dataset consisting of 𝑛 observations on 2 numerical variables 𝑥 (") and
𝑥 ($) . We first fit a simple linear regression model with 𝑥 (") as the response variable and 𝑥 ($)
as the explanatory variable. Let 𝑏"$ denote the (least-square) estimated slope co-efficient.
Next we fit a simple linear regression model with 𝑥 ($) as the response variable and 𝑥 (") as
the explanatory variable. Let 𝑏$" denote the (least-square) estimated slope of this new
fitted line. Show that
$
𝑏"$ 𝑏$" = 6𝑟* ($) *(&) 7 ,

where 𝑟* ($) * (&) is the sample correlation co-efficient between the variables 𝑥 (") and 𝑥 ($) .
Argue that both the estimated slopes have the same signs and at most one of them can
have absolute value greater than 1.
The least-square slope estimates are:

Cov6𝑥 (") , 𝑥 ($)7 Cov6𝑥 (") , 𝑥 ($) 7


𝑏"$ = , 𝑏$" = .
Var(𝑥 ($) ) Var(𝑥 (") )

$
\Cov6𝑥 (") , 𝑥 ($) 7]
𝑏"$ ⋅ 𝑏$" = .
Var(𝑥 (") )Var(𝑥 ($) )
The correlation coefficient:

Cov6𝑥 (") , 𝑥 ($)7


𝑟* ($)* (&) = ,
𝑠* ($) 𝑠* (&)

since 𝑠* ($) = XVar(𝑥 (") ) and 𝑠* (&) = XVar(𝑥 ($))

$
$ \Cov6𝑥 (") , 𝑥 ($) 7]
6𝑟* ($) * (&) 7 = .
Var(𝑥 (") )Var(𝑥 ($) )

$
Thus, 𝑏"$ ⋅ 𝑏$" = 6𝑟* ($)* (&) 7
Both slopes 𝑏"$ and 𝑏$" share the sign of Cov6𝑥 (") , 𝑥 ($) 7, which matches the sign of 𝑟* ($)* (&) .

The slopes are:$ b_{12} = r_{x{(1)}x{(2)}} , b_{21} = r_{x{(1)}x{(2)}} . $


0 ($) 0 (&)
If 0! > 1, then 0! <1
!(&) !($)

Since `𝑟* ($) * (&) ` ≤ 1, at most one of |𝑏"$ | or |𝑏$" | can exceed 1.

You might also like