Homework 1
Homework 1
This homework is due on 11:59 pm (Eastern Time) on Jan 30, 2025. The homework is to be
submitted in Gradescope.
(a) (2 + 3 + 3 points)
Fix two real numbers 𝑎 and 𝑏. Consider the new variable 𝑧 : = 𝑎𝑥 (") + 𝑏. In other words, the
(")
value of the variable 𝑧 for the 𝑗-th observation is given by 𝑧% = 𝑎𝑥% + 𝑏. Show that
(")
𝑧‾' = 𝑎𝑥‾' + 𝑏, var(𝑧) = 𝑎$ var6𝑥 (") 7, cov6𝑧, 𝑥 ($) 7 = 𝑎cov6𝑥 (") , 𝑥 ($) 7.
Use the definition of sample mean, variance and covariance.
' '
1 1 (")
𝑧‾' = ; 𝑧% = ;<𝑎𝑥% + 𝑏=
𝑛 𝑛
%(" %("
' '
1 (") 1
𝑧‾' = 𝑎 ⋅ ; 𝑥% + ; 𝑏
𝑛 𝑛
%(" %("
(")
𝑧‾' = 𝑎𝑥‾' + 𝑏
'
1 $
var(𝑧) = ;6𝑧% − 𝑧‾' 7
𝑛−1
%("
(") (")
Substitute 𝑧% = 𝑎𝑥% + 𝑏 and 𝑧‾' = 𝑎𝑥‾' + 𝑏:
'
1 (") (")
$
var(𝑧) = ; C𝑎𝑥% + 𝑏 − <𝑎𝑥‾' + 𝑏=D
𝑛−1
%("
'
1 (") (")
$
var(𝑧) = ; 𝑎$ <𝑥% − 𝑥‾' =
𝑛−1
%("
'
1 (") (")
$
var(𝑧) = 𝑎 ⋅ ;<𝑥% − 𝑥‾' = = 𝑎$ var6𝑥 (")7
$
𝑛−1
%("
'
($)
1 ($) ($)
cov6𝑧, 𝑥 7= ;6𝑧% − 𝑧‾' 7 <𝑥% − 𝑥‾' =
𝑛−1
%("
(") (")
Substitute 𝑧% = 𝑎𝑥% + 𝑏 and 𝑧‾' = 𝑎𝑥‾' + 𝑏:
'
($)
1 (") (") ($) ($)
cov6𝑧, 𝑥 7= ; 𝑎 <𝑥% − 𝑥‾' =<𝑥% − 𝑥‾' =
𝑛−1
%("
'
($)
1 (") (") ($) ($)
cov6𝑧, 𝑥 7=𝑎⋅ ;<𝑥% − 𝑥‾' = <𝑥% − 𝑥‾' = = 𝑎cov6𝑥 (") , 𝑥 ($) 7
𝑛−1
%("
(b) (5 points)
Show that
var6𝑥 (") + 𝑥 ($) 7 = var6𝑥 (")7 + var6𝑥 ($)7 + 2cov6𝑥 (") , 𝑥 ($) 7.
'
(") ($)
1 (") ($) (") ($)
$
var6𝑥 +𝑥 7= ;<𝑥% + 𝑥% − 𝑥‾' − 𝑥‾' =
𝑛−1
%("
$ $ $
(") ($) (") ($) (") (") ($) ($) (") (") ($) ($)
<𝑥% + 𝑥% − 𝑥‾' − 𝑥‾' = = <𝑥% − 𝑥‾' = + <𝑥% − 𝑥‾' = + 2<𝑥% − 𝑥‾' =<𝑥% − 𝑥‾' =
var6𝑥 (") + 𝑥 ($) 7 = var6𝑥 (") 7 + var6𝑥 ($) 7 + 2cov6𝑥 (") , 𝑥 ($)7
Exercise 2 (7 points)
Consider a dataset of 𝑛 observations which collected data on a variable 𝑌. The value for
the numerical variable 𝑌 for the 𝑖-th observation is given by 𝑌& , for 𝑖 = 1, … , 𝑛. Establish the
following sum-of-squares decomposition. For any real number 𝑐,
' '
&(" &("
where 𝑌‾' is the sample mean of the variable 𝑌. Observe that the above identity shows the
LHS to be minimized uniquely at 𝑐 = 𝑌‾' .
Write (𝑌& − 𝑐) on the left hand side as
(𝑌& − 𝑐) = (𝑌& − 𝑌‾' ) + (𝑌‾' − 𝑐),
and use the quadratic formula (𝑎 + 𝑏)$ = 𝑎$ + 𝑏 $ + 2𝑎𝑏 to expand (𝑌& − 𝑐)$ .
' '
(𝑌& − 𝑐) = (𝑌& − 𝑌‾' ) + (𝑌‾' − 𝑐). (𝑌& − 𝑐)$ = [(𝑌& − 𝑌‾' ) + (𝑌‾' − 𝑐)]$ .
(𝑌& − 𝑐)$ = (𝑌& − 𝑌‾' )$ + (𝑌‾' − 𝑐)$ + 2(𝑌& − 𝑌‾' )(𝑌‾' − 𝑐).
' ' ' '
&(" &("
(a) (2 points )
Use Exercise 1 to show that for any (𝛽) , 𝛽" ),
' '
;(𝑌& − 𝛽) − 𝛽" 𝑥& )$ = ;(𝑌& − 𝑌‾' − 𝛽" 𝑥& + 𝛽" 𝑥‾' )$ + 𝑛(𝑌‾' − 𝛽" 𝑥‾' − 𝛽) )$ .
&(" &("
(b) (2 points)
Deduce from Part (a) that any solution 6𝛽Q) , 𝛽Q" 7 to the optimization problem in () satisfies
the relation
(c) (2 points)
Deduce from Part (b) that the solution 𝛽Q" to () solves the following optimization problem.
(d) (4 points)
Show that
'
1 $
;6(𝑌& − 𝑌‾' ) − 𝛽" (𝑥& − 𝑥‾' )7 = 𝛽"$ Var(𝑥) − 2𝛽" Cov(𝑥, 𝑌) + Var(𝑌).
𝑛−1
&("
(e) (3 points)
Use Part (d) to deduce that () is solved at
𝑟*+ 𝑠+
𝛽Q" = .
𝑠*
Use the fact that for any positive real number 𝑎 and real numbers 𝑏, 𝑐, the quadratic
expression 𝑎𝑡 $ + 𝑏𝑡 + 𝑐 is minimized at 𝑡 = −𝑏/2𝑎.
a.
Using Exercise 2, substitute 𝑐 = 𝛽) + 𝛽" 𝑥& into the sum-of-squares decomposition. Let 𝑐 =
𝑌‾' − 𝛽" 𝑥‾' + 𝛽) − (𝑌‾' − 𝛽" 𝑥‾' ). Expanding:
' '
$
;(𝑌& − 𝛽) − 𝛽" 𝑥& )$ = ;6𝑌& − 𝑌‾' − 𝛽" (𝑥& − 𝑥‾' )7 + 𝑛(𝑌‾' − 𝛽" 𝑥‾' − 𝛽) )$ .
&(" &("
𝑌& − 𝛽) − 𝛽" 𝑥& = 6𝑌& − 𝑌‾' − 𝛽" (𝑥& − 𝑥‾' )7 + (𝑌‾' − 𝛽" 𝑥‾' − 𝛽) ).
' '
$
;(𝑌& − 𝛽) − 𝛽" 𝑥& )$ = ;6𝑌& − 𝑌‾' − 𝛽" (𝑥& − 𝑥‾' )7 + 𝑛(𝑌‾' − 𝛽" 𝑥‾' − 𝛽) )$ .
&(" &("
' '
$ $
;6𝑌& − 𝛽Q) − 𝛽" 𝑥& 7 = ;6(𝑌& − 𝑌‾' ) − 𝛽" (𝑥& − 𝑥‾' )7 .
&(" &("
$
6(𝑌& − 𝑌‾' ) − 𝛽" (𝑥& − 𝑥‾' )7 = (𝑌& − 𝑌‾' )$ − 2𝛽" (𝑌& − 𝑌‾' )(𝑥& − 𝑥‾' ) + 𝛽"$ (𝑥& − 𝑥‾' )$ .
'
1
;[⋯ ] = 𝛽"$ Var(𝑥) − 2𝛽" Cov(𝑥, 𝑌) + Var(𝑌).
𝑛−1
&("
e. 𝛽"$ Var(𝑥) − 2𝛽" Cov(𝑥, 𝑌) + Var(𝑌) is a quadratic in 𝛽" . Minimizing this quadratic gives:
Cov(𝑥, 𝑌) 𝑟*+ 𝑠+
𝛽Q" = = ,
Var(𝑥) 𝑠*
,-.(*,+)
where 𝑟*+ = 0! 0"
, 𝑠* = XVar(𝑥), and 𝑠+ = XVar(𝑌).
(a) (5 + 5 points)
Fit separate simple linear regression models with being the response variable and being
the predictor variable on the sub-datasets of male and female children. Report the
estimated co-efficients and a scatter plot (along with the fitted line) for each sub-
population.
nepal <- read.csv("/Users/macbook/Desktop/STOR455/nepal.csv")
males <- subset(nepal, sex == 1)
females <- subset(nepal, sex == 2)
male_model <- lm(weight~height, data = males)
female_model <- lm(weight~height, data = females)
male_model$coefficients
## (Intercept) height
## -9.0869252 0.2393433
female_model$coefficients
## (Intercept) height
## -8.3712108 0.2281936
##
## Call:
## lm(formula = weight ~ height, data = males)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.7192 -0.5064 -0.0510 0.4496 3.2427
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.086925 0.288998 -31.44 <2e-16 ***
## height 0.239343 0.003341 71.63 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8373 on 453 degrees of freedom
## Multiple R-squared: 0.9189, Adjusted R-squared: 0.9187
## F-statistic: 5131 on 1 and 453 DF, p-value: < 2.2e-16
summary(female_model)
##
## Call:
## lm(formula = weight ~ height, data = females)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.82127 -0.57982 -0.02652 0.50813 3.15115
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.371211 0.303580 -27.57 <2e-16 ***
## height 0.228194 0.003551 64.26 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8916 on 420 degrees of freedom
## Multiple R-squared: 0.9077, Adjusted R-squared: 0.9075
## F-statistic: 4129 on 1 and 420 DF, p-value: < 2.2e-16
Examining the regression outputs for both subpopulations, the male model exhibits a
lower standard error (0.8383 vs. 0.8924), indicating that its predictions are slightly more
precise compared to the female model. Additionally, the male subpopulation
demonstrates a higher R-squared value (0.9188 vs. 0.9073) and a lower residual standard
error (RSE) than the female subpopulation. These results suggest that the simple linear
regression model provides a better fit for the male subpopulation, as it explains a slightly
greater proportion of the variance in weight and produces more precise predictions.
Exercise 5 (3 + 2 points)
Consider a bivariate dataset consisting of 𝑛 observations on 2 numerical variables 𝑥 (") and
𝑥 ($) . We first fit a simple linear regression model with 𝑥 (") as the response variable and 𝑥 ($)
as the explanatory variable. Let 𝑏"$ denote the (least-square) estimated slope co-efficient.
Next we fit a simple linear regression model with 𝑥 ($) as the response variable and 𝑥 (") as
the explanatory variable. Let 𝑏$" denote the (least-square) estimated slope of this new
fitted line. Show that
$
𝑏"$ 𝑏$" = 6𝑟* ($) *(&) 7 ,
where 𝑟* ($) * (&) is the sample correlation co-efficient between the variables 𝑥 (") and 𝑥 ($) .
Argue that both the estimated slopes have the same signs and at most one of them can
have absolute value greater than 1.
The least-square slope estimates are:
$
\Cov6𝑥 (") , 𝑥 ($) 7]
𝑏"$ ⋅ 𝑏$" = .
Var(𝑥 (") )Var(𝑥 ($) )
The correlation coefficient:
$
$ \Cov6𝑥 (") , 𝑥 ($) 7]
6𝑟* ($) * (&) 7 = .
Var(𝑥 (") )Var(𝑥 ($) )
$
Thus, 𝑏"$ ⋅ 𝑏$" = 6𝑟* ($)* (&) 7
Both slopes 𝑏"$ and 𝑏$" share the sign of Cov6𝑥 (") , 𝑥 ($) 7, which matches the sign of 𝑟* ($)* (&) .
Since `𝑟* ($) * (&) ` ≤ 1, at most one of |𝑏"$ | or |𝑏$" | can exceed 1.