0% found this document useful (0 votes)
11 views12 pages

Assignment1 Roll 182-001

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views12 pages

Assignment1 Roll 182-001

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Assignment 1

Fahim Muntasir_182-001

2024-10-31

# Loading required packages

library(tidyverse)
library(alr4)
library(ggplot2)

Answer to the question no. 2.3.1

UBSprices %>%
mutate(log_rice2003 = log(rice2003), log_rice2009 = log(rice2009)) %>%
ggplot(mapping = aes(x = log_rice2003, y = log_rice2009)) +
geom_point(shape = 1) +
coord_cartesian(xlim = c(1.5, 4.5), ylim = c(1.5, 4.5)) +
scale_x_continuous(breaks = seq(1.5, 4.5, by = 0.5)) +
scale_y_continuous(breaks = seq(1.5, 4.5, by = 0.5)) +
geom_abline(slope = 1, intercept = 0, lwd = 0.9, linetype = "solid") +
geom_smooth(method = "lm", se = FALSE, linetype = "dashed", color = "bl
annotate("text", x = 4.2, y = 1.5, label = "Dashed line : OLS,
Solid line : y = x",
color = "red", size = 3)

1
4.5

4.0

3.5
log_rice2009

3.0

2.5

2.0

Dashed line : OLS,


1.5
Solid line : y = x
1.5 2.0 2.5 3.0 3.5 4.0 4.5
log_rice2003

[From the continuation of problem 2.2] We can see that the from the scatter
plot of y = rice2009 versus x = rice2003, th points are densely situated around
the point (10,10) but otherwise the points are dispersed. Moreover, there are
some potential leverage points and outliers too. Also in this case a linear
relationship is not evident.
When we use log transformation, we see that the scatter plot becomes a
lot more evenly dispersed and it also eliminates unusual values. Using log
scales, the relationship between the transformed variables become a lot more
evident, visually.
We know, simple linear regression fits a straight line between the independent
and dependent variables. As using log scale makes the scatter plot more linear
and more evenly dispersed, using log scales is preferable in this case.

2
Answer to the question no. 2.3.2

Here the original model is E(y | x) = γ0 .xβ1 . According to the question, taking
log on both sides and assuming log(E(y|x)) ≈ E(log(y|x), we can rewrite the
model as
E(log(y|x) ≈ log(γ0 ) + β1 .log(x),
where
β0 = log(γ0 ).

1. Interpretation of β0 : As β0 = log(γ0 ) , if γ0 = 1, then β0 = 1. In


the original model, γ0 is the baseline or scaling factor of the expected
value of y when x is 1. Thus β0 represents the expected value of log(y)
i.e the transformed dependent variable when x = 1 i.e the transformed
independent variable log(x) is equal to 0 .
2. Interpretation of β1 : From the original model :
E(y|x) = γ0 .xβ1 .........(i)
E(y|X = nx) = γ0 .(nx)β1 .........(ii)
Dividing (ii) by (i), we get,
E(y|nx)/E(y|x) = nβ1

So, if we use the original model to interpret β1 , it gives the proportional


change in y (on an average) with respect to x. Specifically, β1 indicates
percentage change in y associated with 1% change in x. If β1 = 1, then if x
changes to 2x i.e. 100% increase in x will cause the expected value of y to be
twice as much or 100% increase.

3
From the transformed model :
E(log(y)|x) = log(γ0 ) + β1 .log(x).........(iii)
E(log(y)|nx) = log(γ0 ) + β1 .log(nx).........(iv)
Subtracting (iv) from (iii), we get,
E(log(y)|2x) − E(log(y)|x) = β1 .log(n)

If we use the transformed model to interpret β1 , it gives the linear change in


log(y) (on an average) with respect to log(x). Specifically, β1 indicates linear
change in log(y) associated with 1% change in log(x). If β1 = 1, then if log(x)
changes to log(2x) i.e. 100% increase in x will cause the expected value of y
to increase by log(2).

• If β1 > 1, response variable increases at a faster rate than the predictor


variable.
• If β1 < 1, response variable increases at a slower rate than the predictor
variable.
• If β1 = 1, response variable increases at the same rate than the predictor
variable.

4
Answer to the question no. 2.16.1

UN11_log <- UN11 %>%


mutate(log_fertility = log(fertility), log_ppgdp = log(ppgdp))

slr <- function(x, y){


y <- y
x <- x
n <- length(x)
mean_y <- mean(y)
mean_x <- mean(x)
sxx <- sum((x - mean_x)ˆ2)
sxy <- sum((x - mean_x) * (y - mean_y))
syy <- sum((y - mean_y)ˆ2)
beta_1 <- sxy / sxx
beta_0 <- mean_y - (beta_1 * mean_x)
rss <- syy - ( sxyˆ2 / sxx )
est_error_var <- rss / (n - 2)
est_beta1_var <- est_error_var / sxx
est_beta0_var <- est_error_var * ((1 / n) + (((mean_x)ˆ2) /sxx))
R_squared <- 1 - (rss / syy )
residual <- (y - beta_0 - beta_1*x)
return(list("n" = n,
"Mean_x" = mean_x, "Mean_y" = mean_y,
"SXX" = sxx, "SYY" = syy, "SXY" = sxy,
"Intercept" = beta_0, "Slope" = beta_1,
"Error_Variance" = est_error_var,
"Beta_0_Variance" = est_beta0_var,
"Beta_1_Variance" = est_beta1_var,
"Coefficient_of_determination" = R_squared,
"Residuals" = residual))
}

result <- slr(y = UN11_log$log_fertility, x = UN11_log$log_ppgdp)

5
So, using the user defined function, using log_fertility as dependent vari-
able and log_ppgdp as independent variable and applying the simple linear
regression, we get,

Intercept = 2.6655073
Slope = −0.2071498

6
Answer to the question no. 2.16.2

UN11_log %>%
ggplot(mapping = aes(x = log_ppgdp, y = log_fertility)) +
geom_point() +
geom_abline(intercept = result$Intercept, slope = result$Slope,
lwd = 0.9, color = "orange")

2.0

1.5
log_fertility

1.0

0.5

6 8 10
log_ppgdp

7
Answer to the question no. 2.16.3

Here, we wish to test the hypotheses,

H0 : β1 = 0
H1 : β1 < 0
Let α = 0.05

beta_1_null <- 0
alpha <- 0.05
t_cal <- (result$Slope - beta_1_null) / (sqrt(result$Beta_1_Variance))
t_tab <- qt(p = 1 - alpha, df = result$n - 2, lower.tail = TRUE)

if_else(abs(t_cal) > t_tab,


"Reject the Null Hypothesis",
"Do not Reject the Null Hypothesis")

## [1] "Reject the Null Hypothesis"

So, we can conclude with 95% confidence that, the slope is negative.

8
Answer to the question no. 2.16.4

result$Coefficient_of_determination

## [1] 0.525985

Interpretation: As the “Coefficient of determination”, R2 = 0.5259, we can


say 52.59% of the total variation in the response variable i.e log_fertility can
be explained/captured by the explanatory/predictor variable i.e log_ppgdp.

9
Answer to the question no. 2.16.5

x_pred <- log(1000)

pred_y <- function(x_pred, beta0, beta1) {


y_pred <- beta0 + (beta1*x_pred)
return (y_pred)
}

y_predicted <- pred_y (x_pred = x_pred,


beta0 = result$Intercept,
beta1 = result$Slope)

var_pred <- result$Error_Variance +


(result$Error_Variance) * ((1 / result$n) + (x_pred - result$M

se_pred <- sqrt(var_pred)

a <- y_predicted - qt(p = 1 - 0.05/2, df = result$n - 2 )


b <- y_predicted + qt(p = 1 - 0.05/2, df = result$n - 2 )
pred_intrvl_transform <- c(a, b)
pred_intrvl <- c(exp(a), exp(b))

If ppgdp = 1000,
˜
log_f ertility * = 1.2345673.
˜
sepred(log_f ertility * |x∗ = 1000) = 0.308653
˜
P rediction interval(log_f ertility * |x∗ = 1000) = (−0.7375117, 3.2066463)
˜
P rediction interval(f ertility * |x∗ = 1000) = (0.4783026, 24.6961248)

10
Answer to the question no. 2.16.6

UN11$region <- as.character(UN11$region)

#1
max_fertility_index <- which.max(UN11$fertility)
max_fertility <- c("Locality" = UN11$region[max_fertility_index],
"Fertility" = UN11$fertility[max_fertility_index])

#2
min_fertility_index <- which.min(UN11$fertility)
min_fertility <- c("Locality" = UN11$region[min_fertility_index],
"Fertility" = UN11$fertility[min_fertility_index])

#3
UN11_log <- UN11_log %>%
mutate("Residuals" = result$Residuals)

ordered_residual <- sort(UN11_log$Residuals, decreasing = TRUE)

#Two localities with largest positive residual :


high_resid <- rbind.data.frame(UN11_log[which(UN11_log$Residuals == orde
UN11_log[which(UN11_log$Residuals == orde

#Two localities with negative positive residual :


low_resid <- rbind.data.frame(UN11_log[which(UN11_log$Residuals == ordere
UN11_log[which(UN11_log$Residuals == ordere

11
(1): Locality with highest value of fertility is :

max_fertility[["Locality"]]

## [1] "Africa"

(2): Locality with highest value of fertility is :

min_fertility[["Locality"]]

## [1] "Europe"

(3): Two localities with largest positive residual (in log scales) is

high_resid["region"]

## region
## Equatorial Guinea Africa
## Angola Africa

Two localities with largest negative residual (in log scales) is

low_resid["region"]

## region
## Bosnia and Herzegovina Europe
## Moldova Europe

12

You might also like