Assignment1 Roll 182-001
Assignment1 Roll 182-001
Fahim Muntasir_182-001
2024-10-31
library(tidyverse)
library(alr4)
library(ggplot2)
UBSprices %>%
mutate(log_rice2003 = log(rice2003), log_rice2009 = log(rice2009)) %>%
ggplot(mapping = aes(x = log_rice2003, y = log_rice2009)) +
geom_point(shape = 1) +
coord_cartesian(xlim = c(1.5, 4.5), ylim = c(1.5, 4.5)) +
scale_x_continuous(breaks = seq(1.5, 4.5, by = 0.5)) +
scale_y_continuous(breaks = seq(1.5, 4.5, by = 0.5)) +
geom_abline(slope = 1, intercept = 0, lwd = 0.9, linetype = "solid") +
geom_smooth(method = "lm", se = FALSE, linetype = "dashed", color = "bl
annotate("text", x = 4.2, y = 1.5, label = "Dashed line : OLS,
Solid line : y = x",
color = "red", size = 3)
1
4.5
4.0
3.5
log_rice2009
3.0
2.5
2.0
[From the continuation of problem 2.2] We can see that the from the scatter
plot of y = rice2009 versus x = rice2003, th points are densely situated around
the point (10,10) but otherwise the points are dispersed. Moreover, there are
some potential leverage points and outliers too. Also in this case a linear
relationship is not evident.
When we use log transformation, we see that the scatter plot becomes a
lot more evenly dispersed and it also eliminates unusual values. Using log
scales, the relationship between the transformed variables become a lot more
evident, visually.
We know, simple linear regression fits a straight line between the independent
and dependent variables. As using log scale makes the scatter plot more linear
and more evenly dispersed, using log scales is preferable in this case.
2
Answer to the question no. 2.3.2
Here the original model is E(y | x) = γ0 .xβ1 . According to the question, taking
log on both sides and assuming log(E(y|x)) ≈ E(log(y|x), we can rewrite the
model as
E(log(y|x) ≈ log(γ0 ) + β1 .log(x),
where
β0 = log(γ0 ).
3
From the transformed model :
E(log(y)|x) = log(γ0 ) + β1 .log(x).........(iii)
E(log(y)|nx) = log(γ0 ) + β1 .log(nx).........(iv)
Subtracting (iv) from (iii), we get,
E(log(y)|2x) − E(log(y)|x) = β1 .log(n)
4
Answer to the question no. 2.16.1
5
So, using the user defined function, using log_fertility as dependent vari-
able and log_ppgdp as independent variable and applying the simple linear
regression, we get,
Intercept = 2.6655073
Slope = −0.2071498
6
Answer to the question no. 2.16.2
UN11_log %>%
ggplot(mapping = aes(x = log_ppgdp, y = log_fertility)) +
geom_point() +
geom_abline(intercept = result$Intercept, slope = result$Slope,
lwd = 0.9, color = "orange")
2.0
1.5
log_fertility
1.0
0.5
6 8 10
log_ppgdp
7
Answer to the question no. 2.16.3
H0 : β1 = 0
H1 : β1 < 0
Let α = 0.05
beta_1_null <- 0
alpha <- 0.05
t_cal <- (result$Slope - beta_1_null) / (sqrt(result$Beta_1_Variance))
t_tab <- qt(p = 1 - alpha, df = result$n - 2, lower.tail = TRUE)
So, we can conclude with 95% confidence that, the slope is negative.
8
Answer to the question no. 2.16.4
result$Coefficient_of_determination
## [1] 0.525985
9
Answer to the question no. 2.16.5
If ppgdp = 1000,
˜
log_f ertility * = 1.2345673.
˜
sepred(log_f ertility * |x∗ = 1000) = 0.308653
˜
P rediction interval(log_f ertility * |x∗ = 1000) = (−0.7375117, 3.2066463)
˜
P rediction interval(f ertility * |x∗ = 1000) = (0.4783026, 24.6961248)
10
Answer to the question no. 2.16.6
#1
max_fertility_index <- which.max(UN11$fertility)
max_fertility <- c("Locality" = UN11$region[max_fertility_index],
"Fertility" = UN11$fertility[max_fertility_index])
#2
min_fertility_index <- which.min(UN11$fertility)
min_fertility <- c("Locality" = UN11$region[min_fertility_index],
"Fertility" = UN11$fertility[min_fertility_index])
#3
UN11_log <- UN11_log %>%
mutate("Residuals" = result$Residuals)
11
(1): Locality with highest value of fertility is :
max_fertility[["Locality"]]
## [1] "Africa"
min_fertility[["Locality"]]
## [1] "Europe"
(3): Two localities with largest positive residual (in log scales) is
high_resid["region"]
## region
## Equatorial Guinea Africa
## Angola Africa
low_resid["region"]
## region
## Bosnia and Herzegovina Europe
## Moldova Europe
12