0% found this document useful (0 votes)
8 views16 pages

Solutions To Coding

The document discusses statistical modeling using linear regression and polynomial regression techniques, focusing on the relationship between concentration and strength in a dataset. It includes R code for data analysis, model fitting, and evaluation metrics such as adjusted R-squared and p-values. The analysis concludes that a fifth-order polynomial model provides a better fit for the data compared to a third-order model.

Uploaded by

Niyati Shah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views16 pages

Solutions To Coding

The document discusses statistical modeling using linear regression and polynomial regression techniques, focusing on the relationship between concentration and strength in a dataset. It includes R code for data analysis, model fitting, and evaluation metrics such as adjusted R-squared and p-values. The analysis concludes that a fifth-order polynomial model provides a better fit for the data compared to a third-order model.

Uploaded by

Niyati Shah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Name-Niyati Shah

VIN-633007415
-

STAT-654

email-myatibhavik Shah atamu edu . -

Assignment-1
#1 .

E(i) =. Bl
For simple linear model
,
=>
regression
Yi =
Bo + Bixi + ei

the data sex (x,, 4) , (x2 y 2) (kn , yu)


for , ...

= (yi Yi)

sum
o square residuals
-

2
: SSR
= [Yi -

(B + B xi)]
To determine Boa Bi ,
we have to minimize SSR

GST
...
[yi -(P pixi)] + ( D) 0
-
=

.
- Eyi + EP + piEsi = 0
y =
E
N

- -

In + Bon + Bisn = 0

y BY i
:
Bo = -

Now
,

4
O

[yi -

(B + Bixi)] ( -

xi) = 0

:
E[j-Bi + Bixi-yilxi =
0

Exi-pis Exi +
pi Exi-Exiyi = o

Bi [n22 -Exi2)
-

:N Fy - -

Ex1YI = &

:
B =
nxY -

Exy ;
net-Exi ?

=
Exiyi -

nx
Exi -
ni

*
Exiyi Eyi
= -

2(Xi -
X)2
& sample correlation
coefficient =
R2
-
coefficient of determination

① residuals =
2. =

yi -Y ,

error sun SSE Eci2


of square
=

22(yi y, )2 -

E(yi-y)
:
① lotal sum
of square SST :

+ =

E(yi y y y) -
+ -

E[(yi yi) + (yi y)]2 - -

[(yi-y) + 2 [ (yi-yi) (yi y) E(yi y) -

+ -

Now
,
y =
B + Bixi =

y B -
+ Bixi
.. -

y
=
B(xi x) -

Abo
B Bixi
,

yi-y =

yi
-

Yi
-

Y + Bise-Bix ;
(yi-yi) (yi y) + BY (x xi)
= -
-

Therefore , 2(yi yi)(yi y)


- -

=
EB , (xi -
x) .

[Bi(xi x) -
+ Bi (5 x)] -

BYE(xi -
-) "
-

BY(xi 5) -

I
O
vence
,
SST =

E(yi yi) -

+
E(yi - y)2
: 337 = SSE + Ely - y)2
E(y y)
2
where SSR = -

regression sum
of squares
&2 given that MX = 2

My(X(X) =
Bo + Bi( Mx) - +
B2(X-MX)
=
Bo +
Bi(x 2) -
+
B2(x 2)2 -

B2(X = 4)
-

Bo +
Bix -

2B + 4x +

Bo +
BIK-2B , +
B2x2- 4B2X + 4B2
=
Bo-2B ,
+
4B2 +
BIX-YP2X +
B2x2
*

comparing the equation with


Mylx -8 5-3
. .
2x + 0 .
FX

Be = 0 7 .

4 B2 3 2
B1
-
= -
.

Bi 4(0 7) 3 2
-
: -
.
= .

3 2 + 2 8
Bi
- = -
.
.

=
-
0 .
Y
And

Bo-2B1 + 4B2 =- 8.5

:
Bo-2(0 4) -
+ 4(0 7) .
= - 8.5

2.
:

Thus ,
the centered model is

My(X(X)
= -

21 -
0 .
4(X 2) + -
0
.
7(x -
2) .
*Help of ChatGPT was used in learning some part of the code.

Question D
R Code with Outputs
#Question D1

setwd("C:/Users/n-shah/OneDrive - Texas A&M University/Semester 3/STAT 654 Stats")


# Read the data from the text file
hc <- read.table("HardwoodTensileStr-1.txt", header=TRUE, sep="")
# View the data
head(hc)

Concentration Strength
1 1.0 6.3
2 1.5 11.1
3 2.0 20.0
4 3.0 24.0
5 4.0 26.1
6 4.5 30.0

# Center the predictor


hc$Concentration_centered <- hc$Concentration - mean(hc$Concentration)
y <- hc$Strength
# Fit a third order polynomial model
hc3 <- lm(Strength ~ poly(Concentration_centered, 3, raw=TRUE), data=hc)
#Summary of the model to get the adjusted R-squared and p-value
summary(hc3)

lm(formula = Strength ~ poly(Concentration_centered, 3, raw = TRUE),


data = hc)

Residuals:
Min 1Q Median 3Q Max
-4.6250 -1.6109 0.0413 1.5892 5.0216

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 44.975562 0.869032 51.754 < 2e-16 ***
poly(Concentration_centered, 3, raw = TRUE)1 4.339394 0.350978 12.364 2.87e-09 ***
poly(Concentration_centered, 3, raw = TRUE)2 -0.548873 0.039199 -14.002 5.11e-10 ***
poly(Concentration_centered, 3, raw = TRUE)3 -0.055188 0.009789 -5.638 4.72e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.585 on 15 degrees of freedom


Multiple R-squared: 0.9707, Adjusted R-squared: 0.9648
F-statistic: 165.4 on 3 and 15 DF, p-value: 1.025e-11

The adjusted R2 is 0.9648 and the p-value is 1.025e-11. Which is much lower than 0.01 significance level.
Thus, it can be said that the model is statistically significant and can be used to predict the tensile
strength.

#Question D2
# Plot the data and the fitted model
plot(hc$Concentration_centered, y, main="Scatter Plot with Third Order Polynomial Fit", xlab =
"Concentration(centered)", ylab = "Strength", pch=19)
points(hc$Concentration_centered,fitted(hc3), col="red",pch=19)
# Adding the fitted curve
curve(predict(hc3,newdata = data.frame(Concentration_centered=x)), add = TRUE, col="blue",lwd=2)

As seen in the above graph, it appears that the third order polynomial fits the dataset well.

#Question D3
# Residuals vs Fitted values plot
plot(fitted(hc3),resid(hc3), xlab = "Fitted Values", ylab="Residuals", main = "Residuals vs Fitted Values")
abline(h=0, col="red", lty=2)
# Fit a fifth order polynomial model to this data
hc5 <- lm(Strength ~ poly(Concentration_centered, 5, raw=TRUE), data=hc)
#Summary of the model to get the adjusted R-squared and p-value
summary(hc5)
Call:
lm(formula = Strength ~ poly(Concentration_centered, 5, raw = TRUE),
data = hc)

Residuals:
Min 1Q Median 3Q Max
-2.65167 -0.91159 -0.03811 0.96396 2.56865

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 43.6187788 0.7309210 59.676 < 2e-16 ***
poly(Concentration_centered, 5, raw = TRUE)1 5.3479308 0.3896655 13.724 4.11e-09 ***
poly(Concentration_centered, 5, raw = TRUE)2 -0.1378567 0.1059263 -1.301 0.215700
poly(Concentration_centered, 5, raw = TRUE)3 -0.1630817 0.0289147 -5.640 8.06e-05 ***
poly(Concentration_centered, 5, raw = TRUE)4 -0.0114448 0.0026525 -4.315 0.000840 ***
poly(Concentration_centered, 5, raw = TRUE)5 0.0021978 0.0005163 4.257 0.000935 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.703 on 13 degrees of freedom


Multiple R-squared: 0.989, Adjusted R-squared: 0.9847
F-statistic: 233.1 on 5 and 13 DF, p-value: 3.022e-12

The adjusted R-squared is 0.98 which is greated than third order polynomial and the p-value is lower than 0.01 and that of the
third order polynomial.

Question D4
#p-values for the polynomial terms (excluding the intercept)
p_values <- summary_hc5$coefficients[-1, "Pr(>|t|)"]

# Finding the index of the term with the largest p-value (adding one to the index considering the
intercept)
largest_p_value_index <- which.max(p_values) + 1

#Generating the polynomial terms excluding the one with the largest p-value
poly_terms <- paste0("I(Concentration_centered^", 1:5, ")", collapse=" + ")
poly_terms <- strsplit(poly_terms, " + ")[[1]]
# Split into individual terms
poly_terms <- poly_terms[-largest_p_value_index]
# Removing the term with largest p-value
newf <- as.formula(paste("Strength ~ ", paste(poly_terms, collapse=" + ")))

# Fitting the model with the updated formula


hc5_reduced <- lm(newf, data=hc)
summary_hc5_reduced <- summary(hc5_reduced)

Call:
lm(formula = newf, data = hc)

Residuals:
Min 1Q Median 3Q Max
-2.65167 -0.91159 -0.03811 0.96396 2.56865

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 43.6187788 0.7309210 59.676 < 2e-16 ***
I(Concentration_centered^1) 5.3479308 0.3896655 13.724 4.11e-09 ***
I(Concentration_centered^2) -0.1378567 0.1059263 -1.301 0.215700
I(Concentration_centered^3) -0.1630817 0.0289147 -5.640 8.06e-05 ***
I(Concentration_centered^4) -0.0114448 0.0026525 -4.315 0.000840 ***
I(Concentration_centered^5) 0.0021978 0.0005163 4.257 0.000935 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.703 on 13 degrees of freedom


Multiple R-squared: 0.989, Adjusted R-squared: 0.9847
F-statistic: 233.1 on 5 and 13 DF, p-value: 3.022e-12

# Scatterpot of the data with the fitted curve from the final model
plot(hc$Concentration_centered, hc$Strength, main="Scatter Plot with Reduced Model Fit",
xlab="Concentration (Centered)", ylab="Strength", pch=19)
lines(sort(hc$Concentration_centered), predict(hc5_reduced)[order(hc$Concentration_centered)],
col="blue", lwd=2)

The significance after omitting the polynomial term with the highest p value didn’t affect the model.
The R-squared is 0.9847 which is the same.
The scatter plot of 5 degree polynomial visually and statistically fits the dataset better and precisely
compared to the third order polynomial.

Question E

Question E1

setwd("C:/Users/n-shah/OneDrive - Texas A&M University/Semester 3/STAT 654 Stats")


data <- read.table("TreeAgeDiamSugarMaple-1.txt", header = TRUE, sep = "")
x=data$Diamet
y=data$Age
lm1=lm(y~poly(x,1,raw=TRUE), data=data)
lm2=lm(y~poly(x,2,raw=TRUE), data=data)
lm3=lm(y~poly(x,3,raw=TRUE), data=data)
lm4=lm(y~poly(x,4,raw=TRUE), data=data)
lm5=lm(y~poly(x,5,raw=TRUE), data=data)
lm6=lm(y~poly(x,6,raw=TRUE), data=data)
lm7=lm(y~poly(x,7,raw=TRUE), data=data)
lm8=lm(y~poly(x,8,raw=TRUE), data=data)
AIC <- c(AIC(lm1), AIC(lm2), AIC(lm3), AIC(lm4), AIC(lm5), AIC(lm6), AIC(lm7), AIC(lm8))
BIC <- c(BIC(lm1), BIC(lm2), BIC(lm3), BIC(lm4), BIC(lm5), BIC(lm6), BIC(lm7), BIC(lm8))
AIC
BIC
AIC
[1] 239.5899 230.0744 231.8443 233.4604 235.2434 237.2078
[7] 234.6909 234.6980
BIC
[1] 243.4774 235.2577 238.3235 241.2355 244.3142 247.5745
[7] 246.3534 247.6563

which.min(AIC)
[1] 2

which.min(BIC)
[1] 2

Here, the least value of AIC and BIC is seen in second order polynomial. Hence, the same has been taken
for further analysis.

Question E2
summary(lm2)
Call:
lm(formula = y ~ poly(x, 2, raw = TRUE), data = data)

Residuals:
Min 1Q Median 3Q Max
-25.451 -10.027 -1.046 8.201 31.469

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.4693224 5.7627144 1.123 0.27271
poly(x, 2, raw = TRUE)1 0.4545286 0.0658670 6.901 3.89e-07 ***
poly(x, 2, raw = TRUE)2 -0.0004106 0.0001149 -3.573 0.00154 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 15.68 on 24 degrees of freedom


Multiple R-squared: 0.9072, Adjusted R-squared: 0.8995
F-statistic: 117.4 on 2 and 24 DF, p-value: 4.062e-13

plot(lm2$fitted,lm2$res,main="",xlab="fitted",ylab="residuals",pch=19)

abline(h=mean(lm2$res),col="red")

P-value is : 4.062e-13
Question E4

new <- data.frame(Age=110)

prediction <- predict(lm2, newdata=new, interval="predict", level=0.95)

prediction

fit lwr upr

1 7.765916 -26.614207 42.14604

2 8.411693 -25.920323 42.74371

3 10.334476 -23.860730 44.52968

4 12.880690 -21.148054 46.90943

5 13.508629 -20.481620 47.49888

6 14.139317 -19.813281 48.09191

7 15.395654 -18.484949 49.27626

8 29.935475 -3.391954 63.26290

9 32.262878 -1.021172 65.54693

10 43.505908 10.277105 76.73471


fit lwr upr

11 47.816181 14.548155 81.08421

12 48.877911 15.595732 82.16009

13 49.929246 16.631410 83.22708

14 50.977566 17.662541 84.29259

15 53.054053 19.700554 86.40755

16 74.100469 40.126265 108.07467

17 74.541587 40.551983 108.53119

18 95.148398 60.531815 129.76498

19 95.500596 60.876930 130.12426

20 108.199612 73.486420 142.91280

21 122.220881 87.938708 156.50305

22 130.676961 96.543286 164.81064

23 132.241131 96.559324 167.92294

24 132.058646 97.289721 166.82757

25 132.083716 97.279233 166.88820

26 132.233644 97.063349 167.40394

27 132.198768 97.159801 167.23773

You might also like