Lab 5
Lab 5
Thulasi-2348152
2023-12-09
OBJECTIVE:
To fit a suitable linear regression model and construct a normal probability plot of the
residuals and to see if there is any problem with the normality and constant variance
assumption and construct and interpret a plot of the residuals and find if the residuals
correlated .To find outliers in the data and way to handle them.
INTRODUCTION:
Multicollinearity exists whenever an independent variable is highly correlated with one or
more of the other independent variables in a multiple regression equation. An outlier is an
observation that lies an abnormal distance from other values in a random sample from a
population.We will analyze the data and see how can we handle the outliers.
library(readxl)
inventer<- read_excel("C:/Users/Admin/Downloads/Inverter data.xlsx")
View(inventer)
model=lm(inventer$y~inventer$x1+inventer$x2+inventer$x3+inventer$x4+inventer$
x5)
summary(model)
##
## Call:
## lm(formula = inventer$y ~ inventer$x1 + inventer$x2 + inventer$x3 +
## inventer$x4 + inventer$x5)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.2915 -1.0794 -0.5519 1.2685 3.5009
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.85473 1.86922 1.527 0.1432
## inventer$x1 -0.29047 0.11742 -2.474 0.0230 *
## inventer$x2 0.20572 0.07506 2.741 0.0130 *
## inventer$x3 0.45444 0.18768 2.421 0.0256 *
## inventer$x4 -0.59419 0.21253 -2.796 0.0115 *
## inventer$x5 0.00464 0.01817 0.255 0.8012
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.196 on 19 degrees of freedom
## Multiple R-squared: 0.5584, Adjusted R-squared: 0.4422
## F-statistic: 4.805 on 5 and 19 DF, p-value: 0.005239
Here we got R^2 value as 0.5584 suggesting that the model explains a moderate amount of
the variance in the data. P value is 0.005239 and we see that predictors (except Inverter$5)
are individually significant in predicting the response variable.
#Normal probability plot of the residuals
qqnorm(resid(model))
qqline(resid(model))
res=resid(model)
#To test the normality by using shapiro-wilk test
shapiro.test(rstandard(model))
##
## Shapiro-Wilk normality test
##
## data: rstandard(model)
## W = 0.91451, p-value = 0.03847
The plot of residuals shows there is are deviations from the reference line and from
shapiron test ,the p-value is 0.038 which rejects our null hypothesisand hence it doesn’t
holds the constant variance assumption.
#To constant variance(homoscedasticity test)
library(lmtest)
##
## Attaching package: 'zoo'
bptest(model)
##
## studentized Breusch-Pagan test
##
## data: model
## BP = 14.072, df = 5, p-value = 0.01516
P - value in Breusch pagan test is 0.015 which is less than 0.05 . Hence we reject the null
hypothesis at a significance level of 0.05 , there is evidence of heteroscedasticity.
#To find the fitted values
fit=fitted.values(model)
fit
## 1 2 3 4 5 6
7
## 2.1812243 5.5844677 2.3791193 -2.7968027 1.7266932 3.3656279
1.4620291
## 8 9 10 11 12 13
14
## 5.6060986 6.9425257 2.0506003 3.0384708 0.8448812 3.0303516
5.7139819
## 15 16 17 18 19 20
21
## 0.6348032 1.7663056 2.0435146 2.8485294 -0.9096868 -0.7936971
0.7016434
## 22 23 24 25
## 3.6975241 3.0440700 3.3092863 1.8304384
#To check whether the residuals are correlated using acf(auto correlation
factor)
acf(resid(model))
There is no
significant correlation between the residuals since all the lags lie in between the threshold
line.
#To find variance inflation factor
library(car)
Since all the predictors have a low VIF value of approximately 2(less than 5), it doesn’t have
multi-collinearity among them.
#Checking for outliers
res1=rstandard(model)
res1
## 1 2 3 4 5 6
## -0.75521781 -3.29890047 -0.33472058 2.11852178 -0.46081838 0.68417952
## 7 8 9 10 11 12
## -0.41108251 1.96135788 1.36586687 -0.39905377 0.74517729 -0.27006283
## 13 14 15 16 17 18
## -0.37810817 1.74654110 0.03007476 -0.68093441 -0.77999639 0.24478579
## 19 20 21 22 23 24
## 0.58836599 0.54315134 -0.18405269 0.66681923 -0.82171407 -0.93401389
## 25
## -0.58855319
#2nd value is greater than 3, which is an outlier. We can then remove that
value from the data.
outliers<-which(abs(res1)>3)
outliers
## 2
## 2
## # A tibble: 24 × 6
## x1 x2 x3 x4 x5 y
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 3 3 3 3 0 0.787
## 2 3 6 6 6 0 1.71
## 3 4 4 4 12 0 0.203
## 4 8 7 6 5 0 0.806
## 5 10 20 5 5 0 4.71
## 6 8 6 3 3 25 0.607
## 7 6 24 4 4 25 9.11
## 8 4 10 12 4 25 9.21
## 9 16 12 8 4 25 1.36
## 10 3 10 8 8 25 4.55
## # ℹ 14 more rows
##
## Call:
## lm(formula = y ~ ., data = Inverter1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.6019 -0.8676 -0.5544 0.6388 3.3297
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.125843 1.303651 0.864 0.39917
## x1 -0.306737 0.078924 -3.886 0.00108 **
## x2 0.383627 0.062069 6.181 7.8e-06 ***
## x3 0.448299 0.126041 3.557 0.00225 **
## x4 -0.445042 0.145916 -3.050 0.00689 **
## x5 0.006228 0.012209 0.510 0.61619
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.474 on 18 degrees of freedom
## Multiple R-squared: 0.8072, Adjusted R-squared: 0.7536
## F-statistic: 15.07 on 5 and 18 DF, p-value: 6.837e-06
Here we got R^2 value as 0.7228 suggesting that the model explains a substantial amount
of the variance in the data. P value is 0.02039 and we see that predictors (except
Inverter$5) are individually significant in predicting the response variable. Here we can say
that outlier is a levarage point since variation between these two models will affect the
prediction value.
CONCLUTION:
For the given data, a suitable regression model is fitted.The residual vs fitted-value plot is
plotted. Autocorrelation is found using acf plot.Correlation between the residuals is found.
Multicollinearity is found using vif(model). The presence of outliers is found using
standardized residuals. The value greater than 3(which is an outlier) is removed and
further analysis is done.