Deliverytime 3
Deliverytime 3
Predicting the amount of time required by the route driver to service the vending machines in an outlet.
Response: delivery time (y) Predictors: number of cases of product stocked (x1) distance walked by the
route driver (x2).
library(readxl)
DeliveryTime <- read_excel("DeliveryTime.xlsx")
colnames(DeliveryTime)<-c("Time","NumberofCases", "Distance")
summary(DeliveryTime)
pairs(DeliveryTime)
1
5 10 15 20 25 30
70
50
Time
30
10
25
NumberofCases
15
5
1000
Distance
400
0
10 20 30 40 50 60 70 80 0 200 600 1000 1400
Scatter plots y vs x1 and y vs x2 shows linear relationships. Addition, x1 vs x2 plot also shows linear
relationship, resulting multicollinearity.
If there is only one (or a few) dominant regressor, or if the regressors operate nearly independently, the matrix
of scatterplots is most useful. However, when several important regressors are themselves interrelated, then
these scatter diagrams can be very misleading.
cor(DeliveryTime)
cor.test(as.numeric(DeliveryTime$Distance),DeliveryTime$Time)
##
## Pearson’s product-moment correlation
##
## data: as.numeric(DeliveryTime$Distance) and DeliveryTime$Time
## t = 9.4465, df = 23, p-value = 2.214e-09
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7666503 0.9515461
## sample estimates:
2
## cor
## 0.8916701
cor.test(DeliveryTime$NumberofCases,DeliveryTime$Time)
##
## Pearson’s product-moment correlation
##
## data: DeliveryTime$NumberofCases and DeliveryTime$Time
## t = 17.546, df = 23, p-value = 8.22e-15
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9202275 0.9845031
## sample estimates:
## cor
## 0.9646146
Both predictor variables have significant and strong positive linear correlations with the response.
Obtain the 3D visual as below.
Del_lm2<-lm(Time~NumberofCases+Distance,data=DeliveryTime)
40
Distance
1500
20
1000
500
0
0
0 5 10 15 20 25 30
No. of Cases
3
Modeling
form of the linear model:
y = β0 + β1 x 1 + β2 x 2 + ϵ
colnames(DeliveryTime)
First model with the simple linear model using NumberofCases only.
Del_lm1<-lm(Time~NumberofCases,data=DeliveryTime)
Now add the second variable Distance, to the multiple linear regression model.
Del_lm2<-lm(Time~NumberofCases+Distance,data=DeliveryTime)
anova(Del_lm2)
In the R output anova table shows SS for seperate predictor. But we need overall SS. Let’s obtain it as
below.
Null hypothesis: All the regression coefficients are zero. beta_1=beta2=0 Alternative hypothesis: at least
one regression coefficient(beta_1 or beta_2) is non zero
SSR = 5382.4+168.4=5550 with df=2 SSRes=233.7 with df=25-2-1=22 Fstatistic=(5550/2)/(233.7/22)=261.27
Critical Value of F=F(2,22)(alpha=0.05)=3.44
qf(0.95,2,22)
## [1] 3.443357
Fstatitic is greater than the critical value, therefore, we reject the null hypothesis and conclude that atleast
one regression coefficient is non zero. therefore the regression is significant.
H0 : β1 = β2 = 0 is rejected based on the F-test since pvalue~=0.
Since the P value of the F statistic is very small, we conclude that delivery time is related to delivery volume
and/or distance. However, this does not necessarily imply that the relationship found is an appropriate
one for predicting delivery time as a function of volume and distance. Further tests of model adequacy are
required.
4
summary(Del_lm2)
##
## Call:
## lm(formula = Time ~ NumberofCases + Distance, data = DeliveryTime)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.7880 -0.6629 0.4364 1.1566 7.4197
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.341231 1.096730 2.135 0.044170 *
## NumberofCases 1.615907 0.170735 9.464 3.25e-09 ***
## Distance 0.014385 0.003613 3.981 0.000631 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 3.259 on 22 degrees of freedom
## Multiple R-squared: 0.9596, Adjusted R-squared: 0.9559
## F-statistic: 261.2 on 2 and 22 DF, p-value: 4.687e-16
In the bottom of the Summary of the model has the overall F-statistic as well.
Further,
R2 for the multiple regression model for the delivery time data as R2 = 0.96, or 96.0%.
Compare the R squared for single predictor and multiple predictor models.
summary(Del_lm1)
##
## Call:
## lm(formula = Time ~ NumberofCases, data = DeliveryTime)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.5811 -1.8739 -0.3493 2.1807 10.6342
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.321 1.371 2.422 0.0237 *
## NumberofCases 2.176 0.124 17.546 8.22e-15 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 4.181 on 23 degrees of freedom
## Multiple R-squared: 0.9305, Adjusted R-squared: 0.9275
## F-statistic: 307.8 on 1 and 23 DF, p-value: 8.22e-15
2
RAdj = 0.956 . (95.6%) for the two - variable model, while for the simple linear regression model with only
2
x1 (cases), RAdj = 0.930 . , or 93%. Therefore, we would conclude that adding x2 (distance) to the model
5
did result in a meaningful reduction of total variability. This implies having distance and no of cases both
in the model is better than having only no. of cases in the model.
Also comment on the significance of single predictors using the t-test given in the summary.
H0 : β2 = 0 is rejected based on the t-test since pvalue is 0.000631. Hence, conclude that the regressor x2
(distance) contributes significantly to the model given that x1 (cases) is also in the model.
here, tstatistic=0.014385/0.003613=3.98 critical value=t(n-k-1)(0.05/2)=2.073873
qt(0.975,22)
## [1] 2.073873
#plot(Del_lm2)
confint.lm(Del_lm2)
## 2.5 % 97.5 %
## (Intercept) 0.066751987 4.61571030
## NumberofCases 1.261824662 1.96998976
## Distance 0.006891745 0.02187791
measuring the contribution of xj as if it were the last variable added to the model.
anova(Del_lm1)
anova(Del_lm2)
6
## Analysis of Variance Table
##
## Response: Time
## Df Sum Sq Mean Sq F value Pr(>F)
## NumberofCases 1 5382.4 5382.4 506.619 < 2.2e-16 ***
## Distance 1 168.4 168.4 15.851 0.0006312 ***
## Residuals 22 233.7 10.6
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
anova(Del_lm1,Del_lm2)
## 2.5 % 97.5 %
## (Intercept) 0.066751987 4.61571030
## NumberofCases 1.261824662 1.96998976
## Distance 0.006891745 0.02187791
## 0.833 % 99.167 %
## (Intercept) -0.500628740 5.1830910
## NumberofCases 1.173496917 2.0583175
## Distance 0.005022556 0.0237471
7
Prediction interval for the new observation