R Class 21
R Class 21
> fit<-lm(d1$Failure.Time~d1$Material.Type+d1$Rainfall)
> fit
Call:
lm(formula = d1$Failure.Time ~ d1$Material.Type + d1$Rainfall)
Coefficients:
(Intercept) d1$Material.Type d1$Rainfall
0.081086 -0.014499 0.005591
> summary(fit)
Call:
lm(formula = d1$Failure.Time ~ d1$Material.Type + d1$Rainfall)
Residuals:
Min 1Q Median 3Q Max
-0.051313 -0.006694 0.002246 0.008590 0.025763
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.081086 0.012659 6.406 6.52e-06 *** pvalue not greater than 0.05
d1$Material.Type -0.014499 0.003575 -4.055 0.000822 *** pvalue not greater than 0.05
d1$Rainfall 0.005591 0.046745 0.120 0.906200 pvalue greater than 0.05
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
𝑎) 𝐹𝑟𝑜𝑚 𝑎𝑛𝑎𝑙𝑦𝑠𝑖𝑛𝑔 𝑡ℎ𝑒 𝑅 𝑜𝑢𝑡𝑝𝑢𝑡, 𝑤𝑒 𝑠𝑒𝑒 𝑡ℎ𝑎𝑡 𝑡ℎ𝑒 𝑓𝑖𝑡𝑡𝑒𝑑 𝑙𝑖𝑛𝑒𝑎𝑟 𝑚𝑜𝑑𝑒𝑙 𝑖𝑠:
FUTURE TRACK Edutech Pvt Ltd | 52, First floor Mall Road, Kingsway camp, Delhi-09|+9910024949, 011 -45024949.
b) The R output shows that the ‘material type’ parameter is significantly different to zero (at the 0.1% level), [1]
(p value is not greater than 0.05, this means Ho is rejected)
but the ‘rainfall’ parameter is not significantly different to zero ( P value is greater than 0.05, this means Ho is not
rejected.)
(iii) (a)
# plot the residuals
> plot(residuals(fit))
th
(b) The residuals exhibit a fairly random scatter around zero (independent) apart from the 6 point.
FUTURE TRACK Edutech Pvt Ltd | 52, First floor Mall Road, Kingsway camp, Delhi-09|+9910024949, 011 -45024949.
(b) The residuals plot indicates that data point 6 is an outlier.
refit<-lm(d1new$Failure.Time~d1new$Material.Type+d1new$Rainfall);refit
Call:
lm(formula = d1new$Failure.Time ~ d1new$Material.Type + d1new$Rainfall)
Coefficients:
(Intercept) d1new$Material.Type d1new$Rainfall
0.086629 -0.015576 0.005152
summary(refit)
Call:
lm(formula = d1new$Failure.Time ~ d1new$Material.Type + d1new$Rainfall)
Residuals:
Min 1Q Median 3Q Max
-0.0144060 -0.0082529 -0.0003768 0.0071557 0.0213878
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.086629 0.008094 10.703 1.06e-08 ***
d1new$Material.Type -0.015576 0.002275 -6.846 3.93e-06 ***
d1new$Rainfall 0.005152 0.029621 0.174 0.864
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Coefficients:
(Intercept) d1new$Material.Type d1new$Rainfall
6.1921 6.7286 0.4814
coef(gfit)
FUTURE TRACK Edutech Pvt Ltd | 52, First floor Mall Road, Kingsway camp, Delhi-09|+9910024949, 011 -45024949.
where x1 is the ‘material type’ variable, and x2 is the ‘rainfall’ variable.
> summary(gfit)
Call:
glm(formula = d1new$Failure.Time ~ d1new$Material.Type + d1new$Rainfall,
family = Gamma)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.26231 -0.14156 -0.03338 0.12850 0.25185
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.1921 2.6184 2.365 0.031 *
d1new$Material.Type 6.7286 0.8359 8.050 5.12e-07 ***
d1new$Rainfall 0.4814 10.6244 0.045 0.964
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Reviewing the model fit output from R, the ‘rainfall’ parameter is not significantly different to zero, whereas the
‘material type’ parameter is significant at the 0.1% level.
FUTURE TRACK Edutech Pvt Ltd | 52, First floor Mall Road, Kingsway camp, Delhi-09|+9910024949, 011 -45024949.
SOLUTION 2
> # Solution 2
> ac<-read.csv(file.choose(),header=T);ac
STATE CLASS GENDER AGE PAID
1 STATE 01 C6 M 43 2364.696
2 STATE 01 F6 M 43 18787.967
3 STATE 01 F6 M 43 27115.745
4 STATE 02 C1 M 43 15288.492
5 STATE 02 C11 M 43 2265.707
(no need of copy complete data, if data is large)
> reg<-lm(ac$PAID~ac$STATE+ac$CLASS+ac$GENDER+ac$AGE)
> reg
(alternative reg<-lm(ac$PAID~.,data=ac) # dot means all covariates)
Call:
lm(formula = ac$PAID ~ ac$STATE + ac$CLASS + ac$GENDER + ac$AGE)
Coefficients:
(Intercept) ac$STATESTATE 02 ac$STATESTATE 03 ac$STATESTATE 04 ac$STATESTATE 06
19818.12 -2306.41 -580.09 -689.08 440.79
ac$STATESTATE 07 ac$STATESTATE 10 ac$STATESTATE 12 ac$STATESTATE 14 ac$STATESTATE 15
-1254.29 2275.25 -752.99 -404.69 -4791.86
ac$STATESTATE 17 ac$CLASSC11 ac$CLASSC6 ac$CLASSF6 ac$GENDERM
-883.67 -11743.95 -14833.37 -225.16 -1193.01
ac$AGE
15.76
> summary(reg)
Call:
lm(formula = ac$PAID ~ ac$STATE + ac$CLASS + ac$GENDER + ac$AGE)
Residuals:
Min 1Q Median 3Q Max
-10462 -2276 119 1611 36377
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 19818.12 1391.58 14.242 < 2e-16 ***
ac$STATESTATE 02 -2306.41 658.69 -3.502 0.000477 ***
ac$STATESTATE 03 -580.09 761.36 -0.762 0.446242
ac$STATESTATE 04 -689.08 702.04 -0.982 0.326495
ac$STATESTATE 06 440.79 752.27 0.586 0.558010
ac$STATESTATE 07 -1254.29 837.22 -1.498 0.134318
ac$STATESTATE 10 2275.25 885.44 2.570 0.010284 *
ac$STATESTATE 12 -752.99 850.10 -0.886 0.375897
ac$STATESTATE 14 -404.69 842.90 -0.480 0.631216
ac$STATESTATE 15 -4791.86 623.56 -7.685 2.87e-14 ***
ac$STATESTATE 17 -883.67 704.58 -1.254 0.209982
ac$CLASSC11 -11743.95 430.60 -27.274 < 2e-16 ***
ac$CLASSC6 -14833.37 410.84 -36.105 < 2e-16 ***
ac$CLASSF6 -225.16 517.68 -0.435 0.663670
ac$GENDERM -1193.01 215.50 -5.536 3.69e-08 ***
ac$AGE 15.76 24.10 0.654 0.513418
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
FUTURE TRACK Edutech Pvt Ltd | 52, First floor Mall Road, Kingsway camp, Delhi-09|+9910024949, 011 -45024949.
Explanation
R-Squared: 68.86% of the variation in the claims paid is explained by state, rating class, gender and age
Adjusted R-Squared: 68.52% is used to compare with other models, adjusts for the number of terms in the model.
We Use adjusted R-squared to compare the goodness-of-fit for regression models that contain differing numbers of
independent variables.
p-value of the model is <2.2*E-16 which is less than 0.05 and hence the null hypothesis of “There is no significant
linear relationship between the given independent variables X and a dependent variable Y” is rejected at 5% level of
significance. Using this model to predict the DV is better than simply using the expected value of the DV as a
predictor for the DV
p-value of the coefficients: While the model is overall significant, some of the variables may be insignificant. As
state 1, Rating class C1 and Gender female are taken as based states and their coefficients are clubbed in the
intercept itself, we observe that coefficients of State 2 and state 15 (Negative) and State 10 (Positive) are
significantly different from state 1 (At 95% Confidence level). Similarly rating classes C11 and C6 have
significantly negative coefficients compared to C1 indicating that the claim paid for those two rating classes is
significantly lesser compared to that of C1. Males have significantly lesser claim paid compared to females at 95%
confidence level
> anova(reg)
Analysis of Variance Table
Response: ac$PAID
Df Sum Sq Mean Sq F value Pr(>F)
ac$STATE 10 9.8027e+09 9.8027e+08 63.0629 < 2.2e-16 ***
ac$CLASS 3 3.7869e+10 1.2623e+10 812.0763 < 2.2e-16 ***
ac$GENDER 1 4.7271e+08 4.7271e+08 30.4106 4.158e-08 ***
ac$AGE 1 6.6423e+06 6.6423e+06 0.4273 0.5134
Residuals 1401 2.1778e+10 1.5544e+07
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
From the ANOVA table, we can infer that except Age, all other variables are significant in prediction of claims paid
(iii)
> summary(glmmodel)
Call:
glm(formula = indices$Sensex_direction ~ indices$BM + indices$CD +
indices$EN + indices$FM + indices$FI + indices$HC + indices$IN +
indices$IT + indices$TE + indices$UT, family = binomial(link = "logit"))
Deviance Residuals:
Min 1Q Median 3Q Max
-2.27544 -0.00117 0.00000 0.01354 1.75651
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.0086 0.7315 -1.379 0.16796
indices$BM 7.7977 16.5255 0.472 0.63703
indices$CD -87.5335 42.6785 -2.051 0.04027 *
indices$EN 93.9675 38.3193 2.452 0.01420 *
indices$FM 41.1745 20.1436 2.044 0.04095 *
indices$FI 172.8807 60.8192 2.843 0.00448 **
indices$HC -6.4294 13.9394 -0.461 0.64463
indices$IN 4.1735 18.2152 0.229 0.81877
indices$IT 78.3494 30.9307 2.533 0.01131 *
indices$TE 29.9111 13.4184 2.229 0.02581 *
FUTURE TRACK Edutech Pvt Ltd | 52, First floor Mall Road, Kingsway camp, Delhi-09|+9910024949, 011 -45024949.
indices$UT -14.4767 23.0602 -0.628 0.53015
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Sectors which have significantly impacted the direction of Sensex returns are CD, EN, FI,
FM, IT and TE at 95% Confidence level. But only FI has impacted the Sensex direction at
99% Confidence level
> anova(reg)
Analysis of Variance Table
Response: ac$PAID
Df Sum Sq Mean Sq F value Pr(>F)
ac$STATE 10 9.8027e+09 9.8027e+08 63.0629 < 2.2e-16 ***
ac$CLASS 3 3.7869e+10 1.2623e+10 812.0763 < 2.2e-16 ***
ac$GENDER 1 4.7271e+08 4.7271e+08 30.4106 4.158e-08 ***
ac$AGE 1 6.6423e+06 6.6423e+06 0.4273 0.5134
Residuals 1401 2.1778e+10 1.5544e+07
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
From the ANOVA table, we can infer that except Age, all other variables
are significant in prediction of claims paid > #Plot of residuals vs.
Fitted Values
>
plot(reg$fitted.values,reg$residuals,col=c("blue","red"))
FUTURE TRACK Edutech Pvt Ltd | 52, First floor Mall Road, Kingsway camp, Delhi-09|+9910024949, 011 -45024949.
The plot is used to detect non-linearity, unequal error variances, and outliers.
The residuals "do not bounce randomly" around the 0 line. This suggests that the
assumption that the relationship is linear is not reasonable.
The residuals do not form a "horizontal band" around the 0 line. This suggests that the
variances of the error terms are not equal and exhibit heteroscedasticity
A few residuals "stands out" from the basic random pattern of residuals.This suggests
that there are outliers.
# QQ Plot
> qqnorm(reg$residuals)
A Q-Q plot is a scatterplot created by plotting two sets of quantiles against one
another. If both sets of quantiles came from the same distribution, we should see the
points forming a line that’s roughly straight. Here it is not, indicating deviance of the
residuals from normality. Thus linear regression may not be a better fit to the data
#Reason for better model
(iv) Although Kurtosis is not in course, but they asked in IAI exam.
#Checking for the normality of Auto Claims Paid vs. Logairthm of Auto Claims Paid
#Writing Functions for Skewness and Kurtosis
skew<-function(x)mean((x-mean(x))^3)/sd(x)^3 [2]
kurt<-function(x)(mean((x-mean(x))^4)/sd(x)^4)-3 [2]
skew(AutoClaims$PAID) [0.5]
## [1] 2.619422
kurt(AutoClaims$PAID) [0.5]
## [1] 9.20876
skew(log(AutoClaims$PAID)) [0.5]
## [1] 0.4528057
FUTURE TRACK Edutech Pvt Ltd | 52, First floor Mall Road, Kingsway camp, Delhi-09|+9910024949, 011 -45024949.
kurt(log(AutoClaims$PAID)) [0.5]
## [1] -0.787689
Skewness and Kurtosis of Log (Claims) are more close to Zero compared to those of actual
claims paid, thus indicating the possibility of using linear regression with this
dependent
variable [1]` [7]
anova(model2)
Analysis of Variance Table
Response: log(PAID)
Df Sum Sq Mean Sq F value Pr(>F)
STATE 10 246.26 24.626 107.6969 < 2.2e-16 ***
CLASS 3 690.09 230.031 1006.0090 < 2.2e-16 ***
GENDER 1 10.88 10.882 47.5908 7.927e-12 ***
AGE 1 0.12 0.123 0.5384 0.4632
Residuals 1401 320.35 0.229
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
##
## Call:
## lm(formula = log(PAID) ~ ., data = AutoClaims)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.96098 -0.34264 -0.05047 0.36828 1.08237
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.896308 0.168777 58.635 < 2e-16 ***
## STATESTATE 02 -0.154804 0.079889 -1.938 0.0529 .
## STATESTATE 03 0.110585 0.092342 1.198 0.2313
## STATESTATE 04 0.049554 0.085147 0.582 0.5607
## STATESTATE 06 0.116190 0.091239 1.273 0.2031
## STATESTATE 07 0.142721 0.101543 1.406 0.1601
## STATESTATE 10 0.098014 0.107391 0.913 0.3616
## STATESTATE 12 0.027982 0.103105 0.271 0.7861
## STATESTATE 14 0.090316 0.102231 0.883 0.3771
## STATESTATE 15 -0.645918 0.075628 -8.541 < 2e-16 ***
## STATESTATE 17 0.004611 0.085455 0.054 0.9570
## CLASSC11 -1.203098 0.052225 -23.037 < 2e-16 ***
## CLASSC6 -1.988743 0.049829 -39.911 < 2e-16 ***
## CLASSF6 -0.034909 0.062787 -0.556 0.5783
## GENDERM -0.180923 0.026136 -6.922 6.75e-12 ***
## AGE 0.002145 0.002923 0.734 0.4632
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4782 on 1401 degrees of freedom
## Multiple R-squared: 0.7473, Adjusted R-squared: 0.7446
## F-statistic: 276.2 on 15 and 1401 DF, p-value: < 2.2e-16
Key Differences
1. R-Squared and Adjusted R-Squared improved and hence the model is a better
fit compared to the initial model [1.5
2. While a the significance level of a few factor coefficients when compared with the
base categories changed, the overall significant variables did not change which can
FUTURE TRACK Edutech Pvt Ltd | 52, First floor Mall Road, Kingsway camp, Delhi-09|+9910024949, 011 -45024949.
be inferred from the ANOVA table [1.5]
[6]
v)
# Using Interaction effects in the model
model3<-lm(PAID~.+STATE:CLASS+STATE:GENDER+CLASS:GENDER,data = AutoClaims)
summary(model3) [5]
##
## Call:
## lm(formula = PAID ~ . + STATE:CLASS + STATE:GENDER + CLASS:GENDER,
## data = AutoClaims)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13110.3 -1475.9 -377.5 1250.5 20442.8
##
## Coefficients: (3 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 24373.09 3236.06 7.532 9.08e-14 ***
## STATESTATE 02 -7868.94 3269.51 -2.407 0.016227 *
## STATESTATE 03 -3695.87 3422.76 -1.080 0.280427
## STATESTATE 04 6883.00 3545.51 1.941 0.052425 .
## STATESTATE 06 5428.58 3262.57 1.664 0.096363 .
## STATESTATE 07 -979.80 1184.59 -0.827 0.408314
## STATESTATE 10 7340.08 3546.35 2.070 0.038664 *
## STATESTATE 12 1048.26 3382.21 0.310 0.756659
## STATESTATE 14 -2796.37 3538.40 -0.790 0.429494
## STATESTATE 15 -14038.72 3164.04 -4.437 9.86e-06 ***
## STATESTATE 17 -4266.43 3640.98 -1.172 0.241490
## CLASSC11 -15321.38 3534.31 -4.335 1.56e-05 ***
## CLASSC6 -20741.82 3262.53 -6.358 2.79e-10 ***
## CLASSF6 4075.85 3386.90 1.203 0.229026
## GENDERM -5508.80 1305.95 -4.218 2.62e-05 ***
## AGE 23.06 19.35 1.192 0.233581
## STATESTATE 02:CLASSC11 4114.13 3666.38 1.122 0.262008
## STATESTATE 03:CLASSC11 2022.51 3802.06 0.532 0.594847
## STATESTATE 04:CLASSC11 -7719.04 3904.17 -1.977 0.048229 *
## STATESTATE 06:CLASSC11 -6209.18 3743.46 -1.659 0.097412 .
## STATESTATE 07:CLASSC11 -1049.15 1827.01 -0.574 0.565899
## STATESTATE 10:CLASSC11 -6795.03 3956.71 -1.717 0.086144 .
## STATESTATE 12:CLASSC11 -3756.14 3939.41 -0.953 0.340517
## STATESTATE 14:CLASSC11 1668.66 3927.62 0.425 0.671012
## STATESTATE 15:CLASSC11 7776.13 3581.01 2.171 0.030066 *
## STATESTATE 17:CLASSC11 2227.20 4011.07 0.555 0.578806
## STATESTATE 02:CLASSC6 5677.35 3416.32 1.662 0.096777 .
## STATESTATE 03:CLASSC6 2122.55 3605.64 0.589 0.556177
## STATESTATE 04:CLASSC6 -8231.17 3692.27 -2.229 0.025957 *
## STATESTATE 06:CLASSC6 -6151.60 3417.36 -1.800 0.072066 .
## STATESTATE 07:CLASSC6 NA NA NA NA
## STATESTATE 10:CLASSC6 -8117.07 3687.43 -2.201 0.027884 *
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3111 on 1361 degrees of freedom
## Multiple R-squared: 0.8116, Adjusted R-squared: 0.804
## F-statistic: 106.6 on 55 and 1361 DF, p-value: < 2.2e-16
Interpretation [5]
anova(model3)
Analysis of Variance Table
Response: PAID
FUTURE TRACK Edutech Pvt Ltd | 52, First floor Mall Road, Kingsway camp, Delhi-09|+9910024949, 011 -45024949.
Df Sum Sq Mean Sq F value Pr(>F)
STATE 10 9.8027e+09 9.8027e+08 101.2664 < 2.2e-16 ***
CLASS 3 3.7869e+10 1.2623e+10 1304.0318 < 2.2e-16 ***
GENDER 1 4.7271e+08 4.7271e+08 48.8334 4.350e-12 ***
AGE 1 6.6423e+06 6.6423e+06 0.6862 0.407612
STATE:CLASS 27 7.8659e+09 2.9133e+08 30.0960 < 2.2e-16 ***
STATE:GENDER 10 2.7570e+08 2.7570e+07 2.8482 0.001634 **
CLASS:GENDER 3 4.6128e+08 1.5376e+08 15.8841 3.733e-10 ***
Residuals 1361 1.3175e+10 9.6801e+06
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
1. R-Squared and Adjusted R-Squared increased to above 80% and hence the model is a
better fit compared to the earlier models [2]
2. Interaction effect between a few classes and states emerged out to be very
significant (State 6 and Class F6 came out to be significantly negative). Though State
15 came out to be significantly negative when main effects alone were considered,
the interaction effects compensated that negative significantly when interacted with
class C6 and Class 11 whereas the interaction coefficient is not significant between
State 15 and Class F6 indicating that the claims paid is significantly lesser when the
state is 15 and class is F6 compared to other rating classes. Digging deeper into the
relationships is possible with the interaction effect. Similarly main effect of Gender is
significantly negative compared to females but that is offset to some extent for some
states (2,3,7,15) and for some rating classes (C6) whereas it is further negative in
case of F6. So the differences can be magnified by considering the interaction
effects, improving the predictability of the model [1]
3. ANOVA table for the model suggests that except the age all the main effects and
their interaction effects are significant at 5% significance level indicating their
contribution to the predictability of the model [2]
Solution 3:
# Load the data file
indices<-read.csv(file.choose(),header=T);indices
Compute pearson correlation coefficient and finding the most correlated and least
correlated pair
correlation [1]
## BM CD EN FM FI HC IN IT TE UT
## BM 1.000 0.882 0.823 0.534 0.819 0.605 0.898 0.447 0.646 0.861
## CD 0.882 1.000 0.783 0.581 0.878 0.637 0.915 0.434 0.696 0.847
## EN 0.823 0.783 1.000 0.446 0.745 0.506 0.799 0.359 0.622 0.793
## FM 0.534 0.581 0.446 1.000 0.485 0.510 0.511 0.303 0.410 0.520
## FI 0.819 0.878 0.745 0.485 1.000 0.502 0.902 0.349 0.623 0.838
## HC 0.605 0.637 0.506 0.510 0.502 1.000 0.588 0.525 0.489 0.530
## IN 0.898 0.915 0.799 0.511 0.902 0.588 1.000 0.370 0.676 0.882
## IT 0.447 0.434 0.359 0.303 0.349 0.525 0.370 1.000 0.291 0.317
## TE 0.646 0.696 0.622 0.410 0.623 0.489 0.676 0.291 1.000 0.669
## UT 0.861 0.847 0.793 0.520 0.838 0.530 0.882 0.317 0.669 1.000
FUTURE TRACK Edutech Pvt Ltd | 52, First floor Mall Road, Kingsway camp, Delhi-09|+9910024949, 011 -45024949.
(ii) Do manually
min_cor_pair "IT TE"
max_cor_pair "CD IN"
(iii)
Perform a Principal component analysis of the sectoral return values
PCA_corr<-princomp(indices[,3:12])
summary(PCA_corr) [4]
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4
## Standard deviation 0.2106142 0.06763728 0.05825406 0.04631607
## Proportion of Variance 0.7294045 0.07522554 0.05580143 0.03527414
## Cumulative Proportion 0.7294045 0.80463008 0.86043151 0.89570565
## Comp.5 Comp.6 Comp.7 Comp.8
## Standard deviation 0.04255566 0.03735608 0.03326497 0.03093065
## Proportion of Variance 0.02977884 0.02294646 0.01819564 0.01573154
## Cumulative Proportion 0.92548448 0.94843094 0.96662658 0.98235812
## Comp.9 Comp.10
## Standard deviation 0.023803102 0.022500987
## Proportion of Variance 0.009316657 0.008325228
## Cumulative Proportion 0.991674772 1.000000000
Alternatively instead of using princomp, the student can use prcomp as well.
PCA_corr_1<-prcomp(indices[,3:12])
summary(PCA_corr_1)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 P6
## Standard deviation 0.2113 0.06784 0.05843 0.04646 0.04269 0.0377
## Proportion of Variance 0.7294 0.07523 0.05580 0.03527 0.02978 0.0225
## Cumulative Proportion 0.7294 0.80463 0.86043 0.89571 0.92548 0.9483
## PC7 PC8 PC9 PC10
## Standard deviation 0.03337 0.03103 0.02388 0.02257
## Proportion of Variance 0.01820 0.01573 0.00932 0.00833
## Cumulative Proportion 0.96663 0.98236 0.99167 1.00000
sum(PCA_corr$sdev[1:2]^2)/sum(PCA_corr$sdev^2)
## [1] 0.8046301 [1]
OR Alternatively
sum(PCA_corr_1$sdev[1:2]^2)/sum(PCA_corr_1$sdev^2)
## [1] 0.8046301
vi)
Paiwise correlations of the transformed components
round(cor(PCA_corr$scores),3) [3]
OR Alternatively
round(cor(PCA_corr_1$x),3)
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
## PC1 1 0 0 0 0 0 0 0 0 0
## PC2 0 1 0 0 0 0 0 0 0 0
## PC3 0 0 1 0 0 0 0 0 0 0
## PC4 0 0 0 1 0 0 0 0 0 0
## PC5 0 0 0 0 1 0 0 0 0 0
## PC6 0 0 0 0 0 1 0 0 0 0
## PC7 0 0 0 0 0 0 1 0 0 0
## PC8 0 0 0 0 0 0 0 1 0 0
## PC9 0 0 0 0 0 0 0 0 1 0
## PC10 0 0 0 0 0 0 0 0 0 1
Interpretation
The pairwise correlation between the components after the PCA is performed should be zero
as PCA is a way to deal with highly correlated variables. If N variables are highly
correlated than they will all load out on the SAME Principal Component (Eigenvector) and
they will be uncorrelated with other components (All these components are orthogonal).
Hence the correlations will be zero between the components [2]
(vi)
Scree Plot screeplot(PCA_corr,type = "l")
Interpretation: Number of significant components is 1 as the scree plot almost flattened out after the second
component [1]
FUTURE TRACK Edutech Pvt Ltd | 52, First floor Mall Road, Kingsway camp, Delhi-09|+9910024949, 011 -45024949.
Solution 4
The above scatter plot shows a positive linear relationship between marketing Spend and Sales data.
ii)
> cor = cor(budget$Sales,budget$Spend)
> cor
[1] 0.9701669
iii)
> cor.test(budget$Spend,budget$Sales,method="pearson",alternative = "greater")
Pearson's product-moment correlation
data: budget$Spend and budget$Sales
t = 30.476, df = 58, p-value < 2.2e-16
alternative hypothesis: true correlation is greater than 0
95 percent confidence interval:
0.9542479 1.0000000
sample estimates:
cor
0.9701669
The p-value is 2.2 X 10^-16, showing very strong evidence against the null hypothesis. Thus, we reject that
the Pearson’s correlation coefficient is equal to 0 and conclude that it is positive.
iv)
> reg = lm(Sales ~ Spend, data = budget)
> summary(reg)
Call:
lm(formula = Sales ~ Spend, data = budget)
Residuals:
Min 1Q Median 3Q Max
-25331.9 -6783.1 -844.5 7965.9 25320.1
Coefficients:
Estimate Std. Error t value Pr(>|t|)
FUTURE TRACK Edutech Pvt Ltd | 52, First floor Mall Road, Kingsway camp, Delhi-09|+9910024949, 011 -45024949.
(Intercept) 3431.5592 3245.9169 1.057 0.295
Spend 10.5310 0.3455 30.476 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 10650 on 58 degrees of freedom
Multiple R-squared: 0.9412, Adjusted R-squared: 0.9402
F-statistic: 928.8 on 1 and 58 DF, p-value: < 2.2e-16
From the output, the estimate of parameter sigma is 10,650.
v)
> abline(reg)
vi)
From the R output, the proportion of total variability of the responses explained by the model is 94.12%. [1]
viii)
es = resid(reg)
> t.test(es,conf.level = 0.99)$conf.int
[1] -3630.146 3630.146
attr(,"conf.level")
[1] 0.99
From the above, the confidence interval for parameter sigma is (-3630.15, 3630.15)
ix)Based on the results in both part (vii) and part (viii), the errors seem to be close to zero and the
confidence interval of residuals also contains 0. Hence the model seems to be a good fit.
x)
Let Ho: Beta = 10 and H1: Beta not equal to 10
> b1 = (coef(reg))[['Spend']]… [1]
> n = 60
> s = sqrt(sum(es^2)/(n-2))
> SE = s/sqrt(sum((budget$Spend-mean(budget$Spend))^2)).. [2]
> t = (b1-10)/SE … [1]
> pt(t,58,lower.tail = FALSE)… [1]
[1] 0.06489565
FUTURE TRACK Edutech Pvt Ltd | 52, First floor Mall Road, Kingsway camp, Delhi-09|+9910024949, 011 -45024949.
> pvalue = 2*pt(t,58,lower.tail = FALSE).. [1]
> pvalue
[1] 0.1297913.. [1]
xi)
There is insufficient evidence to reject the null hypothesis at 5% level of significance. The slope is equalto 10 for this
data.
xii)
> y = 3431.5592 + (b1*4500)
>y
[1] 50821.17
With a marketing spend of INR 4,500, the Sales would be INR 50,821.
FUTURE TRACK Edutech Pvt Ltd | 52, First floor Mall Road, Kingsway camp, Delhi-09|+9910024949, 011 -45024949.