Assignment 2.1
Assignment 2.1
Hospital
Assignment 2
Below are the results from linear regression between Total Cost to Hospital ansd AGE.
The graphs above show that the assumptions of normality and homoscedasticity is not being
followed, as in the residual vs fitted graph we can see a pattern, the values are clustered with
lower fitted values and far apart with higher fitted values. This shows that the variances are
not same, they depend on the covariance of fitted values. Similarly the Normal QQ Plot
shows that the plot of the values deviate from the normal line. Hence the underlying
assumptions for a linear relationship are not satisfied.
So we try the log linear model.
Following are the result by linear regression between Log (Total Cost to Hospital) and AG
E.
We see in the residual vs fitted graph that it shows random variances, and the pattern that was
first visible in the previous graph is not there. Also the normal QQ Plot shows a better the fit
of normality than the previous plot. The beta 1 shows that one unit change in age will change
the total cost to hospital by a factor of Rs. 1.0086 ( 0.008565 ).
Exponential power of lwr is 84434.41.So, Minimum cost of treatment for the patient with age
of 50 at 95% confidence level is Rs. 84434.41( 11.34373)
Calculate the probability that the treatment cost will exceed the package price. We know that
distribution of log of treatment cost with age will follow the normal distribution. Since mean
and standard deviation is known from model in question 1, we can calculate the probability b
y 1-pnorm((log(250000)-mean)/ Residual standard error)
34% is the probability that the treatment cost exceeds package price. The hospital should not
revise the package price as it is greater than the mean
Gender being a qualitative variable. A dummy variable is created which is quantitative and as
already discussed in question 1 log of total lost is used for regression model. The dummy
variable formed is GEN. The contrast command shows that it is coded as 1 for male and 0 for
female.
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.93436 0.05503 216.865 < 2e-16 ***
GEN 0.19082 0.06726 2.837 0.00493 **
The model shows that for males the total cost to hospital will be increased by a factor of
1.210242 ( 0.19082 ) and p value is also significant.
5. Build a simple linear regression model between Total Cost to
Hospital and MARITAL STATUS. Interpret the results.
Following is the plot of Total cost of capital and MARITAL STATUS. MARITAL STATUS here
is qualitative variable with value MARRIED and UNMARRIED.
Marital status being a qualitative variable. A dummy variable is created which is quantitative
and as already discussed in question 1 log of total lost is used for regression model. The
contrast command shows that it is coded as 1 for UNMARRIED and 0 for MARRIED.
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.88486 0.03923 302.987 <2e-16 ***
MARITAL_STAT 0.40697 0.05944 6.847 6e-11 ***
---
F-statistic: 46.88 on 1 and 246 DF, p-value: 5.998e-11
The model shows that for males the total cost to hospital will be increased by a factor of
0.6656642 ( 0.40697 ) and p value is also significant.
6. Build a multiple linear regression model with Total Cost to Hospital
as dependent variable, and AGE, GENDER and MARITAL STATUS
as predictors. Compare the results with that of (4) and (5).
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.757557 0.056499 208.102 < 2e-16 ***
AGE 0.007637 0.002555 2.989 0.00308 **
GEN 0.104211 0.062490 1.668 0.09667 .
MARITAL_STAT 0.032630 0.132570 0.246 0.80578
Only Age is significant. Gender and marital status are insignificant as seen by p
value, however in question 4 and 5 these variables were coming as significant. This shows
if considered independently, the gender and marital status show a lot of
significant impact on the total cost to hospital, however, in the combined model,
the effect is not significant for these two variables.
7. Build a multiple linear regression model with appropriate set of predictors.
Identify the statistically significant predictors that the Mission Hospital can
use in predicting Total Cost to Hospital. Comment on the performance of
the fitted model. How does the fitted model help Mission Hospital to take
managerial decisions?
Linear regression by taking all variables as independent variables
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.4195765 0.4989676 20.882 < 2e-16 ***
AGE 0.0085850 0.0030825 2.785 0.006015 **
MALE -0.0410937 0.0716926 -0.573 0.567339
UNMARRIED 0.0964430 0.1444585 0.668 0.505364
ACHD 0.0606913 0.1454933 0.417 0.677148
CAD.DVD 0.4675391 0.1300201 3.596 0.000433 ***
CAD.SVD 0.3492459 0.3141862 1.112 0.268025
CAD.TVD 0.3441462 0.1408546 2.443 0.015670 *
CAD.VSD 0.3220618 0.4186867 0.769 0.442926
OS.ASD 0.2303903 0.1517427 1.518 0.130964
other..heart 0.2947377 0.1152326 2.558 0.011488 *
other..respiratory 0.0736222 0.2061631 0.357 0.721494
other.general -1.6289222 0.4634972 -3.514 0.000577 ***
other.nervous 0.6509382 0.4193210 1.552 0.122602
other.tertalogy 0.3684828 0.1693884 2.175 0.031108 *
PM.VSD 0.2809374 0.2406915 1.167 0.244907
RHD 0.5645466 0.1333216 4.234 3.9e-05 ***
BODY.WEIGHT 0.0022855 0.0037020 0.617 0.537890
BODY.HEIGHT 0.0005591 0.0016910 0.331 0.741381
HR.PULSE 0.0050994 0.0019315 2.640 0.009129 **
BP..HIGH -0.0021987 0.0023049 -0.954 0.341603
BP.LOW -0.0005388 0.0032198 -0.167 0.867311
RR 0.0173013 0.0090719 1.907 0.058343 .
Diabetes1 -0.0931856 0.1643344 -0.567 0.571496
Diabetes2 0.2090071 0.1756235 1.190 0.235820
hypertension1 -0.0623585 0.1217057 -0.512 0.609116
hypertension2 -0.2203463 0.1496889 -1.472 0.143028
hypertension3 0.1137384 0.1999772 0.569 0.570339
other -0.0703775 0.1239298 -0.568 0.570932
HB 0.0027892 0.0118002 0.236 0.813456
UREA 0.0008210 0.0026521 0.310 0.757307
CREATININE 0.2667857 0.1271125 2.099 0.037444 *
AMBULANCE 0.1048268 0.3199244 0.328 0.743607
TRANSFERRED -0.2662347 0.2261663 -1.177 0.240923
ELECTIVE 0.0878894 0.3115261 0.282 0.778221
The significant predictors are highlighted in yellow in the table above.
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.974447 0.176893 62.040 < 2e-16 ***
AGE 0.006630 0.001672 3.965 0.000101 ***
CAD.DVD 0.401122 0.105391 3.806 0.000186 ***
CAD.TVD 0.388842 0.109755 3.543 0.000490 ***
other..heart 0.221803 0.074259 2.987 0.003162 **
other.general -1.544496 0.419724 -3.680 0.000298 ***
other.tertalogy 0.288918 0.114124 2.532 0.012103 *
RHD 0.490360 0.100450 4.882 2.11e-06 ***
HR.PULSE 0.005739 0.001594 3.600 0.000399 ***
Residual standard error: 0.4061 on 205 degrees of freedom
(33 observations deleted due to missingness)
Multiple R-squared: 0.4232, Adjusted R-squared: 0.3979
F-statistic: 16.71 on 9 and 205 DF, p-value: < 2.2e-16
The fitted model with all the significant predictor also has a multiple r square of 42.32% and
the adjusted r square of 0.3979 which shows that model explains 42.32% of the model.
Appendix R syntax
Question 1:
> ggplot(mission, aes(x=AGE, y=TOTAL.COST.TO.HOSPITAL)) + geom_point(col="
green") + geom_smooth(method="lm", col="red") + labs(x="Age", y="Total Cos
t")
> mod1 <- lm(TOTAL.COST.TO.HOSPITAL ~ AGE, data = mission)
> summary(mod1)
Call:
lm(formula = TOTAL.COST.TO.HOSPITAL ~ AGE, data = mission)
Residuals:
Min 1Q Median 3Q Max
-232683 -61888 -19440 28238 600773
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 141216.6 10610.7 13.309 < 2e-16 ***
AGE 1991.2 273.8 7.273 4.67e-12 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Call:
lm(formula = log(TOTAL.COST.TO.HOSPITAL) ~ AGE, data = mission)
Residuals:
Min 1Q Median 3Q Max
-1.51748 -0.24402 -0.00536 0.25388 1.39912
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.814724 0.043326 272.693 < 2e-16 ***
AGE 0.008565 0.001118 7.662 4.21e-13 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Question 2:
> newage = data.frame(AGE=50)
> p<- predict(mod2, newage, interval = "prediction")
> p
fit lwr upr
1 12.24298 11.34373 13.14223
> exp(p[2])
[1] 84434.41
Question 3:
> 1-pnorm((log(250000)-p[1])/.455)
[1] 0.3411574
Question 4:
# To Develop a simple linear model to predict Total cost with Gender as predictor
# plotting Gender Vs TOtal cost
plot(GENDER, TOTAL.COST.TO.HOSPITAL, col="red")
# plotting points & linear model
ggplot(mission, aes(x=GENDER, y=TOTAL.COST.TO.HOSPITAL)) + geom_poin
t(col="green") + geom_smooth(method="lm", col="red") + labs(x="Gende
r", y="Total Cost")
# creating dummy variable for Gender
GEN <- NULL
for(i in 1:248)
{
if(GENDER[i]=="M") GEN[i]<-1
if(GENDER[i]=="F") GEN[i]<-0
}
> contrasts(GENDER)
M
F 0
M 1
# developing a linear model with Total cost and Gender
> mod3 <- lm(log(TOTAL.COST.TO.HOSPITAL) ~ GEN, data = mission)
> summary(mod3)
Call:
lm(formula = log(TOTAL.COST.TO.HOSPITAL) ~ GEN, data = mission)
Residuals:
Min 1Q Median 3Q Max
-1.31142 -0.28273 -0.08258 0.26109 1.57082
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.93436 0.05503 216.865 < 2e-16 ***
GEN 0.19082 0.06726 2.837 0.00493 **
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Question 5:
> contrasts(MARITAL.STATUS)
UNMARRIED
MARRIED 0
UNMARRIED 1
Call:
lm(formula = log(TOTAL.COST.TO.HOSPITAL) ~ MARITAL_STAT, data = mission)
Residuals:
Min 1Q Median 3Q Max
-1.3608 -0.2360 -0.0334 0.2396 1.4042
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.88486 0.03923 302.987 <2e-16 ***
MARITAL_STAT 0.40697 0.05944 6.847 6e-11 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Question 6:
Call:
lm(formula = log(TOTAL.COST.TO.HOSPITAL) ~ AGE + GEN + MARITAL_STAT,
data = mission)
Residuals:
Min 1Q Median 3Q Max
-1.5285 -0.2603 -0.0104 0.2470 1.3529
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.757557 0.056499 208.102 < 2e-16 ***
AGE 0.007637 0.002555 2.989 0.00308 **
GEN 0.104211 0.062490 1.668 0.09667 .
MARITAL_STAT 0.032630 0.132570 0.246 0.80578
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 0.4543 on 244 degrees of freedom
Multiple R-squared: 0.2019, Adjusted R-squared: 0.1921
F-statistic: 20.58 on 3 and 244 DF, p-value: 6.394e-12
Question 7:
> mod6<-lm(log(TOTAL.COST.TO.HOSPITAL)~AGE+MALE+UNMARRIED+ACHD+CAD.DVD+CAD
.SVD+CAD.TVD+CAD.VSD+OS.ASD+other..heart+other..respiratory+other.general+
other.nervous+other.tertalogy+PM.VSD+RHD+BODY.WEIGHT+BODY.HEIGHT+HR.PULSE+
BP..HIGH+BP.LOW+RR+Diabetes1+Diabetes2+hypertension1+hypertension2+hyperte
nsion3+other+HB+UREA+CREATININE+AMBULANCE+TRANSFERRED+ELECTIVE,data=missio
n)
> summary(mod6)
Call:
lm(formula = log(TOTAL.COST.TO.HOSPITAL) ~ AGE + MALE + UNMARRIED +
ACHD + CAD.DVD + CAD.SVD + CAD.TVD + CAD.VSD + OS.ASD + other..heart +
other..respiratory + other.general + other.nervous + other.tertalogy +
PM.VSD + RHD + BODY.WEIGHT + BODY.HEIGHT + HR.PULSE + BP..HIGH +
BP.LOW + RR + Diabetes1 + Diabetes2 + hypertension1 + hypertension2 +
hypertension3 + other + HB + UREA + CREATININE + AMBULANCE +
TRANSFERRED + ELECTIVE, data = mission)
Residuals:
Min 1Q Median 3Q Max
-0.96533 -0.18093 -0.01659 0.19462 1.19165
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.4195765 0.4989676 20.882 < 2e-16 ***
AGE 0.0085850 0.0030825 2.785 0.006015 **
MALE -0.0410937 0.0716926 -0.573 0.567339
UNMARRIED 0.0964430 0.1444585 0.668 0.505364
ACHD 0.0606913 0.1454933 0.417 0.677148
CAD.DVD 0.4675391 0.1300201 3.596 0.000433 ***
CAD.SVD 0.3492459 0.3141862 1.112 0.268025
CAD.TVD 0.3441462 0.1408546 2.443 0.015670 *
CAD.VSD 0.3220618 0.4186867 0.769 0.442926
OS.ASD 0.2303903 0.1517427 1.518 0.130964
other..heart 0.2947377 0.1152326 2.558 0.011488 *
other..respiratory 0.0736222 0.2061631 0.357 0.721494
other.general -1.6289222 0.4634972 -3.514 0.000577 ***
other.nervous 0.6509382 0.4193210 1.552 0.122602
other.tertalogy 0.3684828 0.1693884 2.175 0.031108 *
PM.VSD 0.2809374 0.2406915 1.167 0.244907
RHD 0.5645466 0.1333216 4.234 3.9e-05 ***
BODY.WEIGHT 0.0022855 0.0037020 0.617 0.537890
BODY.HEIGHT 0.0005591 0.0016910 0.331 0.741381
HR.PULSE 0.0050994 0.0019315 2.640 0.009129 **
BP..HIGH -0.0021987 0.0023049 -0.954 0.341603
BP.LOW -0.0005388 0.0032198 -0.167 0.867311
RR 0.0173013 0.0090719 1.907 0.058343 .
Diabetes1 -0.0931856 0.1643344 -0.567 0.571496
Diabetes2 0.2090071 0.1756235 1.190 0.235820
hypertension1 -0.0623585 0.1217057 -0.512 0.609116
hypertension2 -0.2203463 0.1496889 -1.472 0.143028
hypertension3 0.1137384 0.1999772 0.569 0.570339
other -0.0703775 0.1239298 -0.568 0.570932
HB 0.0027892 0.0118002 0.236 0.813456
UREA 0.0008210 0.0026521 0.310 0.757307
CREATININE 0.2667857 0.1271125 2.099 0.037444 *
AMBULANCE 0.1048268 0.3199244 0.328 0.743607
TRANSFERRED -0.2662347 0.2261663 -1.177 0.240923
ELECTIVE 0.0878894 0.3115261 0.282 0.778221
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
> mod7<-lm(log(TOTAL.COST.TO.HOSPITAL)~AGE+CAD.DVD+CAD.TVD+other..heart+ot
her.general+other.tertalogy+RHD+HR.PULSE+CREATININE,data=f)
>
> summary(mod7)
Call:
lm(formula = log(TOTAL.COST.TO.HOSPITAL) ~ AGE + CAD.DVD + CAD.TVD +
other..heart + other.general + other.tertalogy + RHD + HR.PULSE +
CREATININE, data = f)
Residuals:
Min 1Q Median 3Q Max
-1.06605 -0.20151 -0.02119 0.19485 1.26342
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.974447 0.176893 62.040 < 2e-16 ***
AGE 0.006630 0.001672 3.965 0.000101 ***
CAD.DVD 0.401122 0.105391 3.806 0.000186 ***
CAD.TVD 0.388842 0.109755 3.543 0.000490 ***
other..heart 0.221803 0.074259 2.987 0.003162 **
other.general -1.544496 0.419724 -3.680 0.000298 ***
other.tertalogy 0.288918 0.114124 2.532 0.012103 *
RHD 0.490360 0.100450 4.882 2.11e-06 ***
HR.PULSE 0.005739 0.001594 3.600 0.000399 ***
CREATININE 0.223745 0.064466 3.471 0.000633 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1