222ECO01__Anand__advanced_econometrics_activity1.
R
andand
2024-02-02
Question 1:
a) Source of data: This dataset is originally from the National Institute
of Diabetes and Digestive and Kidney Diseases.
b) Time Period of data: The data was collected between 1965 and 1988
c) Dependent variable: The dependent variable is “Outcome”. This is a
binary variable that takes the dummy value of 1 if a patient is
Diabetic and 0 if a patient is Not Diabetic
d) Explanatory variables: The explanatory variables include Pregnancies,
Glucose, Blood Pressure, Skin Thickness, Insulin, BMI, Diabetes
Pedigree Function, and Age of the patients.
The objective of the dataset is to predict whether or not a patient has
diabetes, based on the explanatory variables included in the dataset
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.3.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(purrr)
## Warning: package 'purrr' was built under R version 4.3.2
library(Ecdat)
## Warning: package 'Ecdat' was built under R version 4.3.2
## Loading required package: Ecfun
## Warning: package 'Ecfun' was built under R version 4.3.2
##
## Attaching package: 'Ecfun'
## The following object is masked from 'package:base':
##
## sign
##
## Attaching package: 'Ecdat'
## The following object is masked from 'package:datasets':
##
## Orange
library(broom)
## Warning: package 'broom' was built under R version 4.3.2
library(aod)
## Warning: package 'aod' was built under R version 4.3.2
library(margins)
## Warning: package 'margins' was built under R version 4.3.2
library(lmtest)
## Warning: package 'lmtest' was built under R version 4.3.2
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.3.2
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(sandwich)
## Warning: package 'sandwich' was built under R version 4.3.2
library(DescTools)
## Warning: package 'DescTools' was built under R version 4.3.2
##
## Attaching package: 'DescTools'
## The following object is masked from 'package:Ecfun':
##
## BoxCox
library(mfx)
## Warning: package 'mfx' was built under R version 4.3.2
## Loading required package: MASS
##
## Attaching package: 'MASS'
## The following object is masked from 'package:Ecdat':
##
## SP500
## The following object is masked from 'package:dplyr':
##
## select
## Loading required package: betareg
## Warning: package 'betareg' was built under R version 4.3.2
library(brant)
## Warning: package 'brant' was built under R version 4.3.2
library(tidyr)
## Warning: package 'tidyr' was built under R version 4.3.2
library(janitor)
## Warning: package 'janitor' was built under R version 4.3.2
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(nnet)
## Warning: package 'nnet' was built under R version 4.3.2
library(readxl)
## Warning: package 'readxl' was built under R version 4.3.2
#Question 2) Run a Linear Probability Model and interpret the
coefficients
diabetes <- read_excel("C:\\Users\\andand\\Desktop\\diabetes.xlsx")
diabetes <- diabetes %>%
mutate(Outcome_num = ifelse(Outcome == "NotDiabetic", 0, 1))
model_lpm <- lm(Outcome_num ~ Pregnancies + Glucose + BloodPressure +
SkinThickness + Insulin + BMI + DiabetesPedigreeFunction +
Age, data = diabetes)
summary(model_lpm)
##
## Call:
## lm(formula = Outcome_num ~ Pregnancies + Glucose + BloodPressure +
## SkinThickness + Insulin + BMI + DiabetesPedigreeFunction +
## Age, data = diabetes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.01348 -0.29513 -0.09541 0.32112 1.24160
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.8538943 0.0854850 -9.989 < 2e-16 ***
## Pregnancies 0.0205919 0.0051300 4.014 6.56e-05 ***
## Glucose 0.0059203 0.0005151 11.493 < 2e-16 ***
## BloodPressure -0.0023319 0.0008116 -2.873 0.00418 **
## SkinThickness 0.0001545 0.0011122 0.139 0.88954
## Insulin -0.0001805 0.0001498 -1.205 0.22857
## BMI 0.0132440 0.0020878 6.344 3.85e-10 ***
## DiabetesPedigreeFunction 0.1472374 0.0450539 3.268 0.00113 **
## Age 0.0026214 0.0015486 1.693 0.09092 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4002 on 759 degrees of freedom
## Multiple R-squared: 0.3033, Adjusted R-squared: 0.2959
## F-statistic: 41.29 on 8 and 759 DF, p-value: < 2.2e-16
Interpretation of coefficients
i) An increase in Pregnancies by one unit increases the probability of the
patient being Diabetic by 0.0205919, holding other variables constant
ii) An increase in Glucose by one unit increases the probability of the patient
being Diabetic by 0.0059203, holding other variables constant
iii) An increase in Blood Pressure by one unit decreases the probability of the
patient being Diabetic by 0.0023319, holding other variables constant
iv) An increase in Skin Thickness by one unit increases the probability of the
patient being Diabetic by 0.0001545, holding other variables constant
v) An increase in Insulin by one unit decreases the probability of the patient
being Diabetic by 0.0001805, holding other variables constant
vi) An increase in BMI by one unit increases the probability of the patient
being Diabetic by 0.0132440, holding other variables constant
vii) An increase in Diabetes Pedigree Function by one unit increases the
probability of the patient being Diabetic by 0.1472374, holding other
variables constant
viii) An increase in Age by one unit increases the probability of the patient
being Diabetic by 0.0026214, holding other variables constant
#Question 3) Run a Logit/Probit model and interpret the coefficients
model_logit <- glm(Outcome_num ~ Pregnancies + Glucose + BloodPressure +
SkinThickness + Insulin + BMI + DiabetesPedigreeFunction
+ Age, data = diabetes, family = binomial(link = "logit"))
summary(model_logit)
##
## Call:
## glm(formula = Outcome_num ~ Pregnancies + Glucose + BloodPressure +
## SkinThickness + Insulin + BMI + DiabetesPedigreeFunction +
## Age, family = binomial(link = "logit"), data = diabetes)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.4046964 0.7166359 -11.728 < 2e-16 ***
## Pregnancies 0.1231823 0.0320776 3.840 0.000123 ***
## Glucose 0.0351637 0.0037087 9.481 < 2e-16 ***
## BloodPressure -0.0132955 0.0052336 -2.540 0.011072 *
## SkinThickness 0.0006190 0.0068994 0.090 0.928515
## Insulin -0.0011917 0.0009012 -1.322 0.186065
## BMI 0.0897010 0.0150876 5.945 2.76e-09 ***
## DiabetesPedigreeFunction 0.9451797 0.2991475 3.160 0.001580 **
## Age 0.0148690 0.0093348 1.593 0.111192
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 993.48 on 767 degrees of freedom
## Residual deviance: 723.45 on 759 degrees of freedom
## AIC: 741.45
##
## Number of Fisher Scoring iterations: 5
Interpretation of the coefficients:
i) An increase in Pregnancies by one unit increases the log odds of a patient
being Diabetic by 0.1231823, holding other variables constant
ii) An increase in Glucose by one unit increases the log odds of a patient being
Diabetic by 0.0351637, holding other variables constant
iii) An increase in Blood Pressure by one unit decreases the log odds of a
patient being Diabetic by 0.0132955, holding other variables constant
iv) An increase in Skin Thickness by one unit increases the log odds of a
patient being Diabetic by 0.0006190, holding other variables constant
v) An increase in Insulin by one unit decreases the log odds of a patient being
Diabetic by 0.0011917, holding other variables constant
vi) An increase in BMI by one unit increases the log odds of a patient being
Diabetic by 0.0897010, holding other variables constant
vii) An increase in Diabetes Pedigree Function by one unit increases the log
odds of a patient being Diabetic by 0.9451797, holding other variables
constant
viii) An increase in Age by one unit increases the log odds of a patient being
Diabetic by 0.0148690, holding other variables constant