BIA B350F Assignment 1 Regression Analysis Sample
BIA B350F Assignment 1 Regression Analysis Sample
1. Import the data and prepare the dataset by removing height/weight and remove row
with missing data
R code
dat <- read.csv("bmi_data.csv", header = TRUE)
dat <- dat[,c(1:5)]
dat <- na.omit(dat);
The skew and kurtosis suggests that BMI, p_eat_time, s_eat_time and exercise_freq are all
positively skewed and heavy tailed. Fast_food is a binary variable and do not need to check
for the normality.
1
The histograms and qq-plots also suggests that all four variables are positively skewed and
heavy-tailed.
Correlation analysis
R code
library(ltm)
rcor.test(dat)
R output (partial) and analysis
BMI p_eat_time s_eat_time fast_food exercise_freq
BMI ***** -0.059 0.003 0.050 -0.128
p_eat_time <0.001 ***** -0.087 0.004 0.052
s_eat_time 0.739 <0.001 ***** 0.027 -0.001
fast_food <0.001 0.705 0.005 ***** -0.027
exercise_freq <0.001 <0.001 0.930 0.006 *****
upper diagonal part contains correlation coefficient estimates
lower diagonal part contains corresponding p-values
2
The correlation between the ‘BMI’ and all predicting variables except ‘s_eat_time’ are
statistically significant but the correlation is relatively weak with the largest correlation of
only -0.128. The correlations between the following predicting variables are statistically
significant but weak: ‘p_eat_time’ and ‘s_eat_time’; ‘p_eat_time’ and ‘exercise_freq’;
‘fast_food’ and ‘s_eat_time’ and ‘fast_food’ and ‘exercise_freq’.
Scatter plot
R code
library(car)
scatterplotMatrix(x = dat, diagonal = "histogram");
R output (partial) and analysis
The scatterplots don’t reveal any strong linear relationship between the dependent and
independent variables and also pairs of independents variables.
3
Checking for missing data
The data set doesn’t have missing data to be further processed.
4
The boxplots show that the dependent variable and independent variables all have
substantial far outliers.
5
Building the regression model
R code
fit<-lm(BMI ~ p_eat_time+s_eat_time+fast_food+exercise_freq, data = dat)
summary(fit)
The full regression model is statistically significant that can account for 2.1% of the total
variation in the BMI. All predictors except s_eat_time are significant at the 0.01 level of
significant.
Residual diagnostics
R code
#fit full model
re<g-lm(BMI ~ p_eat_time+s_eat_time+fast_food+exercise_freq, data = dat)
summary(reg)
par(mfrow=c(2,2))
plot(reg)
## plot cook's distance
par(mfrow=c(1,1))
cutoff<-4/(nrow(dat)-length(reg$coefficients)-2)
plot(reg,which=4,cook.levels=cutoff)
abline(h=cutoff, lty=2, col="red")
## Influence Plot
influencePlot(reg, id.method="identify", main="Influence Plot", sub="Circle size is proportional to Cook's
Distance" )
## plot residual vs predictors
plot(dat$p_eat_time, resid(reg),
main = "residual by regressors for p_eat_time")
plot(dat$s_eat_time, resid(reg),
main = "residual by regressors for s_eat_time")
plot(dat$fast_food, resid(reg),
main = "residual by regressors for fast_food")
plot(dat$exercise_freq, resid(reg),
main = "residual by regressors for exercise_freq")
6
R output (partial) and analysis
The residual analysis results of the final model are given below.
The results suggest significant issues in the normality, linearity and homoscedasticity.
The Cook’s distance plot suggests that two observations: 6180 and 3968 are high
influential points. The influence plot suggests there are substantial leverage points and
outliers in the data set. Since the three basic assumptions cannot be fulfilled, further
investigation must be conducted to see if transforming the variables can improve the
situation.
7
residual by regressors for p_eat_time residual by regressors for s_eat_time
30
30
resid(reg)
resid(reg)
10
10
-10
-10
0 100 200 300 400 500 0 200 400 600 800 1000
dat$p_eat_time dat$s_eat_time
30
resid(reg)
resid(reg)
10
10
-10
-10
dat$fast_food dat$exercise_freq
The plots of residual against independent variables also show non-random pattern,
suggesting a better fit for non-linear model.
Transforming variables
A. BMI
R code
log_BMI<-log10(dat$BMI)
inv_BMI<-1/dat$BMI
par(mfrow=c(2,2))
hist(log_BMI, breaks = 20, main="Histogram of log_BMI")
qqnorm(log_BMI, main = "QQ plot for log_BMI")
qqline(log_BMI)
hist(inv_BMI, breaks = 20, main="Histogram of inv_BMI")
qqnorm(inv_BMI, main = "QQ plot for inv_BMI")
qqline(inv_BMI)
describe(cbind(log_BMI,inv_BMI))
8
R output (partial) and analysis
vars n mean sd median trimmed mad min max range skew kurtosis se
log_BMI 1 10614 1.43 0.09 1.42 1.43 0.09 1.11 1.87 0.75 0.49 0.49 0
inv_BMI 2 10614 0.04 0.01 0.04 0.04 0.01 0.01 0.08 0.06 0.18 0.21 0*
The above measures suggest that inverse transformation works slightly better than the log10
transformation. However, it is more common and would also be more easily to interpret the
model by log-transforming the dependent variable. So, we use the log10 transform the BMI.
9
B. p_eat_time
R code
log_p_eat_time<-log10(1+dat$p_eat_time)
sqrt_p_eat_time<-sqrt(1+dat$p_eat_time)
par(mfrow=c(2,2))
hist(log_p_eat_time, breaks = 20, main="Histogram of log_p_eat_time")
qqnorm(log_p_eat_time, main = "QQ plot for log_p_eat_time")
qqline(log_p_eat_time)
hist(sqrt_p_eat_time, breaks = 20, main="Histogram of sqrt_p_eat_time")
qqnorm(sqrt_p_eat_time, main = "QQ plot for inv_p_eat_time")
qqline(sqrt_p_eat_time)
describe(cbind(log_p_eat_time,sqrt_p_eat_time))
vars n mean sd median trimmed mad min max range skew kurtosis se
log_p_eat_time 1 10614 1.67 0.47 1.79 1.73 0.34 0 2.71 2.71 -1.90 4.66 0.00
sqrt_p_eat_time 2 10614 7.62 2.96 7.81 7.61 2.95 1 22.56 21.56 0.11 0.39 0.03*
The above measures suggest that square root transformation works better for p_eat_time.
10
C. s_eat_time
R code
log_s_eat_time<-log10(3+dat$s_eat_time)
inv_s_eat_time<-1/(3+dat$s_eat_time)
par(mfrow=c(2,2))
hist(log_s_eat_time, breaks = 20, main="Histogram of log_s_eat_time")
qqnorm(log_s_eat_time, main = "QQ plot for log_s_eat_time")
qqline(log_s_eat_time)
hist(inv_s_eat_time, breaks = 20, main="Histogram of inv_s_eat_time)")
qqnorm(inv_s_eat_time, main = "QQ plot for inv_s_eat_time")
qqline(inv_s_eat_time)
describe(cbind(log_s_eat_time,inv_s_eat_time))
vars n mean sd median trimmed mad min max range skew kurtosis se
log_s_eat_time 1 10614 0.92 0.50 0.85 0.85 0.55 0 3 3 0.89 0.28 0*
inv_s_eat_time 2 10614 0.19 0.15 0.14 0.19 0.19 0 1 1 0.74 2.33 0
The above measures suggest that log10 transformation works better for s_eat_time.
11
D. exercise_freq
R code
log_exercise_freq<-log10(3+dat$exercise_freq)
inv_exercise_freq<-1/(3+dat$exercise_freq)
par(mfrow=c(2,2))
hist(log_exercise_freq, breaks = 20, main="Histogram of log_exercise_freq")
qqnorm(log_exercise_freq, main = "QQ plot for log_exercise_freq")
qqline(log_exercise_freq)
hist(inv_exercise_freq, breaks = 20, main="Histogram of inv_exercise_freq")
qqnorm(inv_exercise_freq, main = "QQ plot for inv_exercise_freq")
qqline(inv_exercise_freq)
describe(cbind(log_exercise_freq,inv_exercise_freq))
vars n mean sd median trimmed mad min max range skew kurtosis se
log_exercise_freq 1 10614 0.70 0.21 0.7 0.69 0.33 0.00 1.61 1.61 0.27 -0.63 0*
inv_exercise_freq 2 10614 0.22 0.10 0.2 0.22 0.13 0.02 1.00 0.98 0.90 4.53 0
The above measures also suggest that log10 transformation performs better for exercise_freq.
12
Transformed model and diagnostics
A. Build transformed data and regression model
R code
log_BMI<-1/(dat$BMI)
sqrt_p_eat_time<-sqrt(1+dat$p_eat_time)
log_s_eat_time<-log10(3+dat$s_eat_time)
fast_food<-dat$fast_food
log_exercise_freq<-log10(3+dat$exercise_freq)
dat_t<-data.frame(cbind(BMI=log_BMI, p_eat_time=sqrt_p_eat_time,
s_eat_time=log_s_eat_time, fast_food,
exercise_freq=log_exercise_freq))
reg_t<-lm(BMI ~ p_eat_time+s_eat_time+fast_food+exercise_freq, data = dat_t)
summary(reg_t)
The outlier test only gives 1 outlier for the transformed data (instead of 10 before). So,
the transformation can alleviate the influence of outliers on the model.
13
C. Residual diagnostic
R code
par(mfrow=c(2,2))
plot(reg_t)
par(mfrow=c(1,1))
cutoff<-4/(nrow(dat)-length(reg_t$coefficients)-2)
plot(reg_t,which=4,cook.levels=cutoff)
abline(h=cutoff, lty=2, col="red")
influencePlot(reg_t, id.method="identify", main="Influence Plot",
sub="Circle size is proportional to Cook's Distance" )
plot(dat_t$p_eat_time, resid(reg_t),
main = "residual by regressors for p_eat_time")
plot(dat_t$s_eat_time, resid(reg_t),
main = "residual by regressors for s_eat_time")
plot(dat_t$fast_food, resid(reg_t),
main = "residual by regressors for fast_food")
plot(dat_t$exercise_freq, resid(reg_t),
main = "residual by regressors for exercise_freq")
R output (partial) and analysis
The residual analysis results of the transformed model are given below.
1734
39669947 1734 3966
9947
4
Residuals
2
0.0
0
-2
-0.4
Standardized residuals
1734
39669947
2.0
9947
4
5861
5269
2
1.0
0
-4 -2
Cook's distance
0.0
14
Cook's distance
0.005 0.006
1734
3966 9947
9947 5861
4
0.002 0.003 0.004
Cook's distance
5269
Studentized Residuals
2
102
0
9757
0.000 0.001
-2
0 2000 4000 6000 8000 10000 0.0005 0.0010 0.0015 0.0020 0.0025
The results show that transforming the dependent and independent variables can reduce
the non-normality, non-linearity and heteroscedasticity of the residual. It can also reduce
the Cook’s distance, leverage and studentized residuals of extreme observations.
0.2 0.4
resid(reg_t)
resid(reg_t)
-0.2
-0.2
dat_t$p_eat_time dat_t$s_eat_time
0.2 0.4
resid(reg_t)
resid(reg_t)
-0.2
-0.2
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5
dat_t$fast_food dat_t$exercise_freq
The plots of residual against independent variables in the transformed model also show
closer to random pattern.
15
D. Multicollinearity checking
R code
library(mctest)
omcdiag(reg_t)
imcdiag(reg_t)
MC Results detection
Determinant |X'X|: 0.9446 0
Farrar Chi-Square: 604.4303 1
Red Indicator: 0.0949 0
Sum of Lambda Inverse: 4.1186 0
Theil's Method: 0.0412 0
Condition Number: 11.9177 0
Although Farrar Chi-Square indicate overall multicollinearity issue and Klein’s rule
further suggests p_eat_time and s_eat_time may suffer from the issues, the VIF of all
variables is much smaller than 10. So, multicollinearity issues may not seriously affect
the model estimation.
Stepwise regression
R code
library(MASS)
fit1 <- lm(BMI ~ 1, data = dat_t)
fit2 <- lm(BMI~p_eat_time+s_eat_time+fast_food+exercise_freq, data = dat_t)
step <- stepAIC(fit1, direction = "forward", scope= formula(fit2))
Step: AIC=-51265.24
BMI ~ exercise_freq
Step: AIC=-51293.29
BMI ~ exercise_freq + fast_food
Step: AIC=-51314.79
BMI ~ exercise_freq + fast_food + p_eat_time
Df Sum of Sq RSS AIC
+ s_eat_time 1 0.0602 84.253 -51320
<none> 84.314 -51315
Step: AIC=-51320.37
BMI ~ exercise_freq + fast_food + p_eat_time + s_eat_time
> step1$anova
Stepwise Model Path
Analysis of Deviance Table
Initial Model:
BMI ~ 1
Final Model:
BMI ~ exercise_freq + fast_food + p_eat_time + s_eat_time
Stepwise regression suggests that all variables can be added in the model based on the
AIC criterion.
R output
Call:
lm(formula = BMI ~ p_eat_time + s_eat_time + fast_food + exercise_freq,
data = dat_t)
Residuals:
17
Min 1Q Median 3Q Max
-0.31986 -0.06086 -0.00614 0.05410 0.43514
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.4866728 0.0042372 350.858 < 2e-16 ***
p_eat_time -0.0015862 0.0002989 -5.307 1.13e-07 ***
s_eat_time -0.0048860 0.0017746 -2.753 0.00591 **
fast_food 0.0101628 0.0017632 5.764 8.44e-09 ***
exercise_freq -0.0602290 0.0042292 -14.241 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Hypothesis testing
Residual standard error: 0.08912 on 10609 degrees of freedom
Multiple R-squared: 0.02583, Adjusted R-squared: 0.02546
F-statistic: 70.32 on 4 and 10609 DF, p-value: < 2.2e-16
𝐻0:𝛽1 = 𝛽2 = 𝛽3 = 0
𝐻1:𝐴𝑡𝑙𝑒𝑎𝑠𝑡𝑜𝑛𝑒𝛽𝑖 𝑖𝑠𝑛𝑜𝑡𝑒𝑞𝑢𝑎𝑙𝑡𝑜𝑧𝑒𝑟𝑜
F-value = 65.6
With p-value < 0.05, we reject H0 and conclude that the model is valid at the 0.01 level of
significance.
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.4866728 0.0042372 350.858 < 2e-16 ***
p_eat_time -0.0015862 0.0002989 -5.307 1.13e-07 ***
s_eat_time -0.0048860 0.0017746 -2.753 0.00591 **
fast_food 0.0101628 0.0017632 5.764 8.44e-09 ***
exercise_freq -0.0602290 0.0042292 -14.241 < 2e-16 ***
𝐻0:𝛽𝑖 = 0, 𝑖 = 1,2,3
𝐻1:𝛽𝑖 ≠ 0
With p-values < 0.01, we reject H0 and conclude that the regression coefficients of the
‘p_eat_time’ (square root), ‘s_eat_time’ (log 10), ‘fast_food’, and ‘exercise_freq’ (log10) are
significant at the 0.01 level of significance.
18
Interpretation of coefficients
The interpretation of the coefficients for the transformed model is not straightforward. The
following is the interpretations of model with logarithm transformation:
(Source: Wooldridge J.M. (2013) Introductory Econometrics – A Modern Approach, 5th edn.)
So, the interpretation of the coefficients can be approximated as follows:
For p_eat_time, the interpretation of the coefficient can be obtained by applying derivatives
. The result is when p_eat_time increases by 1, BMI falls by
0.5 ∗ 0.158⁄√𝑝_𝑒𝑎𝑡_𝑡𝑖𝑚𝑒 %= 0.5 ∗ 0.1586⁄√𝑝_𝑒𝑎𝑡_𝑡𝑖𝑚𝑒 %=0.0793⁄√𝑝_𝑒𝑎𝑡_𝑡𝑖𝑚𝑒 %. By
letting p_eat_time at the mean (27.8) , when p_eat_time increases by 1% (about 0.278), BMI
falls by 0.278 ∗ 0.0793⁄√27.8 % = 0.418%.
Overall, the impacts of p_eat_time and s_eat_time on BMI is similar and increase both times
can reduce BMI but the exercise_freq is the more effective way to reduce BMI. On other
hand, taking fast_food will slightly increase the BMI by around 1% only.
19