0% found this document useful (0 votes)
35 views19 pages

BIA B350F Assignment 1 Regression Analysis Sample

The document describes analyzing a dataset on BMI and related variables. It includes importing the data and removing missing values. Descriptive analyses show the variables are positively skewed. Correlation between variables is weak. Scatter plots don't reveal strong linear relationships. Univariate and bivariate outliers are identified in the data. A linear regression model is built with BMI as the outcome and other variables as predictors. The model accounts for 2.1% of BMI variation and identifies outliers. Residual diagnostics show issues with normality, linearity and homoscedasticity.

Uploaded by

Nile Seth
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views19 pages

BIA B350F Assignment 1 Regression Analysis Sample

The document describes analyzing a dataset on BMI and related variables. It includes importing the data and removing missing values. Descriptive analyses show the variables are positively skewed. Correlation between variables is weak. Scatter plots don't reveal strong linear relationships. Univariate and bivariate outliers are identified in the data. A linear regression model is built with BMI as the outcome and other variables as predictors. The model accounts for 2.1% of BMI variation and identifies outliers. Residual diagnostics show issues with normality, linearity and homoscedasticity.

Uploaded by

Nile Seth
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Import data

1. Import the data and prepare the dataset by removing height/weight and remove row
with missing data
R code
dat <- read.csv("bmi_data.csv", header = TRUE)
dat <- dat[,c(1:5)]
dat <- na.omit(dat);

Descriptive analysis and normality check


R code
library(psych)
describe(dat)
par(mfrow=c(1,2))
hist(dat$BMI, breaks = 20, main="Histogram of BMI")
qqnorm(dat$BMI, main = "QQ plot for BMI")
qqline(dat$BMI)
par(mfrow=c(2,2))
hist(dat$p_eat_time, breaks = 20, main="Histogram of p_eat_time")
qqnorm(dat$p_eat_time, main = "QQ plot for p_eat_time")
qqline(dat$p_eat_time)
hist(dat$s_eat_time, breaks = 20, main="Histogram of s_eat_time")
qqnorm(dat$s_eat_time, main = "QQ plot for s_eat_time")
qqline(dat$s_eat_time)
hist(dat$fast_food, breaks = 20, main="Histogram of fast_food")
qqnorm(dat$fast_food, main = "QQ plot for fast_food")
qqline(dat$fast_food)
hist(dat$exercise_freq, breaks = 20, main="Histogram of exercise_freq")
qqnorm(dat$exercise_freq, main = "QQ plot for exercise_freq")
qqline(dat$exercise_freq)
R output (partial) and analysis
vars n mean sd median trimmed mad min max range skew kurtosis se
BMI 1 10614 27.77 6.16 26.6 27.13 5.19 13 73.6 60.6 1.27 2.88 0.06
p_eat_time 2 10614 65.86 48.07 60.0 60.34 44.48 0 508.0 508.0 1.50 4.43 0.47
s_eat_time 3 10614 16.92 51.01 4.0 7.41 5.93 -2 990.0 992.0 8.28 90.26 0.50
fast_food 4 10614 0.58 0.49 1.0 0.60 0.00 0 1.0 1.0 -0.34 -1.88 0.00
exercise_freq 5 10614 2.64 2.94 2.0 2.27 2.97 -2 38.0 40.0 2.18 12.69 0.03

The skew and kurtosis suggests that BMI, p_eat_time, s_eat_time and exercise_freq are all
positively skewed and heavy tailed. Fast_food is a binary variable and do not need to check
for the normality.

1
The histograms and qq-plots also suggests that all four variables are positively skewed and
heavy-tailed.

Correlation analysis
R code
library(ltm)
rcor.test(dat)
R output (partial) and analysis
BMI p_eat_time s_eat_time fast_food exercise_freq
BMI ***** -0.059 0.003 0.050 -0.128
p_eat_time <0.001 ***** -0.087 0.004 0.052
s_eat_time 0.739 <0.001 ***** 0.027 -0.001
fast_food <0.001 0.705 0.005 ***** -0.027
exercise_freq <0.001 <0.001 0.930 0.006 *****
upper diagonal part contains correlation coefficient estimates
lower diagonal part contains corresponding p-values

2
The correlation between the ‘BMI’ and all predicting variables except ‘s_eat_time’ are
statistically significant but the correlation is relatively weak with the largest correlation of
only -0.128. The correlations between the following predicting variables are statistically
significant but weak: ‘p_eat_time’ and ‘s_eat_time’; ‘p_eat_time’ and ‘exercise_freq’;
‘fast_food’ and ‘s_eat_time’ and ‘fast_food’ and ‘exercise_freq’.

Scatter plot
R code
library(car)
scatterplotMatrix(x = dat, diagonal = "histogram");
R output (partial) and analysis

The scatterplots don’t reveal any strong linear relationship between the dependent and
independent variables and also pairs of independents variables.

3
Checking for missing data
The data set doesn’t have missing data to be further processed.

Checking for outliers in the data set


R code
#univariate outliers
Boxplot(dat$BMI, main = "Boxplot of BMI", ylab = "BMI", id=list(location="avoid"))
Boxplot(dat$p_eat_time, main = "Boxplot of p_eat_time", ylab = "p_eat_time", id=list(location="avoid"))
Boxplot(dat$s_eat_time, main = "Boxplot of s_eat_time", ylab = "s_eat_time", id=list(location="avoid"))
Boxplot(dat$exercise_freq, main = "Boxplot of exercise_freq", ylab = "exercise_freq",
id=list(location="avoid"))
#bivariate oultiers
dataEllipse(x = dat$p_eat_time, y = dat$BMI, levels=0.95,
xlab = "p_eat_time",
ylab = "BMI",
main = "bivariate outliers", id = list(n=5))
dataEllipse(x = dat$s_eat_time, y = dat$BMI, levels=0.95,
xlab = "s_eat_time",
ylab = "BMI",
main = "bivariate outliers", id = list(n=5))
dataEllipse(x = dat$fast_food, y = dat$BMI, levels=0.95,
xlab = "fast_food",
ylab = "BMI",
main = "bivariate outliers", id = list(n=5))
dataEllipse(x = dat$exercise_freq, y = dat$BMI, levels=0.95,
xlab = "exercise_freq",
ylab = "BMI",
main = "bivariate outliers", id = list(n=5))
#multivariate outliers
reg <- lm(BMI ~ p_eat_time + s_eat_time + fast_food + exercise_freq, data = dat)
outlierTest(reg)
The program includes checking for univariate, bivariate, and multivariate outliers in the
data set.

R output (partial) and analysis

4
The boxplots show that the dependent variable and independent variables all have
substantial far outliers.

The bivariate plots also suggest number of significant outliers.

rstudent unadjusted p-value Bonferonni p


1838 7.539660 5.0970e-14 5.4099e-10
10506 6.584210 4.7893e-11 5.0834e-07
4186 6.522435 7.2324e-11 7.6765e-07
6180 5.884605 4.1108e-09 4.3632e-05
10054 5.448844 5.1832e-08 5.5014e-04
8592 5.402694 6.7067e-08 7.1185e-04
4118 5.339358 9.5203e-08 1.0105e-03
7292 5.299713 1.1831e-07 1.2558e-03
6270 5.278625 1.3273e-07 1.4088e-03
6824 5.249013 1.5587e-07 1.6544e-03
The outlier test identifies 10 outliers in the linear model.

5
Building the regression model
R code
fit<-lm(BMI ~ p_eat_time+s_eat_time+fast_food+exercise_freq, data = dat)
summary(fit)

R output (partial) and analysis


Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 28.5642467 0.1347230 212.022 < 2e-16 ***
p_eat_time -0.0068468 0.0012380 -5.531 3.27e-08 ***
s_eat_time -0.0003419 0.0011655 -0.293 0.769
fast_food 0.5910893 0.1201912 4.918 8.88e-07 ***
exercise_freq -0.2597586 0.0201490 -12.892 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6.099 on 10609 degrees of freedom


Multiple R-squared: 0.02143, Adjusted R-squared: 0.02106
F-statistic: 58.09 on 4 and 10609 DF, p-value: < 2.2e-16

The full regression model is statistically significant that can account for 2.1% of the total
variation in the BMI. All predictors except s_eat_time are significant at the 0.01 level of
significant.

Residual diagnostics
R code
#fit full model
re<g-lm(BMI ~ p_eat_time+s_eat_time+fast_food+exercise_freq, data = dat)
summary(reg)
par(mfrow=c(2,2))
plot(reg)
## plot cook's distance
par(mfrow=c(1,1))
cutoff<-4/(nrow(dat)-length(reg$coefficients)-2)
plot(reg,which=4,cook.levels=cutoff)
abline(h=cutoff, lty=2, col="red")
## Influence Plot
influencePlot(reg, id.method="identify", main="Influence Plot", sub="Circle size is proportional to Cook's
Distance" )
## plot residual vs predictors
plot(dat$p_eat_time, resid(reg),
main = "residual by regressors for p_eat_time")
plot(dat$s_eat_time, resid(reg),
main = "residual by regressors for s_eat_time")
plot(dat$fast_food, resid(reg),
main = "residual by regressors for fast_food")
plot(dat$exercise_freq, resid(reg),
main = "residual by regressors for exercise_freq")

6
R output (partial) and analysis

The residual analysis results of the final model are given below.

The results suggest significant issues in the normality, linearity and homoscedasticity.
The Cook’s distance plot suggests that two observations: 6180 and 3968 are high
influential points. The influence plot suggests there are substantial leverage points and
outliers in the data set. Since the three basic assumptions cannot be fulfilled, further
investigation must be conducted to see if transforming the variables can improve the
situation.

7
residual by regressors for p_eat_time residual by regressors for s_eat_time

30

30
resid(reg)

resid(reg)
10

10
-10

-10
0 100 200 300 400 500 0 200 400 600 800 1000

dat$p_eat_time dat$s_eat_time

residual by regressors for fast_food residual by regressors for exercise_freq


30

30
resid(reg)

resid(reg)
10

10
-10

-10

0.0 0.2 0.4 0.6 0.8 1.0 0 10 20 30

dat$fast_food dat$exercise_freq

The plots of residual against independent variables also show non-random pattern,
suggesting a better fit for non-linear model.

Transforming variables
A. BMI
R code
log_BMI<-log10(dat$BMI)
inv_BMI<-1/dat$BMI
par(mfrow=c(2,2))
hist(log_BMI, breaks = 20, main="Histogram of log_BMI")
qqnorm(log_BMI, main = "QQ plot for log_BMI")
qqline(log_BMI)
hist(inv_BMI, breaks = 20, main="Histogram of inv_BMI")
qqnorm(inv_BMI, main = "QQ plot for inv_BMI")
qqline(inv_BMI)
describe(cbind(log_BMI,inv_BMI))

8
R output (partial) and analysis

vars n mean sd median trimmed mad min max range skew kurtosis se
log_BMI 1 10614 1.43 0.09 1.42 1.43 0.09 1.11 1.87 0.75 0.49 0.49 0
inv_BMI 2 10614 0.04 0.01 0.04 0.04 0.01 0.01 0.08 0.06 0.18 0.21 0*

The above measures suggest that inverse transformation works slightly better than the log10
transformation. However, it is more common and would also be more easily to interpret the
model by log-transforming the dependent variable. So, we use the log10 transform the BMI.

9
B. p_eat_time
R code
log_p_eat_time<-log10(1+dat$p_eat_time)
sqrt_p_eat_time<-sqrt(1+dat$p_eat_time)
par(mfrow=c(2,2))
hist(log_p_eat_time, breaks = 20, main="Histogram of log_p_eat_time")
qqnorm(log_p_eat_time, main = "QQ plot for log_p_eat_time")
qqline(log_p_eat_time)
hist(sqrt_p_eat_time, breaks = 20, main="Histogram of sqrt_p_eat_time")
qqnorm(sqrt_p_eat_time, main = "QQ plot for inv_p_eat_time")
qqline(sqrt_p_eat_time)
describe(cbind(log_p_eat_time,sqrt_p_eat_time))

R output (partial) and analysis

vars n mean sd median trimmed mad min max range skew kurtosis se
log_p_eat_time 1 10614 1.67 0.47 1.79 1.73 0.34 0 2.71 2.71 -1.90 4.66 0.00
sqrt_p_eat_time 2 10614 7.62 2.96 7.81 7.61 2.95 1 22.56 21.56 0.11 0.39 0.03*

The above measures suggest that square root transformation works better for p_eat_time.

10
C. s_eat_time
R code
log_s_eat_time<-log10(3+dat$s_eat_time)
inv_s_eat_time<-1/(3+dat$s_eat_time)
par(mfrow=c(2,2))
hist(log_s_eat_time, breaks = 20, main="Histogram of log_s_eat_time")
qqnorm(log_s_eat_time, main = "QQ plot for log_s_eat_time")
qqline(log_s_eat_time)
hist(inv_s_eat_time, breaks = 20, main="Histogram of inv_s_eat_time)")
qqnorm(inv_s_eat_time, main = "QQ plot for inv_s_eat_time")
qqline(inv_s_eat_time)
describe(cbind(log_s_eat_time,inv_s_eat_time))

R output (partial) and analysis

vars n mean sd median trimmed mad min max range skew kurtosis se
log_s_eat_time 1 10614 0.92 0.50 0.85 0.85 0.55 0 3 3 0.89 0.28 0*
inv_s_eat_time 2 10614 0.19 0.15 0.14 0.19 0.19 0 1 1 0.74 2.33 0

The above measures suggest that log10 transformation works better for s_eat_time.

11
D. exercise_freq
R code
log_exercise_freq<-log10(3+dat$exercise_freq)
inv_exercise_freq<-1/(3+dat$exercise_freq)
par(mfrow=c(2,2))
hist(log_exercise_freq, breaks = 20, main="Histogram of log_exercise_freq")
qqnorm(log_exercise_freq, main = "QQ plot for log_exercise_freq")
qqline(log_exercise_freq)
hist(inv_exercise_freq, breaks = 20, main="Histogram of inv_exercise_freq")
qqnorm(inv_exercise_freq, main = "QQ plot for inv_exercise_freq")
qqline(inv_exercise_freq)
describe(cbind(log_exercise_freq,inv_exercise_freq))

R output (partial) and analysis

vars n mean sd median trimmed mad min max range skew kurtosis se
log_exercise_freq 1 10614 0.70 0.21 0.7 0.69 0.33 0.00 1.61 1.61 0.27 -0.63 0*
inv_exercise_freq 2 10614 0.22 0.10 0.2 0.22 0.13 0.02 1.00 0.98 0.90 4.53 0

The above measures also suggest that log10 transformation performs better for exercise_freq.

12
Transformed model and diagnostics
A. Build transformed data and regression model
R code
log_BMI<-1/(dat$BMI)
sqrt_p_eat_time<-sqrt(1+dat$p_eat_time)
log_s_eat_time<-log10(3+dat$s_eat_time)
fast_food<-dat$fast_food
log_exercise_freq<-log10(3+dat$exercise_freq)
dat_t<-data.frame(cbind(BMI=log_BMI, p_eat_time=sqrt_p_eat_time,
s_eat_time=log_s_eat_time, fast_food,
exercise_freq=log_exercise_freq))
reg_t<-lm(BMI ~ p_eat_time+s_eat_time+fast_food+exercise_freq, data = dat_t)
summary(reg_t)

R output (partial) and analysis


Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.4866728 0.0042372 350.858 < 2e-16 ***
p_eat_time -0.0015862 0.0002989 -5.307 1.13e-07 ***
s_eat_time -0.0048860 0.0017746 -2.753 0.00591 **
fast_food 0.0101628 0.0017632 5.764 8.44e-09 ***
exercise_freq -0.0602290 0.0042292 -14.241 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.08912 on 10609 degrees of freedom


Multiple R-squared: 0.02583, Adjusted R-squared: 0.02546
F-statistic: 70.32 on 4 and 10609 DF, p-value: < 2.2e-16
The transformed regression model is statistically significant. The transformed model can
account for 2.5% of the total variation in the BMI. All predictors become significant at
the 0.01 level of significant.
B. Examine the outliers in the transformed variables
The transformation can pull in high numbers and yield a distribution that is closer to
normal. So, it can reduce the influence of outliers. To verify the effect of the
transformation on the outliers, the outlier test is conducted for the transformed variables
as follows:
R code
#multivariate outliers
reg <- lm(BMI ~ p_eat_time + s_eat_time + fast_food + exercise_freq, data = dat_t)
outlierTest(reg)

R output (partial) and analysis


rstudent unadjusted p-value Bonferonni p
1734 4.888898 1.0289e-06 0.010921

The outlier test only gives 1 outlier for the transformed data (instead of 10 before). So,
the transformation can alleviate the influence of outliers on the model.

13
C. Residual diagnostic
R code
par(mfrow=c(2,2))
plot(reg_t)
par(mfrow=c(1,1))
cutoff<-4/(nrow(dat)-length(reg_t$coefficients)-2)
plot(reg_t,which=4,cook.levels=cutoff)
abline(h=cutoff, lty=2, col="red")
influencePlot(reg_t, id.method="identify", main="Influence Plot",
sub="Circle size is proportional to Cook's Distance" )
plot(dat_t$p_eat_time, resid(reg_t),
main = "residual by regressors for p_eat_time")
plot(dat_t$s_eat_time, resid(reg_t),
main = "residual by regressors for s_eat_time")
plot(dat_t$fast_food, resid(reg_t),
main = "residual by regressors for fast_food")
plot(dat_t$exercise_freq, resid(reg_t),
main = "residual by regressors for exercise_freq")
R output (partial) and analysis
The residual analysis results of the transformed model are given below.

Residuals vs Fitted Normal Q-Q


Standardized residuals
0.4

1734
39669947 1734 3966
9947
4
Residuals

2
0.0

0
-2
-0.4

1.38 1.42 1.46 -4 -2 0 2 4

Fitted values Theoretical Quantiles

Scale-Location Residuals vs Leverage


Standardized residuals

Standardized residuals

1734
39669947
2.0

9947
4

5861
5269
2
1.0

0
-4 -2

Cook's distance
0.0

1.38 1.42 1.46 0.0000 0.0010 0.0020

Fitted values Leverage

14
Cook's distance
0.005 0.006

5861 Influence Plot

1734
3966 9947
9947 5861

4
0.002 0.003 0.004
Cook's distance

5269

Studentized Residuals

2
102

0
9757
0.000 0.001

-2
0 2000 4000 6000 8000 10000 0.0005 0.0010 0.0015 0.0020 0.0025

Obs. number Hat-Values


lm(BMI ~ p_eat_time + s_eat_time + fast_food + exercise_freq) Circle size is proportional to Cook's Distance

The results show that transforming the dependent and independent variables can reduce
the non-normality, non-linearity and heteroscedasticity of the residual. It can also reduce
the Cook’s distance, leverage and studentized residuals of extreme observations.

residual by regressors for p_eat_time residual by regressors for s_eat_time


0.2 0.4

0.2 0.4
resid(reg_t)

resid(reg_t)
-0.2

-0.2

5 10 15 20 0.0 0.5 1.0 1.5 2.0 2.5 3.0

dat_t$p_eat_time dat_t$s_eat_time

residual by regressors for fast_food residual by regressors for exercise_freq


0.2 0.4

0.2 0.4
resid(reg_t)

resid(reg_t)
-0.2

-0.2

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5

dat_t$fast_food dat_t$exercise_freq

The plots of residual against independent variables in the transformed model also show
closer to random pattern.

15
D. Multicollinearity checking
R code
library(mctest)
omcdiag(reg_t)
imcdiag(reg_t)

R output (partial) and analysis


Overall Multicollinearity Diagnostics

MC Results detection
Determinant |X'X|: 0.9446 0
Farrar Chi-Square: 604.4303 1
Red Indicator: 0.0949 0
Sum of Lambda Inverse: 4.1186 0
Theil's Method: 0.0412 0
Condition Number: 11.9177 0

1 --> COLLINEARITY is detected by the test


0 --> COLLINEARITY is not detected by the test

All Individual Multicollinearity Diagnostics Result


VIF TOL Wi Fi Leamer CVIF Klein
p_eat_time 1.0477 0.9545 168.6540 253.0049 0.9770 1.0484 1
s_eat_time 1.0534 0.9493 188.7192 283.1056 0.9743 1.0541 1
fast_food 1.0095 0.9905 33.7456 50.6232 0.9953 1.0102 0
exercise_freq 1.0081 0.9920 28.4920 42.7421 0.9960 1.0088 0
1 --> COLLINEARITY is detected by the test
0 --> COLLINEARITY is not detected by the test
* all coefficients have significant t-ratios

R-square of y on all x: 0.0258

* use method argument to check which regressors may be the reason of co


llinearity
===================================

Although Farrar Chi-Square indicate overall multicollinearity issue and Klein’s rule
further suggests p_eat_time and s_eat_time may suffer from the issues, the VIF of all
variables is much smaller than 10. So, multicollinearity issues may not seriously affect
the model estimation.

Stepwise regression
R code
library(MASS)
fit1 <- lm(BMI ~ 1, data = dat_t)
fit2 <- lm(BMI~p_eat_time+s_eat_time+fast_food+exercise_freq, data = dat_t)
step <- stepAIC(fit1, direction = "forward", scope= formula(fit2))

R output (partial) and analysis


Start: AIC=-51050.61
BMI ~ 1

Df Sum of Sq RSS AIC


+ exercise_freq 1 1.74729 84.740 -51265
+ p_eat_time 1 0.27078 86.217 -51082
+ fast_food 1 0.24948 86.238 -51079
+ s_eat_time 1 0.02515 86.462 -51052
16
<none> 86.487 -51051

Step: AIC=-51265.24
BMI ~ exercise_freq

Df Sum of Sq RSS AIC


+ fast_food 1 0.239568 84.500 -51293
+ p_eat_time 1 0.184170 84.556 -51286
<none> 84.740 -51265
+ s_eat_time 1 0.011356 84.729 -51265

Step: AIC=-51293.29
BMI ~ exercise_freq + fast_food

Df Sum of Sq RSS AIC


+ p_eat_time 1 0.18691 84.314 -51315
+ s_eat_time 1 0.02340 84.477 -51294
<none> 84.500 -51293

Step: AIC=-51314.79
BMI ~ exercise_freq + fast_food + p_eat_time
Df Sum of Sq RSS AIC
+ s_eat_time 1 0.0602 84.253 -51320
<none> 84.314 -51315

Step: AIC=-51320.37
BMI ~ exercise_freq + fast_food + p_eat_time + s_eat_time

> step1$anova
Stepwise Model Path
Analysis of Deviance Table

Initial Model:
BMI ~ 1

Final Model:
BMI ~ exercise_freq + fast_food + p_eat_time + s_eat_time

Step Df Deviance Resid. Df Resid. Dev AIC


1 10613 86.48733 -51050.61
2 + exercise_freq 1 1.74729120 10612 84.74004 -51265.24
3 + fast_food 1 0.23956840 10611 84.50047 -51293.29
4 + p_eat_time 1 0.18690593 10610 84.31357 -51314.79
5 + s_eat_time 1 0.06020019 10609 84.25337 -51320.37

Stepwise regression suggests that all variables can be added in the model based on the
AIC criterion.

Final regression model


R code
final<-lm(BMI~p_eat_time+s_eat_time+fast_food+exercise_freq, data = dat_t)
summary(final)

R output
Call:
lm(formula = BMI ~ p_eat_time + s_eat_time + fast_food + exercise_freq,

data = dat_t)

Residuals:

17
Min 1Q Median 3Q Max
-0.31986 -0.06086 -0.00614 0.05410 0.43514

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.4866728 0.0042372 350.858 < 2e-16 ***
p_eat_time -0.0015862 0.0002989 -5.307 1.13e-07 ***
s_eat_time -0.0048860 0.0017746 -2.753 0.00591 **
fast_food 0.0101628 0.0017632 5.764 8.44e-09 ***
exercise_freq -0.0602290 0.0042292 -14.241 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.08912 on 10609 degrees of freedom


Multiple R-squared: 0.02583, Adjusted R-squared: 0.02546
F-statistic: 70.32 on 4 and 10609 DF, p-value: < 2.2e-16

The final regression model is:


log⁡(𝐵𝑀𝐼) = 1.4866728 − 0.0015862 × 𝑆𝑄𝑅𝑇(𝑝𝑒𝑎𝑡𝑡𝑖𝑚𝑒 ) − 0.0048860 × log(𝑠𝑒𝑎𝑡𝑡𝑖𝑚𝑒 )
+ 0.0101628 × 𝑓𝑎𝑠𝑡𝑓𝑜𝑜𝑑 − 0.0602290 × log⁡(𝑒𝑥𝑒𝑟𝑐𝑖𝑠𝑒_𝑓𝑟𝑒𝑞)

Hypothesis testing
Residual standard error: 0.08912 on 10609 degrees of freedom
Multiple R-squared: 0.02583, Adjusted R-squared: 0.02546
F-statistic: 70.32 on 4 and 10609 DF, p-value: < 2.2e-16

𝐻0:⁡𝛽1 = 𝛽2 = 𝛽3 = 0
𝐻1:⁡𝐴𝑡⁡𝑙𝑒𝑎𝑠𝑡⁡𝑜𝑛𝑒⁡𝛽𝑖 ⁡𝑖𝑠⁡𝑛𝑜𝑡⁡𝑒𝑞𝑢𝑎𝑙⁡𝑡𝑜⁡𝑧𝑒𝑟𝑜
F-value = 65.6
With p-value < 0.05, we reject H0 and conclude that the model is valid at the 0.01 level of
significance.
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.4866728 0.0042372 350.858 < 2e-16 ***
p_eat_time -0.0015862 0.0002989 -5.307 1.13e-07 ***
s_eat_time -0.0048860 0.0017746 -2.753 0.00591 **
fast_food 0.0101628 0.0017632 5.764 8.44e-09 ***
exercise_freq -0.0602290 0.0042292 -14.241 < 2e-16 ***

𝐻0:⁡𝛽𝑖 = 0, 𝑖 = 1,2,3
𝐻1:⁡𝛽𝑖 ≠ 0

With p-values < 0.01, we reject H0 and conclude that the regression coefficients of the
‘p_eat_time’ (square root), ‘s_eat_time’ (log 10), ‘fast_food’, and ‘exercise_freq’ (log10) are
significant at the 0.01 level of significance.

18
Interpretation of coefficients
The interpretation of the coefficients for the transformed model is not straightforward. The
following is the interpretations of model with logarithm transformation:

(Source: Wooldridge J.M. (2013) Introductory Econometrics – A Modern Approach, 5th edn.)
So, the interpretation of the coefficients can be approximated as follows:

s_eat_time: When p_eat_time increases by 1%, BMI falls by 0.489%.

fast_food: When fast_food = 1, BMI increases by approximately 1.016%

exercise_freq: When exercise_freq increases by 1%, BMI falls by 6.02%

For p_eat_time, the interpretation of the coefficient can be obtained by applying derivatives
. The result is when p_eat_time increases by 1, BMI falls by
0.5 ∗ 0.158⁄√𝑝_𝑒𝑎𝑡_𝑡𝑖𝑚𝑒 %= 0.5 ∗ 0.1586⁄√𝑝_𝑒𝑎𝑡_𝑡𝑖𝑚𝑒 %=0.0793⁄√𝑝_𝑒𝑎𝑡_𝑡𝑖𝑚𝑒 %. By
letting p_eat_time at the mean (27.8) , when p_eat_time increases by 1% (about 0.278), BMI
falls by 0.278 ∗ 0.0793⁄√27.8 % = 0.418%.

Overall, the impacts of p_eat_time and s_eat_time on BMI is similar and increase both times
can reduce BMI but the exercise_freq is the more effective way to reduce BMI. On other
hand, taking fast_food will slightly increase the BMI by around 1% only.

19

You might also like