0% found this document useful (0 votes)

35 views19 pages

BIA B350F Assignment 1 Regression Analysis Sample

The document describes analyzing a dataset on BMI and related variables. It includes importing the data and removing missing values. Descriptive analyses show the variables are positively skewed. Correlation between variables is weak. Scatter plots don't reveal strong linear relationships. Univariate and bivariate outliers are identified in the data. A linear regression model is built with BMI as the outcome and other variables as predictors. The model accounts for 2.1% of BMI variation and identifies outliers. Residual diagnostics show issues with normality, linearity and homoscedasticity.

Uploaded by

Nile Seth

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views19 pages

BIA B350F Assignment 1 Regression Analysis Sample

Uploaded by

Nile Seth

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

Import data

1. Import the data and prepare the dataset by removing height/weight and remove row
with missing data
R code
dat <- read.csv("bmi_data.csv", header = TRUE)
dat <- dat[,c(1:5)]
dat <- na.omit(dat);

Descriptive analysis and normality check

R code
library(psych)
describe(dat)
par(mfrow=c(1,2))
hist(dat$BMI, breaks = 20, main="Histogram of BMI")
qqnorm(dat$BMI, main = "QQ plot for BMI")
qqline(dat$BMI)
par(mfrow=c(2,2))
hist(dat$p_eat_time, breaks = 20, main="Histogram of p_eat_time")
qqnorm(dat$p_eat_time, main = "QQ plot for p_eat_time")
qqline(dat$p_eat_time)
hist(dat$s_eat_time, breaks = 20, main="Histogram of s_eat_time")
qqnorm(dat$s_eat_time, main = "QQ plot for s_eat_time")
qqline(dat$s_eat_time)
hist(dat$fast_food, breaks = 20, main="Histogram of fast_food")
qqnorm(dat$fast_food, main = "QQ plot for fast_food")
qqline(dat$fast_food)
hist(dat$exercise_freq, breaks = 20, main="Histogram of exercise_freq")
qqnorm(dat$exercise_freq, main = "QQ plot for exercise_freq")
qqline(dat$exercise_freq)
R output (partial) and analysis
vars n mean sd median trimmed mad min max range skew kurtosis se
BMI 1 10614 27.77 6.16 26.6 27.13 5.19 13 73.6 60.6 1.27 2.88 0.06
p_eat_time 2 10614 65.86 48.07 60.0 60.34 44.48 0 508.0 508.0 1.50 4.43 0.47
s_eat_time 3 10614 16.92 51.01 4.0 7.41 5.93 -2 990.0 992.0 8.28 90.26 0.50
fast_food 4 10614 0.58 0.49 1.0 0.60 0.00 0 1.0 1.0 -0.34 -1.88 0.00
exercise_freq 5 10614 2.64 2.94 2.0 2.27 2.97 -2 38.0 40.0 2.18 12.69 0.03

The skew and kurtosis suggests that BMI, p_eat_time, s_eat_time and exercise_freq are all
positively skewed and heavy tailed. Fast_food is a binary variable and do not need to check
for the normality.

1
The histograms and qq-plots also suggests that all four variables are positively skewed and
heavy-tailed.

Correlation analysis
R code
library(ltm)
rcor.test(dat)
R output (partial) and analysis
BMI p_eat_time s_eat_time fast_food exercise_freq
BMI ***** -0.059 0.003 0.050 -0.128
p_eat_time <0.001 ***** -0.087 0.004 0.052
s_eat_time 0.739 <0.001 ***** 0.027 -0.001
fast_food <0.001 0.705 0.005 ***** -0.027
exercise_freq <0.001 <0.001 0.930 0.006 *****
upper diagonal part contains correlation coefficient estimates
lower diagonal part contains corresponding p-values

2
The correlation between the ‘BMI’ and all predicting variables except ‘s_eat_time’ are
statistically significant but the correlation is relatively weak with the largest correlation of
only -0.128. The correlations between the following predicting variables are statistically
significant but weak: ‘p_eat_time’ and ‘s_eat_time’; ‘p_eat_time’ and ‘exercise_freq’;
‘fast_food’ and ‘s_eat_time’ and ‘fast_food’ and ‘exercise_freq’.

Scatter plot
R code
library(car)
scatterplotMatrix(x = dat, diagonal = "histogram");
R output (partial) and analysis

The scatterplots don’t reveal any strong linear relationship between the dependent and
independent variables and also pairs of independents variables.

3
Checking for missing data
The data set doesn’t have missing data to be further processed.

Checking for outliers in the data set

R code
#univariate outliers
Boxplot(dat$BMI, main = "Boxplot of BMI", ylab = "BMI", id=list(location="avoid"))
Boxplot(dat$p_eat_time, main = "Boxplot of p_eat_time", ylab = "p_eat_time", id=list(location="avoid"))
Boxplot(dat$s_eat_time, main = "Boxplot of s_eat_time", ylab = "s_eat_time", id=list(location="avoid"))
Boxplot(dat$exercise_freq, main = "Boxplot of exercise_freq", ylab = "exercise_freq",
id=list(location="avoid"))
#bivariate oultiers
dataEllipse(x = dat$p_eat_time, y = dat$BMI, levels=0.95,
xlab = "p_eat_time",
ylab = "BMI",
main = "bivariate outliers", id = list(n=5))
dataEllipse(x = dat$s_eat_time, y = dat$BMI, levels=0.95,
xlab = "s_eat_time",
ylab = "BMI",
main = "bivariate outliers", id = list(n=5))
dataEllipse(x = dat$fast_food, y = dat$BMI, levels=0.95,
xlab = "fast_food",
ylab = "BMI",
main = "bivariate outliers", id = list(n=5))
dataEllipse(x = dat$exercise_freq, y = dat$BMI, levels=0.95,
xlab = "exercise_freq",
ylab = "BMI",
main = "bivariate outliers", id = list(n=5))
#multivariate outliers
reg <- lm(BMI ~ p_eat_time + s_eat_time + fast_food + exercise_freq, data = dat)
outlierTest(reg)
The program includes checking for univariate, bivariate, and multivariate outliers in the
data set.

R output (partial) and analysis

4
The boxplots show that the dependent variable and independent variables all have
substantial far outliers.

The bivariate plots also suggest number of significant outliers.

rstudent unadjusted p-value Bonferonni p

1838 7.539660 5.0970e-14 5.4099e-10
10506 6.584210 4.7893e-11 5.0834e-07
4186 6.522435 7.2324e-11 7.6765e-07
6180 5.884605 4.1108e-09 4.3632e-05
10054 5.448844 5.1832e-08 5.5014e-04
8592 5.402694 6.7067e-08 7.1185e-04
4118 5.339358 9.5203e-08 1.0105e-03
7292 5.299713 1.1831e-07 1.2558e-03
6270 5.278625 1.3273e-07 1.4088e-03
6824 5.249013 1.5587e-07 1.6544e-03
The outlier test identifies 10 outliers in the linear model.

5
Building the regression model
R code
fit<-lm(BMI ~ p_eat_time+s_eat_time+fast_food+exercise_freq, data = dat)
summary(fit)

R output (partial) and analysis

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 28.5642467 0.1347230 212.022 < 2e-16 ***
p_eat_time -0.0068468 0.0012380 -5.531 3.27e-08 ***
s_eat_time -0.0003419 0.0011655 -0.293 0.769
fast_food 0.5910893 0.1201912 4.918 8.88e-07 ***
exercise_freq -0.2597586 0.0201490 -12.892 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6.099 on 10609 degrees of freedom

Multiple R-squared: 0.02143, Adjusted R-squared: 0.02106
F-statistic: 58.09 on 4 and 10609 DF, p-value: < 2.2e-16

The full regression model is statistically significant that can account for 2.1% of the total
variation in the BMI. All predictors except s_eat_time are significant at the 0.01 level of
significant.

Residual diagnostics
R code
#fit full model
re<g-lm(BMI ~ p_eat_time+s_eat_time+fast_food+exercise_freq, data = dat)
summary(reg)
par(mfrow=c(2,2))
plot(reg)
## plot cook's distance
par(mfrow=c(1,1))
cutoff<-4/(nrow(dat)-length(reg$coefficients)-2)
plot(reg,which=4,cook.levels=cutoff)
abline(h=cutoff, lty=2, col="red")
## Influence Plot
influencePlot(reg, id.method="identify", main="Influence Plot", sub="Circle size is proportional to Cook's
Distance" )
## plot residual vs predictors
plot(dat$p_eat_time, resid(reg),
main = "residual by regressors for p_eat_time")
plot(dat$s_eat_time, resid(reg),
main = "residual by regressors for s_eat_time")
plot(dat$fast_food, resid(reg),
main = "residual by regressors for fast_food")
plot(dat$exercise_freq, resid(reg),
main = "residual by regressors for exercise_freq")

6
R output (partial) and analysis

The residual analysis results of the final model are given below.

The results suggest significant issues in the normality, linearity and homoscedasticity.
The Cook’s distance plot suggests that two observations: 6180 and 3968 are high
influential points. The influence plot suggests there are substantial leverage points and
outliers in the data set. Since the three basic assumptions cannot be fulfilled, further
investigation must be conducted to see if transforming the variables can improve the
situation.

7
residual by regressors for p_eat_time residual by regressors for s_eat_time

30
resid(reg)

resid(reg)
10

10
-10

-10
0 100 200 300 400 500 0 200 400 600 800 1000

dat$p_eat_time dat$s_eat_time

residual by regressors for fast_food residual by regressors for exercise_freq

30
resid(reg)

resid(reg)
10

10
-10

-10

0.0 0.2 0.4 0.6 0.8 1.0 0 10 20 30

dat$fast_food dat$exercise_freq

The plots of residual against independent variables also show non-random pattern,
suggesting a better fit for non-linear model.

Transforming variables
A. BMI
R code
log_BMI<-log10(dat$BMI)
inv_BMI<-1/dat$BMI
par(mfrow=c(2,2))
hist(log_BMI, breaks = 20, main="Histogram of log_BMI")
qqnorm(log_BMI, main = "QQ plot for log_BMI")
qqline(log_BMI)
hist(inv_BMI, breaks = 20, main="Histogram of inv_BMI")
qqnorm(inv_BMI, main = "QQ plot for inv_BMI")
qqline(inv_BMI)
describe(cbind(log_BMI,inv_BMI))

8
R output (partial) and analysis

vars n mean sd median trimmed mad min max range skew kurtosis se
log_BMI 1 10614 1.43 0.09 1.42 1.43 0.09 1.11 1.87 0.75 0.49 0.49 0
inv_BMI 2 10614 0.04 0.01 0.04 0.04 0.01 0.01 0.08 0.06 0.18 0.21 0*

The above measures suggest that inverse transformation works slightly better than the log10
transformation. However, it is more common and would also be more easily to interpret the
model by log-transforming the dependent variable. So, we use the log10 transform the BMI.

9
B. p_eat_time
R code
log_p_eat_time<-log10(1+dat$p_eat_time)
sqrt_p_eat_time<-sqrt(1+dat$p_eat_time)
par(mfrow=c(2,2))
hist(log_p_eat_time, breaks = 20, main="Histogram of log_p_eat_time")
qqnorm(log_p_eat_time, main = "QQ plot for log_p_eat_time")
qqline(log_p_eat_time)
hist(sqrt_p_eat_time, breaks = 20, main="Histogram of sqrt_p_eat_time")
qqnorm(sqrt_p_eat_time, main = "QQ plot for inv_p_eat_time")
qqline(sqrt_p_eat_time)
describe(cbind(log_p_eat_time,sqrt_p_eat_time))

R output (partial) and analysis

vars n mean sd median trimmed mad min max range skew kurtosis se
log_p_eat_time 1 10614 1.67 0.47 1.79 1.73 0.34 0 2.71 2.71 -1.90 4.66 0.00
sqrt_p_eat_time 2 10614 7.62 2.96 7.81 7.61 2.95 1 22.56 21.56 0.11 0.39 0.03*

The above measures suggest that square root transformation works better for p_eat_time.

10
C. s_eat_time
R code
log_s_eat_time<-log10(3+dat$s_eat_time)
inv_s_eat_time<-1/(3+dat$s_eat_time)
par(mfrow=c(2,2))
hist(log_s_eat_time, breaks = 20, main="Histogram of log_s_eat_time")
qqnorm(log_s_eat_time, main = "QQ plot for log_s_eat_time")
qqline(log_s_eat_time)
hist(inv_s_eat_time, breaks = 20, main="Histogram of inv_s_eat_time)")
qqnorm(inv_s_eat_time, main = "QQ plot for inv_s_eat_time")
qqline(inv_s_eat_time)
describe(cbind(log_s_eat_time,inv_s_eat_time))

R output (partial) and analysis

vars n mean sd median trimmed mad min max range skew kurtosis se
log_s_eat_time 1 10614 0.92 0.50 0.85 0.85 0.55 0 3 3 0.89 0.28 0*
inv_s_eat_time 2 10614 0.19 0.15 0.14 0.19 0.19 0 1 1 0.74 2.33 0

The above measures suggest that log10 transformation works better for s_eat_time.

11
D. exercise_freq
R code
log_exercise_freq<-log10(3+dat$exercise_freq)
inv_exercise_freq<-1/(3+dat$exercise_freq)
par(mfrow=c(2,2))
hist(log_exercise_freq, breaks = 20, main="Histogram of log_exercise_freq")
qqnorm(log_exercise_freq, main = "QQ plot for log_exercise_freq")
qqline(log_exercise_freq)
hist(inv_exercise_freq, breaks = 20, main="Histogram of inv_exercise_freq")
qqnorm(inv_exercise_freq, main = "QQ plot for inv_exercise_freq")
qqline(inv_exercise_freq)
describe(cbind(log_exercise_freq,inv_exercise_freq))

R output (partial) and analysis

vars n mean sd median trimmed mad min max range skew kurtosis se
log_exercise_freq 1 10614 0.70 0.21 0.7 0.69 0.33 0.00 1.61 1.61 0.27 -0.63 0*
inv_exercise_freq 2 10614 0.22 0.10 0.2 0.22 0.13 0.02 1.00 0.98 0.90 4.53 0

The above measures also suggest that log10 transformation performs better for exercise_freq.

12
Transformed model and diagnostics
A. Build transformed data and regression model
R code
log_BMI<-1/(dat$BMI)
sqrt_p_eat_time<-sqrt(1+dat$p_eat_time)
log_s_eat_time<-log10(3+dat$s_eat_time)
fast_food<-dat$fast_food
log_exercise_freq<-log10(3+dat$exercise_freq)
dat_t<-data.frame(cbind(BMI=log_BMI, p_eat_time=sqrt_p_eat_time,
s_eat_time=log_s_eat_time, fast_food,
exercise_freq=log_exercise_freq))
reg_t<-lm(BMI ~ p_eat_time+s_eat_time+fast_food+exercise_freq, data = dat_t)
summary(reg_t)

R output (partial) and analysis

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.4866728 0.0042372 350.858 < 2e-16 ***
p_eat_time -0.0015862 0.0002989 -5.307 1.13e-07 ***
s_eat_time -0.0048860 0.0017746 -2.753 0.00591 **
fast_food 0.0101628 0.0017632 5.764 8.44e-09 ***
exercise_freq -0.0602290 0.0042292 -14.241 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.08912 on 10609 degrees of freedom

Multiple R-squared: 0.02583, Adjusted R-squared: 0.02546
F-statistic: 70.32 on 4 and 10609 DF, p-value: < 2.2e-16
The transformed regression model is statistically significant. The transformed model can
account for 2.5% of the total variation in the BMI. All predictors become significant at
the 0.01 level of significant.
B. Examine the outliers in the transformed variables
The transformation can pull in high numbers and yield a distribution that is closer to
normal. So, it can reduce the influence of outliers. To verify the effect of the
transformation on the outliers, the outlier test is conducted for the transformed variables
as follows:
R code
#multivariate outliers
reg <- lm(BMI ~ p_eat_time + s_eat_time + fast_food + exercise_freq, data = dat_t)
outlierTest(reg)

R output (partial) and analysis

rstudent unadjusted p-value Bonferonni p
1734 4.888898 1.0289e-06 0.010921

The outlier test only gives 1 outlier for the transformed data (instead of 10 before). So,
the transformation can alleviate the influence of outliers on the model.

13
C. Residual diagnostic
R code
par(mfrow=c(2,2))
plot(reg_t)
par(mfrow=c(1,1))
cutoff<-4/(nrow(dat)-length(reg_t$coefficients)-2)
plot(reg_t,which=4,cook.levels=cutoff)
abline(h=cutoff, lty=2, col="red")
influencePlot(reg_t, id.method="identify", main="Influence Plot",
sub="Circle size is proportional to Cook's Distance" )
plot(dat_t$p_eat_time, resid(reg_t),
main = "residual by regressors for p_eat_time")
plot(dat_t$s_eat_time, resid(reg_t),
main = "residual by regressors for s_eat_time")
plot(dat_t$fast_food, resid(reg_t),
main = "residual by regressors for fast_food")
plot(dat_t$exercise_freq, resid(reg_t),
main = "residual by regressors for exercise_freq")
R output (partial) and analysis
The residual analysis results of the transformed model are given below.

Residuals vs Fitted Normal Q-Q

Standardized residuals
0.4

1734
39669947 1734 3966
9947
4
Residuals

2
0.0

0
-2
-0.4

1.38 1.42 1.46 -4 -2 0 2 4

Fitted values Theoretical Quantiles

Scale-Location Residuals vs Leverage

Standardized residuals

1734
39669947
2.0

9947
4

5861
5269
2
1.0

0
-4 -2

Cook's distance
0.0

1.38 1.42 1.46 0.0000 0.0010 0.0020

Fitted values Leverage

14
Cook's distance
0.005 0.006

5861 Influence Plot

1734
3966 9947
9947 5861

4
0.002 0.003 0.004
Cook's distance

5269

Studentized Residuals

2
102

0
9757
0.000 0.001

-2
0 2000 4000 6000 8000 10000 0.0005 0.0010 0.0015 0.0020 0.0025

Obs. number Hat-Values

lm(BMI ~ p_eat_time + s_eat_time + fast_food + exercise_freq) Circle size is proportional to Cook's Distance

The results show that transforming the dependent and independent variables can reduce
the non-normality, non-linearity and heteroscedasticity of the residual. It can also reduce
the Cook’s distance, leverage and studentized residuals of extreme observations.

residual by regressors for p_eat_time residual by regressors for s_eat_time

0.2 0.4

0.2 0.4
resid(reg_t)

resid(reg_t)
-0.2

-0.2

5 10 15 20 0.0 0.5 1.0 1.5 2.0 2.5 3.0

dat_t$p_eat_time dat_t$s_eat_time

residual by regressors for fast_food residual by regressors for exercise_freq

0.2 0.4

0.2 0.4
resid(reg_t)

resid(reg_t)
-0.2

-0.2

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5

dat_t$fast_food dat_t$exercise_freq

The plots of residual against independent variables in the transformed model also show
closer to random pattern.

15
D. Multicollinearity checking
R code
library(mctest)
omcdiag(reg_t)
imcdiag(reg_t)

R output (partial) and analysis

Overall Multicollinearity Diagnostics

MC Results detection
Determinant |X'X|: 0.9446 0
Farrar Chi-Square: 604.4303 1
Red Indicator: 0.0949 0
Sum of Lambda Inverse: 4.1186 0
Theil's Method: 0.0412 0
Condition Number: 11.9177 0

1 --> COLLINEARITY is detected by the test

0 --> COLLINEARITY is not detected by the test

All Individual Multicollinearity Diagnostics Result

VIF TOL Wi Fi Leamer CVIF Klein
p_eat_time 1.0477 0.9545 168.6540 253.0049 0.9770 1.0484 1
s_eat_time 1.0534 0.9493 188.7192 283.1056 0.9743 1.0541 1
fast_food 1.0095 0.9905 33.7456 50.6232 0.9953 1.0102 0
exercise_freq 1.0081 0.9920 28.4920 42.7421 0.9960 1.0088 0
1 --> COLLINEARITY is detected by the test
0 --> COLLINEARITY is not detected by the test
* all coefficients have significant t-ratios

R-square of y on all x: 0.0258

* use method argument to check which regressors may be the reason of co

llinearity
===================================

Although Farrar Chi-Square indicate overall multicollinearity issue and Klein’s rule
further suggests p_eat_time and s_eat_time may suffer from the issues, the VIF of all
variables is much smaller than 10. So, multicollinearity issues may not seriously affect
the model estimation.

Stepwise regression
R code
library(MASS)
fit1 <- lm(BMI ~ 1, data = dat_t)
fit2 <- lm(BMI~p_eat_time+s_eat_time+fast_food+exercise_freq, data = dat_t)
step <- stepAIC(fit1, direction = "forward", scope= formula(fit2))

R output (partial) and analysis

Start: AIC=-51050.61
BMI ~ 1

Df Sum of Sq RSS AIC

+ exercise_freq 1 1.74729 84.740 -51265
+ p_eat_time 1 0.27078 86.217 -51082
+ fast_food 1 0.24948 86.238 -51079
+ s_eat_time 1 0.02515 86.462 -51052
16
<none> 86.487 -51051

Step: AIC=-51265.24
BMI ~ exercise_freq

Df Sum of Sq RSS AIC

+ fast_food 1 0.239568 84.500 -51293
+ p_eat_time 1 0.184170 84.556 -51286
<none> 84.740 -51265
+ s_eat_time 1 0.011356 84.729 -51265

Step: AIC=-51293.29
BMI ~ exercise_freq + fast_food

Df Sum of Sq RSS AIC

+ p_eat_time 1 0.18691 84.314 -51315
+ s_eat_time 1 0.02340 84.477 -51294
<none> 84.500 -51293

Step: AIC=-51314.79
BMI ~ exercise_freq + fast_food + p_eat_time
Df Sum of Sq RSS AIC
+ s_eat_time 1 0.0602 84.253 -51320
<none> 84.314 -51315

Step: AIC=-51320.37
BMI ~ exercise_freq + fast_food + p_eat_time + s_eat_time

> step1$anova
Stepwise Model Path
Analysis of Deviance Table

Initial Model:
BMI ~ 1

Final Model:
BMI ~ exercise_freq + fast_food + p_eat_time + s_eat_time

Step Df Deviance Resid. Df Resid. Dev AIC

1 10613 86.48733 -51050.61
2 + exercise_freq 1 1.74729120 10612 84.74004 -51265.24
3 + fast_food 1 0.23956840 10611 84.50047 -51293.29
4 + p_eat_time 1 0.18690593 10610 84.31357 -51314.79
5 + s_eat_time 1 0.06020019 10609 84.25337 -51320.37

Stepwise regression suggests that all variables can be added in the model based on the
AIC criterion.

Final regression model

R code
final<-lm(BMI~p_eat_time+s_eat_time+fast_food+exercise_freq, data = dat_t)
summary(final)

R output
Call:
lm(formula = BMI ~ p_eat_time + s_eat_time + fast_food + exercise_freq,

data = dat_t)

Residuals:

17
Min 1Q Median 3Q Max
-0.31986 -0.06086 -0.00614 0.05410 0.43514

Residual standard error: 0.08912 on 10609 degrees of freedom

Multiple R-squared: 0.02583, Adjusted R-squared: 0.02546
F-statistic: 70.32 on 4 and 10609 DF, p-value: < 2.2e-16

The final regression model is:

log⁡(𝐵𝑀𝐼) = 1.4866728 − 0.0015862 × 𝑆𝑄𝑅𝑇(𝑝𝑒𝑎𝑡𝑡𝑖𝑚𝑒 ) − 0.0048860 × log(𝑠𝑒𝑎𝑡𝑡𝑖𝑚𝑒 )
+ 0.0101628 × 𝑓𝑎𝑠𝑡𝑓𝑜𝑜𝑑 − 0.0602290 × log⁡(𝑒𝑥𝑒𝑟𝑐𝑖𝑠𝑒_𝑓𝑟𝑒𝑞)

Hypothesis testing
Residual standard error: 0.08912 on 10609 degrees of freedom
Multiple R-squared: 0.02583, Adjusted R-squared: 0.02546
F-statistic: 70.32 on 4 and 10609 DF, p-value: < 2.2e-16

𝐻0:⁡𝛽1 = 𝛽2 = 𝛽3 = 0
𝐻1:⁡𝐴𝑡⁡𝑙𝑒𝑎𝑠𝑡⁡𝑜𝑛𝑒⁡𝛽𝑖 ⁡𝑖𝑠⁡𝑛𝑜𝑡⁡𝑒𝑞𝑢𝑎𝑙⁡𝑡𝑜⁡𝑧𝑒𝑟𝑜
F-value = 65.6
With p-value < 0.05, we reject H0 and conclude that the model is valid at the 0.01 level of
significance.
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.4866728 0.0042372 350.858 < 2e-16 ***
p_eat_time -0.0015862 0.0002989 -5.307 1.13e-07 ***
s_eat_time -0.0048860 0.0017746 -2.753 0.00591 **
fast_food 0.0101628 0.0017632 5.764 8.44e-09 ***
exercise_freq -0.0602290 0.0042292 -14.241 < 2e-16 ***

𝐻0:⁡𝛽𝑖 = 0, 𝑖 = 1,2,3
𝐻1:⁡𝛽𝑖 ≠ 0

With p-values < 0.01, we reject H0 and conclude that the regression coefficients of the
‘p_eat_time’ (square root), ‘s_eat_time’ (log 10), ‘fast_food’, and ‘exercise_freq’ (log10) are
significant at the 0.01 level of significance.

18
Interpretation of coefficients
The interpretation of the coefficients for the transformed model is not straightforward. The
following is the interpretations of model with logarithm transformation:

(Source: Wooldridge J.M. (2013) Introductory Econometrics – A Modern Approach, 5th edn.)
So, the interpretation of the coefficients can be approximated as follows:

s_eat_time: When p_eat_time increases by 1%, BMI falls by 0.489%.

fast_food: When fast_food = 1, BMI increases by approximately 1.016%

exercise_freq: When exercise_freq increases by 1%, BMI falls by 6.02%

For p_eat_time, the interpretation of the coefficient can be obtained by applying derivatives
. The result is when p_eat_time increases by 1, BMI falls by
0.5 ∗ 0.158⁄√𝑝_𝑒𝑎𝑡_𝑡𝑖𝑚𝑒 %= 0.5 ∗ 0.1586⁄√𝑝_𝑒𝑎𝑡_𝑡𝑖𝑚𝑒 %=0.0793⁄√𝑝_𝑒𝑎𝑡_𝑡𝑖𝑚𝑒 %. By
letting p_eat_time at the mean (27.8) , when p_eat_time increases by 1% (about 0.278), BMI
falls by 0.278 ∗ 0.0793⁄√27.8 % = 0.418%.

Overall, the impacts of p_eat_time and s_eat_time on BMI is similar and increase both times
can reduce BMI but the exercise_freq is the more effective way to reduce BMI. On other
hand, taking fast_food will slightly increase the BMI by around 1% only.

Two Way ANOVA
No ratings yet
Two Way ANOVA
27 pages
Diabetes Prediction Using Machine Learning
No ratings yet
Diabetes Prediction Using Machine Learning
20 pages
Healthcare Analytics
No ratings yet
Healthcare Analytics
72 pages
Spss Raporu (Şevval Şeyma Ülgü)
No ratings yet
Spss Raporu (Şevval Şeyma Ülgü)
15 pages
Lecture-5 2
No ratings yet
Lecture-5 2
51 pages
Weight Loss Plan Upto 20.11
No ratings yet
Weight Loss Plan Upto 20.11
17 pages
Pset 6 - Fall2019 - Solutions PDF
100% (3)
Pset 6 - Fall2019 - Solutions PDF
33 pages
Logistic Regression Worksheet Solution
No ratings yet
Logistic Regression Worksheet Solution
3 pages
Step 1
No ratings yet
Step 1
10 pages
Using SPSS For Multiple Regression: UDP 520 Lab 7 Lin Lin December 4, 2007
No ratings yet
Using SPSS For Multiple Regression: UDP 520 Lab 7 Lin Lin December 4, 2007
20 pages
Question
No ratings yet
Question
7 pages
Multiple Linear Regression
No ratings yet
Multiple Linear Regression
26 pages
Carlos Willis Problem-Set-1
No ratings yet
Carlos Willis Problem-Set-1
10 pages
Explanationdocx
No ratings yet
Explanationdocx
9 pages
INDE-6609 NARAYANA 00889845 Project Presentation Food Preferences Analysis
No ratings yet
INDE-6609 NARAYANA 00889845 Project Presentation Food Preferences Analysis
27 pages
Quantrix
No ratings yet
Quantrix
10 pages
Om Ashish Mishra 23363025: 5 Mcqs
No ratings yet
Om Ashish Mishra 23363025: 5 Mcqs
9 pages
Using SPSS For Multiple Regression: UDP 520 Lab 7 Lin Lin December 4, 2007
100% (1)
Using SPSS For Multiple Regression: UDP 520 Lab 7 Lin Lin December 4, 2007
20 pages
LAb Test 2
No ratings yet
LAb Test 2
4 pages
4-R Code and PPT - Predicting Medical Expenses Using Linear Regression - New Without Prerequsit
No ratings yet
4-R Code and PPT - Predicting Medical Expenses Using Linear Regression - New Without Prerequsit
17 pages
Chapter 4 - Multiple Regression Analysis
No ratings yet
Chapter 4 - Multiple Regression Analysis
43 pages
Data Analysis Project Exemplar Part 2 Excel
No ratings yet
Data Analysis Project Exemplar Part 2 Excel
4 pages
Measuring Diet 2006
No ratings yet
Measuring Diet 2006
49 pages
Activity Handout SPSS Tests Review
No ratings yet
Activity Handout SPSS Tests Review
6 pages
Presentation Health Insurance USA
No ratings yet
Presentation Health Insurance USA
18 pages
Assignment On ANOVA
No ratings yet
Assignment On ANOVA
7 pages
Mla - 2 (Cia - 2) - 20221013
No ratings yet
Mla - 2 (Cia - 2) - 20221013
14 pages
P01 - Data Mining - 2001202058
No ratings yet
P01 - Data Mining - 2001202058
9 pages
Y (BMI) Xy X y : Respondents X (Age in Years)
No ratings yet
Y (BMI) Xy X y : Respondents X (Age in Years)
6 pages
F24 Lab-01
No ratings yet
F24 Lab-01
4 pages
Mc'donald Analysis: Sudhanva Saralaya
No ratings yet
Mc'donald Analysis: Sudhanva Saralaya
27 pages
7th Report
No ratings yet
7th Report
14 pages
AMDA Practical - A048
No ratings yet
AMDA Practical - A048
35 pages
Đại Học Quốc Gia Đại Học Bách Khoa Tp Hồ Chí Minh: Subject: probability and statistics
No ratings yet
Đại Học Quốc Gia Đại Học Bách Khoa Tp Hồ Chí Minh: Subject: probability and statistics
13 pages
Minitab Multiple Regression Analysis
100% (1)
Minitab Multiple Regression Analysis
6 pages
Stroke Prediction Dataset
No ratings yet
Stroke Prediction Dataset
48 pages
Tutorial 11: For Relationship
No ratings yet
Tutorial 11: For Relationship
15 pages
Minitab Multiple Regression Analysis
No ratings yet
Minitab Multiple Regression Analysis
6 pages
Mean Vector and Correlation Matrix in R - Jupyter Notebook
No ratings yet
Mean Vector and Correlation Matrix in R - Jupyter Notebook
7 pages
Multiple Linear Regression
No ratings yet
Multiple Linear Regression
3 pages
Assignment Food and Nutrition
No ratings yet
Assignment Food and Nutrition
3 pages
PG Department Berhampur University
No ratings yet
PG Department Berhampur University
15 pages
Minitab Multiple Regression Analysis PDF
No ratings yet
Minitab Multiple Regression Analysis PDF
6 pages
100 Anova
No ratings yet
100 Anova
4 pages
Unit 2 - Data Visualization Techniques
No ratings yet
Unit 2 - Data Visualization Techniques
101 pages
Math Portfolio BMI
No ratings yet
Math Portfolio BMI
7 pages
Spss
100% (1)
Spss
26 pages
Pima Indians Diabetes Database Analysis - Kaggle
No ratings yet
Pima Indians Diabetes Database Analysis - Kaggle
37 pages
Marketing Analytics Project: Alisha Srivastava Prachi Aggarwal Anup Thakur Gowtham Reddy Sandeep Pal
No ratings yet
Marketing Analytics Project: Alisha Srivastava Prachi Aggarwal Anup Thakur Gowtham Reddy Sandeep Pal
16 pages
Department of Statistics: COURSE STATS 330/762
No ratings yet
Department of Statistics: COURSE STATS 330/762
8 pages
Chapter3 Is This
No ratings yet
Chapter3 Is This
27 pages
Practice Problem 3-2 - Minimization of Cost (Simplex Method)
No ratings yet
Practice Problem 3-2 - Minimization of Cost (Simplex Method)
3 pages
Lab 2
No ratings yet
Lab 2
5 pages
Class 8 Algebraic Expressions & Identities Ques Bank
No ratings yet
Class 8 Algebraic Expressions & Identities Ques Bank
6 pages
Assignment 01 Nipun Goyal Jinye Lu
No ratings yet
Assignment 01 Nipun Goyal Jinye Lu
12 pages
Descriptive Statistics With Minitab Summer A, 2007: Example 1
No ratings yet
Descriptive Statistics With Minitab Summer A, 2007: Example 1
3 pages
Ch11 Properties of Stock Options Fall 2022-20221101
No ratings yet
Ch11 Properties of Stock Options Fall 2022-20221101
67 pages
Statistics 2ndyear Syllabus
No ratings yet
Statistics 2ndyear Syllabus
10 pages
Finite-Dimensional Linear Algebra: Mark S
0% (1)
Finite-Dimensional Linear Algebra: Mark S
7 pages
QM
No ratings yet
QM
12 pages
MLR Output Interpretation - W - O - Dummy
No ratings yet
MLR Output Interpretation - W - O - Dummy
6 pages
Math 120 Final Project
No ratings yet
Math 120 Final Project
5 pages
Analysis of Athletics Data
No ratings yet
Analysis of Athletics Data
18 pages
HW4 Solutions: Problem 6.2
No ratings yet
HW4 Solutions: Problem 6.2
8 pages
2nd Year Syllabus - Biotech
No ratings yet
2nd Year Syllabus - Biotech
19 pages
Mini Project Rathin
No ratings yet
Mini Project Rathin
6 pages
Master Thesis 2010-Review and Verification of Marine Riser Analysis Programs
No ratings yet
Master Thesis 2010-Review and Verification of Marine Riser Analysis Programs
113 pages
Week 4 Solution
No ratings yet
Week 4 Solution
6 pages
Numerical Modeling of Earth Systems PDF
No ratings yet
Numerical Modeling of Earth Systems PDF
222 pages
Thesis Used Openfoam
No ratings yet
Thesis Used Openfoam
171 pages
Topic 3 - Generating Innovative Ideas - Developing Inno Strats
No ratings yet
Topic 3 - Generating Innovative Ideas - Developing Inno Strats
61 pages
Topic 4 - Implementing Inno and Operations - Organ Inno
No ratings yet
Topic 4 - Implementing Inno and Operations - Organ Inno
37 pages
LEC 4 Matries Indirect Method
No ratings yet
LEC 4 Matries Indirect Method
10 pages
Electronics & Communication Engineering Old Syllabus 2
No ratings yet
Electronics & Communication Engineering Old Syllabus 2
82 pages
R Programming For BIA B452F
No ratings yet
R Programming For BIA B452F
21 pages
Topic 2 - Innovation in Content and Sources of Innovation
No ratings yet
Topic 2 - Innovation in Content and Sources of Innovation
55 pages
MSCI570 - Lecture 8 - Advanced Regression Analysis 2022 Part 2
No ratings yet
MSCI570 - Lecture 8 - Advanced Regression Analysis 2022 Part 2
26 pages
Wa0001.
No ratings yet
Wa0001.
47 pages
Numerical Differentiation
No ratings yet
Numerical Differentiation
9 pages
X - Cbse - Polynomials
No ratings yet
X - Cbse - Polynomials
4 pages
Transportation Problem
No ratings yet
Transportation Problem
56 pages
Unit 1 - Introduction To Business Intelligence and Big Data Analytics
No ratings yet
Unit 1 - Introduction To Business Intelligence and Big Data Analytics
36 pages
Topic 1 - Understand Innovation and Its Importance
No ratings yet
Topic 1 - Understand Innovation and Its Importance
41 pages
BIA B350F - 2022 Autumn - Specimen Exam Paper
No ratings yet
BIA B350F - 2022 Autumn - Specimen Exam Paper
13 pages
Ch17 Index and Currency Options Fall 2022
No ratings yet
Ch17 Index and Currency Options Fall 2022
12 pages
AE 106 Module 9 Assignment Model
No ratings yet
AE 106 Module 9 Assignment Model
22 pages
IOQM Worksheet 12 Integer Root Theorem
No ratings yet
IOQM Worksheet 12 Integer Root Theorem
6 pages
Using R For Basic Statistical Analysis
No ratings yet
Using R For Basic Statistical Analysis
11 pages
FIN B488F-Tutorial Answers - Ch03 - Autumn 2022
No ratings yet
FIN B488F-Tutorial Answers - Ch03 - Autumn 2022
7 pages
FIN B488F - 2022 Autumn - Exam Formula Booklet - SV
No ratings yet
FIN B488F - 2022 Autumn - Exam Formula Booklet - SV
7 pages
FIN B488F - 2022 Autumn - Specimen Exam Paper
No ratings yet
FIN B488F - 2022 Autumn - Specimen Exam Paper
4 pages
Numerical For Delay Fractional Differential Equation
No ratings yet
Numerical For Delay Fractional Differential Equation
10 pages
BIA B350F - 2022 Autumn - Specimen Exam Sample Answers
No ratings yet
BIA B350F - 2022 Autumn - Specimen Exam Sample Answers
6 pages
Ch03 Exercise Fall 2022
No ratings yet
Ch03 Exercise Fall 2022
5 pages
MATH-314 Linear Algebra
No ratings yet
MATH-314 Linear Algebra
3 pages
IR, Postprandial Drowsiness, Diabetes
No ratings yet
IR, Postprandial Drowsiness, Diabetes
5 pages
Book List
No ratings yet
Book List
2 pages
Tongkat Lansia Dengan Metode Elemen Hingga
No ratings yet
Tongkat Lansia Dengan Metode Elemen Hingga
9 pages
BIA B350F - 2022 Autumn - Assignment 1 Rubrics (OLE)
No ratings yet
BIA B350F - 2022 Autumn - Assignment 1 Rubrics (OLE)
3 pages
Practical No. 10
No ratings yet
Practical No. 10
4 pages
Ee332 - Lab-Sheets - Student Workbook
No ratings yet
Ee332 - Lab-Sheets - Student Workbook
6 pages
328 Requirements
No ratings yet
328 Requirements
2 pages
Flops PDF
No ratings yet
Flops PDF
6 pages
BIA B452F - 2023 Spring - Assignment 1 Rubrics (OLE)
No ratings yet
BIA B452F - 2023 Spring - Assignment 1 Rubrics (OLE)
1 page
Machine Learning, Modeling, and Simulation Principles Schedule
No ratings yet
Machine Learning, Modeling, and Simulation Principles Schedule
3 pages
Optimization Methods in Finance
No ratings yet
Optimization Methods in Finance
3 pages
A New FDTD Algorithm Based On Alternating-Direction Implicit Method
No ratings yet
A New FDTD Algorithm Based On Alternating-Direction Implicit Method
5 pages
Interpolating With Splines: Linear Spline Cubic Spline Interpolation Polynomials
No ratings yet
Interpolating With Splines: Linear Spline Cubic Spline Interpolation Polynomials
4 pages
Assignment 2
No ratings yet
Assignment 2
11 pages
Amazing Java: Learn Java Quickly
From Everand
Amazing Java: Learn Java Quickly
Andrei Besedin
No ratings yet

BIA B350F Assignment 1 Regression Analysis Sample

Uploaded by

BIA B350F Assignment 1 Regression Analysis Sample

Uploaded by

Import data

Descriptive analysis and normality check

Checking for outliers in the data set

R output (partial) and analysis

The bivariate plots also suggest number of significant outliers.

rstudent unadjusted p-value Bonferonni p

R output (partial) and analysis

Residual standard error: 6.099 on 10609 degrees of freedom

residual by regressors for fast_food residual by regressors for exercise_freq

0.0 0.2 0.4 0.6 0.8 1.0 0 10 20 30

R output (partial) and analysis

R output (partial) and analysis

R output (partial) and analysis

R output (partial) and analysis

Residual standard error: 0.08912 on 10609 degrees of freedom

R output (partial) and analysis

Residuals vs Fitted Normal Q-Q

1.38 1.42 1.46 -4 -2 0 2 4

Fitted values Theoretical Quantiles

Scale-Location Residuals vs Leverage

1.38 1.42 1.46 0.0000 0.0010 0.0020

Fitted values Leverage

5861 Influence Plot

Obs. number Hat-Values

residual by regressors for p_eat_time residual by regressors for s_eat_time

5 10 15 20 0.0 0.5 1.0 1.5 2.0 2.5 3.0

residual by regressors for fast_food residual by regressors for exercise_freq

R output (partial) and analysis

1 --> COLLINEARITY is detected by the test

All Individual Multicollinearity Diagnostics Result

R-square of y on all x: 0.0258

* use method argument to check which regressors may be the reason of co

R output (partial) and analysis

Df Sum of Sq RSS AIC

Df Sum of Sq RSS AIC

Df Sum of Sq RSS AIC

Step Df Deviance Resid. Df Resid. Dev AIC

Final regression model

Residual standard error: 0.08912 on 10609 degrees of freedom

The final regression model is:

s_eat_time: When p_eat_time increases by 1%, BMI falls by 0.489%.

fast_food: When fast_food = 1, BMI increases by approximately 1.016%

exercise_freq: When exercise_freq increases by 1%, BMI falls by 6.02%

You might also like