Assignment 3
Assignment 3
Question 1
(a) The best model for the predictors is contain PAPER and MACHINE, since these
two predictors gives lowest AIC.
(b) Step 1:
-COST ~ 1
-COST ~ MACHINE
-COST ~ PAPER
-COST ~ OVERHEAD
--COST ~ LABOR
After we compare above models, we found that the best model is COST ~
MACHINE, because the corresponding AIC is the smallest and less than the current
model.
Step 2:
-COST ~ MACHINE
-COST ~ MACHINE+OVERHEAD
-COST ~ MACHINE+LABOR
We future add PAPER to the model, the best model is COST ~ MACHIN+ PAPERE,
because the AIC is the smallest and is also less than the current model.
Step 3:
-COST ~ MACHINE+PAPER
-COST ~ MACHINE+PAPER+OVERHEAD
-COST ~ MACHINE+PAPER+LABOR
Final models consist of MACHINE and PAPER, we terminate the procedure, because
the current model has the lowest AIC.
(c) the estimated regression line of the final model:
COST=59.4318+0.9489(PAPER)+2.3864(MACHINE)
(d)
(e) Yes, the same variables included in the final regression model, both from all
possible regression and backward elimination procedures return the same model.
Question 2
(a) Because of the large number of variables, this approach may not be practically
feasible, because there are 28=256 candidate models to be considered, the running
time will be long. Therefore, it is better to use forward selection.
(b)
PROD, FOV and HOUSE these three variables suggested to develop the most suitable
model, since the model has an R2 = 0.7613, showing a fair fit.
(c)
With AIC as the criterion, the best set of variables are PROD FOV and HOUSE in the
final regression. Yes, I agree with my method because it return the same model.
Question 3
(a)
##
## Call:
## lm(formula = SALARY ~ YEARS + as.factor(POSITION) +
as.factor(EDUCAT) +
## as.factor(GENDER), data = dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1410.3 -204.5 -103.4 230.3 752.1
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 '
' 1
##
## Residual standard error: 495 on 38 degrees of freedom
## Multiple R-squared: 0.7504, Adjusted R-squared: 0.6979
## F-statistic: 14.28 on 8 and 38 DF, p-value: 2.407e-09
The estimated value is 231.36, which means that the average monthly salary of male
employees is 231.36 higher than that of female employees, holding other variables
unchanged.
(b) The remaining degrees of freedom of the model can be d.f. = 47 − 1− 1−1-4 – 3 =
37, which does not match with the R output since the R output equals 38.
(c)
which(dataset$POSITION == 4 | dataset$POSITION == 5)
## [1] 4 7 8 10 15 16 20 21 24 26 30 33 34 35 41 42 43 45
46 47
which(dataset$EDUCAT == 3 | dataset$EDUCAT == 4)
## [1] 4 7 8 10 15 16 20 21 24 26 30 33 34 35 41 42 43 45
46 47
These two outputs are the same, which indicates that 20 chemists or management
employees are the same as 20 employees with a bachelor's or master's degree.
Therefore, the model has a perfect multicollinearity.
(d)
full_model <- lm(SALARY ~ YEARS + as.factor(POSITION) +
as.factor(EDUCAT) + as.factor(GENDER), data = dataset)
reduced_model <- lm(SALARY ~ YEARS + as.factor(POSITION) +
as.factor(EDUCAT), data = dataset)
anova(reduced_model, full_model)
F= 0.4672, p-value:0.4984,
α =0.05 , d . f =[ 1 , 38 ] ,C .V .=4.10
This P-value is much higher than the common significance level of 0.05. Therefore,
do not reject H0. The GENDER is not significant at 5% level, which means that there
is no significant evidence of gender discrimination against employees in the provided
dataset.
Question 4
(a)
=1.27758 + 0.49853Age
When non-smokers increase their age by 1, their lung capacity increases by 0.55823
on average.
(c) Smokers have a lower average change in lung capacity of 0.0597 per unit increase
in age than nonsmokers.
(d) No, there not significant individually at 5% level since p-values of smoke is 0.823
and p-value of age is 0.377 both are greater than the significant level 5%.
(e)