0% found this document useful (0 votes)
75 views

Lecture 3. Part 1 - Regression Analysis

Using stepwise regression on data from an aptitude test measuring job proficiency, the analysis found: - Forward and backward stepwise selection identified the same best fitting model using Test 3, Test 1, and Test 4 as predictors. - The final regression equation explained 96.15% of variation in job proficiency. - Assumption checks found residuals were independent and normally distributed with constant variance and predictors were not multicollinear.

Uploaded by

Richelle Pausang
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
75 views

Lecture 3. Part 1 - Regression Analysis

Using stepwise regression on data from an aptitude test measuring job proficiency, the analysis found: - Forward and backward stepwise selection identified the same best fitting model using Test 3, Test 1, and Test 4 as predictors. - The final regression equation explained 96.15% of variation in job proficiency. - Assumption checks found residuals were independent and normally distributed with constant variance and predictors were not multicollinear.

Uploaded by

Richelle Pausang
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

STT153A

Lecture 3. Regression Analysis


Variable Selection and Model Building
• Forward stepwise
Starting with a single predictor, we add predictors until the added
“explanatory power” is negligible.

• Backward stepwise
Starting with all possible predictors (full model), we delete
“insignificant” predictors
Forward stepwise
• Data File. This example is based on the examples data file
Job_prof.sta (from Neter, Wasserman, and Kutner, 1989, page
473). Open this data file by selecting Open Examples from the
File menu (classic menus) or by selecting Open Examples from
the Open menu on the Home tab (ribbon bar); it is in the
Datasets folder. The first four variables (Test1-Test4) represent
four different aptitude tests that were administered to each of
the 25 applicants for entry-level clerical positions in a company.
Regardless of their test scores, all 25 applicants were hired. Once
their probationary period had expired, each of these employees
was evaluated and given a job proficiency rating (variable
Job_prof).

• Research problem. Using stepwise regression, the variables (or


subset of variables) that best predict job proficiency will be
analyzed. Thus, the dependent variable will be Job_prof and
variables Test1-Test4 will be the independent or predictor
variables.

When Test2 was evaluated, the F value was less than the F to enter
value of 1.0, therefore, it was not entered into the model.
Forward stepwise
Now, according to the Forward stepwise regression procedure, the
subset of aptitude tests (independent variables) that best predicts the
job proficiency score (dependent variable) contains Test3, Test1, and
Test4. Therefore, the regression equation appears as follows:
𝑦 = 𝑏0 + 𝑏1 𝑥3 + b2 𝑥1 + 𝑏3 𝑥4

The final regression equation is:

𝑦 = −124.200 + 1.357∗ 𝑥3 + 0.296∗ 𝑥1 + 0.517∗ 𝑥4

p<0.000 which is less than 0.05 meaning model is significant

R-squared is 0.9615

Interpretation: 96.15% of the variation of the job proficiency rating can be


explained by the model with variables Test 3, Test 1 and Test 4.
Backward stepwise
Using the same data from
Forward Stepwise apply
backward stepwise and
observe the result.
The final regression equation is:
𝑦 = −124.382 + 1.306∗ 𝑥3 + 0.296∗ 𝑥1 + 0.520∗ 𝑥4
p<0.000 which is less than 0.05 meaning model is
significant
R-squared is 0.9555
Interpretation: 95.55% of the variation of the job proficiency
rating can be explained by the model with variables Test 3, Test
1 and Test 4.
Example 2.
Compare the forward stepwise and backward stepwise of FIES data
forward
backward
Model Diagnostics/Assumption
Checking
Assumption Checking
The following are assumptions about the error terms (residuals) in a regression
model:
• Independence (through Durbin-Watson test; Satisfied)
• Normality (through histogram (visual) or chi-square test (formal)) (Satisfied)
• Chi-square test:
H0: the residuals follow a normal distribution
Ha: the residuals do not follow a normal distribution
• Homoscedasticity = constant variance (check through plots or levene’s test)

Additionally, we also have these assumptions


• Linearity (through plots)
• No multicollinearity (Satisfied)
Independence of observation

A rule of thumb is that test statistic values in the range of 1.5


to 2.5 are relatively normal. Values outside of this range could
be cause for concern. Field(2009) suggests that values under 1
or more than 3 are a definite cause for concern
Independence of observation using Durbin-Watson
Statistics->Multiple Regression
-> Variables->Dependent variable
(job_prof) –independent variable- (1 2 3 4)-
>Ok
->Advance tab, click Advance Options box-
>Ok-> Ok
Click Residual/assumption/prediction tab->
click performance residual analysis
Click Advanced -> Durbin-Watson Statistic
Independence of observation

Interpretation: The Durbin-Watson is 1.148347 which is below the rule of thumb (1.5-2/5) for job
proficiency rating. The respondents in the data are different individuals with the assumption that
they do not affect the answer of one another. The value 1.14 can be tolerated.

Note: That the assumptions for independence of observation should be satisfied upon data
collection such as the respondents should only answer the questionnaire once, otherwise a
respondent answering more than once can lead to dependence of observation.
Normality of residuals by Histogram
Statistics->Multiple Regression
-> Variables->Dependent variable
(job_prof) –independent variable- (1 2 3 4)-
>Ok
->Advance tab, click Advance Options box-
>Ok-> Ok
Click Residual/assumption/prediction tab->
click performance residual analysis
Click Normal Plot of residuals
Normality of residual by scatter plot of
residuals Normal Probability Plot of Residuals
2.5
H0: the residuals follow a
normal distribution 2.0

1.5

1.0

Expected Normal Value


0.5

Therefore, the residuals 0.0


follow a normal
distribution. -0.5

-1.0

-1.5

-2.0

-2.5

https://online.stat.psu.ed -8 -6 -4 -2 0 2 4 6 8

u/stat501/lesson/4/4.6 Residuals
Normality of residual by histogram
Distribution of Raw residuals
Expected Normal
8

No of obs
4

0
-8 -6 -4 -2 0 2 4 6 8

https://online.stat.psu.edu/stat501/lesson/4/4.6/4.6.1
Normality using
chi-square
Statistics ->Basic Statistics-
>Tables and banners

Specify tables(select
variables)-> Job_prof ->
Test1 Test 2 Test3 Test 4 Ok-
>ok
Normality using chi-square

Options-> Expected Frequencies


-> Pearson &M-L Chi-square

->Advanced -> Detailed two-way


tables-> ok
Normality using chi-square
Chi-square test:
H0: the residuals follow a normal distribution
Ha: the residuals do not follow a normal distribution

Interpretation: None of the p-value is less than 0.05 hence not statistically
significant, therefore we fail to reject the null hypothesis. The residuals
follow a normal distribution.

https://www.youtube.com/watch?v=vn5a5lAL54I
Multicollinearity
This is when predictors (independent variables) are correlated with
each other.

Redundance.

Depending on the situation, it may not be a problem for your model if


only slight or moderate collinearity issue occurs. However, it is strongly
advised to solve the issue if severe collinearity issue exists(e.g.
correlation >0.8 between 2 variables or Variance inflation factor(VIF)
>20
Multicollinearity

• Looking among the independent variables, Test 1 to Test 4, none of


the independent variable has correlation more than 0.80. Although,
Test3 and Test4 have correction of 0.7820, this is acceptable.

You might also like