Lecture 4. Part 1 - Regression Analysis
Lecture 4. Part 1 - Regression Analysis
Job_prof
Lower 15.17
Upper 27.02
Median, Inter-quartile Range & Non-outlier Range
95% Confidence for Mean
Mean & 95% Confidence Interval
Lower 84.18
Upper 100
Mean & 95% Prediction Interval 95% Prediction for Observation
Lower 51.32
Upper 133
50 60 70 80 90 100 110 120 130 140
Homoscedasticity or Homogeneity of
Variances (Levene’s Test)
Homogeneous means the same in structure or
composition. This test gets its name from the null
hypothesis, where we claim that the distribution of
the responses are the same (homogeneous) across
groups.
Ho: The distribution of responses are the same
across groups
Ha: At least one of the distribution is different.
Decision: P-value is 0.886 which is not significant,
hence fail to reject the Null hypothesis.
140 130
130
120
120
110
TEST1
110
TEST2
100
100
90
80 90
70
80
60
50
50 60 70 80 90 100 110 120 130 70
50 60 70 80 90 100 110 120 130
JOB_PROF 0.95 Conf.Int.
JOB_PROF 0.95 Conf.Int.
Linearity
There needs to be a linear relationship between (a) the dependent variable and each of your
independent variables, and (b) the dependent variable and the independent variables collectively.
115 110
110
105
105
100
100
TEST3
95
TEST4
95
90
90
85
85
80
80
75
75
50 60 70 80 90 100 110 120 130
70
JOB_PROF 0.95 Conf.Int. 50 60 70 80 90 100 110 120 130
JOB_PROF 0.95 Conf.Int.
Comparison of Model
Consider the following variables from Jobprof
Dependent Variable:
Job Proficiency Rating
Independent Variables :
Test 1
Test 2
Test 3
Test 4
Model Comparison
MODEL 1: SLRM (Test 1)
• R-Sq = 26.46%
• Adj R-Sq = 23.26%
MODEL 2: (Test 1 and Test 2)
• R-Sq = 46.41%
• Adj R-Sq = 41.54%
MODEL 3: full model
• R-Sq = 96.29%
• Adj R-Sq = 95.54%
Model Comparison
MODEL 4: sig predictors (Test 1,3, and 4) /
FORWARD STEPWISE
• R-Sq = 96.15%
• Adj R-Sq = 95.60%
Dependent Variable:
First-Year GPA (𝑌)
Independent Variables :
SATMath (𝑋1 )
SATVerbal(𝑋2 )
HSMath (𝑋3 )
HSEnglish (𝑋4 )
Model Comparison
MODEL 1: SLRM (SATMath (𝑋1 ))
• R-Sq = 72.17%
• Adj R-Sq = 70.62%
• RMSE =
MODEL 2: (SATMath (𝑋1 ) and SATVerbal (𝑋2 ))
• R-Sq = 81.10%
• Adj R-Sq = 78.88%
• RMSE =
MODEL 3: full model
• R-Sq = 85.28%
• Adj R-Sq = 81.35%
• RMSE =
Model Comparison
MODEL 4: FORWARD STEPWISE
• R-Sq = 85.04%
• Adj R-Sq = 82.23% (OPTIMAL R-Sq with
least number of predictors)
• RMSE = 0.23
The lower the value of the Root Mean Squared Error, the better the model is. A perfect
model (a hypothetic model that would always predict the exact expected value) would have a
Root Mean Squared Error value of 0.
The Root Mean Squared Error has the advantage of representing the amount of error in the
same unit as the predicted column making it easy to interpret. If you are trying to predict an
amount in dollars, then the Root Mean Squared Error can be interpreted as the amount of error
in dollars.
RMSE value interpretation
The closer RMSE is to 0, the more accurate the model is. But RMSE is returned on the same
scale as the target you are predicting for and therefore there isn’t a general rule for how to
interpret ranges of values. The interpretation of your value can only be evaluated within your
dataset.
14.2 215
16.4 325
11.9 185
15.2 332
18.5 406
22.1 522
19.4 412
25.1 614 𝑛
2
𝑂𝑏𝑠 − 𝑃𝑟𝑒𝑑
23.4 544 𝑅𝑀𝑆𝐸 =
𝑛
𝑖=1
18.1 421
22.6 445
17.2 408
Interpretation: RMSE of 34 dollars is good enough when the sample mean of ice
cream sales is 402 dollars..
Example 2
Use the GPA data and the best model earlier which is Model 4 with
predictors SATMath(𝑥1 ) and SATVerbal(𝑥2 ) for the dependent variable
Firs Year GPA(𝑦).
The model is 𝑦ො = 0.002185𝑥1 + 0.001312𝑥2 .