0% found this document useful (0 votes)
62 views26 pages

Lecture 4. Part 1 - Regression Analysis

The document discusses assumptions that must be satisfied in regression analysis, including independence, normality, homoscedasticity, linearity, and no multicollinearity. It provides steps to check each assumption, such as the Durbin-Watson test for independence, histograms and chi-square tests for normality, and plots or Levene's test for homoscedasticity. Scatterplots are shown to check for linear relationships between variables.

Uploaded by

Richelle Pausang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views26 pages

Lecture 4. Part 1 - Regression Analysis

The document discusses assumptions that must be satisfied in regression analysis, including independence, normality, homoscedasticity, linearity, and no multicollinearity. It provides steps to check each assumption, such as the Durbin-Watson test for independence, histograms and chi-square tests for normality, and plots or Levene's test for homoscedasticity. Scatterplots are shown to check for linear relationships between variables.

Uploaded by

Richelle Pausang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

STT153A

Lecture 4. Regression Analysis


Assumption Checking
The following are assumptions about the error terms (residuals) in a regression
model:
• Independence (through Durbin-Watson test; Satisfied)
• Normality (through histogram (visual) or chi-square test (formal)) (Satisfied)
• Chi-square test:
H0: the residuals follow a normal distribution
Ha: the residuals do not follow a normal distribution
• Homoscedasticity = constant variance (check through plots or levene’s test)

Additionally, we also have these assumptions


• Linearity (through plots)
• No multicollinearity (Satisfied)
Homoscedasticity
Graphical Summary for JOB_PROF
Shapiro-Wilk p: 0.642
7

Your data needs to show 6


Mean: 92.20

homoscedasticity, which is where Std.Dev.:


Variance:
19.42
377

the variances along the line of best 5 Std.Err.Mean


Skewness:
3.885
0.107
fit remain similar as you move 4
Valid N: 25.00
along the line. 3
Minimum: 58.00
Lower Quartile 78.00
2
Median: 94.00
Upper Quartile 109
Statistics-> Basic Statistics-> 1
Maximum: 127

Descriptive -> Graph 2 -> Variable 0


50 60 70 80 90 100 110 120 130 140
95% Confidence for Std Dev

Job_prof
Lower 15.17
Upper 27.02
Median, Inter-quartile Range & Non-outlier Range
95% Confidence for Mean
Mean & 95% Confidence Interval
Lower 84.18
Upper 100
Mean & 95% Prediction Interval 95% Prediction for Observation
Lower 51.32
Upper 133
50 60 70 80 90 100 110 120 130 140
Homoscedasticity or Homogeneity of
Variances (Levene’s Test)
Homogeneous means the same in structure or
composition. This test gets its name from the null
hypothesis, where we claim that the distribution of
the responses are the same (homogeneous) across
groups.
Ho: The distribution of responses are the same
across groups
Ha: At least one of the distribution is different.
Decision: P-value is 0.886 which is not significant,
hence fail to reject the Null hypothesis.

Conclusion: The distribution of responses for Job


proficiency rating is the same across groups. In other
words, the variance is the same through out the
population.
-> Statistics-> General
Linear Models (GLM)-
>Advance Linear or
Non linear model->
General Linear Model
-> click General linear
Models -> ok

Variables -> Select


dependent and
independent
variables -> ok –> ok
More results ->
Assumptions tab -
> Levene’s test
(ANOVA)
Linearity
There needs to be a linear relationship between (a) the dependent variable and each of your
independent variables, and (b) the dependent variable and the independent variables collectively.

Scatterplot: JOB_PROF vs. TEST1


Scatterplot: JOB_PROF vs. TEST2
TEST1 = 53.796 + .53757 * JOB_PROF
TEST2 = 65.921 + .44250 * JOB_PROF
Correlation: r = .51441
160 Correlation: r = .49701
140
150

140 130

130
120
120

110
TEST1

110

TEST2
100
100
90

80 90
70
80
60

50
50 60 70 80 90 100 110 120 130 70
50 60 70 80 90 100 110 120 130
JOB_PROF 0.95 Conf.Int.
JOB_PROF 0.95 Conf.Int.
Linearity
There needs to be a linear relationship between (a) the dependent variable and each of your
independent variables, and (b) the dependent variable and the independent variables collectively.

Scatterplot: JOB_PROF vs. TEST3 Scatterplot: JOB_PROF vs. TEST4


TEST3 = 63.091 + .40899 * JOB_PROF TEST4 = 50.621 + .47787 * JOB_PROF
Correlation: r = .89706 Correlation: r = .86939
120 115

115 110

110
105

105
100
100
TEST3

95

TEST4
95
90
90
85
85
80
80
75
75
50 60 70 80 90 100 110 120 130
70
JOB_PROF 0.95 Conf.Int. 50 60 70 80 90 100 110 120 130
JOB_PROF 0.95 Conf.Int.
Comparison of Model
Consider the following variables from Jobprof

Dependent Variable:
Job Proficiency Rating
Independent Variables :
Test 1
Test 2
Test 3
Test 4
Model Comparison
MODEL 1: SLRM (Test 1)
• R-Sq = 26.46%
• Adj R-Sq = 23.26%
MODEL 2: (Test 1 and Test 2)
• R-Sq = 46.41%
• Adj R-Sq = 41.54%
MODEL 3: full model
• R-Sq = 96.29%
• Adj R-Sq = 95.54%
Model Comparison
MODEL 4: sig predictors (Test 1,3, and 4) /
FORWARD STEPWISE
• R-Sq = 96.15%
• Adj R-Sq = 95.60%

MODEL 5: Backward Stepwise Model (Test


1,3,and 4)
• R-Sq = 96.15%
• Adj R-Sq = 95.60%
Comparison of Model
Consider the following variables from GPA data

Dependent Variable:
First-Year GPA (𝑌)
Independent Variables :
SATMath (𝑋1 )
SATVerbal(𝑋2 )
HSMath (𝑋3 )
HSEnglish (𝑋4 )
Model Comparison
MODEL 1: SLRM (SATMath (𝑋1 ))
• R-Sq = 72.17%
• Adj R-Sq = 70.62%
• RMSE =
MODEL 2: (SATMath (𝑋1 ) and SATVerbal (𝑋2 ))
• R-Sq = 81.10%
• Adj R-Sq = 78.88%
• RMSE =
MODEL 3: full model
• R-Sq = 85.28%
• Adj R-Sq = 81.35%
• RMSE =
Model Comparison
MODEL 4: FORWARD STEPWISE
• R-Sq = 85.04%
• Adj R-Sq = 82.23% (OPTIMAL R-Sq with
least number of predictors)
• RMSE = 0.23

MODEL 5: Backward Stepwise Model


• R-Sq = 71.17%
• Adj R-Sq =70.62 %
• RMSE =
Residuals Analysis and Model
Diagnostics
Root Mean Squared Error
The Root Mean Squared Error (RMSE) is one of the two main performance indicators for a
regression model. The other one is Mean Square Error (MSE). RMSE measures the average
difference between values predicted by a model and the actual values. It provides an estimation
of how well the model is able to predict the target value (accuracy).

The lower the value of the Root Mean Squared Error, the better the model is. A perfect
model (a hypothetic model that would always predict the exact expected value) would have a
Root Mean Squared Error value of 0.

The Root Mean Squared Error has the advantage of representing the amount of error in the
same unit as the predicted column making it easy to interpret. If you are trying to predict an
amount in dollars, then the Root Mean Squared Error can be interpreted as the amount of error
in dollars.
RMSE value interpretation
The closer RMSE is to 0, the more accurate the model is. But RMSE is returned on the same
scale as the target you are predicting for and therefore there isn’t a general rule for how to
interpret ranges of values. The interpretation of your value can only be evaluated within your
dataset.

Let’s try to unpack this more by looking at an example.


An RMSE of 1,000 for a house price prediction model is most likely seen as good because house
prices tend to be over $100,000. However, the same RMSE of 1,000 for a height prediction model
is terrible as the average height is around 175cm.
Root Mean Squared Error
𝑛
𝑂𝑏𝑠 − 𝑃𝑟𝑒𝑑 2
𝑅𝑀𝑆𝐸 = ෍
𝑛
𝑖=1
Example 1
Temperature Ice Cream
°C (x) Sales(y)

14.2 215
16.4 325
11.9 185
15.2 332
18.5 406
22.1 522
19.4 412
25.1 614 𝑛
2
𝑂𝑏𝑠 − 𝑃𝑟𝑒𝑑
23.4 544 𝑅𝑀𝑆𝐸 = ෍
𝑛
𝑖=1
18.1 421
22.6 445
17.2 408
Interpretation: RMSE of 34 dollars is good enough when the sample mean of ice
cream sales is 402 dollars..
Example 2
Use the GPA data and the best model earlier which is Model 4 with
predictors SATMath(𝑥1 ) and SATVerbal(𝑥2 ) for the dependent variable
Firs Year GPA(𝑦).
The model is 𝑦ො = 0.002185𝑥1 + 0.001312𝑥2 .

Interpretation: RMSE of 0.23 is good enough when the


sample mean of First Year GPA is 2.50 given that the scale
for GPA is 1.0 to 4.0.
Obtaining the Difference of Observed and
Predicted or the residual using Statistica.
Statistics -> Multiple Regression -> Variable Dependent First Year GPA,
Independent (SAT Math, SAT Verbal, HSMath, HSEnglish) ->Ok-
>Advance -> Click Advance Options box ->Ok
Stepwise tab-> Method: Forward Stepwise (Model4) ->ok
Residual/ assumptions /Prediction Tab -> Click Perform residual analysis
->
Save tab-> Save residuals and Predicted -> First year GPA ->Ok
-> Copy columns 1 to 3 and paste in excel to solve for RMSE
Cook’s distance
A measure for identifying outliers/influential points.

Threshold (rule of thumb) = 4/n


Example. GPA data
Compute for cook’s distance for Model 4 of GPA data. 4/n=4/20=0.2
Remove observations with Cook’s D greater than 4/n=4/20=0.2.
Fit model 4 again to the new dataset.

Calculate R-Sq, Adj R-Sq and RMSE

MODEL 4: FORWARD STEPWISE (GPA after removal of outliers)


R-Sq = 90.72%
Adj R-Sq = 87.86%
RMSE = 0.168
Interpretation: RMSE of 0.168 is good when the sample
mean of First Year GPA is 2.51 given that the scale for GPA
is 1.0 to 4.0.
Homework Number 2: Due date May 26
1. Compute for the RMSE for Models 1, 2, 3, and 5 of GPA data.
1. Groups 1-3 Model1 (Simple Linear Model Using SATMath (𝑋1 ))
2. Groups 4-5 Model 2 (SATMath and SATEnglish)
3. Groups 6-7 Model 3 Full Model
4. Groups 8-10 Model 5 Backward Selection
2. Remove observations with Cook’s D greater than 4/n.
3. Fit model 4 again to the new dataset.
4. Calculate R-Sq, Adj R-Sq and RMSE

You might also like