Linear Regression
Linear Regression
Regression
A scatter plot can be used to show the
relationship between two variables
Correlation analysis is used to measure the
strength of the association (linear relationship)
between two variables
Correlation is only concerned with strength of the
relationship
No causal effect is implied with correlation
Introduction to
Regression Analysis
Regression analysis is used to:
Predict the value of a dependent variable based on
the value of at least one independent variable
Explain the impact of changes in an independent
variable on the dependent variable
Dependent variable: the variable we wish to
predict or explain
Independent variable: the variable used to predict
or explain the
dependent variable
Simple Linear Regression
Model
Only one independent variable, X
Relationship between X and Y is
described by a linear function
Changes in Y are assumed to be related
to changes in X
Types of Relationships
Y Y
X X
Y Y
X X
Types of Relationships
(continued)
Strong relationships Weak relationships
Y Y
X X
Y Y
X X
Types of Relationships
(continued)
No relationship
X
Simple Linear Regression
Model
Population Random
Population Independent Error
Slope
Y intercept Variable term
Coefficient
Dependent
Variable
Yi β0 β1Xi ε i
Linear component Random Error
component
Simple Linear Regression
Model
(continued)
Y Yi β0 β1Xi ε i
Observed Value
of Y for Xi
εi Slope = β1
Predicted Value Random Error
of Y for Xi
for this Xi value
Intercept = β0
Xi X
Simple Linear Regression
Equation (Prediction Line)
The simple linear regression equation provides an
estimate of the population regression line
Estimated
(or predicted) Estimate of Estimate of the
Y value for the regression regression slope
observation i
intercept
Value of X for
Ŷi b0 b1Xi
observation i
The Least Squares Method
350
300
250
200
150
100
50
0
0 500 1000 1500 2000 2500 3000
Square Feet
Simple Linear Regression Example:
Using Excel Data Analysis Function
ANOVA
df SS MS F Significance F
Regression 1 18934.9348 18934.9348 11.0848 0.01039
Residual 8 13665.5652 1708.1957
Total 9 32600.5000
350
Slope
300
250
= 0.10977
200
150
100
50
Intercept 0
= 98.248 0 500 1000 1500 2000 2500 3000
Square Feet
98.25 0.1098(200 0)
317.85
The predicted price for a house with 2000
square feet is 317.85($1,000s) = $317,850
Simple Linear Regression
Example: Making Predictions
When using a regression model for prediction,
only predict within the relevant range of data
Relevant range for
interpolation
450
400
House Price ($1000s)
350
300
250
200
150
100
50 Do not try to
0
extrapolate
0 500 1000 1500 2000 2500 3000
Square Feet
beyond the range
of observed X’s
Measures of Variation
Y
Yi
SSE = (Yi - Yi )2 Y
_
SST = (Yi - Y)2
Y _
_ SSR = (Yi - Y)2 _
Y Y
Xi X
Coefficient of Determination, r2
The coefficient of determination is the portion
of the total variation in the dependent variable
that is explained by variation in the
independent variable
The coefficient of determination is also called
r-squared and is denoted as r2
SSR regression sum of squares
r
2
SST total sum of squares
note: 0 r 1
2
Examples of Approximate
r2 Values
Y
r2 = 1
X
r =1
2
Examples of Approximate
r2 Values
Y
0 < r2 < 1
X
Examples of Approximate
r2 Values
r2 = 0
Y
No linear relationship
between X and Y:
SSR 18934.9348
2
Regression Statistics
r 0.58082
Multiple R 0.76211 SST 32600.5000
R Square 0.58082
Adjusted R Square 0.52842
58.08% of the variation in
Standard Error 41.33032
Observations 10
house prices is explained by
variation in square feet
ANOVA
df SS MS F Significance F
Regression 1 18934.9348 18934.9348 11.0848 0.01039
Residual 8 13665.5652 1708.1957
Total 9 32600.5000
SSE
(Yi Yˆi ) 2
i 1
S YX
n2 n2
Where
SSE = error sum of squares
n = sample size
Simple Linear Regression Example:
Standard Error of Estimate in Excel
Regression Statistics
Multiple R
R Square
0.76211
0.58082
S YX 41.33032
Adjusted R Square 0.52842
Standard Error 41.33032
Observations 10
ANOVA
df SS MS F Significance F
Regression 1 18934.9348 18934.9348 11.0848 0.01039
Linearity
The relationship between X and Y is linear
Independence of Errors
Error values are statistically independent
Normality of Error
Error values are normally distributed for any given
value of X
Equal Variance (also called homoscedasticity)
The probability distribution of the errors has constant
variance
Residual Analysis
ei Yi Ŷi
The residual for observation i, ei, is the difference
between its observed and predicted value
Check the assumptions of regression by examining the
residuals
Examine for linearity assumption
Evaluate independence assumption
Evaluate normal distribution assumption
Examine for constant variance for all levels of X
(homoscedasticity)
Graphical Analysis of Residuals
Can plot residuals vs. X
Residual Analysis for Linearity
Y Y
x x
residuals
x residuals x
Not Linear
Linear
Residual Analysis for
Independence
Not Independent
Independent
residuals
residuals
X
residuals
X
Checking for Normality
Percent
100
0
-3 -2 -1 0 1 2 3
Residual
Checking Homoscedasticity:
Residual Analysis for
Equal (Constant) Variance
Y Y
x x
residuals
x residuals x
Non-constant variance
Constant variance
Simple Linear Regression
Example: Excel Residual Output
15
Here, residuals show a 10
Residuals
5
cyclic pattern, not
0
random. Cyclical
-5 0 2 4 6 8
patterns are a sign of -10
positive autocorrelation -15
Time (t)
i
e 2
i 1
this range could be a cause for concern
D less than 2 may signal positive autocorrelation,
D greater than 2 may signal negative
autocorrelation
Testing for Positive
Autocorrelation
H0: positive autocorrelation does not exist
H1: positive autocorrelation is present
Calculate the Durbin-Watson test statistic = D
(The Durbin-Watson Statistic can be found using Excel or SPSS)
0 dL dU 2
Testing for Positive
Autocorrelation (continued)
140
120
100
Sales
80 y = 30.65 + 4.7038x
2
60 R = 0.8976
40
20
0
0 5 10 15 20 25 30
Tim e
Is there autocorrelation?
Testing for Positive (continued)
Autocorrelation
160
Example with n = 25: 140
120
Excel/PHStat output: 100
Durbin-Watson Calculations
Sales
80 y = 30.65 + 4.7038x
2
Sum of Squared 60 R = 0.8976
Difference of Residuals 3296.18 40
Sum of Squared 20
Residuals 3279.98 0
Durbin-Watson 0 5 10 15 20 25 30
(e i ei1 )2
3296.18
D i 2
n
1.00494
3279.98
ei
2
i1
Testing for Positive
Autocorrelation (continued)
S YX S YX
Sb1
SSX (X i X) 2
where:
Sb1 = Estimate of the standard error of the slope
SSE
S YX = Standard error of the estimate
n2
Inferences About the Slope:
t Test
House Price
Square Feet
Estimated Regression Equation:
in $1000s
(x)
(y)
house price 98.25 0.1098 (sq.ft.)
245 1400
312 1600
279 1700
308 1875
The slope of this model is 0.1098
199 1100
219 1550 Is there a relationship between the
405 2350 square footage of the house and its
324 2450
sales price?
319 1425
255 1700
Inferences About the Slope:
t Test Example
H0: β1 = 0
From Excel output: H1: β1 ≠ 0
Coefficients Standard Error t Stat P-value
Intercept 98.24833 58.03348 1.69296 0.12892
Square Feet 0.10977 0.03297 3.32938 0.01039
b1 Sb1
b1 β 1 0.10977 0
t STAT 3.32938
Sb 0.03297
1
Inferences About the Slope:
t Test Example
H0: β1 = 0
Test Statistic: tSTAT = 3.329
H1: β1 ≠ 0
d.f. = 10- 2 = 8
a/2=.025 a/2=.025
Decision: Reject H0
p-value
Decision: Reject H0, since p-value < α
There is sufficient evidence that
square footage affects house price.
F Test for Significance
where SSR
MSR
k
SSE
MSE
n k 1
where FSTAT follows an F distribution with k numerator and (n – k - 1)
denominator degrees of freedom
Regression Statistics
Multiple R 0.76211
MSR 18934.9348
R Square 0.58082 FSTAT 11.0848
Adjusted R MSE 1708.1957
Square 0.52842
Standard Error 41.33032
With 1 and 8 degrees p-value for
Observations 10
of freedom the F-Test
ANOVA
df SS MS F Significance F
Regression 1 18934.9348 18934.9348 11.0848 0.01039
Residual 8 13665.5652 1708.1957
Total 9 32600.5000
F Test for Significance
(continued)
1. Enter (Regression):
Enters all variables in a single step
2. Stepwise:
At each step, the independent variable not in the equation that has the
smallest probability of F is entered, if that probability is sufficiently small.
Variables already in the regression equation are removed if their
probability of F becomes sufficiently large. The method terminates when
no more variables are eligible for inclusion or removal
Contd.
3. Remove
A procedure for variable selection in which all variables in a block are
removed in a single step
4. Backward Elimination
o All variables are entered into the equation and then sequentially
removed.
o The procedure stops when there are no variables that meet the removal
criterion
5. Forward Selection
o A stepwise variable selection procedure in which variables are
criterion