Sessions 18 & 19
Simple Linear Regression & Multiple
Linear Regression
Learning Objectives
In this session, we will learn:
• How to use regression analysis to predict the value of a
dependent variable based on an independent variable
• The meaning of the regression coefficients b0 and b1
• How to evaluate the assumptions of regression analysis and
know what to do if the assumptions are violated
• To make inferences about the slope and correlation
coefficient
• To estimate mean values and predict individual values
Correlation vs. Regression
• A scatter plot can be used to show the relationship
between two variables
• Correlation analysis is used to measure the strength
of the association (linear relationship) between two
variables- [refer session 3 material]
• Correlation is only concerned with strength of the
relationship
• No causal effect is implied with correlation
Introduction to
Regression Analysis
• Regression analysis is used to:
• Predict the value of a dependent variable based on the
value of at least one independent variable
• Explain the impact of changes in an independent variable
on the dependent variable
Dependent variable: the variable we wish to
predict or explain
Independent variable: the variable used to predict
or explain the dependent
variable
Simple Linear Regression Model
• Only one independent variable, X
• Relationship between X and Y is described
by a linear function
• Changes in Y are assumed to be related to
changes in X
Types of Relationships
Linear relationships Curvilinear relationships
Y Y
X X
Y Y
X X
Types of Relationships-linear relationships
(continued)
Strong relationships Weak relationships
Y Y
X X
Y Y
X X
Types of Relationships
(continued)
No relationship
X
Simple Linear Regression Model
Population Random
Population Independent Error
Slope
Y intercept Variable term
Coefficient
Dependent
Variable
Yi β0 β1Xi ε i
Linear component Random Error
component
Simple Linear Regression Model
(continued)
Y Yi β0 β1Xi ε i
Observed Value
of Y for Xi
εi Slope = β1
Predicted Value Random Error
of Y for Xi
for this Xi value
Intercept = β0
Xi X
Simple Linear Regression Equation
(Prediction Line)
The simple linear regression equation provides an
estimate of the population regression line
Estimated
(or predicted) Estimate of Estimate of the
Y value for the regression regression slope
observation i
intercept
Value of X for
Ŷi b0 b1Xi
observation i
The Least Squares Method
Ŷ
b0 and b1 are obtained by finding the values of that
minimize the sum of the squared differences between
Ŷ
Y and :
min (Yi Ŷi ) min (Yi (b 0 b1Xi ))
2 2
Least Squares Line…
n ces
re
d iffe
e d
these differences are q u ar
s
called residuals f the
m o …
e su line
e s th
d t he
im iz a n
t s
e min poin
n
h is li n the
T wee
be t
16.13
The Mathematics Behind…
Least Squares Method
• Slope for the Estimated Regression Equation
b1 ( x x )( y y )
i i
(x x )
i
2
where:
xi = value of independent variable for ith
observation
yi = value of dependent variable for ith
_ observation
x = mean value for independent variable
_
y = mean value for dependent variable
Least Squares Method
y-Intercept for the Estimated Regression Equation
b0 y b1 x
Interpretation of the
Slope and the Intercept
• b0 is the estimated average value of Y
when the value of X is zero
• b1 is the estimated change in the average
value of Y as a result of a one-unit increase
in X
Simple Linear Regression Example
• A real estate agent wishes to examine the relationship
between the selling price of a home and its size
(measured in square feet)
• A random sample of 10 houses is selected
• Dependent variable (Y) = house price in $1000s
• Independent variable (X) = square feet
Simple Linear Regression
Example: Data
House Price in $1000s Square Feet
(Y) (X)
245 1400
312 1600
279 1700
308 1875
199 1100
219 1550
405 2350
324 2450
319 1425
255 1700
Simple Linear Regression Example: Scatter Plot
House price model: Scatter Plot
450
400
House Price ($1000s) 350
300
250
200
150
100
50
0
0 500 1000 1500 2000 2500 3000
Square Feet
Simple Linear Regression Example: Refer
to actual Excel Output
Regression Statistics
Multiple R 0.76211 The regression equation is:
R Square 0.58082
Adjusted R Square 0.52842 house price 98.24833 0.10977 (square feet)
Standard Error 41.33032
Observations 10
ANOVA
df SS MS F Significance F
Regression 1 18934.9348 18934.9348 11.0848 0.01039
Residual 8 13665.5652 1708.1957
Total 9 32600.5000
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept 98.24833 58.03348 1.69296 0.12892 -35.57720 232.07386
Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.18580
Simple Linear Regression Example:
Coefficient of Determination, r2 in Excel
SSR 18934.9348
Regression Statistics
r2 0.58082
Multiple R 0.76211 SST 32600.5000
R Square 0.58082
Adjusted R Square 0.52842 58.08% of the variation in
Standard Error 41.33032
house prices is explained by
Observations 10
variation in square feet
ANOVA
df SS MS F Significance F
Regression 1 18934.9348 18934.9348 11.0848 0.01039
Residual 8 13665.5652 1708.1957
Total 9 32600.5000
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept 98.24833 58.03348 1.69296 0.12892 -35.57720 232.07386
Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.18580
Simple Linear Regression Example:
Standard Error of Estimate in Excel
Regression Statistics
Multiple R 0.76211 S YX 41.33032
R Square 0.58082
Adjusted R Square 0.52842
Standard Error 41.33032
Observations 10
ANOVA
df SS MS F Significance F
Regression 1 18934.9348 18934.9348 11.0848 0.01039
Residual 8 13665.5652 1708.1957
Total 9 32600.5000
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept 98.24833 58.03348 1.69296 0.12892 -35.57720 232.07386
Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.18580
Simple Linear Regression Example:
Graphical Representation
House price model: Scatter Plot and Prediction Line
450
400
House Price ($1000s)
350
Slope
300
250
= 0.10977
200
150
100
50
Intercept 0
= 98.248 0 500 1000 1500 2000 2500 3000
Square Feet
house price 98.24833 0.10977 (square feet)
Simple Linear Regression Example:
Interpretation of bo
house price 98.24833 0.10977 (square feet)
• b0 is the estimated average value of Y when the value
of X is zero (if X = 0 is in the range of observed X
values)
• Because a house cannot have a square footage of 0,
b0 has no practical application
Simple Linear Regression Example:
Interpreting b1
house price 98.24833 0.10977 (square feet)
• b1 estimates the change in the average value of Y as
a result of a one-unit increase in X
• Here, b1 = 0.10977 tells us that the mean value of a house
increases by .10977($1000) = $109.77, on average, for
each additional one square foot of size
Simple Linear Regression
Example: Making Predictions
Predict the price for a house
with 2000 square feet:
house price 98.25 0.1098 (sq.ft.)
98.25 0.1098(200 0)
317.85
The predicted price for a house with 2000
square feet is 317.85($1,000s) = $317,850
Simple Linear Regression
Example: Making Predictions
• When using a regression model for prediction, only
predict within the relevant range of data
Relevant range for
interpolation
450
400
House Price ($1000s)
350
300
250
200
150
100
50 Do not try to
0
extrapolate
0 500 1000 1500 2000 2500 3000
Square Feet
beyond the range
of observed X’s
Comparing Standard Errors
SYX is a measure of the variation of observed
Y values from the regression line
Y Y
small SYX X large SYX X
The magnitude of SYX should always be judged relative to the
size of the Y values in the sample data
i.e., SYX = $41.33K is moderately small relative to house prices in
the $200K - $400K range
Performing different Hypothesis tests
t-test for correlation co-efficient
F-test for overall model fit &
t-test for slope co-efficient
Module 2
Hypothesis Test for Correlation Coefficient
• Hypotheses
H0 : ρ = 0 (no correlation between X and Y)
H1 : ρ ≠ 0 (correlation exists)
• Test statistic
r -ρ (with n – 2 degrees of freedom)
t STAT
2
1 r where
n2 r r 2 if b1 0
r r 2 if b1 0
Hypothesis Test for Correlation Coefficient
(continued)
Is there evidence of a linear relationship
between square feet and house price at
the .05 level of significance?
H 0: ρ = 0 (No correlation)
H 1: ρ ≠ 0 (correlation exists)
=.05 , df = 10 - 2 = 8
rρ .762 0
t STAT 3.329
1 r2 1 .7622
n2 10 2
Hypothesis Test for Correlation Coefficient
(continued)
rρ .762 0 Decision:
t STAT 3.329
Reject H0
1 r2 1 .7622
n2 10 2 Conclusion:
There is
d.f. = 10-2 = 8
evidence of a
linear association
a/2=.025 a/2=.025
at the 5% level of
significance
Reject H0 Do not reject H0 Reject H0
-tα/2 0
tα/2
-2.3060 2.3060
3.329
F Test for Significance- Overall Model Fit
• F Test statistic:
MSR
FSTAT
where MSE
SSR
MSR
k
SSE
MSE
n k 1
where FSTAT follows an F distribution with k numerator and (n – k - 1)
denominator degrees of freedom
(k = the number of independent variables in the regression model)
Measures of Variation
• Total variation is made up of two parts:
SST SSR SSE
Total Sum of Regression Sum Error Sum of
Squares of Squares Squares
SST ( Yi Y )2 SSR ( Ŷi Y )2 SSE ( Yi Ŷi )2
where:
Y = Mean value of the dependent variable
Yi = Observed value of the dependent variable
Yˆi = Predicted value of Y for the given X value
i
Measures of Variation
• SST = total sum of squares (Total Variation)
• Measures the variation of the Yi values around their mean Y
• SSR = regression sum of squares (Explained Variation)
• Variation attributable to the relationship between X and Y
• SSE = error sum of squares (Unexplained Variation)
• Variation in Y attributable to factors other than X
Measures of Variation
Y
Yi
SSE = (Yi - Yi )2 Y
_
SST = (Yi - Y)2
Y _
_ SSR = (Yi - Y)2 _
Y Y
Xi X
Coefficient of Determination, r2
• The coefficient of determination is the portion of
the total variation in the dependent variable that
is explained by variation in the independent
variable
• The coefficient of determination is also called r-
squared and is denoted as r2
SSR regression sum of squares
r
2
SST total sum of squares
note: 0 r 1
2
F Test for Significance
(continued)
H0: β1 = 0 Test Statistic:
H1: β1 ≠ 0 MSR
FSTAT 11.08
= .05 MSE
df1= 1 df2 = 8 Decision:
Critical Reject H0 at = 0.05
Value:
F = 5.32
Conclusion:
= .05
There is sufficient evidence that
0 F house size affects selling price
Do not Reject H0
reject H0
F.05 = 5.32
Inferences About the Slope:
t Test Example
House Price Estimated Regression Equation:
Square Feet
in $1000s
(x)
(y)
house price 98.25 0.1098 (sq.ft.)
245 1400
312 1600
279 1700
308 1875 The slope of this model is 0.1098
199 1100
219 1550
Is there a relationship between the
405 2350 square footage of the house and its
324 2450 sales price?
319 1425
255 1700
Inferences About the Slope: t Test
• t test for a population slope
• Is there a linear relationship between X and Y?
• Null and alternative hypotheses
• H0: β1 = 0 (no linear relationship)
• H1: β1 ≠ 0 (linear relationship does exist)
• Test statistic where:
b1 = regression slope
coefficient
b1 β 1
t STAT β1 = hypothesized slope
Sb Sb1 = standard
1 error of the slope
Df = n-k-1,
k = # indepenedent variables
d.f. n 2
Inferences About the Slope
• The standard error of the regression slope coefficient (b1) is estimated
by
S YX S YX
Sb1
SSX (X i X) 2
where:
Sb1 = Estimate of the standard error of the slope
SSE
S YX = Standard error of the estimate
n2
Standard Error of Estimate
• The standard deviation of the variation of
observations around the regression line is estimated
by
n
SSE
(Yi Yˆi ) 2
i 1
S YX
n2 n2
Where
SSE = error sum of squares
n = sample size
Inferences About the Slope: t Test Example
H0: β1 = 0
From Excel output: H1: β1 ≠ 0
Coefficients Standard Error t Stat P-value
Intercept 98.24833 58.03348 1.69296 0.12892
Square Feet 0.10977 0.03297 3.32938 0.01039
b1 Sb1
b1 β 1 0.10977 0
t STAT 3.32938
Sb 0.03297
1
Inferences About the Slope: t Test Example
H0: β1 = 0
Test Statistic: tSTAT = 3.329
H1: β1 ≠ 0
d.f. = 10- 2 = 8
a/2=.025 a/2=.025
Decision: Reject H0
There is sufficient evidence
Reject H0
-tα/2
Do not reject H0
tα/2
Reject H0 that square footage affects
0
-2.3060 2.3060 3.329 house price
Inferences About the Slope: t Test Example
H0: β1 = 0
H1: β1 ≠ 0
From Excel output:
Coefficients Standard Error t Stat P-value
Intercept 98.24833 58.03348 1.69296 0.12892
Square Feet 0.10977 0.03297 3.32938 0.01039
p-value
Decision: Reject H0, since p-value < α
There is sufficient evidence that
square footage affects house price.
Confidence Interval Estimate for the
Slope
Confidence Interval Estimate of the Slope:
b1 t α / 2 S b d.f. = n - 2
1
Excel Printout for House Prices:
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept 98.24833 58.03348 1.69296 0.12892 -35.57720 232.07386
Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.18580
At 95% level of confidence, the confidence interval for
the slope is (0.0337, 0.1858)
Confidence Interval Estimate for the
Slope
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept 98.24833 58.03348 1.69296 0.12892 -35.57720 232.07386
Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.18580
Since the units of the house price variable is
$1000s, we are 95% confident that the average
impact on sales price is between $33.74 and
$185.80 per square foot of house size
This 95% confidence interval does not include 0.
Conclusion: There is a significant relationship between
house price and square feet at the .05 level of significance
Assumptions of Regression
L.I.N.E
• Linearity
• The relationship between X and Y is linear
• Independence of Errors
• Error values are statistically independent
• Normality of Error
• Error values are normally distributed for any given value of
X
• Equal Variance (also called homoscedasticity)
• The probability distribution of the errors has constant
variance
Residual Analysis
ei Yi Ŷi
• The residual for observation i, ei, is the difference between
its observed and predicted value
• Check the assumptions of regression by examining the
residuals
• Examine for linearity assumption
• Evaluate independence assumption
• Evaluate normal distribution assumption
• Examine for constant variance for all levels of X (homoscedasticity)
• Graphical Analysis of Residuals
• Can plot residuals vs. X
Residual Analysis for
Linearity
Y Y
x x
residuals
residuals
x x
Not Linear
Linear
Residual Analysis for
Independence
Not Independent
Independent
residuals
residuals
X
residuals
X
Checking for Normality
• Examine the Stem-and-Leaf Display of the Residuals
• Examine the Boxplot of the Residuals
• Examine the Histogram of the Residuals
• Construct a Normal Probability Plot of the Residuals
Residual Analysis for
Normality
When using a normal probability plot, normal
errors will approximately display in a straight line
Percent
100
0
-3 -2 -1 0 1 2 3
Residual
Residual Analysis for
Equal Variance
Y Y
x x
residuals
residuals
x x
Non-constant variance
Constant variance
Simple Linear Regression Example: Excel Residual Output
RESIDUAL OUTPUT
House Price Model Residual Plot
Predicted
House Price Residuals
80
1 251.92316 -6.923162
2 273.87671 38.12329 60
3 284.85348 -5.853484 40
Residuals
4 304.06284 3.937162 20
5 218.99284 -19.99284 0
6 268.38832 -49.38832 0 1000 2000 3000
-20
7 356.20251 48.79749 -40
8 367.17929 -43.17929
-60
9 254.6674 64.33264 Square Feet
10 284.85348 -29.85348
Does not appear to violate
any regression assumptions
Estimating Mean Values and
Predicting Individual Values
Goal: Form intervals around Y to express
uncertainty about the value of Y for a given Xi
Confidence
Interval for Y
the mean of Y
Y, given Xi
Y = b0+b1Xi
Prediction Interval
for an individual Y,
given Xi Xi X
Confidence Interval for the Average Y, Given X
Confidence interval estimate for the
mean value of Y given a particular Xi
Confidenceintervalfor μ Y|X X :
i
Yˆ t α / 2 S YX hi
Size of interval varies according
to distance away from mean, X
1 (X i X)2 1 (X i X)2
hi
n SSX n (X i X)2
Prediction Interval for an Individual Y, Given X
Confidence interval estimate for an
Individual value of Y given a particular Xi
Confidenceintervalfor YX X :
i
Yˆ t α / 2 S YX 1 hi
This extra term adds to the interval width to reflect
the added uncertainty for an individual case
Estimation of Mean Values:
Example
Confidence Interval Estimate for μY|X=X
i
Find the 95% confidence interval for the mean price
of 2,000 square-foot houses
Predicted Price Yi = 317.85 ($1,000s)
1 (X i X) 2
Ŷ t 0.025S YX 317.85 37.12
n
(X i X) 2
The confidence interval endpoints (from Excel) are
280.66 and 354.90, or from $280,660 to $354,900
Estimation of Individual Values:
Example
Prediction Interval Estimate for YX=X
i
Find the 95% prediction interval for an individual
house with 2,000 square feet
Predicted Price Yi = 317.85 ($1,000s)
1 (X i X) 2
Ŷ t 0.025S YX 1 317.85 102.28
n
(X i X) 2
The prediction interval endpoints from Excel are 215.50
and 420.07, or from $215,500 to $420,070
What’s the Difference?
Prediction Interval Confidence Interval
1 no 1
Used to estimate the value of Used to estimate the mean
one value of y (at given x) value of y (at given x)
The confidence interval estimate of the expected value of y will be narrower than
the prediction interval for the same given value of x and confidence level. This is
because there is less error in estimating a mean value as opposed to predicting an
individual value.
Multiple Linear Regression
• Concept and analogy to Simple Linear Regression
• Equation type and interpretation
• Model fit [DoF=n-k-1] and explanatory power of each independent
variable
• Comparing different models- Adjusted R squared
Adjusted r2
• r2 never decreases when a new X variable is added
to the model
• This can be a disadvantage when comparing models
• What is the net effect of adding a new variable?
• We lose a degree of freedom when a new X variable is
added
• Did the new X variable add enough explanatory power to
offset the loss of one degree of freedom?
Adjusted r2
(continued)
• Shows the proportion of variation in Y explained by all
X variables adjusted for the number of X variables used
2 n 1
r 2
adj 1 (1 r )
n k 1
(where n = sample size, k = number of independent variables)
• Penalizes excessive use of unimportant independent
variables
• Smaller than r2
• Useful in comparing among models
Pitfalls of Regression Analysis
• Lacking an awareness of the assumptions underlying
least-squares regression
• Not knowing how to evaluate the assumptions
• Not knowing the alternatives to least-squares
regression if a particular assumption is violated
• Using a regression model without knowledge of the
subject matter
• Extrapolating outside the relevant range
Strategies for Avoiding the Pitfalls of Regression
• Start with a scatter plot of X vs. Y to observe possible
relationship
• Perform residual analysis to check the assumptions
• Plot the residuals vs. X to check for violations of assumptions
such as homoscedasticity
• Use a histogram, stem-and-leaf display, boxplot, or normal
probability plot of the residuals to uncover possible non-
normality
• If there is violation of any assumption, use alternative
methods or models
• If there is no evidence of assumption violation, then test
for the significance of the regression coefficients and
construct confidence intervals and prediction intervals
• Avoid making predictions or forecasts outside the relevant
range
Chapter Summary
• Making inferences about the slope
• Correlation -- measuring the strength of the association
• The estimation of mean values and prediction of individual values
• Possible pitfalls in regression and recommended strategies to avoid
them
APPENDIX - I
APPENDIX - II