0% found this document useful (0 votes)
43 views99 pages

Chapter - 2 Regression - Business Analytics

Uploaded by

Tahmid Hossain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views99 pages

Chapter - 2 Regression - Business Analytics

Uploaded by

Tahmid Hossain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 99

Linear Regression

Chapter 7

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
Introduction
• Managerial decisions are often based on the relationship between
two or more variables
• Example: After considering the relationship between advertising
expenditures and sales, a marketing manager might attempt to predict sales
for a given level of advertising expenditures
• Sometimes a manager will rely on intuition to judge how two
variables are related
• If data can be obtained, a statistical procedure called regression
analysis can be used to develop an equation showing how the
variables are related
© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
2
Introduction
• Dependent variable or response: Variable being predicted
• Independent variables or predictor variables: Variables being used to
predict the value of the dependent variable
• Linear regression: A regression analysis involving one independent
variable and one dependent variable
• In statistical notation:
y = dependent variable
x = independent variable

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
3
Introduction
• Simple linear regression: A regression analysis for which any one unit
change in the independent variable, x, is assumed to result in the
same change in the dependent variable, y
• Multiple linear regression: Regression analysis involving two or more
independent variables

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
4
The Simple Linear
Regression Model
Regression Model
Estimated Regression Equation

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
The Simple Linear Regression Model
Regression Model
• The equation that describes how y is related to x and an error term
• Simple Linear Regression Model:
y = β0 + β1x + ε
• Parameters: The characteristics of the population, β0 and β1
• Random variable: Error term, ε
• The error term accounts for the variability in y that cannot be explained by
the linear relationship between x and y

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
6
The Simple Linear Regression Model
• The parameter values are usually not known and must be estimated
using sample data
• Sample statistics (denoted b0 and b1) are computed as estimates of
the population parameters β0 and β1
Estimated Regression Equation
• The equation obtained by substituting the values of the sample
statistics b0 and b1 for β0 and β1 in the regression equation

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
7
The Simple Linear Regression Model
• Estimated simple linear regression equation:
= b0 + b1 x
• = Point estimator of E(y|x)
• b0 = Estimated y-intercept
• b1 = Estimated slope
• The graph of the estimated simple linear regression equation is
called the estimated regression line

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
8
Figure 7.1: The Estimation Process
in Simple Linear Regression

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
9
Figure 7.2: Possible Regression Lines
in Simple Linear Regression

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
10
Least Squares Method
Least Squares Estimates of the Regression Parameters
Using Excel’s Chart Tools to Compute the Estimated Regression Equation

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
Least Squares Method
• Least squares method: A procedure for using sample data to find the
estimated regression equation
• Determine the values of b0 and b1
• Interpretation of b0 and b1:
• The slope b1 is the estimated change in the mean of the dependent variable y
that is associated with a one unit increase in the independent variable x
• The y-intercept b0 is the estimated value of the dependent variable y when
the independent variable x is equal to 0

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
12
Table 7.1: Miles Traveled and Travel Time for
10 Butler Trucking Company Driving
Assignments

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
13
Figure 7.3: Scatter Chart of Miles Traveled and
Travel Time for Sample of 10 Butler Trucking
Company Driving Assignments

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
14
Least Squares Method
• Least squares equation:
n n
min å ( yi - yˆi ) = min å ( yi - b0 - b1xi ) 2
2

i =1 i =1
• y = observed value of the dependent variable for the ith
observation
• = predicted value of the dependent variable for the ith
observation
• n = total number of observations

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
15
Least Squares Method
• ith residual: The error made using the regression model to estimate
the mean value of the dependent variable for the ith observation
• Denoted as ei =
𝑛 𝑛
• Hence,min ∑ ( 𝑦 𝑖 − ^𝑦 𝑖 ) 2=¿ min ∑ 𝑒𝑖 2 ¿
𝑖=1 𝑖=1
• We are finding the regression that minimizes the sum of squared
errors

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
16
Least Squares Method
Least Squares Estimates of the Regression Parameters
• For the Butler Trucking Company data in Table 7.1:
• Estimated slope of b1 = 0.0678
• y-intercept of b0 = 1.2739
• The estimated simple linear regression model:
= 1.2739 + 0.0678x1

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
17
Least Squares Method
• Interpretation of b1: If the length of a driving assignment were 1 unit
(1 mile) longer, the mean travel time for that driving assignment
would be 0.0678 units (0.0678 hours, or approximately 4 minutes)
longer
• Interpretation of b0: If the driving distance for a driving assignment
was 0 units (0 miles), the mean travel time would be 1.2739 units
(1.2739 hours, or approximately 76 minutes)

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
18
Least Squares Method
• Experimental region: The range of values of the independent
variables in the data used to estimate the model
• The regression model is valid only over this region
• Extrapolation: Prediction of the value of the dependent variable
outside the experimental region
• It is risky

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
19
Least Squares Method
• Butler Trucking Company example: Use the estimated model and the
known values for miles traveled for a driving assignment (x) to
estimate mean travel time in hours
• For example, the first driving assignment in Table 7.1 has a value for miles
traveled of x = 100
• The mean travel time in hours for this driving assignment is estimated to be:
= 1.2739 + 0.0678(100) = 8.0539
• The resulting residual of the estimate is:
= = 9.3 8.0539 = 1.2461

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
20
Table 7.2: Predicted Travel Time and
Residuals for 10 Butler Trucking Company
Driving Assignments

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
21
Figure 7.4: Scatter Chart of Miles Traveled and Travel Time for
Butler Trucking Company Driving Assignments with
Regression Line Superimposed

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
22
Figure 7.5: A Geometric Interpretation of the
Least Squares Method

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
23
Least Squares Method
Using Excel’s Chart Tools to Compute the Estimated Regression
Equation
• After constructing a scatter chart with Excel’s chart tools:
1. Right-click on any data point and select Add Trendline…
2. In the Format Trendline task pane, in the Trendline Options area:
• Select Linear
• Select Display Equation on chart

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
Figure 7.6: Scatter Chart and Estimated
Regression Line for Butler Trucking Company

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
25
Least Squares Method
Slope Equation y-Intercept Equation
n

å ( xi - x )( yi - y )
b1 = i =1
n b0 = y - b1x
å
i =1
( xi - x ) 2

• xi = value of the independent variable for the ith observation


• yi = value of the dependent variable for the ith observation
• = mean value for the independent variable
• = mean value for the dependent variable
• n = total number of observations

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
26
Assessing the Fit of the
Simple Linear Regression
Model
The Sums of Squares
The Coefficient of Determination
Using Excel’s Chart Tools to Compute the Coefficient of Determination

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
Assessing the Fit of the Simple
Linear Regression Model
The Sums of Squares
• Sum of squares due to error: The value of SSE is a measure of the
error in using the estimated regression equation to predict the values
of the dependent variable in the sample
n
SSE = å ( yi - yˆi )2
i =1
n
From Table 7.2, SSE = å ei 2 = 8.0288
i =1

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
28
Figure 7.7: The Sample Mean as a Predictor
of Travel Time for Butler Trucking Company

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
29
Assessing the Fit of the Simple
Linear Regression Model
Total Sum of Squares, SST
• Butler Trucking Example: For the ith driving assignment in the
sample, the difference i provides a measure of the error involved in
using to predict travel time for the ith driving assignment
• The corresponding sum of squares is called the total sum of squares
(SST)
n
SST = å ( yi - y )2
i =1

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
30
Table 7.3: Calculations for the Sum of
Squares Total for the Butler Trucking Simple
Linear Regression

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
31
Figure 7.8: Deviations About the Estimated Regression
Line and the Line y = for the Third Butler Trucking
Company Driving Assignment

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
32
Assessing the Fit of the Simple
Linear Regression Model
• Sum of squares due to regression, SSR:
n
SSR = å ( yˆi - y )2
i =1

• Measures how much the values on the estimated regression line


deviate from
• Relation between SST, SSR, and SSE:
SST = SSR + SSE

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
33
Assessing the Fit of the Simple
Linear Regression Model
The Coefficient of Determination
• The ratio SSR/SST used to evaluate the goodness of fit for the estimated
regression equation
• r2 =
• Take values between zero and one
• Interpreted as the percentage of the total sum of squares that can be
explained by using the estimated regression equation
• Square of the correlation between the and
• Referred to as the simple coefficient of determination in simple regression

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
34
Figure 7.9: Scatter Chart and Estimated
Regression Line with Coefficient of Determination
r2 for Butler Trucking Company

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
35
The Multiple Regression
Model
Regression Model
Estimated Multiple Regression Equation
Least Squares Method and Multiple Regression
Butler Trucking Company and Multiple Regression
Using Excel’s Regression Tool to Develop the Estimated Multiple Regression Equation

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
The Multiple Regression Model
Regression Model
• Multiple regression model
y = β0 + β1x1 + β2x2 + ∙ ∙ ∙ + βqxq + ε
• y = dependent variable
• x1, x2, . . . , xq = independent variables
• β0, β1, β2, . . . , βq = parameters
• ε = error term (accounts for the variability in y that cannot be explained by
the linear effect of the q independent variables)

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
37
The Multiple Regression Model
• Interpretation of slope coefficient βj: Represents the change in the
mean value of the dependent variable y that corresponds to a one
unit increase in the independent variable xj, holding the values of all
other independent variables in the model constant
• The multiple regression equation that describes how the mean value
of y is related to x1, x2, . . . , xq:
E( y | x1, x2, . . . , xq) = β0 + β1x1 + β2x2 + ∙ ∙ ∙ + βqxq

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
38
The Multiple Regression Model
Estimated Multiple Regression Equation
= b0 + b1x1 + b2x2 + ∙ ∙ ∙ + bqxq
• b0, b1, b2, . . . , bq = the point estimates of β0, β1, β2 , . . . , βq
• = estimated value of the dependent variable
• The least squares method is used to develop the estimated multiple
regression equation:
• Finding b0, b1, b2, . . . , bq that satisfy min = min
• Uses sample data to provide the values of b0, b1, b2, . . . , bq that minimize
the sum of squared residuals

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
39
The Multiple Regression Model
Least Squares Method and Multiple Figure 7:10: The Estimation Process for
Regression Multiple Regression

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
40
The Multiple Regression Model
Butler Trucking Company and Multiple Regression
• The estimated simple linear regression equation, = 1.2739 + 0.0678xi
• The linear effect of the number of miles traveled explains 66.41 percent (r2 =
0.6641) of the variability in travel time in the sample data
• This implies, 33.59 percent of the variability in sample travel times remains
unexplained
• The managers might want to consider adding one or more independent
variables to the model to explain some of the remaining variability in the
dependent variable
• To include the number of deliveries in the model
© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
41
The Multiple Regression Model
Butler Trucking Company and multiple regression (contd.)
• Estimated multiple linear regression with two independent variables:
= b0 + b1x1 + b2x2
• = Estimated mean travel time
• x1 = Distance traveled
• x2 = Number of deliveries
• The SST, SSR, SSE and R2 are computed using the formulas discussed
earlier

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
42
Figure 7.11: Data Analysis Tools Box
Using Excel’s Regression Tool to Develop the Estimated Multiple
Regression Equation

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
43
Figure 7.12: Regression Dialog Box

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
44
Figure 7.13: Excel Regression Output for the
Butler Trucking Company with Miles and Deliveries
as Independent Variables

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
45
Figure 7.14: Graph of the Regression Equation
for Multiple Regression Analysis with Two
Independent Variables

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
46
Inference and Regression
Conditions Necessary for Valid Inference in the Least Squares Regression Model
Testing Individual Regression Parameters
Addressing Nonsignificant Independent Variables
Multicollinearity
Inference and Very Large Samples

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
Inference and Regression
• Statistical inference: Process of making estimates and drawing
conclusions about one or more characteristics of a population (the value
of one or more parameters) through the analysis of sample data drawn
from the population
• In regression, inference is commonly used to estimate and draw
conclusions about:
• The regression parameters β0, β1, β2, . . . , βq
• The mean value and/or the predicted value of the dependent variable y for
specific values of the independent variables , , . . . ,
• Consider both hypothesis testing and interval estimation
© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
48
Inference and Regression
Conditions Necessary for Valid Inference in the Least Squares
Regression Model
• For any given combination of values of the independent variables x1,
x2, . . . , xq, the population of potential error terms ε is normally distributed
with a mean of 0 and a constant variance
• The values of ε are statistically independent

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
49
Figure 7.15: Illustration of the Conditions for
Valid Inference in Regression

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
50
Figure 7.16: Example of a Random Error Pattern in a
Scatter Chart of Residuals and Predicted Values of the
Dependent Variable

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
51
Figure 7.17: Examples of Diagnostic Scatter
Charts of Residuals from Four Regressions

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
52
Figure 7.18: Excel Residual Plots for the
Butler Trucking Company Multiple
Regression

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
53
Inference and Regression
Testing Individual Regression Parameters:
• To determine whether statistically significant relationships exist between
the dependent variable y and each of the independent variables x1,
x2, . . . , xq individually
• If a βj = 0, there is no linear relationship between the dependent variable y
and the independent variable xj
• If a βj ≠ 0, there is a linear relationship between y and xj

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
54
Inference and Regression
• Use a t test to test the hypothesis that a regression parameter βj is zero
• The test statistic for this t test, t =
= Estimated standard deviation of bj
• As the magnitude of t increases (as t deviates from zero in either
direction), we are more likely to reject the hypothesis that the regression
parameter βj is zero

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
55
Inference and Regression
• Confidence interval can be used to test whether each of the
regression parameters β0, β1, β2, . . . , βq is equal to zero
• Confidence interval: An estimate of a population parameter that
provides an interval believed to contain the value of the parameter
at some level of confidence
• Confidence level: Indicates how frequently interval estimates based
on samples of the same size taken from the same population using
identical sampling techniques will contain the true value of the
parameter we are estimating

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
56
Inference and Regression
Addressing Nonsignificant Independent Variables
• If practical experience dictates that the nonsignificant independent
variable has a relationship with the dependent variable, the independent
variable should be left in the model
• If the model sufficiently explains the dependent variable without the
nonsignificant independent variable, then consider rerunning the
regression without the nonsignificant independent variable
• The appropriate treatment of the inclusion or exclusion of the y-intercept
when b0 is not statistically significant may require special consideration

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
57
Inference and Regression
Multicollinearity
• Multicollinearity refers to the correlation among the independent
variables in multiple regression analysis
• In t tests for the significance of individual parameters, the difficulty caused
by multicollinearity is that it is possible to conclude that a parameter
associated with one of the multicollinear independent variables is not
significantly different from zero when the independent variable actually
has a strong relationship with the dependent variable
• This problem is avoided when there is little correlation among the
independent variables

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
58
Inference and Regression
Inference and Very Large Samples
• Because virtually all relationships between independent variables and the
dependent variable will be statistically significant:
• If the sample size is sufficiently large, inference can no longer be used to discriminate
between meaningful and specious relationships
• This is because the variability in potential values of an estimator bj of a regression
parameter βj depends on two factors:
(1) How closely the members of the population adhere to the relationship between xj and
y that is implied by βj
(2) The size of the sample on which the value of the estimator bj is based

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
59
Inference and Regression
• Testing for an overall regression relationship:
• Use an F test based on the F probability distribution
• If the F test leads us to reject the hypothesis that the values of β1, β2, . .
. , βq are all zero:
• Conclude that there is an overall regression relationship
• Otherwise, conclude that there is no overall regression relationship

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
60
Inference and Regression
• Testing for an overall regression relationship (contd.):
• The test statistic generated by the sample data for this test is:

• SSR = Sum of squares due to regression


• SSE = Sum of squares due to error
• q = the number of independent variables in the regression model
• n = the number of observations in the sample
• Larger values of F provide stronger evidence of an overall regression
relationship

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
61
Categorical Independent
Variables
Butler Trucking Company and Rush Hour
Interpreting the Parameters
More Complex Categorical Variables

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
Categorical Independent Variables
Butler Trucking Company and Rush Hour
• Dependent variable, y: Travel time
• Independent variables: miles traveled (x1) and number of deliveries (x2)
0 if an assignment did not include
travel on the congested segment of
highway during afternoon rush hour
• Categorical variable: rush hour (x3) =
1 if an assignment included travel on
the congested segment of highway
during afternoon rush hour

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
63
Figure 7.25: Histograms of the Residuals for Driving
Assignments That Included Travel on a Congested Segment
of a Highway During the Afternoon Rush Hour and
Residuals for Driving Assignments That Did Not

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
64
Figure 7.26: Excel Data and Output for Butler Trucking
with Miles Traveled (x1), Number of Deliveries (x2), and the
Highway Rush Hour Dummy Variable (x3) as the
Independent Variables

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
65
Categorical Independent Variables
Interpreting the Parameters
• The model estimates that travel time increases by:
• 0.0672 hours for every increase of 1 mile traveled, holding constant the
number of deliveries and whether the driving assignment route requires
the driver to travel on the congested segment of a highway during the
afternoon rush hour period
• 0.6735 hours for every delivery, holding constant the number of miles
traveled and whether the driving assignment route requires the driver to
travel on the congested segment of a highway during the afternoon rush
hour period

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
66
Categorical Independent Variables
• The model estimates that travel time increases by:
• 0.9980 hours if the driving assignment route requires the driver to travel on
the congested segment of a highway during the afternoon rush hour period,
holding constant the number of miles traveled and the number of deliveries
• R2 = 0.8838 indicates that the regression model explains approximately 88.4
percent of the variability in travel time for the driving assignments in the
sample

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
67
Categorical Independent Variables
• The mean or expected value of travel time for driving assignments
given no rush hour driving:
E(y|x3 = 0) = β0 + β1x1 + β2 x2 + β3(0) = β0 + β1x1 + β2 x2
• The mean or expected value of travel time for driving assignments
given rush hour driving:
E(y|x3 = 1) = β0 + β1x1 + β2 x2 + β3(1)
= β0 + β1x1 + β2x2 + β3
= (β0 + β3) + β1x1 + β2x2

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
68
Categorical Independent Variables
• Using the estimated multiple regression equation:
= –0.3302 + 0.0672x1 + 0.6735x2 + 0.9980x3
• When x3 = 0:
= –0.3302 + 0.0672x1 + 0.6735x2 + 0.9980(0)
= –0.3302 + 0.0672x1 + 0.6735x2
• When x3 = 1:
= –0.3302 + 0.0672x1 + 0.6735x2 + 0.9980(1)
= 0.6678 + 0.0672x1 + 0.6735x2
© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
69
Categorical Independent Variables
More Complex Categorical Variables
• If a categorical variable has k levels, k – 1 dummy variables are required,
with each dummy variable corresponding to one of the levels of the
categorical variable and coded as 0 or 1
• Example:
• Suppose a manufacturer of vending machines organized the sales territories
for a particular state into three regions: A, B, and C
• The managers want to use regression analysis to help predict the number of
vending machines sold per week
• Suppose the managers believe sales region is one of the important factors
in predicting the number of units sold
© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
70
Categorical Independent Variables
• Example (contd.):
• Sales region: categorical variable with three levels (A, B, and C)
• Number of dummy variables = 3 – 1 = 2
• Each variable can be coded 0 or 1 as:
x1 = 1 if sales Region B x2 = 1 if sales Region C
0 otherwise 0 otherwise
• The values of x1 and x2 are:

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
71
Categorical Independent Variables
• Example (contd.):
• The regression equation relating the expected value of the number of units
sold, E( y|x1, x2), to the dummy variables:
E( y|x1, x2) = β0 + β1x1 + β2 x2

• Observations corresponding to Sales Region A are coded x1 = 0, x2 = 0


• Regression equation:
E( y|x1 = 0, x2 = 0) = E( y|Sales Region A) = β0 + β1(0) + β2(0) = β0

• Observations corresponding to Sales Region C are coded x1 = 0, x2 = 1


• Regression equation:
E( y|x1 = 0, x2 = 1) = E( y|Sales Region C) = β0 + β1(0) + β2(1) = β0 + β2
© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
72
Modeling Nonlinear
Relationships
Quadratic Regression Models
Piecewise Linear Regression Models
Interaction Between Independent Variables

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
Figure 7.27: Scatter Chart for the Reynolds
Example

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
74
Figure 7.28: Excel Regression Output for the
Reynolds Example

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
75
Figure 7.29: Scatter Chart of the Residuals and
Predicted Values
of the Dependent Variable for the Reynolds Simple
Linear Regression

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
76
Modeling Nonlinear Relationships
Quadratic Regression Model
• An estimated regression equation given by:
= b0 + b1x1 + b2 + e
• In the Reynolds example, to account for the curvilinear relationship
between months employed and scales sold we could include the
square of the number of months the salesperson has been employed
in the model as a second independent variable

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
77
Figure 7.30: Relationships That Can Be Fit
with a Quadratic Regression Model

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
78
Figure 7.31: Excel Data for the Reynolds
Quadratic Regression Model

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
79
Figure 7.32: Excel Output for the Reynolds
Quadratic Regression Model

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
80
Figure 7.33: Scatter Chart of the Residuals and
Predicted Values of the Dependent Variable for the
Reynolds Quadratic Regression Model

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
81
Modeling Nonlinear Relationships
Piecewise Linear Regression Models
• For the Reynolds data, as an alternative to a quadratic regression model
• Recognize that below some value of Months Employed, the relationship
between Months Employed and Sales appears to be positive and linear
• Whereas the relationship between Months Employed and Sales appears to
be negative and linear for the remaining observations
• Piecewise linear regression model: This model will allow us to fit these
relationships as two linear regressions that are joined at the value of
Months at which the relationship between Months Employed and Sales
changes
© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
82
Modeling Nonlinear Relationships
• Knot: The value of the independent variable at which the
relationship between dependent variable and independent variable
changes
• For the Reynolds data, knot is the value of the independent variable
Months Employed at which the relationship between Months
Employed and Sales changes

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
83
Figure 7.34: Possible Position of Knot x(k)

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
84
Modeling Nonlinear Relationships
• Define a dummy variable:
0 if x1 ≤ x(k)
xk =
1 if x1 > x(k)
• x1 = Months
• x(k) = value of the knot (90 months for the Reynolds example)
• xk = the knot dummy variable

• Fit the following regression model:


= b0 + b1x1 + b2(x1 – x(k))xk + e

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
85
Modeling Nonlinear Relationships
Interaction Between Independent Variables
• Interaction: This occurs when the relationship between the dependent
variable and one independent variable is different at various values of a
second independent variable
• The model is given by:
y = β0 + β1x1 + β2x2 + β3x1 x2 + e
• The estimated model is given by:
= b0 + b1x1 + b2x2 + b3x1 x2

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
86
Model Fitting
Variable Selection Procedures
Overfitting

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
Model Fitting
Variable Selection Procedures
• Special procedures are sometimes employed to select the independent
variables to include in the regression model
Iterative procedures:
• Stepwise regression At each step of the
• Forward selection procedure procedure a single
independent variable is
• Sequential replacement procedure added or removed and the
new model is evaluated

• Best-subsets procedure Evaluates regression models


involving different subsets
of the independent variables

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
88
Model Fitting
• Variable Selection Procedures
• Backward elimination
• Forward selection
• Stepwise selection
• Best subsets
• Forward selection procedure:
• The analyst establishes a criterion for allowing independent variables to enter the
model
• Example: The independent variable j with the smallest p-value associated with the
test of the hypothesis βj = 0, subject to some predetermined maximum p-value
for which a potential independent variable will be allowed to enter the model

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
89
Model Fitting
• Forward selection procedure (contd.):
• First step: The independent variable that best satisfies the criterion is added
to the model
• Each subsequent step: The remaining independent variables not in the
current model are evaluated, and the one that best satisfies the criterion is
added to the model
• Procedure stops: When there are no independent variables not currently in
the model that meet the criterion for being added to the regression model

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
90
Model Fitting
• Backward selection procedure:
• The analyst establishes a criterion for allowing independent variables to
remain in the model.
• Example: The largest p-value associated with the test of the hypothesis βj = 0,
subject to some predetermined minimum p-value for which a potential
independent variable will be allowed to remain in the model.

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
91
Model Fitting
• Backward selection procedure (contd.):
• First step: The independent variable that violates this criterion to the greatest degree
is removed from the model
• Each subsequent step: The independent variables in the current model are evaluated,
and the one that violates this criterion to the greatest degree is removed from the
model
• Procedure stops: When there are no independent variables currently in the model that
violate the criterion for remaining in the regression model

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
92
Model Fitting
• Stepwise procedure:
• The analyst establishes both a criterion for allowing independent variables
to enter the model and a criterion for allowing independent variables to
remain in the model
• In the first step of the procedure, the independent variable that best
satisfies the criterion for entering the model is added
• First, the remaining independent variables not in the current model are
evaluated, and the one that best satisfies the criterion for entering is added
to the model

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
93
Model Fitting
• Stepwise procedure (contd.):
• Then the independent variables in the current model are evaluated, and the
one that violates the criterion for remaining in the model to the greatest
degree is removed
• The procedure stops when no independent variables not currently in the
model meet the criterion for being added to the regression model, and no
independent variables currently in the model violate the criterion for
remaining in the regression model

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
94
Model Fitting
• Best-subsets procedure:
• Simple linear regressions for each of the independent variables under
consideration are generated, and then the multiple regressions with all
combinations of two independent variables under consideration are
generated, and so on
• Once a regression has been generated for every possible subset of the
independent variables under consideration, an output that provides some
criteria for selecting regression models is produced for all models generated

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
95
Model Fitting
Overfitting
• Results from creating an overly complex model to explain idiosyncrasies in the
sample data
• Results from the use of complex functional forms or independent variables that
do not have meaningful relationships with the dependent variable
• If a model is overfit to the sample data, it will perform better on the sample data
used to fit the model than it will on other data from the population
• Thus, an overfit model can be misleading about its predictive capability and its
interpretation

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
96
Model Fitting
• How does one avoid overfitting a model?
• Use only independent variables that you expect to have real and meaningful
relationships with the dependent variable
• Use complex models, such as quadratic models and piecewise linear
regression models, only when you have a reasonable expectation that such
complexity provides a more accurate depiction of what you are modeling
• Do not let software dictate your model; use iterative modeling procedures,
such as the stepwise and best-subsets procedures, only for guidance and not
to generate your final model

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
97
Model Fitting
• How does one avoid overfitting a model? (contd.)
• If you have access to a sufficient quantity of data, assess your model on data
other than the sample data that were used to generate the model (this is
referred to as cross-validation)
• It is recommended to divide the original sample data into training and
validation sets
• Training set: The data set used to build the candidate models that appear to
make practical sense
• Validation set: The set of data used to compare model performances and
ultimately pick a model for predicting values of the dependent variable

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.
98
Model Fitting
• Holdout method: The sample data are randomly divided into
mutually exclusive and collectively exhaustive training and validation
sets
• k-fold cross-validation: The sample data are randomly divided into k
equal-sized, mutually exclusive, and collectively exhaustive subsets
called fold, and k iterations are executed
• Leave-one-out cross-validation: For a sample of n observations, an
iteration consists of estimating the model on n – 1 observations and
evaluating the model on the single observation that was omitted from
the training data
© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license distributed with a certain product or service or otherwise on a password-protected
website for classroom use.

You might also like