0% found this document useful (0 votes)
55 views93 pages

Regression Analysis and Modelling - Amar Sahay

Uploaded by

Sneha Bhat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views93 pages

Regression Analysis and Modelling - Amar Sahay

Uploaded by

Sneha Bhat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 93

t

os
rP
BEP504

yo
CHAPTER 7

Regression Analysis and Modeling


op
From Business Analytics, Volume II: A Data-Driven Decision-Making
Approach for Business
tC

By Amar Sahay
(A Business Expert Press Book)
No
Do

Copyright © Business Expert Press, LLC, 2020. All rights reserved.

Harvard Business Publishing distributes in digital form the individual chapters from a wide selection of books on business from
publishers including Harvard Business Press and numerous other companies. To order copies or request permission to
reproduce materials, call 1-800-545-7685 or go to http://www.hbsp.harvard.edu. No part of this publication may be reproduced,
stored in a retrieval system, used in a spreadsheet, or transmitted in any form or by any means – electronic, mechanical,
photocopying, recording, or otherwise – without the permission of Harvard Business Publishing, which is an affiliate of Harvard
Business School.
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting is an infringement of
copyright. [email protected] or 617.783.7860
t
os
CHAPTER 7

Regression Analysis

rP
and Modeling

yo
Chapter Highlights
• Introduction to Regression and Correlation
• Linear Regression
? Regression Model
op
• The Estimated Equation of Regression line
• The Method of Least Squares
• Illustration of Least Squares Regression Method
• Analysis of a Simple Regression Problem
tC

• Regression Analysis using Computer


? Simple Regression using EXCEL
? Simple Regression using MINITAB
? Analysis of Regression Output
? Model Adequacy Test
No

? Assumptions of Regression Model and Checking the


Assumptions using MINITAB Residual Plots
? Checking the Assumptions of Regression using Residual Plots
• Multiple Regression: Computer Analysis and Results
? Introduction to Multiple Regression
? Multiple Regression Model
Do

• The Least Squares Multiple Regression Model


• Models with Two Quantitative Independent Variables x1 and x2
• Assumptions of Multiple Regression Model
• Computer Analysis of Multiple Regression
? The Coefficient of Multiple Determination (r2)
? Hypothesis Tests in Multiple Regression

This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
104 BUSINESS ANALYTICS, VOLUME II

t
? Testing the Overall Significance of Regression

os
? Hypothesis Tests on Individual Regression Coefficients
• Multicollinearity and Autocorrelation in Multiple Regression
• Summary of the Key Features of Multiple Regression Model
• Model Building and Computer Analysis

rP
? Model with a Single Quantitative Independent Variable
? First-order Model/ Second-order Model/ Third-order Model
• A Quadratic (second-order) Model: Second-order Model using
MINITAB
? Analysis of Computer Results

yo
• Models with Qualitative Independent (Dummy) Variables
? One Qualitative Independent Variable at Two Levels
• Model with One Qualitative Independent Variable at Three
Levels
• Example: Dummy Variables
op
• Overview of Regression Models
• Implementation Steps and Strategy for Regression Models
tC

Introduction to Regression and Correlation


This chapter provides an introduction of regression and correlation an-
alysis. The techniques of regression enable us to explore the relationship
between variables. We will discuss how to develop regression models that
No

can be used to predict one variable using the other variable, or even mul-
tiple variables. Also, the following features related to regression analysis
are the topic of this chapter.

I. Concepts of dependent or response variable and independent vari-


ables or predictors,
Do

II. The basics of the least squares method in regression analysis and its
purpose in estimating the regression line,
III. Determining the best-fitting line through the data points,
IV. Calculating the slope and y-intercept of the best fitting regression
line and interpreting the meaning of regression line, and
V. Measures of association between two quantitative variables - the
covariance and the coefficient of correlation
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
REgRESSION ANALYSIS ANd MOdELINg 105

t
Linear Regression

os
Regression analysis is used to investigate the relationship between two or
more variables. Often we are interested in predicting a variableusing one
or more independent variables x1 , x 2 ,.., xk . For example, we might be
interested in the relationship between two variables: sales and profit for

rP
a chain of stores, number of hours required to produce a certain number
of products, number of accidents vs. blood alcohol level, advertising ex-
penditures and sales, or the height of parents compared to their children.
In all these cases, regression analysis can be applied to investigate the
relationship between the two variables.

yo
In general, we have one dependent or response variable, y and one or
more independent variables, x1 , x 2 ,..., xk . The independent variables are
also called predictors. If there is only one independent variable x that we
are trying to relate to the dependent variable y, then this is a case of simple
regression. On the other hand, if we have two or more independent vari-
op
ables that are related to a single response or dependent variable, then we
have a case of multiple regression. In this section, we will discuss simple
regression, or to be more specific, simple linear regression. This means
that the relationship we obtain between the dependent or response vari-
tC

able y and the independent variable x will be linear. In this case, there is
only one predictor or independent variable (x) of interest that will be used
to predict the dependent variable (y).
In regression analysis, the dependent or response variable y is a ran-
dom variable; whereas the independent variable or variables x1 , x 2 ,.., xn
No

are measured with negligible error and are controlled by the analyst. The
relationship between the dependent and independent variable or variables
are described by a mathematical model known as a regression model.

The Regression Model


Do

In a simple linear regression method, we study the linear relationship


between two variables, the dependent or the response variable (y)and the
independent variable or predictor (x).
Suppose that the Mountain Power Utility company is interested in de-
veloping a model that will enable them to predict the home heating cost
based on the size of homes in two of the western states that they serve. This
model involves two variables: the heating cost and the size of the homes.
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
106 BUSINESS ANALYTICS, VOLUME II

t
We will denote them by y and x respectively. The manager in charge of

os
developing the model believes that there is a positive relationship between
x and y meaning that the larger homes (homes with larger square-footage)
tend to have higher heating cost. The regression model relating the two
variables— home heating cost y as the dependent variable and the size of the

rP
homes as the independent variable x – can be denoted using equation (7.1).
Equation (7.1) shows the relationship between the values of x and y,
or the independent and dependent variable and an error term in a simple
regression model.

y = ˛0 + ˛1 x + ˝ (7.1)

yo
where y = dependent variable x = independent variable
β0 = y - intercept (population) β1 = slope of the population regression line
ε = random error term (ε is the Greek letter “epsilon”)
op
The model represented by equation (7.1) can be viewed as a popula-
tion model in which β0 and β1 are the parameters of the model. The error
term ε represents the variability in y that cannot be explained by the rela-
tionship between x and y.
tC

In our example, the population consists of all the homes in the region.
This population consists of sub-populations consisting of each home of
size, x. Thus, one subpopulation may be viewed as all homes with 1,500
square-feet, another subpopulation consisting of all home with 2,100
square-feet, and so on. Each of these subpopulations consisting of size
No

x will have a corresponding distribution of y values with the mean or


expected value E(y). The relationship between the expected value of y or
E(y) and x is the regression equation given by:

E ( y ) = ˛0 + ˛1 x (7.2)

where E(y)= is the mean or expected value of y for a given value of x


Do

β0 = y- intercept of the regression line β1 = slope of the regression line

The regression equation represented by equation (7.2) is an equation


of a straight line describing the relationship between E(y) and x. This
relationship shown in Figure 7.1 (a) – (c) can be described as positive,

This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
t
os
rP
Figure 7.1 Possible linear relationship between E(y) and x in simple linear regression
yo
op
tC
No
Do

107
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
108 BUSINESS ANALYTICS, VOLUME II

t
negative, or no relationship. The positive linear relationship is identified

os
by a positive slope. It shows that an increase in the value of x causes an
increase in the mean value of y or E(y), whereas a negative linear relation-
ship is identified by a negative slope and indicates that an increase in the
value x causes a decrease in the mean value of y.

rP
The no relationship between x and y means that the mean value of y or
E(y) is the same for every value of x. In this case, the regression equation
cannot be used to make a prediction because of a weak or no relationship
between x and y.

yo
The Estimated Equation of Regression Line
In equation (7.2), β0 and β1 are the unknown population parameters that
must be estimated using the sample data. The estimates of β0 and β1 are
denoted by b0 and b1 that provide the estimated regression equation given
op
by the following equation.

ŷ = b0 + b1 x (7.3)

where ŷ = point estimator of E(y) or the mean value of y for a given value
tC

of x
b0 = y - intercept of the regression line b1 = slope of the regression line

The regression equation above represents the estimated line of regres-


sion in the slope intercept form. The y-intercept b0 and the slope b1 in
No

equation (7.3) are determined using the least squares method. Before we
discuss the least squares method in detail, we will describe the process of
estimating the regression equation. Figure 7.2 explains this process.

The Method of Least Squares


Do

The regression model is described in form of a regression equation that is


obtained using the least squares method. In a simple linear regression, the
form of the regression equation is y = b0 + b1 x . This is the equation of a
straight line in the slope intercept form.

This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
t
os
rP
yo
op
tC

Figure 7.2 Estimating the regression equation


No
Do

109
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
110 BUSINESS ANALYTICS, VOLUME II

t
Figure 7.3 shows a scatter plot of the data of Table 7.1. Scatter plots

os
are often used to investigate the relationship between two variables. An
investigation of the plot shows a positive relationship between sales and
advertising expenditures therefore, the manager would like to predict the
sales using the advertising expenditure using a simple regression model.

rP
yo
op
Figure 7.3 Scatterplot of sales and advertisement expenditures
tC

Table 7.1 Sales and advertisement data


Sales ($1,000s) Advertising ($1,000s)
458 34
390 30
378 29
No

426 30
330 26
400 31
458 33
410 30
628 41
Do

553 38
728 44
498 40
708 48
719 47
658 45

This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
REgRESSION ANALYSIS ANd MOdELINg 111

t
As outlined above, a simple regression model involves two variables

os
where one variable is used to predict the other variable. The variable to be
predicted is the dependent or response variable, and the other variable is
the independent variable. The dependent variable is usually denoted by y
while the independent variable is denoted by x.

rP
In a scatter plot the dependent variable (y) is plotted on the vertical
axis and the independent variable (x) is plotted on the horizontal axis.
The scatter plot in Figure 7.3 suggests a positive linear relationship
between sales (y) and the advertising expenditures (x). From the figure, it
can be seen that the plotted points can be well approximated by a straight

yo
line of the form y = b0 + b1 x where, b0 and b1 are the y-intercept and
the slope of the line. The process of estimating this regression equation
uses a widely used mathematical tool known as the least squares method.
Te least squares method requires ftting a line through the data points
so that the sum of the squares of errors or residuals is minimum. Tese errors
op
or residuals are the vertical distances of the points from the ftted line. Thus,
the least squares method determines the best fitting line through the data
points that ensures that the sum of the squares of the vertical distances or
deviations from the given points and the fitted line are a minimum.
Figure 7.4 shows the concept of the least squares method. The fig-
tC

ure shows a line fitted to the scatter plot of Figure 7.3 using the least
squares method. This line is the estimated line denoted using y-hat (ŷ).
The method of estimating this line will be illustrated later. The equation
of this line is given below.
No

yˆ = −150.9 + 18.33 x

The vertical distance of each point from the line is known as the error
or residual. Note that the residual or error of a point can be positive, nega-
tive, or zero depending upon whether the point is above, below, or on the
Do

fitted line. If the point is above the line, the error is positive, whereas if
the point is below the fitted line, the error is negative.
Figure 7.4 shows graphically the errors for a few points. To demon-
strate how the error or residual for a point is calculated, refer to the data
in Table 7.1.

This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
112 BUSINESS ANALYTICS, VOLUME II

t
os
rP
yo
Figure 7.4 Fitting the regression line to the sales and advertising data
of table 7.1
op
This table shows that for the advertising expenditure of 40 (or,
x = 40 ) the sales is 498 or ( y = 498 ). This is shown graphically in in
Figure 7.4. The estimated or predicted sales for x = 40 equals the vertical
distance all the way up to the fitted regression line from y = 498 . This
tC

predicted value can be determined using the equation of the fitted line as

ŷ = −150.9 + 18.33 x = −150.9 + 18.33( 40) = 582.3

This is shown in Figure 7.4 as ŷ = 582.3 . The difference between


No

the observed sales, y = 498 , and the predicted value of y is the error or
residual and is equal to

( y − ŷ) = (498 − 582.3) = −84.3

Figure 7.4 shows this error value. This error is negative because the
Do

point y = 498 lies below the fitted regression line.


Now, consider the advertising expenditure of x = 44 . The observed
sales for this value is 728 or y = 728 (from Table 7.1). The predicted
sales for x = 44 equals the vertical distance from y = 728 to the fitted
regression line. This value is calculated as:

This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
REgRESSION ANALYSIS ANd MOdELINg 113

t
yˆ = −150.9 + 18.33 x = −150.9 + 18.33(44) = 655.62

os
The value is shown in Figure 7.4. The error for this point is the dif-
ference between the observed and the predicted, or the estimated value
which is

rP
( y − yˆ ) = (728 − 655.62) = 72.38

This value of the error is positive because the point y = 728 lies

yo
above the fitted line.
The errors for the other observed values can be calculated in a similar
way. The vertical deviation of a point from the fitted regression line rep-
resents the amount of error associated with that point. The least squares
method determines the values b0 and b1 in the fitted regression line
op
ŷ = b0 + b1 x that will minimize the sum of the squares of the errors.
Minimizing the sum of the squares of the errors provides a unique line
through the data points such that the distance of each point from the fit-
ted line is a minimum.
tC

Since the least squares criteria require that the sum of the squares of
the errors be minimized, we have the following relationship:

˛ ( y − yˆ )2 = ˛ ( y − b0 − b1x)2 (7.4)
No

where y is the observed value and ŷ is the estimated value of the depend-
ent variable given by ŷ = b0 + b1 x
Equation (7.4) involves two unknowns b0 and b1. Using differential
calculus, the following two equations can be obtained:

˛ y = nb0 + b1 ˛ x (7.5)
Do

˛ xy = b0 ˛ x + b1 ˛ x 2

These equations are known as the normal equations and can be solved
algebraically to obtain the unknown values of the slope and y-intercept b0
and b1. Solving these equations yields the results shown below.

This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
114 BUSINESS ANALYTICS, VOLUME II

t
n˙ xy − (˙ x )(˙ y )

os
b1 = (7.6)
n˙ x − ( ˙ x )
2
2

and b0 = y − b1 x (7.7)

rP
y =
°y and x = °x
where, n n

The values b0 and b1 when calculated using equations (7.6) and (7.7)

yo
minimize the sum of the squares of the vertical deviations or errors. These
values can be calculated easily using the data points ( xi , yi ) which are
the observed values of the independent and dependent variables (the col-
lected data in Table 7.1).
op
Illustration of Least Squares Regression Method
In this section we will demonstrate the least squares method which is the
basis of regression model. We will also discuss the process of finding the
tC

regression equation using the sales and advertising expenditures data in


Table 7.1. Since the sales manager found a positive linear relationship
between the sales and advertising expenditures through an investigation
of the scatter plot in Figure 7.3, he would now use the data to find the
best fitting line through the points on the scatter plot. The line of best fit
No

can be obtained by first calculating b0 and b1 using equations (7.6) and


(7.7) above. These values will provide the line of the form y = b0 + b1 x
that can be used to predict the sales (y) using the advertising expendi-
tures (x).
In order to evaluate b0 and b1, we need to perform some inter-
mediate calculations shown in Table 7.2. We must first calculate
Do

˜ x , ˜ y, ˜ xy, ˜ x 2 , x , and y . These values can be calculated using


the data points x and y. For later calculations, we will also need the value
of ˜ y 2 therefore, an extra column for y2, or the squares of the depend-
ent variable (y) is added in this table.

This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
REgRESSION ANALYSIS ANd MOdELINg 115

t
Table 7.2 Intermediate calculations for determining the estimated

os
regression line
Sales Advertising
($1,000s) ($1,000s)
y x xy x2 y2

rP
1 458 34 15,572 1,156 209,764
2 390 30 11,700 900 152,100
3 378 29 10,962 841 142,884
4 426 30 12,780 900 181,476
5 330 26 8,580 676 108,900
6 400 31 12,400 961 160,000

yo
7 458 33 15,114 1,089 209,764
8 410 30 12,300 900 168,100
9 628 41 25,748 1,681 394,384
10 553 38 21,014 1,444 305,809
11 728 44 32,032 1,936 529,984
op
12 498 40 19,920 1,600 248,004
13 708 48 33,984 2,304 501,264
14 719 47 33,793 2,209 516,961
15 658 45 29,610 2,025 432,964

° y = 7,742 ° x = 546 ˜ xy = 295, 509 ˜ x2 ˜ y2


tC

= 20, 622 = 4, 262, 358

Note: n = the number of observations = 15

x =
°x =
546
= 36.4 y =
°y =
7, 742
= 516.133
n 15 n 15
No

Using the values in Table 7.2, and equations (7.6) and (7.7) we first
calculate the value of b1

b1 =
n˙ xy − (˙ x ) (˙ y ) = 15(295, 509) − (546)(7, 742
7 )
= 18.326
n˙ x − ( ˙ x )
2
2 15( 20, 622 ) − ( 546) 2
Do

Using the value of b1, we obtain the value of b0.

b0 = y − b1 x = 516.133 − 18.326(36.4) = −150.9

This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
116 BUSINESS ANALYTICS, VOLUME II

t
This gives us the following equation for the estimated regression line:

os
ŷ = −150.9 + 18.33 x

This equation is plotted in Figure 7.5.

rP
The slope (b1) of the estimated regression line has a positive value
of 18.33. This means that as the advertising expenditures (x) increase,
the sales increase. Since the advertising expenditures (x) and the sales
both are measured in $1,000s, the estimated regression equation,
ŷ = −150.9 + 18.33 x means that each unit increase in the value of x (or

yo
every $1,000 increase in the advertising expenditures) will lead to an in-
crease of $18,330 (or 18.33 × 1,000 = 18,330) in expected sales. We can
also use the regression equation to predict the sales for a given value of
x or the advertisement expenditure. For instance, the predicted sales for
x = 40 can be calculated as:
op
ŷ = −150.9 + 18.33(40) = 582.3

Thus, for the advertising expenditure of $40,000 the predicted sales


tC

would be $582,300.
No
Do

Figure 7.5 Graph of the estimated regression equation

This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
REgRESSION ANALYSIS ANd MOdELINg 117

t
It is important to check the adequacy of the estimated regression

os
equation before using the equation to make predictions. In the sections
that follow, we will discuss several tests to check the adequacy of the re-
gression model.

rP
Analysis of a Simple Regression Problem
The example below demonstrates the necessary computations, their inter-
pretation, and application of a simple regression problem using computer
packages. Suppose the operations manager of a manufacturing company

yo
wants to predict the number of hours required to produce a certain num-
ber of products. The data for the number of units produced and the time
in hours to produce those units are shown in the Table 7.3 (Data File:
Hours_Units). This is a simple linear regression problem, so we have one
dependent or response variable that we are trying to relate to one independent
op
variable or predictor. Since we are trying to predict the number of hours
using the number of units produced; hours is the dependent or response
variable (y) and number of units is the independent variable or predictor (x).
For the data in Table 7.3, we first calculate the intermediate values shown
in Table 7.4. All these values are calculated using the observed values of x
tC

and y in Table 7.3. These intermediate values will be used in most of the
computations related to simple regression analysis.
We will also use computer packages such as MINITAB and EXCEL
to analyze the simple regression problem and provide detailed analysis
of the computer output. First, we will explain the manual calculations
No

Table 7.3 Data for regression example


ObsNo. 1 2 3 4 5 6 7 8 9 10
Units (x) 932 951 531 766 814 914 899 535 554 445
Hours (y) 16.20 16.05 11.84 14.21 14.42 15.08 14.45 11.73 12.24 11.12
Do

ObsNo. 11 12 13 14 15 16 17 18 19 20
Units (x) 704 897 949 632 477 754 819 869 1,035 646
Hours (y) 12.63 14.43 15.46 12.64 11.92 13.95 14.33 15.23 16.77 12.41
Obs. No. 21 22 23 24 25 26 27 28 29 30
Units (x) 1,055 875 969 1,075 655 1,125 960 815 555 925
Hours (y) 17.00 15.50 16.20 17.50 12.92 18.20 15.10 14.00 12.20 15.50

This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
118 BUSINESS ANALYTICS, VOLUME II

t
Table 7.4 Intermediate calculations for data in Table 7.3

os
n = 30 (number of observations )
x =
°x = 804.40
° x = 24,132 ° y = 431.23 ° xy = 357, 055 n
° x2 = 20, 467, 220 ° y2 = 6,3
302.3
y =
°y = 14.374
n

rP
and interpret the results. You will find that all the formulas are written in
terms of the values calculated in Table 7.4.

yo
Constructing a Scatterplot of the Data
We can use EXCEL or MINITAB to do a scatter plot of the data. From
the data in Table 7.3, enter the units (x) in the first column and hours (y)
in second column of EXCEL or MINITAB and construct a scatter plot.
op
Figure 7.6 shows the scatter plot for this data.
tC
No
Do

Figure 7.6 Scatter plot of Hours (y) and Units (x)

The above plot clearly shows an increasing trend. It shows a linear re-
lationship between x and y; therefore, the data can be approximated using
a straight line with a positive slope.

This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
REgRESSION ANALYSIS ANd MOdELINg 119

t
Finding the Equation of the Best Fitting Line

os
(Estimated Line)
The equation of the estimated regression line is given by:

ŷ = b0 + b1 x

rP
where b0 = y-intercept, and b1 = slope. These are determined using the
least squares method. The y-intercept b0 and the slope, b1 are determined
using the equations (7.6) and (7.7) discussed earlier.
Using the values in Table 7.4, first calculate the values of b1 (the slope)

yo
and b0 (the y-intercept) as shown below.

b1 =
n˙ xy − (˙ x ) (˙ y ) = 30(357, 055) − (24,132)((431.23) = 0.00964
n˙ x − ( ˙ x )
2
2 30(20, 467, 220) − (24,132) 2
op
and

b0 = y − b1 x = 14.374 − (0.00964)(804.40) = 6.62


tC

Therefore, the equation of the estimated line,

ŷ = b0 + b1 x = 6.62 + 0.00964x

The regression equation or the equation of the “best” fitting line can
No

also be written as:

Hours(y) = 6.62 + 0.00964 Units(x)

or simply, ŷ = 6.62 + 0.00964 x


where, y is the hours and x is the number of units produced. The hat (^)
Do

over y means that the line is estimated. Thus, the equation of the line,
in fact, is an estimated equation of the best fitting line. The line is also
known as the least squares line which minimizes the sum of the squares
of the errors. This means that when the line is placed over the scatter plot,
the vertical distance from each of the points to the line is minimized.
The error is the vertical distance of each point from the estimated line.

This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
120 BUSINESS ANALYTICS, VOLUME II

t
The error is also known as the residual. Figure 7.7 shows the least squares

os
line and the residuals for each of the points as the vertical distance from
the point to the estimated regression line.
[Note: The estimated line is denoted by ŷ and the residual for a point
yi is given by ( yi − ŷ )]

rP
Recall that the error or the residual for a point is given by ( y − ŷ )
which is the vertical distance of a point from the estimated line. Figure 7.8
shows the fitted regression line over the scatter plot.

yo
op
tC

Figure 7.7 The least squares line and residuals


No
Do

Figure 7.8 Fitted line regression plot


This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
REgRESSION ANALYSIS ANd MOdELINg 121

t
Interpretation of the Fitted Regression Line

os
The estimated least squares line is of the form y = b0 + b1 x where, b1 is
the slope and b0 is the y-intercept. The equation of the fitted line is

ŷ = 6.62 + 0.00964 x

rP
In this equation of the fitted line, 6.62 is the y-intercept and 0.00964
is the slope. This line provides the relationship between the hours and
the number of units produced. The equation means that for each unit

yo
increase in(the number of units produced), (the number of hours) will
increase by 0.00964. The value 6.62 represents the portion of the hours
that is not affected by the number of units.

Making Predictions Using the Regression Line


op
The regression equation can be used to predict the number of hours to
produce a certain number of units. For example, suppose we want to pre-
dict the number of hours (y) required to produce 900 units (x). This can
be determined using the equation of the fitted line as:
tC

Hours(y) = 6.62 + 0.00964 Units(x)

Hours(y) = 6.62 + 0.00964 × (900) = 15.296 hours


No

Thus, it will take approximately 15.3 hours to produce 900 units


of the product. Note that making a prediction outside of the range will
introduce error in the predicted value. For example, if we want to predict
the time for producing 2,000 units; this prediction will be outside of the
data range (see the data in Table 7.3, the range of x values is from 445 to
1,125). The value x = 2, 000 is far greater than all the other x values in
Do

the data. From the scatter plot, a straight line fit with an increasing trend
is evident for the data but we should be cautious about assuming that this
straight line trend will continue to hold for values as large as x = 2, 000 .
Therefore, it may not be reasonable to make this prediction for values that
are far beyond the range of the data values.

This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
122 BUSINESS ANALYTICS, VOLUME II

t
The Standard Error of the Estimate(s)

os
The standard error of the estimate measures the variation or scatter of the
points around the fitted line of regression. This is measured in units of the
response or dependent variable (y). The standard error of the estimate is
analogous to the standard deviation. The standard deviation measures the

rP
variability around the mean, whereas the standard error of the estimate (s)
measures the variability around the fitted line of regression. A large value of s
indicates larger variation of the points around the fitted line of regression.
The standard error of the estimate is calculated using the following formula:

yo
s =
˛ ( y − yˆ )2 (7.7A)
n−2

The equation can also be written and evaluated using the values of b0,
b1 and the values in Table 7.4, the standard error of the estimate can be
op
calculated as:

s =
˛ y 2 − b0 ˛ y − b1 ˛ xy
n−2
tC

6, 302.3 − 6.62(431.23) − 0.00964(357, 055)


= = 0.4481 (7.8)
28

Equation (7.7A) measures the average deviation of the points from


No

the fitted line of regression. Equation (7.8) is mathematically equivalent


to equation (7.7A) and is computationally more efficient. Thus,

s = 0.4481

A small value of s indicates less scatter of the data points around the ft-
ted line of regression (see Figure 7.8). Te value s = 0.4481 indicates that the
Do

average deviation is 0.4481 hours (measured in units of dependent variable y).

Assessing the Fit of the Simple Regression Model: The


Coeffcient of Determination (r2) and Its Meaning
The coefficient of determination, r2 is an indication of how well the in-
dependent variable predicts the dependent variable. In other words, it is
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
REgRESSION ANALYSIS ANd MOdELINg 123

t
used to judge the adequacy of the regression model. The value of r2 lies

os
between 0 and 1 (0 ≤ r2 ≤ 1) or 0 to 100 percent. The closer the value of r2
to 1 or 100 percent, the better is the model because the r2 value indicates
the amount of variation in the data explained by the regression model.
Figure 7.9 shows the relationship between the explained, unexplained,

rP
and the total variation.
In regression, the total sum of squares is partitioned into two com-
ponents; the regression sum of squares and the error sum of squares giving
the following relationship:

yo
SST = SSR + SSE

SST = total sum of squares for y


SSR = regression sum of squares (measures the variability in y,
accounted for by the regression line, also known as explained variation)
op
SSE = error sum of squares (measures the variation due to the residual
or error. This is also known as unexplained variation).

yi = any point i; y = average of the y values


tC
No
Do

Figure 7.9 SST = SSR + SSE

This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
124 BUSINESS ANALYTICS, VOLUME II

t
From Figure 7.9, the SST and SSE are calculated as

os
(˙ y )
2

˙( y − y ) ˙ y2
2
SST = = − (7.9)
n

rP
and

SSE = ˛ ( y − ŷ)2 = ˛ y 2 − b0 ˛ y − b1 ˛ xy (7.10)

yo
Note that we can calculate SSR by calculating SST and SSE since,

SST = SSR + SSE or SSR = SST − SSE

Using the SSR and SST values, the coefficient of determination, r2 is


op
calculated using

SSR
r2 = (7.11)
SST
tC

The coefficient of determination, r2 is used to measure the goodness of


fit for the regression equation. It measures the variation in y explained by
the variation in independent variable x or r2 is the ratio of the explained
variation to the total variation.
The calculation of r2 is explained below. First, calculate SST and SSE
No

using equations (7.9) and (7.10) and the values in Table 7.3.

(˛ y)2 (431.23)2
˛( y − y ) = ˛ y2 −
2
SST = = 6302.3 − = 103.6880
n 30

SSE = ˛ ( y − yˆ )2 = ˛ y 2 − b0 ˛ y − b1 ˛ xy
Do

= 6, 302.3 − 6.62(431.23)) − 0.00964(357, 055) = 5.623

Since

SST = SSR + SSE

This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
REgRESSION ANALYSIS ANd MOdELINg 125

t
Therefore,

os
SSR = SST − SSE = 103.680 − 5.623 = 98.057 (7.12)

and

rP
SSR 98.057
r2 = = = 0.946
SST 103.680

or, r2 = 94.6%

yo
This means that 94.6 percent variation in the dependent variable, y is
explained by the variation in x and 5.4 percent of the variation is due to
unexplained reasons or error.

The Coeffcient of Correlation (r) and Its Meaning


op
The coefficient of correlation, r can be calculated by taking the square
root of r2 or,

r = r2
tC

(7.13)

Therefore,

r = r2 = 0.946 = 0.973
No

In this case, r = 97.3% indicates a strong positive correlation between


x and y. Note that r is positive if the slope b1 is positive indicating a posi-
tive correlation between x and y. The value of r is between −1 and +1.

−1 ° r ° 1 (7.14)
Do

The value of r determines the correlation between x and y variables.


The closer the value of r to −1 or +1, stronger is the correlation between
x and y.
The value of the coefficient of correlation r can be positive or negative.
The value of r is positive if the slope b1 is positive; it is negative if b1 is

This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
126 BUSINESS ANALYTICS, VOLUME II

t
negative. If r is positive it indicates a positive correlation, whereas a nega-

os
tive r indicates a negative correlation. The coefficient of correlation r can
also be calculated using the following formula:

(ˆ x )(ˆ y )
ˆ xy −

rP
r = n (7.15)
(ˆ x ) (ˆ y )
2 2

ˆ x2 − n
× ˆ y2 − n

yo
Using the values in Table 7.4, we can calculate r from equation (7.15).

Summary of the Main Features of the Simple


Regression Model Discussed Above
op
The sections above illustrated the least squares method which is the basis
of regression model. The process of finding the regression equation using
the least squared method was demonstrated using the sales and adver-
tising expenditures data. The problem involved predicting the sales–the
response or the dependent variable (y) using the predictor or independent
tC

variable (x)—the advertising expenditures. Another example involved the


number of hours (y) required to produce the number of products (x) The
analysis of this simple regression problem was presented by calculating
and interpreting several measures. In particular, the following analyses
No

were performed: (a) constructing a scatterplot of the data, (b) finding


the equation of the best fitting line, (c) interpreting the fitted regression
line, and (d) making predictions using the fitted regression equation.
Other important measures critical to assessing the quality of the regres-
sion model were calculated and explained. These measures include: (a) the
standard error of the estimate (s) that measures the variation or scatter of
the points around the fitted line of regression, (b) the coefficient of deter-
Do

mination (r2) that measures how well the independent variable predicts
the dependent variable or the percent of variation in the dependent vari-
able y explained by the variation in the independent variable, x, (c) the
coefficient of correlation (r) that measures the strength of relationship
between x and y.

This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
REgRESSION ANALYSIS ANd MOdELINg 127

t
Regression Analysis Using Computer

os
This section provides a step-wise computer analysis of regression model.
In real world, computer software is almost always used to analyze regres-
sion problems. There are a number of computer software in use today
among which MINITAB, EXCEL, SAS, SPSS are few. Here, we have

rP
used Excel and MINITAB computer packages to analyze the regression
models. The applications of simple, multiple, and higher order regres-
sions using EXCEL and MINITAB software are demonstrated in this and
subsequent sections. If you perform regression analysis with substantial
amount of data and need more detailed analyses, the use of statistical

yo
package such as MINITAB, SAS, and SPSS is recommended. Besides
these, a number of software including R, Stata and others are available
readily and are widely used in research and data analysis.
op
Simple Regression Using EXCEL

The instructions in Table 7.5 will produce the regression output shown in
Table 7.6. If you checked the boxes under Residuals and the Line Fit Plots,
the residuals and fitted line plot will be displayed.
tC

Table 7.5 EXCEL instructions for regression


1. Label columns A and B of EXCEL worksheet with Units (x) and Hours (y) and
enter the data of Table 7.3 or, open the EXCEL data fle: Hours_Units.xlsx
2. Click the data tab on the main menu
3. Click data Analysis tab (on far right)
No

4. Select Regression
5. Select Hours(y) for Input y range and Units(x) for Input x range (including the
labels)
6. Check the Labels box
7. Click on the circle to the left of Output Range, click on the box next to output
range and specify where you want to store the output by clicking a blank cell (or
select New Worksheet Ply)
8. Check the Line Fit Plot under residuals. Click OK
Do

You may check the boxes under residuals and normal probability plot as desired.

Table 7.6 shows the output with regression statistics. We calculated


all these manually except the adjusted R-Squared in the previous chapter.
The regression equation can be read from the Coefcients column. The
regression coefficients are b0 and b1; the y-intercept and the slope. In the

This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
t
os
rP
yo
op
tC
No Table 7.6 EXCEL regression output
Do

128
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
REgRESSION ANALYSIS ANd MOdELINg 129

t
coefficients column, 6.620904991 is the y-intercept and 0.009638772 is

os
the slope. The regression equation from this table is

ŷ = 6.62 + 0.00964 x

rP
This is the same equation we obtained earlier using manual calculations.

The Coeffcient of Determination (r2) Using EXCEL


The values of SST and SSR were calculated manually in the previous

yo
chapter. Recall that in regression, the total sum of squares is partitioned
into two components; the regression sum of squares (SSR) and the error
sum of squares (SSE), giving the following relationship: SST = SSR +
SSE. The coefficient of determination r2 which is also the measure of
goodness of fit for the regression equation can be calculated using
op
SSR
r2 =
SST

The values of SSR, SSE, and SST can be obtained using the ANOVA
tC

table of regression output above which is part of the regression analysis


output of EXCEL. Table 7.7 shows the EXCEL regression output with
SSR and SST values. Using these values, the coefficient of determination,
r 2 = SSR / SST = 0.9458 . This value is reported under regression sta-
tistics in Table 7.7.
No

The t-test and F-test for the significance of regression can be easily
performed using the information in the EXCEL computer output under
the ANOVA table. Table 7.8 shows the EXCEL regression output with
the ANOVA table.

(1) Conducting the t-Test Using the Regression Output in Table 7.8.
Do

The test statistic for conducting the significance of regression is given by


the following equation:

t n − 2 = b1 sb1

This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
t
os
rP
yo
op
tC Table 7.7 EXCEL regression output
No
Do

130
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
t
os
rP
yo
op
tC Table 7.8 EXCEL regression output
No
Do

131
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
132 BUSINESS ANALYTICS, VOLUME II

t
The values of b1, sb1 and the test-statistic value t n − 2 are labeled in

os
Table 7.8 below.
Using the test-statistic value, the hypothesis test for the significance
of regression can be conducted. This test is explained here using the com-
puter results. The appropriate hypotheses for the test are:

rP
H 0 : ˜1 = 0
H1 : ˜1 ˛ 0

The null hypothesis states that the slope of the regression line is zero.

yo
Thus, if the regression is significant, the null hypothesis must be rejected.
A convenient way of testing the above hypotheses is to use the p-value
approach. The test statistic value t n − 2 and the corresponding p values are
reported in the regression output Table 7.8. Note that the p value is very
close to zero (p = 2.92278E-19). If we test the hypothesis at a 5 percent
op
level of significance (α = 0.05) then p = 0.000 is less than α = 0.05 and
we reject the null hypothesis and conclude that the regression is signifi-
cant overall.
tC

Simple Regression Using MINITAB

The regression results using MINITAB is explained in this section. We


created a scatter plot, a fitted line plot (a plot with the best fitting line)
and the regression results for the data in Table 7.3. We already analyzed
No

the results from EXCEL above.


[Note: Readers can download a free 30 days trial copy of the MINITAB
version 17 or 18 software from www.minitab.com]
The scatter plot shown in Figure 7.10 shows an increasing or direct
relationship between the number of units produced (x) and the number
of hours (y). Therefore, the data may be approximated by a straight line of
Do

the form y = b0 + b1 x where, b0 is the y-intercept and b1 is the slope. The


fitted line plot with the regression equation from MINITAB is shown in
Figure 7.11. Also, the “Regression Analysis” and “Analysis of Variance” ta-
bles shown in Table 7.9 will be displayed. We will first analyze the regres-
sion and the analysis of variance tables and then provide further analysis.

This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
REgRESSION ANALYSIS ANd MOdELINg 133

t
os
rP
yo
Figure 7.10 Scatterplot of Hours (y) and Units (x)
op
tC
No
Do

Figure 7.11 Fitted line and regression equation

This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
134 BUSINESS ANALYTICS, VOLUME II

t
Analysis of Regression Output in Table 7.9

os
Refer to the Regression Analysis part. In this table, the regression equation
is printed as Hours(y) = 6.62 + 0.00964 Units(x). This is the equation of
the best fitting line using the least squares method. Just below the regression
equation, a table is printed that describes the model in more detail. The val-

rP
ues under the Coef column means coefficients. The values in this column
refer to the regression coefficients b0 and b1 where b0 is the y-intercept or
constant and b1 is the slope of the regression line. Under the Predictor, the
value of Units (x) is 0.0096388 which is b1 (or the slope of the fitted line).
The Constant is 6.6209. These values form the regression equation.

yo
Table 7.9 The regression analysis and analysis of variance tables
using MINITAB
op
tC
No

Refer to Table 7.9 above

1. Te regression equation or the equation of the “best” ftting line is:

Hours(Y) = 6.62 + 0.00964 Units(X)


Do

or, ŷ = 6.62 + 0.00964 x where y is the hours and x is the units


produced.
This line minimizes the sum of the squares of the errors. This means
that when the line is placed over the scatter plot, the vertical distance

This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
REgRESSION ANALYSIS ANd MOdELINg 135

t
from each of the points to the line is minimum. The error or the

os
residual is the vertical distance of each point from the estimated line.
Figure 7.12 shows the least squares line and the residuals. The re-
sidual for a point is given by ( y − y ) which is the vertical distance
of a point from the estimated line.

rP
yo
op
Figure 7.12 The least squares line and residuals
tC

[Note: The estimated line is denoted by y^ and the residual for a point yi is given by (yi-y^)]
The estimated least squares line is of the form y = b0 + b1x where b1 is the slope and b0 is the
y-intercept. In the regression equation: Hours(Y) = 6.62 + 0.00964 Units(X), 6.62 is the
y-intercept and 0.00964 is the slope. This line provides the relationship between the hours and
the number of units produced. The equation states that for each unit increase in x (the number
of units produced), y (the number of hours) will increase by 0.00964.
No

2. Te Standard Error of the Estimate (s)


The standard error of the estimate measures the variation of the
points around the fitted line of regression. This is measured in units
of the response or dependent variable (y).
In regression analysis, the standard error of the estimate is re-
Do

ported as s. The value of s is reported in Table 7.9 under “Regression


Analysis.” This value is
s = 0.4481
A small value of s indicates less scatter of the points around the
fitted line of regression.

This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
136 BUSINESS ANALYTICS, VOLUME II

t
3. Te Coefcient of Determination (r2)

os
The coefficient of determination, r2 is an indication of how well
the independent variable predicts the dependent variable. In other
words, it is used to judge the adequacy of the regression model. The
value of r2 lies between 0 and 1 (0 ≤ r2≤ 1) or 0 to 100 percent.

rP
The closer the value of r2 to 1 or 100 percent, better is the model.
The r2 value indicates the amount of variability in the data explained
by the regression model. In our example, the r2 value is 94.6 percent
(Table 7.9, Regression Analysis). The value of r2 is reported as:

yo
R-Sq = 94.6%

This means that 94.6 percent variation in the dependent variable, y


can be explained by the variation in x and 5.4 percent of the variation is
due to unexplained reasons or error.
op
The R-Sq(adj) = 94.4 percent next to the value of r2 in the regression
output is the R2-adjusted value. This is the r2 value adjusted for the de-
grees of freedom. This value has more importance in multiple regression.
tC

Model Adequacy Test

To check whether the fitted regression model is adequate, we first review


the assumptions on which regression is based followed by the residual
plots that are used to check the model assumptions.
Residuals: A residual or error for any point is the difference between the
No

actual y value and the corresponding estimated value (denoted by y-cap, ŷ ).


Thus, for a given value of , the residual is given by: e = ( y − ŷ ) )

Assumptions of Regression Model and Checking the Assumptions


Using MINITAB Residual Plots
Do

The regression analysis is based on the following assumptions:


(1) Independence of errors (2) Normality assumption
(3) Assumption regarding E(y): the expected values of y fall on the
same straight line described by the model E ( y ) = ˙0 + ˙1 x (4)
Equal variance, and (5) Linearity

This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
REgRESSION ANALYSIS ANd MOdELINg 137

t
The assumption regarding the independence of errors can be evaluated

os
by plotting the errors or residuals in the order or the sequence in which
the data were collected. If the errors are not independent, a relationship
exists between consecutive residuals which is a violation of the assump-
tion of independence of errors. When the errors are not independent,

rP
the plot of residuals versus the time (or the order) in which the data were
collected will show a cyclical pattern. Meeting this assumption is particu-
larly important when data are collected over a period of time. If the data
are collected at different time periods, the errors for specific time period
may be correlated with the errors of those of the previous time periods.

yo
The assumption that the errors are normally distributed or the nor-
mality assumption requires that the errors have a normal or approximately
normal distribution. Note that this assumption means that the errors do
not deviate too much from normality. The assumption can be verified by
plotting the histogram or the normal probability plot of errors.
op
The assumption that the variance of errors are equal (equality of vari-
ance) is also known as homoscedasticity. This requires that the errors are
constant for all values of x or the variability of y values is the same for both
the low and high values of x. The equality of variance assumption is of
particular importance for making inferences about b0 and b1.
tC

The linearity assumption means that the relationship between the


variables is linear. This assumption can be verified using residual plot to
be discussed in the next section.
To check the validity of the above regression assumptions, a graphical
No

approach known as the residual analysis is used. The residual analysis is


also used to determine whether the selected regression model is an ap-
propriate model.

Checking the Assumptions of Regression Using MINITAB


Residual Plots
Do

Several residual plots can be created using EXCEL and MINITAB to


check the adequacy of the regression model. The plots are shown in
Figure 7.13a through 7.13d.
The plots to check the regression assumptions include the histogram of
residuals, normal plot of residuals, plot of the residuals vs. fts, and residuals

This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
138 BUSINESS ANALYTICS, VOLUME II

t
vs. order of data. The residuals can also be plotted with each of the in-

os
dependent variables.
Figures 7.13a and 7.13b are used to check the normality assumption.
The regression model assumes that the errors are normally distributed
with mean zero. Figure 7.13a shows the normal probability plot. This plot

rP
is used to check for the normality assumption of regression model. In this
plot, if the plotted points lie on a straight line or close to a straight line
then the residuals or errors are normally distributed. The pattern of points
appear to fall on a straight line indicating no violation of the normality
assumption.

yo
Figure 7.13b shows the histogram of residuals. If the normality as-
sumption holds, the histogram of residuals should look symmetrical or
approximately symmetrical. Also, the histogram should be centered at
zero because the sum of the residuals is always zero. The histogram of
residuals is approximately symmetrical which indicates that the errors ap-
op
pear to be approximately normally distributed. Note that the histogram
may not be exactly symmetrical. We would like to see a pattern that is
symmetrical or approximately symmetrical.
In Figures 7.13c, the residuals are plotted against the fitted value and
the order of the data points. These plots are used to check the assump-
tC

tions of linearity. The points in this plots should be scattered randomly


around the horizontal line drawn through the zero residual value for the
linear model to be valid. As can be seen, the residuals are randomly scat-
tered about the horizontal line indicating that the relationship between x
No

and y is linear.
The plot of residual vs. the order of the data shown in Figure 7.13d is
used to check the independence of errors.
The independence of errors can be checked by plotting the errors or
the residuals in the order or sequence in which the data were collected.
The plot of residuals vs. the order of data should show no pattern or ap-
parent relationship between the consecutive residuals. This plot shows
Do

no apparent pattern indicating that the assumption of independence of


errors is not violated.
Note that checking the independence of errors is more important in
the case where the data were collected over time. Data collected over time
sometimes may show an autocorrelation effect among successive data

This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
t
os
rP
yo
op
tC

Figure 7.13 Plots for residual analysis


No
Do

139
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
140 BUSINESS ANALYTICS, VOLUME II

t
values. In these cases, there may be a relationship between consecutive

os
residuals that violates the assumption of independence of errors.
The equality of variance assumption requires that the errors are con-
stant for all values of x or the variability of y is the same for both the low
and high values of x. This can be checked by plotting the residuals and the

rP
order of data points. This plot is shown in Figure 7.13d. If the equality
of variance assumption is violated, this plot will show an increasing trend
showing an increasing variability. This demonstrates a lack of homogene-
ity in the variances of y values at each level of x. The plot shows no viola-
tion of equality of variance assumption.

yo
Multiple Regression: Computer Analysis and Results
Introduction to Multiple Regression
op
In the previous chapter we explored the relationship between two vari-
ables using the simple regression and correlation analysis. We demon-
strated how the estimated regression equation can be used to predict a
dependent variable (y) using an independent variable (x). We also dis-
cussed the correlation between two variables that explains the degree of
tC

association between two variables. In this chapter, we expand the concept


of simple linear regression to include multiple regression analysis. A mul-
tiple linear regression involves one dependent or response variable, and two
or more independent variables or predictors. The concepts of simple regres-
sion discussed earlier are also applicable to the multiple regression.
No

Multiple Regression Model

The mathematical form of multiple linear regression model relating the de-
pendent variable y and two or more independent variables x1 , x 2 ,…xk
with the associated error term is given by:
Do

y = ˝0 + ˝1 x1 + ˝ 2 x 2 + ˝3 x3 +…. + ˝k xk + ˙ (7.16)

where, x1 , x 2 ,… xk are k independent or explanatory variables;


˜0 , ˜1 , ˜ 2 ,.. ˜k are the regression coefficients, and ε is the associated

This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
REgRESSION ANALYSIS ANd MOdELINg 141

t
error term. Equation (7.16) can be viewed as a population multiple re-

os
gression model in which y is a linear function of unknown parameters
˜0 , ˜1 , ˜ 2 ,.. ˜k and an error term. The error ε explains the variability in
y that cannot be explained by the linear effects of the independent vari-
ables. The multiple regression model is similar to the simple regression

rP
model except that multiple regression involves more than one independ-
ent variable.
One of the basic assumptions of the regression analysis is that the
mean or the expected value of the error is zero. This implies that the mean
or expected value of y or E = ( y ) in the multiple regression model can

yo
be given by:

E = ( y ) = ˆ0 + ˆ1 x1 + ˆ 2 x 2 + ˆ3 x3 +…. + ˆk xk (7.17)

The above equation relating the mean value of y and the k independ-
op
ent variables is known as the multiple regression equation.
It is important to note that ˜0 , ˜1 , ˜ 2 ,.. ˜k are the unknown popula-
tion parameters, or regression coefficients and they must be estimated
using the sample data to obtain the estimated equation of multiple regres-
tC

sion. The estimated regression coefficients are denoted by b0 , b1 , b2 ,.. bk .


These are the point estimates of the parameters ˜0 , ˜1 , ˜ 2 ,.. ˜k . The esti-
mated multiple regression equation using the estimates of the unknown
population regression coefficients can be written as:

( yˆ ) = b0 + b1x1 + b2 x2 + b3 x3
No

+…. + bk xk (7.18)

where ŷ = point estimator of E = ( y ) or the estimated value of the


response y b0 , b1 , b2 ,.. bk . are the estimated regression coefficients and are
the estimates of ˜0 , ˜1 , ˜ 2 ,.. ˜k
Equation (7.18) is the estimated multiple regression equation and can
Do

be viewed as the sample regression model. The regression equation with


the sample regression coefficients is written as in equation (7.18). This
equation defines the regression equation for k independent variables.
In equation (7.16), ˜0 , ˜1 , ˜ 2 ,.. ˜k denote the regression coefficients
for the population. The sample regression coefficients b0 , b1 , b2 ,.. bk are

This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
142 BUSINESS ANALYTICS, VOLUME II

t
the estimates of the population parameters and can be determined using

os
the least squares method.
In a multiple linear regression, the variation in y (the response vari-
able) may be explained using two or more independent variables or pre-
dictors. The objective is to predict the dependent variable. Compared to

rP
simple linear regression, a more precise prediction can be made because
we use two or more independent variables. By using two or more in-
dependent variables, we are often able to make use of more information
in the model. The simplest form of a multiple linear regression model
involves two independent variables and can be written as:

yo
y = ˛0 + ˛1 x1 + ˛ 2 x 2 + ˝ (7.19)

Equation (7.19) describes a plane. In this equation β0 is the y-intercept


of the regression plane. The parameter β1 indicates the average change in
op
y for each unit change in x1 when x2 is constant. Similarly, β2 indicates
the average change in y for each unit change in x2 when x1 is held con-
stant. When we have more than two independent variables, the regression
equation of the form described using equation (7.18) is the equation of a
tC

hyperplane in an n-dimensional space.

The Least Squares Multiple Regression Model


The regression model is described in form of a regression equation that is
No

obtained using the least squares method. Recall that in a simple regression,
the least squares method requires ftting a line through the data points so that the
sums of the squares of errors or residuals are minimized. Tese errors or residuals
are the vertical distances of the points from the ftted line. The same concept
of simple regression is used to develop the multiple regression equation.
In a multiple regression, the least squares method determines the best
fitting plane or the hyperplane through the data points that ensures that
Do

the sum of the squares of the vertical distances or deviations from the
given points and the plane are a minimum.
Figure 7.14 shows a multiple regression model with two independent
variables. The response y with two independent variables x1 and x2 forms
a regression plane. The observed data points in the figure are shown using

This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
REgRESSION ANALYSIS ANd MOdELINg 143

t
os
rP
variables
yo
Figure 7.14 Scatter plot and regression plane with two independent

dots. The stars on the regression plane indicate the corresponding points
op
that have identical values for x1 and x2. The vertical distance from the ob-
served points to the point on plane are shown using vertical lines. These
vertical lines are the errors. The error for a particular point yi is denoted by
( yi − ŷ ) where the estimated value ŷ is calculated using the regression
tC

equation: ŷ = b0 + b1 x1 + b2 x 2 for a given value of x1 and x2.


The least squares criteria requires that the sum of the squares of the
errors be minimized, or,

˜ ( y − ŷ)2
No

where y is the observed value and ŷ is the estimated value of the depend-
ent variable given by ŷ = b0 + b1 x1 + b2 x 2
[Note: Te terms independent, or explanatory variables, and the predictors have the
same meaning and are used interchangeably in this chapter. Te dependent variable
is often referred to as the response variable in multiple regression.]
Do

Similar to the simple regression, the least squares method uses the
sample data to estimate the regression coefficients b0 , b1 , b2 ,.. bk and
hence the estimated equation of multiple regression. Figure 7.15 shows
the process of estimating the regression coefficients and the multiple re-
gression equation.

This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
t
os
rP
yo
Figure 7.15 Process of estimating the multiple regression equation
op
tC
No
Do

144
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
REgRESSION ANALYSIS ANd MOdELINg 145

t
Models with Two Quantitative Independent Variables

os
x1 and x2
The model with two quantitative independent variables is the simplest
multiple regression model. It is a first order model and is written as:

rP
y = b0 + b1 x1 + b2 x 2 (7.20)

where, b0 = y -intercept, the value of y when x1 = x 2 = 0


b1 = change in y for a 1-unit increase in x1 when x2 is constant

yo
b2 = change in y for a 1-unit increase in x2 when x1 is constant

The graph of the first order model is shown in Figure 7.16. This graph
with two independent quantitative variables x1 and x2 plots a plane in a
three-dimensional space. The plane plots the value of y for every combin-
op
ation ( x1 , x 2 ). This corresponds to the points in the ( x1 , x 2 ) plane.
The first-order model with two quantitative variables x1 and x2 is
based on the assumption that there is no interaction between x1 and x2.
This means that the effect on the response of y of a change in x1(for a
fixed value of x2) is same regardless of the value of x2 and the effect on
tC

y of a change in x2 (for a fixed value of x1) is same rardless of the value


of x1.
In case of simple regression analysis in the previous chapter, we pre-
sented both the manual calculations and the computer analysis of the
No

problem. Most of the concepts we discussed for simple regression also


apply to the multiple regression; however, the computations for multiple
regression are more involved and require the use of matrix algebra and
other mathematical concepts which are beyond the scope of this text.
Therefore, in this chapter, we have provided computer analysis of the
multiple linear regression models using EXCEL and MINITAB. This sec-
tion provides examples with computer instructions and analysis of the
Do

computer results. The assumptions and the interpretation of the multiple


linear regression models are similar to that of the simple linear regression.
As we provide the analysis, we will point out the similarities and the dif-
ferences between the simple and multiple regression models.

This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
146 BUSINESS ANALYTICS, VOLUME II

t
os
rP
yo
Figure 7.16 A multiple regression model with two quantitative
variables

Assumptions of Multiple Regression Model


op
As discussed earlier, the relationship between the response variable (y) to the
independent variables x1 , x 2 ,.. , xk in the multiple regression is assumed to
be a model of the form y = ˝0 + ˝1 x1 + ˝ 2 x 2 + ˝3 x3 +…. + ˝k xk + ˙
where, ˜0 , ˜1 , ˜ 2 ,.. ˜k are the regression coefficients, and ε is the associ-
tC

ated error term. The multiple regression model is based on the following
assumptions about the error term ε.

1. The independence of errors assumption. The assumption—


independence of errors means that the errors are independent of each
No

other. That is, the error for a set of values of independent variables
is not related to the error for any other set of values of independent
variables. This assumption is critical when the data are collected over
different time periods. When the data are collected over time, the er-
rors in one-time period may be correlated with another time period.
2. The normality assumption. This means that the errors or residuals
Do

(εi) calculated using ( yi − ŷ ) are normally distributed. The nor-


mality assumption in regression is fairly robust against departures
from normality. Unless the distribution of errors is extremely dif-
ferent from normal, the inferences about the regression parameters
˜0 , ˜1 , ˜ 2 ,.. ˜k are not affected seriously.

This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
REgRESSION ANALYSIS ANd MOdELINg 147

t
Te error assumption. The error, ε is a random variable with mean

os
or expected value of zero, that is, E (˜ ) = 0 . This implies that the
mean values of the dependent variable y , for a given value of the in-
dependent variable, x is the expected, or the mean value of y
3. denoted by E ( y ) and the population regression model can be

rP
written as:

E ( y ) = ˆ0 + ˆ1 x1 + ˆ 2 x 2 + ˆ3 x3 +…. + ˆk xk

4. Equality of variance assumption. This assumption requires that the

yo
variance of the errors (εi), denoted by σ2 are constant for all values of
the independent variables x1 , x 2 ,.., xk . In case of serious departure
from the equality of variance assumption, methods such as weighted
least-squares, or data transformation may be used.
[Note: The terms error and residual have the same meaning and
op
these terms are used interchangeably in this chapter.]

Computer Analysis of Multiple Regression


tC

In this section we provide a computer analysis of multiple regression.


Due to the complexity involved in the computation, computer software is
always used to model and solve regression problems. We discuss the steps
using MINITAB and EXCEL.
No

Problem Description: The home heating cost is believed to be re-


lated to the average outside temperature, size of the house, and the
age of the heating furnace. A multiple regression model is to be fitted
to investigate the relationship between the heating cost and the three
predictors or independent variables. The data in Table 7.10 shows
the home heating cost (y), average temperature (x1), house size (x2)
Do

in thousands of square feet, and the age of the furnace (x3) in years.
The home heating cost is the response variable and the other three
variables are predictors. (The data for this problem: HEAT_COST.
MTW, EXCEL data file: HEAT_COST.xlsx) is listed in Table 7.10
below.

This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
148 BUSINESS ANALYTICS, VOLUME II

t
Table 7.10 Data for home heating cost

os
Row Avg Temp House Size Age of Furnace Heating Cost
1 37 3.0 6 210
2 30 4.0 9 365
3 37 2.5 4 182

rP
4 61 1.0 3 65
5 66 2.0 5 82
6 39 3.5 4 205
7 15 4.1 6 360
8 8 3.8 9 295

yo
9 22 2.9 10 235
10 56 2.2 4 125
11 55 2.0 3 78
12 40 3.8 4 162
13 21 4.5 12 405
14 40 5.0 6 325
op
15 61 1.8 5 82
16 21 4.2 7 277
17 63 2.3 2 99
18 41 3.0 10 195
tC

19 28 4.2 7 240
20 31 3.0 4 144
21 33 3.2 4 265
22 31 4.2 11 355
23 36 2.8 3 175
24 56 1.2 4 57
No

25 35 2.3 8 196
26 36 3.6 6 215
27 9 4.3 8 380
28 10 4.0 11 300
29 21 3.0 9 240
30 51 2.5 7 130
Do

Constructing Scatter Plots and Matrix Plots


We begin our analysis by constructing scatter plots and matrix plots of the
data. These plots provide useful information about the model. We first
construct scatterplots of the response (y) versus each of the independent

This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
REgRESSION ANALYSIS ANd MOdELINg 149

t
or predictor variables Figure 7.17. If the scatterplots of y on the independ-

os
ent variables appear to be linear enough, a multiple regression model can
be fitted. Based on the analysis of the scatter plots of y and each of the
independent variables, an appropriate model (for example, a first order
model) can be recommended to predict the home heating cost.

rP
A frst order multiple regression model does not include any higher order
terms (e.g., x2). An example of a frst-order model with fve independent vari-
ables can be written as:

y = b0 + b1 x1 + b2 x 2 + b3 x3 + b4 x 4 + b5 x5 (7.21)

yo
The multiple linear regression model is based on the assumption that
the relationship between the response and the independent variables is
linear. This relationship can be checked using a matrix plot. The mat-
rix plot is used to investigate the relationships between pairs of variables
op
by creating an array of scatterplots. MINITAB provides two options for
constructing the matrix plot: Matrix of Plots and Each Y versus each X. The
first of these plots is used to investigate the relationships among pairs of
variables when there are several independent variables involved. The other
plot (each y versus each x) produces separate plots of the response y and
tC

each of the explanatory or independent variable.


Recall that in a simple regression, a scatter plot was constructed to
investigate the relationship between the response y and the predictor. A
matrix plot should be constructed when two or more independent vari-
No

ables are investigated. To investigate the relationships between the re-


sponse and each of the independent or explanatory variables before fitting
a multiple regression model, a matrix plot may prove to be very useful.
The plot allows graphically visualizing the possible relationship between
response and independent variables. The plot is also very helpful in inves-
tigating and verifying the linearity assumption of multiple regression and
to determine which explanatory variables are good predictors of y. For
Do

this example, we have constructed matrix plots using MINITAB.


Figure 7.17 shows such a matrix plot (each y versus each). In this
plot, the response variable y is plotted with each of the independent
variables. The plot shows scatterplots for heating cost (y) versus each of
the independent variables: average temperature, house size, and age of

This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
t
os
rP
yo
op
Figure 7.17 Matrix plot of each y vs. each x
tC
No
Do

150
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
REgRESSION ANALYSIS ANd MOdELINg 151

t
the furnace. An investigation of the plot shows an inverse relationship

os
between the heating cost and the average temperature (the heating cost
decreases as the temperature rises) and a positive relationship between
the heating cost and each of the other two variables: house size and age
of the furnace. The heating cost increases with the increasing house size

rP
and also with the older furnace. None of these plots show bending (non-
linear or curvilinear) patterns between the response and the explanatory
variables. The presence of bending patterns in these plots would suggest
transformation of variables. The scatterplots in Figure 7.17 (also known
as side-by-side scatter plots) show linear relationship between the response

yo
and each of the explanatory variables indicating all the three explanatory
variables could be a good predictor of the home heating cost. In this case,
a multiple linear regression would be an adequate model for predicting
the heating cost.
op
Matrix of Plots: Simple
Another variation of the matrix plot is known as “matrix of plots” in
MINITAB and is shown in Figure 7.18. This plot provides scatterplots
tC

that are helpful in visualizing not only the relationship of the response
variable with each of the independent variables but also provides scat-
terplots that are useful in assessing the interaction effects between the
variables. This plot can be used when more detailed model beyond a
first-order model is of interest. Note that the first order model is the one
No

that contains only the first order terms; with no square or interaction
terms and is written as y = b0 + b1 x1 + b2 x 2 + ˜ + bk xk
The matrix plot in Figure 7.18 is a table of scatterplots with each cell
showing a scatterplot of the variable that is labeled for the column versus
the variable labeled for the row. The cell in the first row and first column
displays the scatterplot of heating cost (y) versus average temperature (x1).
Do

The plot in the second row and first column is the scatterplot of heat-
ing cost (y) and the house size (x2) and the plot in the third row and the
first column shows the scatterplot of heating cost (y) and the age of the
furnace (x3).
The second column and the second row of the matrix plot shows a
scatterplot displaying the relationship between average temperature (x1)

This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
t
os
rP
yo
op
tC
No

Figure 7.18 Matrix plot


Do

152
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
REgRESSION ANALYSIS ANd MOdELINg 153

t
and the house size (x2). The scatterplots showing the relationship between

os
the pairs of independent variables are obtained from columns 2 and 3 of
the matrix plot. The matrix plot is helpful in visualizing the interaction
relationships. For fitting the first order model, a plot of y versus each x is
adequate.

rP
The matrix plots in Figures 7.17 and 7.18 show a negative association
or relationship between the heating cost (y) and the average temperature
(x1) and a positive association or relationship between the heating cost (y)
and the other two explanatory variables: house size (x2) and the age of the
furnace (x3). All these relationships are linear indicating that all the three

yo
explanatory variables can be used to build a multiple regression model.
Constructing the matrix plot and investigating the relationships between
the variables can be very helpful in building a correct regression model.

Multiple Linear Regression Model


op
Since a first order model can be used adequately to predict the home heat-
ing cost, we will fit a multiple linear regression model of the form

y = b0 + b1 x1 + b2 x 2 + b3 x3
tC

where,

y = Home heating cost (dollars), x1 = Average temperature (in °F)


No

x2 = Size of the house (in thousands of square feet), x3 = Age of the furnace
(in years)

Table 7.10 and data file HEAT_COST.MTW shows the data for
this problem. We used MINITAB to run the regression model for this
problem.
Do

Table 7.11 shows the results of running the multiple regression prob-
lem using MINITAB. In this table, we have marked some of the calcula-
tions (e.g., b0, b1, sbo, sb1, etc.) for clarity and explanation. These are not
the part of the computer output. The regression computer output has two
parts: Regression Analysis and Analysis of Variance.

This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
154 BUSINESS ANALYTICS, VOLUME II

t
Table 7.11 MINITAB regression analysis results

os
rP
yo
The Regression Equation
op
Refer to the “Regression Analysis” part of Table 7.11 for analysis. Since
there are three independent or explanatory variables, the regression equa-
tion is of the form:

y = b0 + b1 x1 + b2 x 2 + b3 x3
tC

The regression equation from the computer output is

Heating Cost = 44.4 − 1.65 Avg. Temp. + 57.5


House Size + 7.91 Age of Furnace (7.22)
No

or

yˆ = 44.4 − 1.65 x1 + 57.5 x 2 + 7.91x3 (7.23)

where, y is the response variable (Heating Cost), x1, x2, x3 are the in-
dependent variables as described above, the regression coefficients
b0 , b1 , b2 , b3 are stored under the column Coef. In the regression equation
Do

these coefficients appear in rounded form.


The regression equation which can be stated in the form of equation
(7.22) or (7.23) is the estimated regression equation relating the heating
cost to all the three independent variables.

This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
REgRESSION ANALYSIS ANd MOdELINg 155

t
Interpreting the Regression Equation

os
Equation (7.22) or (7.23) can be interpreted in the following way:

• b1 = −1.65 means that for each unit increase in the average tem-
perature (x1), the heating cost y (in dollars) can be predicted to go

rP
down by 1.65 (or, $1.65) when the house size (x2), and the age of
the furnace (x3) are held constant.
• b2 = +57.5 means that for each unit increase in the house size (x2
in thousands of square feet), the heating cost, y (in dollars) can be
predicted to go up by 57.5 when the average temperature (x1) and

yo
the age of the furnace (x3) are held constant.
• b3 = + 7.91 means that for each unit increase in the age of the furnace
(x3 in years), the heating cost y can be predicted to go up by $7.91 when
the average temperature (x1) and the house size (x2) are held constant.
op
Standard Error of the Estimate(s) and Its Meaning
Te standard error of the estimate or the standard deviation of the model
s is a measure of scatter or the measure of variation of the points around
tC

the regression hyperplane. A small value of s is desirable for a good regres-


sion model. The estimation of y is more accurate for smaller values of s.
The value of the standard error of estimate is reported in the regression
analysis (see Table 7.11). This value is measured in terms of the response
variable (y). For our example, the standard error of the estimate,
No

s = 37.32 dollars

The standard error of the estimate is used to check the utility of the
model and to provide a measure of reliability of the prediction made from
the model. One interpretation of s is that the interval ±2s will provide an ap-
proximation to the accuracy with which the regression model will predict the
Do

future value of the response y for given values of. Thus, for our example, we
can expect the model to provide predictions of heating cost (y) to be within

± 2s = ± 2 ( 37.32 ) = ± 74.64 dollars.

This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
156 BUSINESS ANALYTICS, VOLUME II

t
The Coeffcient of Multiple Determination (r2)

os
The coefficient of multiple determination is often used to check the ad-
equacy of the regression model. The value of r2 lies between 0 and 1, or
0 percent and 100 percent, that is, 0 ≤ r2 ≤ 1. It indicates the fraction
of total variation of the dependent variable y that is explained by the in-

rP
dependent variables or predictors. Usually, closer the value of r2 to 1 or
100 percent; stronger is the model. However, one should be careful in
drawing conclusions based solely on the value of r2. A large value of r2
does not necessarily mean that the model provides a good fit to the data.
In case of multiple regression, addition of a new variable to the model

yo
always increases the value of r2 even if the added variable is not statistically
significant. Thus, addition of a new variable will increase r2 indicating a
stronger model but may lead to poor predictions of new values. The value
of r2 can be calculated using the expression
op
SSE 36, 207
r2 = 1 − = 1− = 0.88
SST 301, 985

SSR 265, 777


tC

r2 = = = 0.88
SST 301, 985

In the above equations, SSE is the sum of square of errors (unex-


plained variation or error), SST is the total sum of squares, and SSR is
No

the sum of squares due to regression (explained variation). These values


can be read from the “Analysis of Variance” part of Table 7.11. From this
table, The value of r2 is calculated and reported in the “Regression Analy-
sis” part of Table 7.11. For our example the coefficient of multiple deter-
mination; r2 (reported as R-sq) is

r2 = 88.0%
Do

This means that 88.0 percent of the variability in y is explained by the


three independent variables used in the model. Note that r2 = 0 implies
a complete lack of fit of the model to the data; whereas, r2 = 1 implies a
perfect fit.

This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
REgRESSION ANALYSIS ANd MOdELINg 157

t
The value of r2 = 88.0% for our example implies that using the three

os
independent variables; average temperature, size of the house, and the age
of the furnace in the model, 88.0 percent of the total variation in heating
cost (y) can be explained. The statistic r2 tells how well the model fits the
data, and thus, provides the overall predictive usefulness of the model.

rP
The value of adjusted R2 is also used in comparing two regression
models that have the same response variable but different number of in-
dependent variables or predictors.

yo
Hypothesis Tests in Multiple Regression

In multiple regression, two types of hypothesis tests are conducted to


measure the model adequacy. These are

1. Hypothesis Test for the overall usefulness, or significance of regression


op
2. Hypothesis Tests on the individual regression coefficients

The test for overall significance of regression can be conducted using


the information in the “Analysis of Variance” part of Table 7.11. The in-
tC

formation contained in the “Regression Analysis” part of this table is used


to conduct the tests on the individual regression coefficients using the
“T ” or “p” column. These tests are explained below.
No

Testing the Overall Signifcance of Regression

Recall that in simple regression analysis, we conducted the test for the sig-
nificance using a t-test and F-test. Both of these tests in simple regression
provided the same conclusion. If the null hypothesis is rejected in these
tests, it will lead to the conclusion that the slope was not zero, or β1 = 0.
In a multiple regression, the t-test and the F-test have somewhat different
Do

interpretation. These tests have the following objectives:


The F-test in a multiple regression is used to test the overall signifi-
cance of the regression. This test is conducted to determine whether a
significant relationship exists between the response variable y and the set
of independent variables, or predictors x1, x2, …,xn.

This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
158 BUSINESS ANALYTICS, VOLUME II

t
1. If the conclusion of the F-test indicates that the regression is sig-

os
nificant overall then a separate t-test is conducted for each of the in-
dependent variables to determine whether each of the independent
variables is significant.
Both the F-test and t-test are explained below.

rP
F-Test
The null and alternate hypotheses for the multiple regression model
y = b0 + b1 x1 + b2 x 2 + .. + bk xk are stated as

yo
H 0 : ˜1 = ˜ 2 = … = ˜ k = 0 (Regression is not significant)
H1 : at least one of the coefficients is nonzero (7.24)

If the null hypothesis H0 is rejected, we conclude that at least one


op
of the independent variables: x1 , x 2 ,.., xn contributes significantly to the
prediction of the response variable y. If H0 is not rejected, then none of
the independent variables contributes to the prediction of y. The test sta-
tistic for testing this hypothesis uses an F-statistic and is given by
tC

MSR
F =
MSE (7.25)

where MSR = mean squares due to regression, or explained variability, and


MSE = mean square error, or unexplained variability. In equation (7.25),
No

the larger the explained variation of the total variability, the larger is the
F-statistic. The values of MSR, MSE, and the F statistic are calculated
in the “Analysis of Variance” table of the multiple regression computer
output (see Table 7.12 below).
The critical value for the test is given by Fk ,n − (k +1),˛ where, k is the
number of independent variables, n is the number of observations in
Do

the model, and α is the level of significance. Note that k and (n-k-1) are
the degrees of freedom associated with MSR and MSE respectively. The
null hypothesis is rejected if F > Fk ,n − (k +1),˛ where F is the calculated F
value or the test statistic value in the Analysis of Variance table.

This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
REgRESSION ANALYSIS ANd MOdELINg 159

t
Table 7.12 Analysis of variance table

os
rP
Test the Overall Signifcance of Regression for the

yo
Example Problem at a 5 Percent Level of Signifcance
Step 1: State the Null and Alternate Hypotheses

For the overall significance of regression, the null and alternate hypoth-
eses are:
op
H 0 : ˜1 = ˜ 2 = … = ˜ k = 0 (Regression is not significant)
H1 : at least one of the coefficients is nonzero (7.26)
tC

Step 2: Specify the Test Statistic to Test the Hypothesis

The test statistics is given by

MSR
F = (7.27)
MSE
No

The value of F statistic is obtained from the “Analysis of Variance”


(ANOVA) table of the computer output. We have reproduced the An-
alysis of Variance part of the table (Table 7.12). In this table the labels k,
[n − (k + 1)], SSR, SSE etc. are added for explanation purpose. They are
not the part of computer results.
Do

In the ANOVA table below, the first column refers to the sources
of variation, DF = the degrees of freedom, SS = the sum of squares,
MS = mean squares, F = the F statistic, and p is the probability or p-value
associated with the calculated F statistic.

This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
160 BUSINESS ANALYTICS, VOLUME II

t
The degrees of freedom (DF) for Regression and Error are k and n −

os
(k + 1) respectively where, k is the number of independent variables (k = 3
for our example) and n is the number of observations (n = 30). Also, the
total sum of squares (SST) is partitioned into two parts: sum of squares
due to regression (SSR) and the sum of squares due to error (SSE) having

rP
the following relationship.

SST = SSR + SSE

We have labeled SST, SSR, and SSE values in Table 7.12. The mean
square due to regression (MSR) and the mean squares due to error (MSE)

yo
are calculated using the following relationships:

MSR = SSR/k and MSE = SSE/(n – k − 1)

The F-test statistic is calculated as F = MSR/MSE.


op
Step 3: Determine the Value of the Test Statistic

The test statistic value or the F statistic from the ANOVA table (see
Table 7.12) is
tC

F = 63.62

Step 4: Specify the Critical Value


No

The critical value is given by

Fk , n − (k −1), ˙ = F3, 26,0.05 = 2.74 (From the F-table)

Step 5: Specify the Decision Rule


Do

Reject Ho if F-statistic > FCritical

Step 6: Reach a Decision and State Your Conclusion

The calculated F statistic value is 63.62. Since F = 63.62 > Fcritical = 2.74,
we reject the null hypothesis stated in equation (7.26) and conclude that

This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
REgRESSION ANALYSIS ANd MOdELINg 161

t
the regression is significant overall. This indicates that there exists a sig-

os
nificant relationship between the dependent and independent variables.

Alternate Method of Testing the above Hypothesis

rP
The hypothesis stated using equation (7.26) can also be tested using the
p-value approach. The decision rule using the p-value approach is given by

If p ≥ α, do not reject H0
If P < α, reject H0

yo
From Table 7.12, the calculated p value is 0.000 (see the P column). Since
p = 0.000 < α = 0.05, we reject the null hypothesis H0 and conclude that
the regression is significant overall.

Hypothesis Tests on Individual Regression Coeffcients


op
t-tests

If the F-test shows that the regression is significant, a t-test on individual


regression coefficients is conducted to determine whether a particular in-
tC

dependent variable is significant. We are often interested in determining


which of the independent variables contributes to the prediction of the y.
The hypothesis test described here can be used for this purpose.
To determine which of the independent variables contributes to the
prediction of the dependent variable y, the following hypotheses test can
No

be conducted:

H0:β j = 0
H 1:β j ≠ 0 (7.28)

This hypothesis tests an individual regression coefficient. If the null


Do

hypothesis H0 is rejected; it indicates that the independent variable xj is


significant and contributes in the prediction of y. On the other hand, if
the null hypothesis H0 is not rejected, then xj is not a significant variable
and can be deleted from the model or further investigated. The test is
repeated for each of the independent variables in the model.

This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
162 BUSINESS ANALYTICS, VOLUME II

t
Table 7.13 MINITAB regression analysis results

os
rP
yo
op
This hypothesis test also helps to determine if the model can be made
more effective by deleting certain independent variables, or by adding
tC

extra variables. The information to conduct the hypothesis test for each of
the independent variables is contained in the “Regression Analysis” part
of the computer output which is reproduced in Table 7.13 below. The
columns labeled T and p are used to test the hypotheses. Since there are
three independent variables, we will test to determine whether each of the
No

three variables is a significant variable; that is, if each of the independent


variables contributes in the prediction of y. The hypothesis to be tested
and the test procedure are explained below. We will use a significance level
of α = 0.05 for testing each of the independent variables.

Test the Hypothesis That Each of the Three


Do

Independent Variables Is Signifcant at a 5 Percent


Level of Signifcance
Test for the signifcance of x1 or Average Temperature
Step 1: State the null and alternate hypotheses. The null and alternate
hypotheses are:

This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
REgRESSION ANALYSIS ANd MOdELINg 163

t
H0: β1 = 0 (x1 is not significant or x1 does not contribute in prediction of y)

os
H1:β1 ≠ 0 (x1 is significant or x1 does contribute in prediction of y) (7.29)

Step 2: Specify the test statistic to test the hypothesis.


The test statistics is given by

rP
b1 (7.30)
t =
sb1

where, b1 is the estimate of slope β1 and sb1 is the estimated standard devi-

yo
ation of b1.
Step 3: Determine the value of the test statistic
The values b1, sb1 and t are all reported in the Regression Analysis part of
Table 7.13. From this table, these values for the variable x1 or the average
temperature (Avg. Temp.) are
op
b1 = −1.6457, sb1 = 0.6967

and the test statistic value is


tC

b1 −1.6457
t = = = −2.36
sb1 0.6967

This value is reported under the T column.


No

Step 4: Specify the critical value


The critical values for the test are given by

t˜ / 2,[ n − ( k +1)]

which is the t-value from the t-table for [n − (k + 1)] degrees of freedom
Do

and α /2, where n is the number of observations (n = 30), k is the number


of independent variables (k = 3) and α is the level of significance (0.05 in
this case). Thus,

t˜ = t 0.025, [30 − (3 + 1)] = t 0.025,26


, = 2.056 (From the t-table)
,[n − ( k + 1) ]
2

This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
164 BUSINESS ANALYTICS, VOLUME II

t
Step 5: Specify the decision rule: The decision rule for the test:

os
Reject H0 if t > +2.056
or, if t < −2.056

Step 6: Reach a decision and state your conclusion

rP
The test statistic value (T value) for the variable “av. temp” (x1) from
Table 7.13 is −2.36. Since, t = −2.36 < tcritical = −2.056

we reject the null hypothesis H0 (stated in equation 7.29) and conclude


that the variable average temperature (x1) is a significant variable and does

yo
contribute in the prediction of y.
The significance of other independent variables can be tested in the
same way. The test statistic or the t values for all the independent vari-
ables are reported in Table 7.13 under T column. The critical values for
testing each independent variable are the same as in the test for the first
op
independent variable above. Thus, the critical values for testing the other
independent variables are

t˜ / 2,[ n − (k +1)] = t 0.025,[30− (3 +1) = t 0.025, 26 = ±2.0566


tC

Alternate Way of Testing the above Hypothesis


The hypothesis stated using equation (7.29) can also be tested using the
p-value approach. The decision rule for the p-value approach is given by
No

If p ≥ α, do not reject H0
If P < α, reject H0 (7.31)

From Table 7.14, the p-value for the variable average temperature (Avg.
Temp., x1) is 0.026. Since, p = 0.026 < α = 0.05, we reject H0 and con-
Do

clude that the variable average temperature (x1) is a significant variable.

I. Test for the other independent variables


The other two independent variables are
x2 = Size of the house (or House Size)
x3 = Age of the furnace
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
REgRESSION ANALYSIS ANd MOdELINg 165

t
Table 7.14 Summary table

os
Independent p-value from Compare p Signifcant?
Variable Table 7.4 to α Decision Yes or No
Av. Temp. (x1) 0.026 P<α Reject H0 Yes
House Size (x2) 0.000 P<α Reject H0 Yes

rP
Age of Furnace (x3) 0.024 P<α Reject H0 Yes

It is usually more convenient to test the hypothesis using the p-value


approach. Table 7.14 provides a summary of the tests using the p-value
approach for all the three independent variables. The significance level α

yo
is 0.05 for all the tests. The hypothesis can be stated as:

H0 :βj = 0 (xj is not a significant variable)

H0 :βj ≠ 0 (xj is a significant variable)


op
where, j = 1,2,…3 for our example.
From Table 7.14 it can be seen that all the independent variables are
significant. This means that all the three independent variables contribute
tC

in predicting the response variable y, the heating cost.


Note: The above method of conducting t-tests on each β parameter
in a model is not the best way to determine whether the overall model is
providing information for the prediction of y. In this method, we need to
conduct a t-test for each independent variable to determine whether the
No

variable is significant. Conducting a series of t-tests increases the likeli-


hood of making an error in deciding which variable to retain in the model
and which one to exclude. For example, suppose we are fitting a first order
model like the one in this example with 10 independent variables and
decided to conduct t-tests on all 10 of the β’s. Suppose each test is con-
ducted at α = 0.05. This means that there is a 5 percent chance of making
Do

a wrong or incorrect decision (Type I error − probability of rejecting a true


null hypothesis) and there is a 95 percent chance of making a right deci-
sion. If 10 tests are conducted, the probability of making a correct decision
drops to approximately 60 percent [(0.95)10 = 0.599] assuming all the
tests are independent of each other. This means that even if all the β par-
ameters (except β0) are equal to 0, approximately 40 percent of the time,
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
166 BUSINESS ANALYTICS, VOLUME II

t
the null hypothesis will be rejected incorrectly at least once leading to the

os
conclusion that β differs from 0. Thus, in the multiple regression models
where a large number of independent variables are involved and a series
of t- tests are conducted, there is a chance of including a large number of
insignificant variables and excluding some useful ones from the model. In

rP
order to assess the utility of the multiple regression models, we need to
conduct a test that will include all the β parameters simultaneously. Such
a test would test the overall significance of the multiple regression model.
The other useful measure of the utility of the model would be to find
some statistical quantity such as R2 that measures how well the model fits

yo
the data.
A Note on Checking the Utility of a Multiple Regression Model
(Checking the Model Adequacy)

Step 1. To test the overall adequacy of a regression model, first test


op
the following null and alternate hypotheses,

H 0 : ˜1 = ˜ 2 = … = ˜ k = 0 (No relationship)

H1: at least one of the coefficients is nonzero


tC

A) If the null hypothesis is rejected, there is evidence that all the β


parameters are not zero and the model is adequate. Go to step 2.
B) If the null hypothesis is not rejected then the overall regression
model is not adequate. In this case, fit another model with more
No

independent variables, or consider higher-order terms.


Step 2. If the overall model is adequate, conduct t-tests on the β param-
eters of interest, or the parameters considered to be most import-
ant in the model. Avoid conducting a series of t-tests on the β
parameters. It will increase the probability of type I error, α.
Do

Multicollinearity and Autocorrelation


in Multiple Regression
Multicollinearity is a measure of correlation among the predictors in a
regression model. Multicollinearity exists when two or more independ-
ent variables in the regression model are correlated with each other.

This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
REgRESSION ANALYSIS ANd MOdELINg 167

t
In practice, it is not unusual to see correlations among the independent

os
variables. However, if serious multicollinearity is present, it may cause
problems by increasing the variance of the regression coefficients and
making them unstable and difficult to interpret. Also, highly correlated
independent variables increase the likelihood of rounding errors in the

rP
calculation of β estimates and standard errors. In the presence of multi-
collinearity, the regression results may be misleading.

Effects of Multicollinearity

yo
A) Consider a regression model where the production cost (y) is related
to three independent variables: machine hours (x1), material cost (x2),
and labor hours (x3):

y = ˛0 + ˛1 x1 + ˛ 2 x 2 + ˛3 x3
op
MINITAB computer output for this model is shown in
Table 7.15. If we perform t-tests for testing ˜1 , ˜ 2 , and ˜3 , we
find that all the three independent variables are non-significant at
α = 0.05 while the F-test for H0: β1 = β2 = β3 = 0 is significant (see
tC

the p-value in the Analysis of Variance results shown in Table 7.15).


The results are contradictory but in fact, they are not. The tests on in-
dividual bi parameters indicate that the contribution of one variable,
say x1 = machine hours is not significant after the effects of x2 = ma-
terial cost, and x3 = labor hours have been accounted for. However,
No

the result of the F-test indicates that at least one of the three variables
is significant, or is making a contribution to the prediction of re-
sponse y. It is also possible that at least two or all the three variables
are contributing to the prediction of y. Here, the contribution of one
variable is overlapping with that of the other variable or variables.
This is because of the multicollinearity effect.
Do

B) Multicollinearity may also have an effect on the signs of the parameter


estimates. For example, refer to the regression equation in Table 7.15.
In this model, the production cost (y) is related to the three explana-
tory variables: machine hours (x1), material cost (x2), and labor
hours (x3). If we check the effect of the variable machine hours (x1),

This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
168 BUSINESS ANALYTICS, VOLUME II

t
Table 7.15 Regression Analysis: PROD COST vs. MACHINE

os
HOURS, MATERIAL COST, and LABOR Hours.

rP
yo
the regression model indicates that for each unit increase in machine
op
hour, the production cost (y) decreases when the other two factors are
held constant. However, we would expect the production cost (y) to
increase as more machine hours are used. This may be due to the pres-
ence of multicollinearity. Because of the presence of multicollinearity,
tC

the value of a β parameter may have the opposite sign from what is
expected.

One way of avoiding multicollinearity in regression is to conduct


design of experiments and select the levels of factors in a way that the
No

levels are uncorrelated. This may not be possible in many situations. It is


not unusual to have correlated independent variables; therefore, it is im-
portant to detect the presence of multicollinearity to make the necessary
modifications in the regression analysis.

Detecting Multicollinearity
Do

Several methods are used to detect the presence of multicollinearity in


regression. We will discuss two of them.

1. Detecting Multicollinearity using Variance Infation Factor (VIF):


MINITAB provides an option to calculate Variance inflation factors

This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
REgRESSION ANALYSIS ANd MOdELINg 169

t
(VIF) for each predictor variable that measures how much the vari-

os
ance of the estimated regression coefficients are inflated as compared
to when the predictor variables are not linearly related. Use the
guidelines in Table 7.16 to interpret the VIF.

rP
Table 7.16 Detecting correlation using VIF values
Values of VIF Predictors are…
VIF = 1 Not correlated
1 < VIF < 5 Moderately correlated
VIF = 5 to 10 or greater Highly correlated

yo
VIF values greater than 10 may indicate that multicollinearity is un-
duly influencing your regression results. In this case, you may want to
reduce multicollinearity by removing unimportant independent variables
op
from your model.
Refer to the table above for the values of VIF for the production cost
example. The VIF value for each predictor has a value greater than 10 in-
dicating the precedence of multicollinearity. The VIF values indicate that
tC

the predictors are highly correlated. The VIF for each of the independent
variables is calculated automatically when a multiple regression model is
run using MINITAB.

Detecting Multicollinearity by Calculating Coeffcient of


No

Correlation, r

A simple way of determining multicollinearity is to calculate the coef-


ficient of correlation, r, between each pair of predictor or independent
variables in the model. The degree of multicollinearity depends on the
magnitude of the value of r. Use Table 7.17 as a guide to determine the
Do

multicollinearity.
Table 7.18 shows the correlation coefficient, r between each pair of
predictors for the production cost example.
The above values of r show that the variables are highly corre-
lated. The correlation coefficient matrix above was calculated using
MINITAB.

This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
170 BUSINESS ANALYTICS, VOLUME II

t
Table 7.17 Determining multicollinearty using correlation coeffcient, r

os
Correlation Coeffcient, r
r ˜ 0.8 Extreme multicollinearity

˜ 0.2 ˜ ˜ r < 0.8 Moderate multicollinearity

rP
r < 0.2 Low multicollinearity

Table 7.18 Correlation coeffcient between pairs of variables


Correlations: Machine Hours, Material Cost, Labor Hours

yo
Machine Hours Material Cost(y)
Material Cost 0.964
Labor Hours 0.953 0.917
Cell Contents: Pearson correlation
op
Summary of the Key Features
of Multiple Regression Model
The multiple regression model above extended the concept of simple lin-
tC

ear regression and provided an in-depth analysis of the multiple regression


model—one of the most widely used prediction techniques used in data
analysis and decision making. The multiple regression model explores the
relationship between a response variable, and two or more independent
variables or the predictors. The sections provided computer analysis and
No

interpretation of multiple regression models. Several examples of matrix


plots were presented. These plots are helpful in the initial stages of model
building. Using the computer results, the following key features of mul-
tiple regression model were explained; (a) the multiple regression equation
and its interpretation, (b) the standard error of the estimate—a measure
used to check the utility of the model and to provide a measure of reliabil-
Do

ity of the prediction made from the model, (c) the coefficient of multiple
determination r2 that explains the variability in the response y explained by
the independent variables used in the model. Besides these, we discussed
the hypothesis tests using the computer results. Step-wise instructions
were provided to conduct the F-test and t-tests. The overall significance
of the regression model is tested using the F-test. The t- test is conducted

This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
REgRESSION ANALYSIS ANd MOdELINg 171

t
on individual predictor or the independent variable to determine the sig-

os
nificance of that variable. The effect of multicollinearity and detection of
multicollinearity using computer were discussed with examples.

rP
Model Building and Computer Analysis
Introduction to Model Building

In the previous chapters, we discussed simple and multiple regression


where we provided detailed analysis of these techniques including the

yo
analysis and interpretation of computer results. In both the simple and
multiple regression models, the relationship among the variables is linear.
In this chapter we will provide an introduction to model building and
nonlinear regression models. By model building, we mean selecting the
model that will provide a good fit to a set of data, and the one that will
op
provide a good estimate of the response or the dependent variable, y that
is related to independent variables or factors x1, x2, …xn. It is important
to choose the right model for the data.
In regression analysis, the dependent or the response variable is usu-
ally quantitative. The independent variables may be either quantitative or
tC

qualitative. The quantitative variable is one that assumes numerical values


or can be expressed as numbers. The qualitative variable may not assume
numerical values.
In experimental situations we often encounter both the quantitative
and qualitative variables. In the model building examples, we will show
No

later how to deal with qualitative independent variables.

Model with a Single Quantitative Independent Variable

The models relating the dependent variable y to a single quantitative in-


Do

dependent variable x are derived from the polynomial of the form:

y = b0 + b1 x + b2 x2 + b3 x3+….+bn xn (7.32)

In the above equation, n is an integer and b0, b1,...,bn are unknown par-
ameters that must be estimated.

This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
172 BUSINESS ANALYTICS, VOLUME II

t
A) First-order Model

os
The first order model is given by:

y = b0 + b1 x
or y = b0 + b1 x1 + b2 x2 + b3 x3+….+bn xn (7.33)

rP
where b0 = y-intercept, bi = regression coefficients
B) Second-order Model
A second order model can be written as

y = b0 + b1 x + b2 x2 (7.34)

yo
Equation (7.34) is a parabola in which:

b0 = y-intercept, b1 = a change in the value of b1 shifts the parabola


to the left or right; increasing the value of b1 causes the parabola to
op
shift to the left, b2 = rate of curvature

The second order model is a parabola. If b2 > 0 the parabola opens


up; if b2 < 0, the parabola opens down. The two cases are shown in
tC

Figure 7.19.
No

Figure 7.19 The second order model


Do

C) Third-order Model
A third order model can be written as:
y = b0 + b1 x + b2 x2 + b3 x3 (7.35)

b0: y-intercept and b3: controls the rate of reversal of the curvature of
curve.

This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
REgRESSION ANALYSIS ANd MOdELINg 173

t
A second order model has no reversal in curvature. In a second order

os
model, the y value either continues to increase or decrease as x increases
and produces either a trough or a peak. A third order model produces one
reversal in curvature and produces one peak and one trough. Reversals
in curvature are not very common but can be modeled using third or

rP
higher order polynomial. The graph of a nth-order polynomial contains
(n − 1) peaks and troughs. Figure 7.20 shows the graph of a third order
polynomial. In real world situation, the second-order model is perhaps
the most useful.

yo
op
tC

Figure 7.20 The third-order model

Example: A Quadratic (Second-Order) Model


No

The life of an electronic component is believed to be related to the tem-


perature in the operating environment. Table 7.19 shows 25 observations
(Data File: COMP_LIFE) that show the life of the components (in hours)
and the corresponding operating temperature (in °F). We would like to fit
a model to predict the life of the component. In this case, the life of the
component is the dependent variable (y) and the operating temperature is
Do

the independent variable (x).


Figure 7.21 shows the scatter plot of the data in Table in 7.19. From
the scatter plot, we can see that the data can be well approximated by a
quadratic model.
We used MINITAB and EXCEL to fit a second order model to the
data. The analysis of the computer results is presented below.

This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
174 BUSINESS ANALYTICS, VOLUME II

t
Table 7.19 Life of electronic components

os
Obs. 1 2 3 4 5 6 7 8 9 10
X(Temp.) 99 101 100 113 72 93 94 89 95 111
Y (Life) 141.0 136.7 145.7 194.3 101.5 121.4 123.5 118.4 137.0 183.2
Obs. 11 12 13 14 15 16 17 18 19 20

rP
X(Temp.) 72 76 105 84 102 103 92 81 73 97
Y (Life) 106.6 97.5 156.9 111.2 158.2 155.1 119.7 105.9 101.3 140.1
Obs. 21 22 23 24 25

X(Temp.) 105 90 94 79 91

Y (Life) 148.6 116.4 121.5 108.9 110.1

yo
op
tC
No

Figure 7.21 Scatter Plot of Life (y) vs. Operating Temp. (x)

Second-Order Model Using MINITAB

A second order model was fitted using MINITAB. The regression output
Do

of the model is shown in Table 7.20.


A quadratic model in MINITAB can also be run using the fitted line
plot option. The results of the quadratic model using this option provide
a fitted line plot (shown in Figure 7.22).
While running the quadratic model, the data values and residuals can
be stored and the plots of residuals be created.

This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
REgRESSION ANALYSIS ANd MOdELINg 175

t
Table 7.20 Computer results of second order model

os
rP
yo
op
tC
No

Figure 7.22 Regression Plot with Equation

Residual Plots for the above Example Using MINITAB


Do

Figure 7.23 shows the residual plots for this quadratic model. The residual
plots are useful in checking the assumptions of the model and the model
adequacy.
The analysis of residual plots for this model is similar to that of simple
and multiple regression models. The investigation of the plots shows that
the normality assumption is met. The plot of residuals versus the fitted

This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
t
os
rP
yo
op
Figure 7.23 Residual plots for the quadratic model example
tC
No
Do

176
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
REgRESSION ANALYSIS ANd MOdELINg 177

t
values shows a random pattern indicating that the quadratic model fitted

os
to the data is adequate.

Running a Second-Order Model Using EXCEL

rP
Unlike MINITAB, EXCEL does not provide an option to run a quadratic
model of the form

y = b0 + b1x + b2 x2

yo
However, we can run a quadratic regression model by calculating the
x2 column from the x column in the data file. The EXCEL computer
results are shown in Table 7.21.

Analysis of Computer Results of Tables 7.20 and 7.21


op
Refer to MINITAB output in Table 7.20 or the EXCEL computer output
in Table 7.21. The prediction equation from this table can be written
using the coefficients column. The equation is
tC

ŷ = 433 − 8.89 x + 0.0598x 2

In the EXCEL output, the prediction equation can be read from the
“coefficients” column.
The r2 value is 95.9 percent which is an indication of a strong model.
No

It indicates that 95.9 percent of the variation in y can be explained by the


variation in x and 4.1 percent of the variation is unexplained or due to
error. The equation can be used to predict the life of the components at a
specified temperature.
We can also test a hypothesis to determine if the second order term in our
model, in fact, contributes to the prediction of y. The null and alternate hy-
Do

potheses to be tested for this can be expressed as

H0:β2 = 0
H0:β2 ≠ 0 (7.36)

This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
Do

178
Table 7.21 EXCEL computer output for the quadratic model
Summary Output
Regression Statistics
Multiple R 0.97947
No
R Square 0.95936
Adjusted R Square 0.95567
Standard Error 5.37620
Observations 25
tC
ANOVA
df SS MS F Signifcance F
Regression 2 15,011.7720 7,505.8860 259.6872 0.0000
op
Residual 22 635.8784 28.9036
Total 24 15,647.6504

Coeffcients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
yo
Intercept 433. 0063 61.8367 7.0024 0.0000 304.7648 561.2478 304.7648 561.2478

is an infringement of copyright. [email protected] or 617.783.7860


Temp. (x) −8.8908 1.3743 −6.4691 0.0000 −11.7410 −6.0405 −11. 7410 −6.0405
x**2 0. 0598 0.0075 7.9251 0.0000 0.0442 0.0755 0.0442 0.0755
rP
os
t

This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
REgRESSION ANALYSIS ANd MOdELINg 179

t
The test statistic for this test is given by

os
b2
t =
sb2

rP
The test statistic value is calculated by the computer and is shown in
Table 7.21. In this table, the t value is reported in x**2 row and under t
stat column. This value is 7.93. Thus,

b2
t = = 7.93
sb2

yo
The critical value for the test is

t
n − k − 1,
˛
2
= t 22, 0.025 = 2.074
op
[Note: t n − k −1 is the t-value from the t-table for (n – k − 1) degrees of
freedom where n is the number of observations and k is the number of
independent variables.]
For our example, n = 25, k = 2 and the level of significance, α = 0.05.
tC

Using these values, the critical value or the t-value from the t-table for 22
degrees of freedom and α = 0.025 is 2.074. Since the calculated value of t

t = 7.93 > tcritical = 2.074


No

We reject the null hypothesis and conclude that the second order term in
fact contributes in the prediction of the life of components (y). Note: we
could have tested the following hypotheses:

H0:β = 0

H0:β > 0
Do

which will determine that the value of b2 = 0.0598 in the prediction equa-
tion is large enough to conclude that the life of the components increases
at an increasing rate with temperature. This hypothesis will have the same
test statistic and can be tested at α = 0.05.

This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
180 BUSINESS ANALYTICS, VOLUME II

t
Terefore, our conclusion is that the mean component life increases at an

os
increasing rate of temperature and the second order term in our model, in fact,
is signifcant and contributes to the prediction of y.

Another Example: Quadratic (Second-Order) Model

rP
The fitted line plot of the temperature and yield in Figure 7.24 shows
the yield of a chemical process at different temperatures. The plot clearly
indicates a nonlinear relationship. There is an indication that the data can
be well approximated by a quadratic model.

yo
We used MINITAB and EXCEL to run a quadratic model to the
data. The prediction equation from the regression output is shown below.

Yield (y) = 1,459 + 277 Temperature (x) − 0.896 x*x or,


ŷ = 1, 459 + 277 x − 0.896x 2
op
The coefficient of determination, R2 is 88.2 percent. This tells us that
88.2 percent of the variation in y is explained by the regression and 11.8
percent of the variation is unexplained or due to error. The model is ap-
tC

propriate and the prediction equation can be used to predict the yield at
different temperatures.
No
Do

Figure 7.24 Fitted line plot showing the yield of a chemical process

This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
REgRESSION ANALYSIS ANd MOdELINg 181

t
Summary of Model Building

os
The sections above provided an introduction to model building. The first
order, second order, and third order models were discussed. Unlike the
simple and multiple regression models, where the relationship among
the variables is linear, there are situation where the relationship among

rP
the variables under study may not be linear. We discussed the situation
where higher order and nonlinear models provide a better relationship
between the response and independent variables and provided examples
of quadratic or second-order models. Scatter plots were created to select
the model that would provide a good fit to a set of data and can be used to

yo
obtain a good estimate of the response or the dependent variable, y that is
related to the independent variables or predictors. Since the second order
or quadratic models are appropriate in many applications, we provided a
detailed computer analysis of such models. The computer analysis and in-
terpretation of computer results were explained and examined including
op
the residual plots and analysis.

Models with Qualitative Independent (Dummy)


Variables
tC

Dummy or Indicator Variables in Multiple Regression: In regression


we often encounter qualitative or indicator variables that need to be in-
cluded as one of the independent variables in the model. For example,
if we are interested in building a regression model to predict the salary
No

of male and female employees based on their education and years of ex-
perience; the variable male or female is a qualitative variable that must be
included as a separate independent variable in the model. To include such
qualitative variables in the model we use a dummy or indicator variable.
The use of dummy or indicator variables in a regression model allows us
to include qualitative variables in the model. For example, to include the
Do

sex of employees in a regression model as an independent variable, we


define this variable as

° 1
x1 = ˛
˙˝ 0

This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
182 BUSINESS ANALYTICS, VOLUME II

t
In the above formulation, a “1” indicates that the employee is a male

os
and a “0” means the employee is a female. Which one of the male or fe-
male is assigned the value of 1 is arbitrary.
In general, the number of dummy or indicator variables needed is
one less than the total number of indicator variables to be included in

rP
the model.

One Qualitative Independent Variable at Two Levels


Suppose we want to build a model to predict the mean salary of male and

yo
female employees. This model can be written as

y = b0 +b1 x

where x is the dummy variable coded as


op
°˝ 1 if male
x1 = ˛
0 if female
˙˝

This coding scheme will allow us to compare the mean salary for male
tC

and female employees by substituting the appropriate code in the regres-


sion equation: y = b0 + b1 x.
Suppose µM = mean salary for the male employees
µF = mean salary for the female employees
No

Then the mean salary for the male: µM = y = b0 + b1 (1) = b0 + b1


and the mean salary for the female: µF = y = b0 + b1 (0) = b0

Thus, the mean salary for the female employees is b0. In a 0-1 coding
system, the mean response will always be b0 for the qualitative variable
that is assigned the value 0.This is also called the base level.
The difference in the mean salary for the male and female employees
Do

can be calculated by taking the difference (µM − µF)

µM −µF = (b0 +b1) − b0 = b1

The above is the difference between the mean response for the level
that is assigned the value 1 and the level that is assigned the value 0 or the

This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
REgRESSION ANALYSIS ANd MOdELINg 183

t
base level. The mean salary for the male and female employees is shown

os
graphically in Figure 7.25. We can also see that

b0 = µ F
b1 = µ M − µ F

rP
yo
op
Figure 7.25 Mean salary of female and male employees
tC

Model with One Qualitative Independent Variable


at Three Levels
We would like to write a model relating the mean profit of a grocery chain.
It is believed that the profit to a large extent depends on the location of the
No

stores. Suppose that the management is interested in three specific locations


where the stores are located. We will call these locations A, B, and C. In this
case, the store location is a single qualitative variable which is at three levels
corresponding to the three locations A, B, and C. The prediction equation
relating the mean profit (y) and the three locations can be written as:

y = b0 + b1 x1 + b2 x2 where,
Do

°˝ 1 if location B
x1 = ˛
˝˙ 0 if not

°˝ 1 if location C
x2 = ˛
˝˙ 0 if not
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
184 BUSINESS ANALYTICS, VOLUME II

t
The variables x1 and x2 are known as the dummy variables that make

os
the model function.

Explanation of the Model

rP
Suppose, µA = mean profit for location A
µB = mean profit for location B
µC = mean profit for location C
If we set x1 = 0 and x2 = 0, we will get the mean profit for location A.
Therefore, the mean value of profit y when the store location is A

yo
µA = y = b0 + b1(0) + b2 (0)
or, µA = b0

Thus, the mean profit for location A is b0 or, b0 = µA


op
Similarly, the mean profit for location B can be calculated by setting
x1 = 1 and x2 = 0. The resulting equation is

µB = y = b0 + b1 x1 + b2 x2 = b0 + b1(1) + b2(0)
or, µB = b0 + b1
tC

Since bo = µA, we can write

µB = µA + b1
b1 = µB − µA
No

or

Finally, the mean profit for location C can be calculated by setting x1 = 0


and x2 = 1. The resulting equation is

µC = y = b0 + b1 x1 + b2 x2 = b0 + b1(0) + b2(1)
µC = b0 + b2
Do

or,

Since b0 = µA, we can write

µC = µA + b2
b2 = µC − µA

This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
REgRESSION ANALYSIS ANd MOdELINg 185

t
Thus, in the above coding system, one qualitative independent variable

os
is at three levels,

µA = b0 and b1 = µB − µA
µB = b0 + b1

rP
µC = b0 + b2 b2 = µC − µA

where µA, µB, µC are the mean profits for locations A, B, and C.
Note that the three levels of the qualitative variable can be described with only
two dummy variables. Tis is because the mean of the base level (in this case
location A) is accounted for by the intercept b0. In general form, for m levels

yo
of qualitative variable, we need (m − 1) dummy variables.
The bar graph in Figure 7.26 shows the values of mean profit (y) for
the three locations.
op
tC
No

Figure 7.26 Bar chart showing the mean proft for three locations A,
B, C

In the above bar chart, the height of the bar corresponding to location
A is y = b0. Similarly, the heights of the bars corresponding to locations
B and C are y = b0 + b1 and y = b0 + b2 respectively. Note that either b1 or
Do

b2, or both could be negative. In Figure 7.26, b1 and b2 are both positive.

Example: Dummy Variables


Consider the problem of the pharmaceutical company model where
the relationship between the sales volume (y) and three quantitative
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
186 BUSINESS ANALYTICS, VOLUME II

t
independent variables: advertisement dollars spent (x1) in hundreds of

os
dollars, commission paid to the salespersons (x2) in hundreds of dollars,
and the number of salespersons (x3) were investigated. The company is
now interested in including different sales territories where they market
the drug. The territory in which the company markets the drug is divided

rP
into three zones: zone A, B, and C. The management wants to predict the
sales for the three zones separately. To do this, the variable “zone” which
is a qualitative independent variable must be included in the model. The
company identified the sales volumes for the three zones along with the
variables considered earlier. The data including the sales volume and

yo
the three zones are shown in the last column of Table 7.22 (Data File:
DummyVar_File1).

˜˛ 1 if zone A ˜˛ 1 if zone B
x4 ° x5 °
˝˛ 0 otherwise ˝˛ 0 otherwise
h
op
In the above coding system, the choice of 0 and 1 in the coding is
arbitrary.
Note that, we have defined only two dummy variables—x4 and x5—
tC

for a total of three zones. It is not necessary to define a third dummy


variable for zone C
From the above discussion, it follows that the regression model for the
data in Table 7.22 including the variable “zone” can be written as:
No

y = b0 + b1 x1 + b2 x 2 + b3 x3 + b4 x 4 + b5 x5

where (y): sales volume (y)


(x1): advertisement dollars spent in hundreds of dollars,
(x2): commission paid to the salespersons in hundreds of dollars,
(x3): the numbers of salespersons, and x4 and x5 the dummy variables:
Do

˜˛ 1 if zone A ˜˛ 1 if zone B
x4 ° x5 °
˝˛ 0 otherwise ˝˛ 0 otherwise
h

This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
REgRESSION ANALYSIS ANd MOdELINg 187

t
Table 7.22 Sales for different zones

os
No. of
Sales Advertisement Commission Salesperson
Row Volume (y) (x1) (x2) (x3) Zone
1 973.62 580.17 235.48 8 A

rP
2 903.12 414.67 240.78 7 A
3 1,067.37 420.48 276.07 10 A
4 1,193.37 454.59 295.70 14 B
5 1,429.62 524.05 286.67 16 C
6 1,557.87 623.77 325.66 18 A
7 1,590.12 641.89 298.82 17 A

yo
8 1,081.62 403.03 210.19 12 C
9 1,088.37 415.76 202.91 13 C
10 1,132.62 506.73 275.88 11 B
11 1,314.87 490.35 337.14 15 A
12 1,562.37 624.24 266.30 19 C
op
13 1,050.12 459.56 240.13 10 C
14 1,055.37 447.03 254.18 12 B
15 1,112.37 493.96 237.49 14 B
16 1,235.37 543.84 276.70 16 B
17 1,518.12 618.38 271.14 18 A
tC

18 1,574.37 690.50 281.94 15 C


19 1,644.87 591.27 316.75 20 C
20 1,169.37 530.73 297.37 10 C
21 1,212.87 541.34 272.77 13 B
22 1,304.37 492.20 344.35 11 B
No

23 1,477.62 546.34 295.53 15 C


24 1,593.87 590.02 293.79 19 C
25 1,134.87 505.32 277.05 11 B

Table 7.23 shows the data file for this regression model with the dummy
variables. The data can be analyzed using a MINITAB data file – [Data
File: DummyVar_File(2) or from the EXCEL data file – DummyVar_File
Do

(2).xlsx].
We used both MINITAB and EXCEL to run this model The
MINITAB and EXCEL regression output and results are shown in Tables
7.24 and 7.25. Refer to the computer results to answer the following
questions.

This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
188 BUSINESS ANALYTICS, VOLUME II

t
Table 7.23 Data fle for the model with dummy variables

os
No. of
Volume Advertisement Commission Salespersons Zone A Zone B
Row (y) (x1) (x2) (x3) (x4) (x5)
1 973.62 580.17 235.48 8 1 0

rP
2 903.12 414.67 240.78 7 1 0
3 1,067.37 420.48 276.07 10 1 0
4 1,193.37 454.59 295.70 14 0 1
5 1,429.62 524.05 286.67 16 0 0
6 1,557.87 623.77 325.66 18 1 0
7 1,590.12 641.89 298.82 17 1 0

yo
8 1,081.62 403.03 210.19 12 0 0
9 1,088.37 415.76 202.91 13 0 0
10 1,132.62 506.73 275.88 11 0 1
11 1,314.87 490.35 337.14 15 1 0
12 1,562.37 624.24 266.30 19 0 0
op
13 1,050.12 459.56 240.13 10 0 0
14 1,055.37 447.03 254.18 12 0 1
15 1,112.37 493.96 237.49 14 0 1
16 1,235.37 543.84 276.70 16 0 1
17 1,518.12 618.38 271.14 18 1 0
tC

18 1,574.37 690.50 281.94 15 0 0


19 1,644.87 591.27 316.75 20 0 0
20 1,169.37 530.73 297.37 10 0 0
21 1,212.87 541.34 272.77 13 0 1
22 1,304.37 492.20 344.35 11 0 1
No

23 1,477.62 546.34 295.53 15 0 0


24 1,593.87 590.02 293.79 19 0 0
25 1,134.87 505.32 277.05 11 0 1

A) Using the EXCEL data file, run a regression model. Show your regres-
sion output.
B) Using the MINITAB or EXCEL regression output, write down the
Do

regression equation.
C) Using a 5 percent level of significance and the column “p” in the
MINITAB regression output or “p-value” column in the EXCEL re-
gression output, conduct appropriate hypotheses tests to determine
that the independent variables advertisement, commission paid, and

This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
REgRESSION ANALYSIS ANd MOdELINg 189

t
number of sales persons are significant or they contribute in predict-

os
ing the sales volume.
D) Write separate regression equations to predict the sales for each of the
zones A, B, and C.
E) Refer to the given MINITAB residual plots and check that all the regres-

rP
sion assumptions are met and the fitted regression model is adequate.

Solution:
A) The MINITAB regression output is shown in Table 7.24.
B) Table 7.25 shows the EXCEL regression output.

yo
C) From the MINITAB or the EXCEL regression outputs in Tables 7.24
and 7.25, the regression equation is:

Sales Volume (y) = −98.2 + 0.884 Advertisement(x1) + 1.81


Commission(x2) + 33.8 No. of Salespersons(x3) − 67.2 Zone A (x4)
op
−105 Zone B (x5)
or

y = −98.2 + 0.884 x1 + 1.81x 2 + 33.8x3 − 67.2x 4 − 105x5


tC

The regression equation from the EXCEL output in Table 7.25 can be
written using the coefficients column.

D) The hypotheses to check the significance of each of the independent


variables can be written as:
No

H 0 : ˜ j = 0 (xj is not a significant variable)

H 1 : ˜ j ° 0 (xj is a significant variable)

The above hypothesis can be tested using the “p” column in either
Do

MINITAB or the p-value column in EXCEL computer results. The deci-


sion rule for the p-value approach is given by

If p ˜ ° , do not reject H0
If p < ° , reject H0

This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
190 BUSINESS ANALYTICS, VOLUME II

t
os
rP
yo
op
tC
No

Table 7.26 shows the p-value for each of the predictor variables. From
MINITAB or EXCEL computer results in Table 7.24 or 7.25 (see the “p”
or the “p-value” columns in these tables).
From the above table it can be seen that all the three independent
variables are significant.
(E) As indicated, the overall regression equation is
Do

Sales Volume (y) = −98.2 + 0.884 Advertisement(x1) + 1.81


Commission(x2) + 33.8 No. of Salespersons(x3) − 67.2 Zone A (x4)
− 105 Zone B (x5)
Separate equations for each zone can be written from this equation.

This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
REgRESSION ANALYSIS ANd MOdELINg 191

t
Table 7.26 Summary table

os
p-value from
Independent Table 7.24 Compare p Signifcant?
Variable or 7.25 to α Decision Yes or No
Advertisement 0.000 p<α Reject H0 Yes
(X1)

rP
Commissions (X2) 0.000 p<α Reject H0 Yes
No. of salesper- 0.000 p<α Reject H0 Yes
sons (X3)

Zone A: x4 = 1.0, x5 = 0

yo
Therefore, the equation for the sales volume of Zone A can be written as
Sales Volume (y) = −98.2 + 0.884 Advertisement(x1) + 1.81
Commission(x2) +33.8 No. of Salespersons(x3) − 67.2(1) − 105 (0.0) or,
Sales Volume (y) = −98.2 + 0.884 Advertisement (x1) + 1.81 Commission
op
(x2) + 33.8 No. of Salespersons (x3) – 67.2 or,
Sales Volume (y) = −165.4 + 0.884 Advertisement(x1) + 1.81
Commission(x2) + 33.8 No. of Salespersons(x3)
tC

Similarly, the regression equations for the other two zones are shown
below.

Zone B: x4 = 0, x5 = 1.0
Substituting these values in the overall regression equation of part (c)
No

Sales Volume (y) = −98.2 + 0.884 Advertisement(x1) + 1.81 Commission


(x2) + 33.8 No. of Salespersons(x3) − 105 or, Sales Volume (y) = −203.2
+ 0.884 Advertisement (x1) + 1.81 Commission (x2) +33.8 No. of Sales-
persons (x3)

Zone C: x4 = 0, x5 = 0
Do

Substituting these values in the overall regression equation of part (c)

Sales Volume (y) = −98.2 + 0.884 Advertisement(x1) + 1.81


Commission(x2) + 33.8 No. of Salespersons(x3)
Note that in all of the above equations, the slopes are same but intercepts
are different.

This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
192 BUSINESS ANALYTICS, VOLUME II

t
(F) The MINITAB residual plots are shown in Figure 7.27.

os
The residual plots in Figure 7.27 show that the normal probability
plot and the histogram of residuals are approximately normally dis-
tributed. The plot of residuals versus fits does not show any pattern
and is quite random indicating that the fitted linear regression model

rP
is adequate. The plot of residuals and the order of data points show no
apparent pattern indicating that there is no violation of independence
of error assumptions.

yo
op
tC

Figure 7.27 Residual plots for the dummy variable example


No

Overview of Regression Models


Regression is a powerful tool and is widely used in studying the relation-
ships among the variables. A number of regression models were discussed
in this book. These models are summarized here:
Do

Simple Linear Regression y = ˛0 + ˛1 x + ˝

Multiple Regression y = ˛0 + ˛1 x1 + ˛ 2 x 2 + ... + ˛k xk + ˝

Polynomial Regression (second Second-order polynomial:


order models can be extended to
y = ˛0 + ˛1 x1 + ˛ 2 x 22 + ˝
higher order model)
Higher-order polynomial:
y = ˛0 + ˛1 x1 + ˛ 2 x 22 + ... + ˛k x k + ˝
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
REgRESSION ANALYSIS ANd MOdELINg 193

t
Interaction Models An interaction model relating y and two quan-

os
titative independent variables can be written as
y = b0 + b1 x1 + b2 x 2 + b3 x1 x 2

Models with dummy Variables general form of Model with one qualitative
(dummy)independent variable at m levels

y = b0 + b1 x1 + b2 x2 +……+ bm − 1 xm − 1

rP
where, xi is the dummy variable for level (i + 1) and

ˇ˙ 1 if y is observed atr level (i + 1)


xi = ˆ
ˇ˘ 0 otherwise

All Subset and Stepwise Regression Finding the best set of predictor variables to be
included in the model

yo
Note; the Interaction Models and All Subset Regression are not discussed
in this chapter.

There are other regression models that are not discussed but can be de-
veloped using the concepts presented for the other models. Some of these
op
models are explained here.

Reciprocal Trans- This transformation can produce a linear relationship and is of


formation of x the form:
Variable ˛ 1ˆ
y = 0 + 1 ˙ ˘ + 
tC

˝ xˇ
This model is appropriate when x and y have an inverse rela-
tionship. Note that the inverse relationship is not linear.
Log Transformation The logarithmic transformation is of the form:
of x Variable y = ˛0 + ˛1 ln( x ) + ˝
Log Transformation
This is a useful curvilinear form where ln( x ) is the natural loga-
No

of x and y variables
rithm of x and x > 0 .
ln( y ) = ˛0 + ˛1 ln( x ) + ˝
The purpose of this transformation is to achieve a linear rela-
tionship. The model is valid for positive values of x and y. This
transformation is more involved and is diffcult to compare it to
other models with y as the dependent variable.
Logistic Regression This model is used when the response variable is categorical. In
Do

all the regression models we developed in this book, response


variable was a quantitative variable. In cases, where the response
is categorical or qualitative, the simple and multiple least-
squares regression model violates the normality assumption.
The correct model in this case is logistic regression and is not
discussed in this book.

This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
194 BUSINESS ANALYTICS, VOLUME II

t
Implementation Steps and Strategy

os
for Regression Models
Successful implementation of regression models requires an understand-
ing of different types of models. A knowledge of least-squares method on
which many of the regression models are based as well as the awareness of

rP
the assumptions of least-squares regression are critical in evaluating and
implementing the correct regression models. The computer packages have
made the model building and analysis easy. As we have demonstrated, the
scatter plots and matrix plots constructed using the computer are very
helpful in the initial stages of selecting the right model for the given data.

yo
The residual plots for checking the assumptions of regression can be easily
constructed using computer. While the computer packages have removed
the computational hurdle, it is important to understand the fundamen-
tals underlying the regression to apply the regression models properly.
A lack of understanding of least-squares method and the assumptions
op
underlying the regression may lead to drawing wrong conclusions and
selecting alternative course of action. For example, if the assumptions of
regression are violated, it is important to determine the alternate course
or courses of action.
tC
No
Do

This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860

You might also like