Regression Analysis and Modelling - Amar Sahay
Regression Analysis and Modelling - Amar Sahay
os
rP
BEP504
yo
CHAPTER 7
By Amar Sahay
(A Business Expert Press Book)
No
Do
Harvard Business Publishing distributes in digital form the individual chapters from a wide selection of books on business from
publishers including Harvard Business Press and numerous other companies. To order copies or request permission to
reproduce materials, call 1-800-545-7685 or go to http://www.hbsp.harvard.edu. No part of this publication may be reproduced,
stored in a retrieval system, used in a spreadsheet, or transmitted in any form or by any means – electronic, mechanical,
photocopying, recording, or otherwise – without the permission of Harvard Business Publishing, which is an affiliate of Harvard
Business School.
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting is an infringement of
copyright. [email protected] or 617.783.7860
t
os
CHAPTER 7
Regression Analysis
rP
and Modeling
yo
Chapter Highlights
• Introduction to Regression and Correlation
• Linear Regression
? Regression Model
op
• The Estimated Equation of Regression line
• The Method of Least Squares
• Illustration of Least Squares Regression Method
• Analysis of a Simple Regression Problem
tC
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
104 BUSINESS ANALYTICS, VOLUME II
t
? Testing the Overall Significance of Regression
os
? Hypothesis Tests on Individual Regression Coefficients
• Multicollinearity and Autocorrelation in Multiple Regression
• Summary of the Key Features of Multiple Regression Model
• Model Building and Computer Analysis
rP
? Model with a Single Quantitative Independent Variable
? First-order Model/ Second-order Model/ Third-order Model
• A Quadratic (second-order) Model: Second-order Model using
MINITAB
? Analysis of Computer Results
yo
• Models with Qualitative Independent (Dummy) Variables
? One Qualitative Independent Variable at Two Levels
• Model with One Qualitative Independent Variable at Three
Levels
• Example: Dummy Variables
op
• Overview of Regression Models
• Implementation Steps and Strategy for Regression Models
tC
can be used to predict one variable using the other variable, or even mul-
tiple variables. Also, the following features related to regression analysis
are the topic of this chapter.
II. The basics of the least squares method in regression analysis and its
purpose in estimating the regression line,
III. Determining the best-fitting line through the data points,
IV. Calculating the slope and y-intercept of the best fitting regression
line and interpreting the meaning of regression line, and
V. Measures of association between two quantitative variables - the
covariance and the coefficient of correlation
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
REgRESSION ANALYSIS ANd MOdELINg 105
t
Linear Regression
os
Regression analysis is used to investigate the relationship between two or
more variables. Often we are interested in predicting a variableusing one
or more independent variables x1 , x 2 ,.., xk . For example, we might be
interested in the relationship between two variables: sales and profit for
rP
a chain of stores, number of hours required to produce a certain number
of products, number of accidents vs. blood alcohol level, advertising ex-
penditures and sales, or the height of parents compared to their children.
In all these cases, regression analysis can be applied to investigate the
relationship between the two variables.
yo
In general, we have one dependent or response variable, y and one or
more independent variables, x1 , x 2 ,..., xk . The independent variables are
also called predictors. If there is only one independent variable x that we
are trying to relate to the dependent variable y, then this is a case of simple
regression. On the other hand, if we have two or more independent vari-
op
ables that are related to a single response or dependent variable, then we
have a case of multiple regression. In this section, we will discuss simple
regression, or to be more specific, simple linear regression. This means
that the relationship we obtain between the dependent or response vari-
tC
able y and the independent variable x will be linear. In this case, there is
only one predictor or independent variable (x) of interest that will be used
to predict the dependent variable (y).
In regression analysis, the dependent or response variable y is a ran-
dom variable; whereas the independent variable or variables x1 , x 2 ,.., xn
No
are measured with negligible error and are controlled by the analyst. The
relationship between the dependent and independent variable or variables
are described by a mathematical model known as a regression model.
t
We will denote them by y and x respectively. The manager in charge of
os
developing the model believes that there is a positive relationship between
x and y meaning that the larger homes (homes with larger square-footage)
tend to have higher heating cost. The regression model relating the two
variables— home heating cost y as the dependent variable and the size of the
rP
homes as the independent variable x – can be denoted using equation (7.1).
Equation (7.1) shows the relationship between the values of x and y,
or the independent and dependent variable and an error term in a simple
regression model.
y = ˛0 + ˛1 x + ˝ (7.1)
yo
where y = dependent variable x = independent variable
β0 = y - intercept (population) β1 = slope of the population regression line
ε = random error term (ε is the Greek letter “epsilon”)
op
The model represented by equation (7.1) can be viewed as a popula-
tion model in which β0 and β1 are the parameters of the model. The error
term ε represents the variability in y that cannot be explained by the rela-
tionship between x and y.
tC
In our example, the population consists of all the homes in the region.
This population consists of sub-populations consisting of each home of
size, x. Thus, one subpopulation may be viewed as all homes with 1,500
square-feet, another subpopulation consisting of all home with 2,100
square-feet, and so on. Each of these subpopulations consisting of size
No
E ( y ) = ˛0 + ˛1 x (7.2)
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
t
os
rP
Figure 7.1 Possible linear relationship between E(y) and x in simple linear regression
yo
op
tC
No
Do
107
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
108 BUSINESS ANALYTICS, VOLUME II
t
negative, or no relationship. The positive linear relationship is identified
os
by a positive slope. It shows that an increase in the value of x causes an
increase in the mean value of y or E(y), whereas a negative linear relation-
ship is identified by a negative slope and indicates that an increase in the
value x causes a decrease in the mean value of y.
rP
The no relationship between x and y means that the mean value of y or
E(y) is the same for every value of x. In this case, the regression equation
cannot be used to make a prediction because of a weak or no relationship
between x and y.
yo
The Estimated Equation of Regression Line
In equation (7.2), β0 and β1 are the unknown population parameters that
must be estimated using the sample data. The estimates of β0 and β1 are
denoted by b0 and b1 that provide the estimated regression equation given
op
by the following equation.
ŷ = b0 + b1 x (7.3)
where ŷ = point estimator of E(y) or the mean value of y for a given value
tC
of x
b0 = y - intercept of the regression line b1 = slope of the regression line
equation (7.3) are determined using the least squares method. Before we
discuss the least squares method in detail, we will describe the process of
estimating the regression equation. Figure 7.2 explains this process.
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
t
os
rP
yo
op
tC
109
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
110 BUSINESS ANALYTICS, VOLUME II
t
Figure 7.3 shows a scatter plot of the data of Table 7.1. Scatter plots
os
are often used to investigate the relationship between two variables. An
investigation of the plot shows a positive relationship between sales and
advertising expenditures therefore, the manager would like to predict the
sales using the advertising expenditure using a simple regression model.
rP
yo
op
Figure 7.3 Scatterplot of sales and advertisement expenditures
tC
426 30
330 26
400 31
458 33
410 30
628 41
Do
553 38
728 44
498 40
708 48
719 47
658 45
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
REgRESSION ANALYSIS ANd MOdELINg 111
t
As outlined above, a simple regression model involves two variables
os
where one variable is used to predict the other variable. The variable to be
predicted is the dependent or response variable, and the other variable is
the independent variable. The dependent variable is usually denoted by y
while the independent variable is denoted by x.
rP
In a scatter plot the dependent variable (y) is plotted on the vertical
axis and the independent variable (x) is plotted on the horizontal axis.
The scatter plot in Figure 7.3 suggests a positive linear relationship
between sales (y) and the advertising expenditures (x). From the figure, it
can be seen that the plotted points can be well approximated by a straight
yo
line of the form y = b0 + b1 x where, b0 and b1 are the y-intercept and
the slope of the line. The process of estimating this regression equation
uses a widely used mathematical tool known as the least squares method.
Te least squares method requires ftting a line through the data points
so that the sum of the squares of errors or residuals is minimum. Tese errors
op
or residuals are the vertical distances of the points from the ftted line. Thus,
the least squares method determines the best fitting line through the data
points that ensures that the sum of the squares of the vertical distances or
deviations from the given points and the fitted line are a minimum.
Figure 7.4 shows the concept of the least squares method. The fig-
tC
ure shows a line fitted to the scatter plot of Figure 7.3 using the least
squares method. This line is the estimated line denoted using y-hat (ŷ).
The method of estimating this line will be illustrated later. The equation
of this line is given below.
No
yˆ = −150.9 + 18.33 x
The vertical distance of each point from the line is known as the error
or residual. Note that the residual or error of a point can be positive, nega-
tive, or zero depending upon whether the point is above, below, or on the
Do
fitted line. If the point is above the line, the error is positive, whereas if
the point is below the fitted line, the error is negative.
Figure 7.4 shows graphically the errors for a few points. To demon-
strate how the error or residual for a point is calculated, refer to the data
in Table 7.1.
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
112 BUSINESS ANALYTICS, VOLUME II
t
os
rP
yo
Figure 7.4 Fitting the regression line to the sales and advertising data
of table 7.1
op
This table shows that for the advertising expenditure of 40 (or,
x = 40 ) the sales is 498 or ( y = 498 ). This is shown graphically in in
Figure 7.4. The estimated or predicted sales for x = 40 equals the vertical
distance all the way up to the fitted regression line from y = 498 . This
tC
predicted value can be determined using the equation of the fitted line as
the observed sales, y = 498 , and the predicted value of y is the error or
residual and is equal to
Figure 7.4 shows this error value. This error is negative because the
Do
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
REgRESSION ANALYSIS ANd MOdELINg 113
t
yˆ = −150.9 + 18.33 x = −150.9 + 18.33(44) = 655.62
os
The value is shown in Figure 7.4. The error for this point is the dif-
ference between the observed and the predicted, or the estimated value
which is
rP
( y − yˆ ) = (728 − 655.62) = 72.38
This value of the error is positive because the point y = 728 lies
yo
above the fitted line.
The errors for the other observed values can be calculated in a similar
way. The vertical deviation of a point from the fitted regression line rep-
resents the amount of error associated with that point. The least squares
method determines the values b0 and b1 in the fitted regression line
op
ŷ = b0 + b1 x that will minimize the sum of the squares of the errors.
Minimizing the sum of the squares of the errors provides a unique line
through the data points such that the distance of each point from the fit-
ted line is a minimum.
tC
Since the least squares criteria require that the sum of the squares of
the errors be minimized, we have the following relationship:
˛ ( y − yˆ )2 = ˛ ( y − b0 − b1x)2 (7.4)
No
where y is the observed value and ŷ is the estimated value of the depend-
ent variable given by ŷ = b0 + b1 x
Equation (7.4) involves two unknowns b0 and b1. Using differential
calculus, the following two equations can be obtained:
˛ y = nb0 + b1 ˛ x (7.5)
Do
˛ xy = b0 ˛ x + b1 ˛ x 2
These equations are known as the normal equations and can be solved
algebraically to obtain the unknown values of the slope and y-intercept b0
and b1. Solving these equations yields the results shown below.
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
114 BUSINESS ANALYTICS, VOLUME II
t
n˙ xy − (˙ x )(˙ y )
os
b1 = (7.6)
n˙ x − ( ˙ x )
2
2
and b0 = y − b1 x (7.7)
rP
y =
°y and x = °x
where, n n
The values b0 and b1 when calculated using equations (7.6) and (7.7)
yo
minimize the sum of the squares of the vertical deviations or errors. These
values can be calculated easily using the data points ( xi , yi ) which are
the observed values of the independent and dependent variables (the col-
lected data in Table 7.1).
op
Illustration of Least Squares Regression Method
In this section we will demonstrate the least squares method which is the
basis of regression model. We will also discuss the process of finding the
tC
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
REgRESSION ANALYSIS ANd MOdELINg 115
t
Table 7.2 Intermediate calculations for determining the estimated
os
regression line
Sales Advertising
($1,000s) ($1,000s)
y x xy x2 y2
rP
1 458 34 15,572 1,156 209,764
2 390 30 11,700 900 152,100
3 378 29 10,962 841 142,884
4 426 30 12,780 900 181,476
5 330 26 8,580 676 108,900
6 400 31 12,400 961 160,000
yo
7 458 33 15,114 1,089 209,764
8 410 30 12,300 900 168,100
9 628 41 25,748 1,681 394,384
10 553 38 21,014 1,444 305,809
11 728 44 32,032 1,936 529,984
op
12 498 40 19,920 1,600 248,004
13 708 48 33,984 2,304 501,264
14 719 47 33,793 2,209 516,961
15 658 45 29,610 2,025 432,964
x =
°x =
546
= 36.4 y =
°y =
7, 742
= 516.133
n 15 n 15
No
Using the values in Table 7.2, and equations (7.6) and (7.7) we first
calculate the value of b1
b1 =
n˙ xy − (˙ x ) (˙ y ) = 15(295, 509) − (546)(7, 742
7 )
= 18.326
n˙ x − ( ˙ x )
2
2 15( 20, 622 ) − ( 546) 2
Do
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
116 BUSINESS ANALYTICS, VOLUME II
t
This gives us the following equation for the estimated regression line:
os
ŷ = −150.9 + 18.33 x
rP
The slope (b1) of the estimated regression line has a positive value
of 18.33. This means that as the advertising expenditures (x) increase,
the sales increase. Since the advertising expenditures (x) and the sales
both are measured in $1,000s, the estimated regression equation,
ŷ = −150.9 + 18.33 x means that each unit increase in the value of x (or
yo
every $1,000 increase in the advertising expenditures) will lead to an in-
crease of $18,330 (or 18.33 × 1,000 = 18,330) in expected sales. We can
also use the regression equation to predict the sales for a given value of
x or the advertisement expenditure. For instance, the predicted sales for
x = 40 can be calculated as:
op
ŷ = −150.9 + 18.33(40) = 582.3
would be $582,300.
No
Do
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
REgRESSION ANALYSIS ANd MOdELINg 117
t
It is important to check the adequacy of the estimated regression
os
equation before using the equation to make predictions. In the sections
that follow, we will discuss several tests to check the adequacy of the re-
gression model.
rP
Analysis of a Simple Regression Problem
The example below demonstrates the necessary computations, their inter-
pretation, and application of a simple regression problem using computer
packages. Suppose the operations manager of a manufacturing company
yo
wants to predict the number of hours required to produce a certain num-
ber of products. The data for the number of units produced and the time
in hours to produce those units are shown in the Table 7.3 (Data File:
Hours_Units). This is a simple linear regression problem, so we have one
dependent or response variable that we are trying to relate to one independent
op
variable or predictor. Since we are trying to predict the number of hours
using the number of units produced; hours is the dependent or response
variable (y) and number of units is the independent variable or predictor (x).
For the data in Table 7.3, we first calculate the intermediate values shown
in Table 7.4. All these values are calculated using the observed values of x
tC
and y in Table 7.3. These intermediate values will be used in most of the
computations related to simple regression analysis.
We will also use computer packages such as MINITAB and EXCEL
to analyze the simple regression problem and provide detailed analysis
of the computer output. First, we will explain the manual calculations
No
ObsNo. 11 12 13 14 15 16 17 18 19 20
Units (x) 704 897 949 632 477 754 819 869 1,035 646
Hours (y) 12.63 14.43 15.46 12.64 11.92 13.95 14.33 15.23 16.77 12.41
Obs. No. 21 22 23 24 25 26 27 28 29 30
Units (x) 1,055 875 969 1,075 655 1,125 960 815 555 925
Hours (y) 17.00 15.50 16.20 17.50 12.92 18.20 15.10 14.00 12.20 15.50
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
118 BUSINESS ANALYTICS, VOLUME II
t
Table 7.4 Intermediate calculations for data in Table 7.3
os
n = 30 (number of observations )
x =
°x = 804.40
° x = 24,132 ° y = 431.23 ° xy = 357, 055 n
° x2 = 20, 467, 220 ° y2 = 6,3
302.3
y =
°y = 14.374
n
rP
and interpret the results. You will find that all the formulas are written in
terms of the values calculated in Table 7.4.
yo
Constructing a Scatterplot of the Data
We can use EXCEL or MINITAB to do a scatter plot of the data. From
the data in Table 7.3, enter the units (x) in the first column and hours (y)
in second column of EXCEL or MINITAB and construct a scatter plot.
op
Figure 7.6 shows the scatter plot for this data.
tC
No
Do
The above plot clearly shows an increasing trend. It shows a linear re-
lationship between x and y; therefore, the data can be approximated using
a straight line with a positive slope.
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
REgRESSION ANALYSIS ANd MOdELINg 119
t
Finding the Equation of the Best Fitting Line
os
(Estimated Line)
The equation of the estimated regression line is given by:
ŷ = b0 + b1 x
rP
where b0 = y-intercept, and b1 = slope. These are determined using the
least squares method. The y-intercept b0 and the slope, b1 are determined
using the equations (7.6) and (7.7) discussed earlier.
Using the values in Table 7.4, first calculate the values of b1 (the slope)
yo
and b0 (the y-intercept) as shown below.
b1 =
n˙ xy − (˙ x ) (˙ y ) = 30(357, 055) − (24,132)((431.23) = 0.00964
n˙ x − ( ˙ x )
2
2 30(20, 467, 220) − (24,132) 2
op
and
ŷ = b0 + b1 x = 6.62 + 0.00964x
The regression equation or the equation of the “best” fitting line can
No
over y means that the line is estimated. Thus, the equation of the line,
in fact, is an estimated equation of the best fitting line. The line is also
known as the least squares line which minimizes the sum of the squares
of the errors. This means that when the line is placed over the scatter plot,
the vertical distance from each of the points to the line is minimized.
The error is the vertical distance of each point from the estimated line.
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
120 BUSINESS ANALYTICS, VOLUME II
t
The error is also known as the residual. Figure 7.7 shows the least squares
os
line and the residuals for each of the points as the vertical distance from
the point to the estimated regression line.
[Note: The estimated line is denoted by ŷ and the residual for a point
yi is given by ( yi − ŷ )]
rP
Recall that the error or the residual for a point is given by ( y − ŷ )
which is the vertical distance of a point from the estimated line. Figure 7.8
shows the fitted regression line over the scatter plot.
yo
op
tC
t
Interpretation of the Fitted Regression Line
os
The estimated least squares line is of the form y = b0 + b1 x where, b1 is
the slope and b0 is the y-intercept. The equation of the fitted line is
ŷ = 6.62 + 0.00964 x
rP
In this equation of the fitted line, 6.62 is the y-intercept and 0.00964
is the slope. This line provides the relationship between the hours and
the number of units produced. The equation means that for each unit
yo
increase in(the number of units produced), (the number of hours) will
increase by 0.00964. The value 6.62 represents the portion of the hours
that is not affected by the number of units.
the data. From the scatter plot, a straight line fit with an increasing trend
is evident for the data but we should be cautious about assuming that this
straight line trend will continue to hold for values as large as x = 2, 000 .
Therefore, it may not be reasonable to make this prediction for values that
are far beyond the range of the data values.
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
122 BUSINESS ANALYTICS, VOLUME II
t
The Standard Error of the Estimate(s)
os
The standard error of the estimate measures the variation or scatter of the
points around the fitted line of regression. This is measured in units of the
response or dependent variable (y). The standard error of the estimate is
analogous to the standard deviation. The standard deviation measures the
rP
variability around the mean, whereas the standard error of the estimate (s)
measures the variability around the fitted line of regression. A large value of s
indicates larger variation of the points around the fitted line of regression.
The standard error of the estimate is calculated using the following formula:
yo
s =
˛ ( y − yˆ )2 (7.7A)
n−2
The equation can also be written and evaluated using the values of b0,
b1 and the values in Table 7.4, the standard error of the estimate can be
op
calculated as:
s =
˛ y 2 − b0 ˛ y − b1 ˛ xy
n−2
tC
s = 0.4481
A small value of s indicates less scatter of the data points around the ft-
ted line of regression (see Figure 7.8). Te value s = 0.4481 indicates that the
Do
t
used to judge the adequacy of the regression model. The value of r2 lies
os
between 0 and 1 (0 ≤ r2 ≤ 1) or 0 to 100 percent. The closer the value of r2
to 1 or 100 percent, the better is the model because the r2 value indicates
the amount of variation in the data explained by the regression model.
Figure 7.9 shows the relationship between the explained, unexplained,
rP
and the total variation.
In regression, the total sum of squares is partitioned into two com-
ponents; the regression sum of squares and the error sum of squares giving
the following relationship:
yo
SST = SSR + SSE
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
124 BUSINESS ANALYTICS, VOLUME II
t
From Figure 7.9, the SST and SSE are calculated as
os
(˙ y )
2
˙( y − y ) ˙ y2
2
SST = = − (7.9)
n
rP
and
yo
Note that we can calculate SSR by calculating SST and SSE since,
SSR
r2 = (7.11)
SST
tC
using equations (7.9) and (7.10) and the values in Table 7.3.
(˛ y)2 (431.23)2
˛( y − y ) = ˛ y2 −
2
SST = = 6302.3 − = 103.6880
n 30
SSE = ˛ ( y − yˆ )2 = ˛ y 2 − b0 ˛ y − b1 ˛ xy
Do
Since
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
REgRESSION ANALYSIS ANd MOdELINg 125
t
Therefore,
os
SSR = SST − SSE = 103.680 − 5.623 = 98.057 (7.12)
and
rP
SSR 98.057
r2 = = = 0.946
SST 103.680
or, r2 = 94.6%
yo
This means that 94.6 percent variation in the dependent variable, y is
explained by the variation in x and 5.4 percent of the variation is due to
unexplained reasons or error.
r = r2
tC
(7.13)
Therefore,
r = r2 = 0.946 = 0.973
No
−1 ° r ° 1 (7.14)
Do
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
126 BUSINESS ANALYTICS, VOLUME II
t
negative. If r is positive it indicates a positive correlation, whereas a nega-
os
tive r indicates a negative correlation. The coefficient of correlation r can
also be calculated using the following formula:
(ˆ x )(ˆ y )
ˆ xy −
rP
r = n (7.15)
(ˆ x ) (ˆ y )
2 2
ˆ x2 − n
× ˆ y2 − n
yo
Using the values in Table 7.4, we can calculate r from equation (7.15).
mination (r2) that measures how well the independent variable predicts
the dependent variable or the percent of variation in the dependent vari-
able y explained by the variation in the independent variable, x, (c) the
coefficient of correlation (r) that measures the strength of relationship
between x and y.
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
REgRESSION ANALYSIS ANd MOdELINg 127
t
Regression Analysis Using Computer
os
This section provides a step-wise computer analysis of regression model.
In real world, computer software is almost always used to analyze regres-
sion problems. There are a number of computer software in use today
among which MINITAB, EXCEL, SAS, SPSS are few. Here, we have
rP
used Excel and MINITAB computer packages to analyze the regression
models. The applications of simple, multiple, and higher order regres-
sions using EXCEL and MINITAB software are demonstrated in this and
subsequent sections. If you perform regression analysis with substantial
amount of data and need more detailed analyses, the use of statistical
yo
package such as MINITAB, SAS, and SPSS is recommended. Besides
these, a number of software including R, Stata and others are available
readily and are widely used in research and data analysis.
op
Simple Regression Using EXCEL
The instructions in Table 7.5 will produce the regression output shown in
Table 7.6. If you checked the boxes under Residuals and the Line Fit Plots,
the residuals and fitted line plot will be displayed.
tC
4. Select Regression
5. Select Hours(y) for Input y range and Units(x) for Input x range (including the
labels)
6. Check the Labels box
7. Click on the circle to the left of Output Range, click on the box next to output
range and specify where you want to store the output by clicking a blank cell (or
select New Worksheet Ply)
8. Check the Line Fit Plot under residuals. Click OK
Do
You may check the boxes under residuals and normal probability plot as desired.
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
t
os
rP
yo
op
tC
No Table 7.6 EXCEL regression output
Do
128
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
REgRESSION ANALYSIS ANd MOdELINg 129
t
coefficients column, 6.620904991 is the y-intercept and 0.009638772 is
os
the slope. The regression equation from this table is
ŷ = 6.62 + 0.00964 x
rP
This is the same equation we obtained earlier using manual calculations.
yo
chapter. Recall that in regression, the total sum of squares is partitioned
into two components; the regression sum of squares (SSR) and the error
sum of squares (SSE), giving the following relationship: SST = SSR +
SSE. The coefficient of determination r2 which is also the measure of
goodness of fit for the regression equation can be calculated using
op
SSR
r2 =
SST
The values of SSR, SSE, and SST can be obtained using the ANOVA
tC
The t-test and F-test for the significance of regression can be easily
performed using the information in the EXCEL computer output under
the ANOVA table. Table 7.8 shows the EXCEL regression output with
the ANOVA table.
(1) Conducting the t-Test Using the Regression Output in Table 7.8.
Do
t n − 2 = b1 sb1
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
t
os
rP
yo
op
tC Table 7.7 EXCEL regression output
No
Do
130
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
t
os
rP
yo
op
tC Table 7.8 EXCEL regression output
No
Do
131
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
132 BUSINESS ANALYTICS, VOLUME II
t
The values of b1, sb1 and the test-statistic value t n − 2 are labeled in
os
Table 7.8 below.
Using the test-statistic value, the hypothesis test for the significance
of regression can be conducted. This test is explained here using the com-
puter results. The appropriate hypotheses for the test are:
rP
H 0 : ˜1 = 0
H1 : ˜1 ˛ 0
The null hypothesis states that the slope of the regression line is zero.
yo
Thus, if the regression is significant, the null hypothesis must be rejected.
A convenient way of testing the above hypotheses is to use the p-value
approach. The test statistic value t n − 2 and the corresponding p values are
reported in the regression output Table 7.8. Note that the p value is very
close to zero (p = 2.92278E-19). If we test the hypothesis at a 5 percent
op
level of significance (α = 0.05) then p = 0.000 is less than α = 0.05 and
we reject the null hypothesis and conclude that the regression is signifi-
cant overall.
tC
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
REgRESSION ANALYSIS ANd MOdELINg 133
t
os
rP
yo
Figure 7.10 Scatterplot of Hours (y) and Units (x)
op
tC
No
Do
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
134 BUSINESS ANALYTICS, VOLUME II
t
Analysis of Regression Output in Table 7.9
os
Refer to the Regression Analysis part. In this table, the regression equation
is printed as Hours(y) = 6.62 + 0.00964 Units(x). This is the equation of
the best fitting line using the least squares method. Just below the regression
equation, a table is printed that describes the model in more detail. The val-
rP
ues under the Coef column means coefficients. The values in this column
refer to the regression coefficients b0 and b1 where b0 is the y-intercept or
constant and b1 is the slope of the regression line. Under the Predictor, the
value of Units (x) is 0.0096388 which is b1 (or the slope of the fitted line).
The Constant is 6.6209. These values form the regression equation.
yo
Table 7.9 The regression analysis and analysis of variance tables
using MINITAB
op
tC
No
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
REgRESSION ANALYSIS ANd MOdELINg 135
t
from each of the points to the line is minimum. The error or the
os
residual is the vertical distance of each point from the estimated line.
Figure 7.12 shows the least squares line and the residuals. The re-
sidual for a point is given by ( y − y ) which is the vertical distance
of a point from the estimated line.
rP
yo
op
Figure 7.12 The least squares line and residuals
tC
[Note: The estimated line is denoted by y^ and the residual for a point yi is given by (yi-y^)]
The estimated least squares line is of the form y = b0 + b1x where b1 is the slope and b0 is the
y-intercept. In the regression equation: Hours(Y) = 6.62 + 0.00964 Units(X), 6.62 is the
y-intercept and 0.00964 is the slope. This line provides the relationship between the hours and
the number of units produced. The equation states that for each unit increase in x (the number
of units produced), y (the number of hours) will increase by 0.00964.
No
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
136 BUSINESS ANALYTICS, VOLUME II
t
3. Te Coefcient of Determination (r2)
os
The coefficient of determination, r2 is an indication of how well
the independent variable predicts the dependent variable. In other
words, it is used to judge the adequacy of the regression model. The
value of r2 lies between 0 and 1 (0 ≤ r2≤ 1) or 0 to 100 percent.
rP
The closer the value of r2 to 1 or 100 percent, better is the model.
The r2 value indicates the amount of variability in the data explained
by the regression model. In our example, the r2 value is 94.6 percent
(Table 7.9, Regression Analysis). The value of r2 is reported as:
yo
R-Sq = 94.6%
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
REgRESSION ANALYSIS ANd MOdELINg 137
t
The assumption regarding the independence of errors can be evaluated
os
by plotting the errors or residuals in the order or the sequence in which
the data were collected. If the errors are not independent, a relationship
exists between consecutive residuals which is a violation of the assump-
tion of independence of errors. When the errors are not independent,
rP
the plot of residuals versus the time (or the order) in which the data were
collected will show a cyclical pattern. Meeting this assumption is particu-
larly important when data are collected over a period of time. If the data
are collected at different time periods, the errors for specific time period
may be correlated with the errors of those of the previous time periods.
yo
The assumption that the errors are normally distributed or the nor-
mality assumption requires that the errors have a normal or approximately
normal distribution. Note that this assumption means that the errors do
not deviate too much from normality. The assumption can be verified by
plotting the histogram or the normal probability plot of errors.
op
The assumption that the variance of errors are equal (equality of vari-
ance) is also known as homoscedasticity. This requires that the errors are
constant for all values of x or the variability of y values is the same for both
the low and high values of x. The equality of variance assumption is of
particular importance for making inferences about b0 and b1.
tC
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
138 BUSINESS ANALYTICS, VOLUME II
t
vs. order of data. The residuals can also be plotted with each of the in-
os
dependent variables.
Figures 7.13a and 7.13b are used to check the normality assumption.
The regression model assumes that the errors are normally distributed
with mean zero. Figure 7.13a shows the normal probability plot. This plot
rP
is used to check for the normality assumption of regression model. In this
plot, if the plotted points lie on a straight line or close to a straight line
then the residuals or errors are normally distributed. The pattern of points
appear to fall on a straight line indicating no violation of the normality
assumption.
yo
Figure 7.13b shows the histogram of residuals. If the normality as-
sumption holds, the histogram of residuals should look symmetrical or
approximately symmetrical. Also, the histogram should be centered at
zero because the sum of the residuals is always zero. The histogram of
residuals is approximately symmetrical which indicates that the errors ap-
op
pear to be approximately normally distributed. Note that the histogram
may not be exactly symmetrical. We would like to see a pattern that is
symmetrical or approximately symmetrical.
In Figures 7.13c, the residuals are plotted against the fitted value and
the order of the data points. These plots are used to check the assump-
tC
and y is linear.
The plot of residual vs. the order of the data shown in Figure 7.13d is
used to check the independence of errors.
The independence of errors can be checked by plotting the errors or
the residuals in the order or sequence in which the data were collected.
The plot of residuals vs. the order of data should show no pattern or ap-
parent relationship between the consecutive residuals. This plot shows
Do
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
t
os
rP
yo
op
tC
139
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
140 BUSINESS ANALYTICS, VOLUME II
t
values. In these cases, there may be a relationship between consecutive
os
residuals that violates the assumption of independence of errors.
The equality of variance assumption requires that the errors are con-
stant for all values of x or the variability of y is the same for both the low
and high values of x. This can be checked by plotting the residuals and the
rP
order of data points. This plot is shown in Figure 7.13d. If the equality
of variance assumption is violated, this plot will show an increasing trend
showing an increasing variability. This demonstrates a lack of homogene-
ity in the variances of y values at each level of x. The plot shows no viola-
tion of equality of variance assumption.
yo
Multiple Regression: Computer Analysis and Results
Introduction to Multiple Regression
op
In the previous chapter we explored the relationship between two vari-
ables using the simple regression and correlation analysis. We demon-
strated how the estimated regression equation can be used to predict a
dependent variable (y) using an independent variable (x). We also dis-
cussed the correlation between two variables that explains the degree of
tC
The mathematical form of multiple linear regression model relating the de-
pendent variable y and two or more independent variables x1 , x 2 ,…xk
with the associated error term is given by:
Do
y = ˝0 + ˝1 x1 + ˝ 2 x 2 + ˝3 x3 +…. + ˝k xk + ˙ (7.16)
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
REgRESSION ANALYSIS ANd MOdELINg 141
t
error term. Equation (7.16) can be viewed as a population multiple re-
os
gression model in which y is a linear function of unknown parameters
˜0 , ˜1 , ˜ 2 ,.. ˜k and an error term. The error ε explains the variability in
y that cannot be explained by the linear effects of the independent vari-
ables. The multiple regression model is similar to the simple regression
rP
model except that multiple regression involves more than one independ-
ent variable.
One of the basic assumptions of the regression analysis is that the
mean or the expected value of the error is zero. This implies that the mean
or expected value of y or E = ( y ) in the multiple regression model can
yo
be given by:
E = ( y ) = ˆ0 + ˆ1 x1 + ˆ 2 x 2 + ˆ3 x3 +…. + ˆk xk (7.17)
The above equation relating the mean value of y and the k independ-
op
ent variables is known as the multiple regression equation.
It is important to note that ˜0 , ˜1 , ˜ 2 ,.. ˜k are the unknown popula-
tion parameters, or regression coefficients and they must be estimated
using the sample data to obtain the estimated equation of multiple regres-
tC
( yˆ ) = b0 + b1x1 + b2 x2 + b3 x3
No
+…. + bk xk (7.18)
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
142 BUSINESS ANALYTICS, VOLUME II
t
the estimates of the population parameters and can be determined using
os
the least squares method.
In a multiple linear regression, the variation in y (the response vari-
able) may be explained using two or more independent variables or pre-
dictors. The objective is to predict the dependent variable. Compared to
rP
simple linear regression, a more precise prediction can be made because
we use two or more independent variables. By using two or more in-
dependent variables, we are often able to make use of more information
in the model. The simplest form of a multiple linear regression model
involves two independent variables and can be written as:
yo
y = ˛0 + ˛1 x1 + ˛ 2 x 2 + ˝ (7.19)
obtained using the least squares method. Recall that in a simple regression,
the least squares method requires ftting a line through the data points so that the
sums of the squares of errors or residuals are minimized. Tese errors or residuals
are the vertical distances of the points from the ftted line. The same concept
of simple regression is used to develop the multiple regression equation.
In a multiple regression, the least squares method determines the best
fitting plane or the hyperplane through the data points that ensures that
Do
the sum of the squares of the vertical distances or deviations from the
given points and the plane are a minimum.
Figure 7.14 shows a multiple regression model with two independent
variables. The response y with two independent variables x1 and x2 forms
a regression plane. The observed data points in the figure are shown using
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
REgRESSION ANALYSIS ANd MOdELINg 143
t
os
rP
variables
yo
Figure 7.14 Scatter plot and regression plane with two independent
dots. The stars on the regression plane indicate the corresponding points
op
that have identical values for x1 and x2. The vertical distance from the ob-
served points to the point on plane are shown using vertical lines. These
vertical lines are the errors. The error for a particular point yi is denoted by
( yi − ŷ ) where the estimated value ŷ is calculated using the regression
tC
˜ ( y − ŷ)2
No
where y is the observed value and ŷ is the estimated value of the depend-
ent variable given by ŷ = b0 + b1 x1 + b2 x 2
[Note: Te terms independent, or explanatory variables, and the predictors have the
same meaning and are used interchangeably in this chapter. Te dependent variable
is often referred to as the response variable in multiple regression.]
Do
Similar to the simple regression, the least squares method uses the
sample data to estimate the regression coefficients b0 , b1 , b2 ,.. bk and
hence the estimated equation of multiple regression. Figure 7.15 shows
the process of estimating the regression coefficients and the multiple re-
gression equation.
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
t
os
rP
yo
Figure 7.15 Process of estimating the multiple regression equation
op
tC
No
Do
144
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
REgRESSION ANALYSIS ANd MOdELINg 145
t
Models with Two Quantitative Independent Variables
os
x1 and x2
The model with two quantitative independent variables is the simplest
multiple regression model. It is a first order model and is written as:
rP
y = b0 + b1 x1 + b2 x 2 (7.20)
yo
b2 = change in y for a 1-unit increase in x2 when x1 is constant
The graph of the first order model is shown in Figure 7.16. This graph
with two independent quantitative variables x1 and x2 plots a plane in a
three-dimensional space. The plane plots the value of y for every combin-
op
ation ( x1 , x 2 ). This corresponds to the points in the ( x1 , x 2 ) plane.
The first-order model with two quantitative variables x1 and x2 is
based on the assumption that there is no interaction between x1 and x2.
This means that the effect on the response of y of a change in x1(for a
fixed value of x2) is same regardless of the value of x2 and the effect on
tC
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
146 BUSINESS ANALYTICS, VOLUME II
t
os
rP
yo
Figure 7.16 A multiple regression model with two quantitative
variables
ated error term. The multiple regression model is based on the following
assumptions about the error term ε.
other. That is, the error for a set of values of independent variables
is not related to the error for any other set of values of independent
variables. This assumption is critical when the data are collected over
different time periods. When the data are collected over time, the er-
rors in one-time period may be correlated with another time period.
2. The normality assumption. This means that the errors or residuals
Do
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
REgRESSION ANALYSIS ANd MOdELINg 147
t
Te error assumption. The error, ε is a random variable with mean
os
or expected value of zero, that is, E (˜ ) = 0 . This implies that the
mean values of the dependent variable y , for a given value of the in-
dependent variable, x is the expected, or the mean value of y
3. denoted by E ( y ) and the population regression model can be
rP
written as:
E ( y ) = ˆ0 + ˆ1 x1 + ˆ 2 x 2 + ˆ3 x3 +…. + ˆk xk
yo
variance of the errors (εi), denoted by σ2 are constant for all values of
the independent variables x1 , x 2 ,.., xk . In case of serious departure
from the equality of variance assumption, methods such as weighted
least-squares, or data transformation may be used.
[Note: The terms error and residual have the same meaning and
op
these terms are used interchangeably in this chapter.]
in thousands of square feet, and the age of the furnace (x3) in years.
The home heating cost is the response variable and the other three
variables are predictors. (The data for this problem: HEAT_COST.
MTW, EXCEL data file: HEAT_COST.xlsx) is listed in Table 7.10
below.
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
148 BUSINESS ANALYTICS, VOLUME II
t
Table 7.10 Data for home heating cost
os
Row Avg Temp House Size Age of Furnace Heating Cost
1 37 3.0 6 210
2 30 4.0 9 365
3 37 2.5 4 182
rP
4 61 1.0 3 65
5 66 2.0 5 82
6 39 3.5 4 205
7 15 4.1 6 360
8 8 3.8 9 295
yo
9 22 2.9 10 235
10 56 2.2 4 125
11 55 2.0 3 78
12 40 3.8 4 162
13 21 4.5 12 405
14 40 5.0 6 325
op
15 61 1.8 5 82
16 21 4.2 7 277
17 63 2.3 2 99
18 41 3.0 10 195
tC
19 28 4.2 7 240
20 31 3.0 4 144
21 33 3.2 4 265
22 31 4.2 11 355
23 36 2.8 3 175
24 56 1.2 4 57
No
25 35 2.3 8 196
26 36 3.6 6 215
27 9 4.3 8 380
28 10 4.0 11 300
29 21 3.0 9 240
30 51 2.5 7 130
Do
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
REgRESSION ANALYSIS ANd MOdELINg 149
t
or predictor variables Figure 7.17. If the scatterplots of y on the independ-
os
ent variables appear to be linear enough, a multiple regression model can
be fitted. Based on the analysis of the scatter plots of y and each of the
independent variables, an appropriate model (for example, a first order
model) can be recommended to predict the home heating cost.
rP
A frst order multiple regression model does not include any higher order
terms (e.g., x2). An example of a frst-order model with fve independent vari-
ables can be written as:
y = b0 + b1 x1 + b2 x 2 + b3 x3 + b4 x 4 + b5 x5 (7.21)
yo
The multiple linear regression model is based on the assumption that
the relationship between the response and the independent variables is
linear. This relationship can be checked using a matrix plot. The mat-
rix plot is used to investigate the relationships between pairs of variables
op
by creating an array of scatterplots. MINITAB provides two options for
constructing the matrix plot: Matrix of Plots and Each Y versus each X. The
first of these plots is used to investigate the relationships among pairs of
variables when there are several independent variables involved. The other
plot (each y versus each x) produces separate plots of the response y and
tC
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
t
os
rP
yo
op
Figure 7.17 Matrix plot of each y vs. each x
tC
No
Do
150
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
REgRESSION ANALYSIS ANd MOdELINg 151
t
the furnace. An investigation of the plot shows an inverse relationship
os
between the heating cost and the average temperature (the heating cost
decreases as the temperature rises) and a positive relationship between
the heating cost and each of the other two variables: house size and age
of the furnace. The heating cost increases with the increasing house size
rP
and also with the older furnace. None of these plots show bending (non-
linear or curvilinear) patterns between the response and the explanatory
variables. The presence of bending patterns in these plots would suggest
transformation of variables. The scatterplots in Figure 7.17 (also known
as side-by-side scatter plots) show linear relationship between the response
yo
and each of the explanatory variables indicating all the three explanatory
variables could be a good predictor of the home heating cost. In this case,
a multiple linear regression would be an adequate model for predicting
the heating cost.
op
Matrix of Plots: Simple
Another variation of the matrix plot is known as “matrix of plots” in
MINITAB and is shown in Figure 7.18. This plot provides scatterplots
tC
that are helpful in visualizing not only the relationship of the response
variable with each of the independent variables but also provides scat-
terplots that are useful in assessing the interaction effects between the
variables. This plot can be used when more detailed model beyond a
first-order model is of interest. Note that the first order model is the one
No
that contains only the first order terms; with no square or interaction
terms and is written as y = b0 + b1 x1 + b2 x 2 + ˜ + bk xk
The matrix plot in Figure 7.18 is a table of scatterplots with each cell
showing a scatterplot of the variable that is labeled for the column versus
the variable labeled for the row. The cell in the first row and first column
displays the scatterplot of heating cost (y) versus average temperature (x1).
Do
The plot in the second row and first column is the scatterplot of heat-
ing cost (y) and the house size (x2) and the plot in the third row and the
first column shows the scatterplot of heating cost (y) and the age of the
furnace (x3).
The second column and the second row of the matrix plot shows a
scatterplot displaying the relationship between average temperature (x1)
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
t
os
rP
yo
op
tC
No
152
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
REgRESSION ANALYSIS ANd MOdELINg 153
t
and the house size (x2). The scatterplots showing the relationship between
os
the pairs of independent variables are obtained from columns 2 and 3 of
the matrix plot. The matrix plot is helpful in visualizing the interaction
relationships. For fitting the first order model, a plot of y versus each x is
adequate.
rP
The matrix plots in Figures 7.17 and 7.18 show a negative association
or relationship between the heating cost (y) and the average temperature
(x1) and a positive association or relationship between the heating cost (y)
and the other two explanatory variables: house size (x2) and the age of the
furnace (x3). All these relationships are linear indicating that all the three
yo
explanatory variables can be used to build a multiple regression model.
Constructing the matrix plot and investigating the relationships between
the variables can be very helpful in building a correct regression model.
y = b0 + b1 x1 + b2 x 2 + b3 x3
tC
where,
x2 = Size of the house (in thousands of square feet), x3 = Age of the furnace
(in years)
Table 7.10 and data file HEAT_COST.MTW shows the data for
this problem. We used MINITAB to run the regression model for this
problem.
Do
Table 7.11 shows the results of running the multiple regression prob-
lem using MINITAB. In this table, we have marked some of the calcula-
tions (e.g., b0, b1, sbo, sb1, etc.) for clarity and explanation. These are not
the part of the computer output. The regression computer output has two
parts: Regression Analysis and Analysis of Variance.
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
154 BUSINESS ANALYTICS, VOLUME II
t
Table 7.11 MINITAB regression analysis results
os
rP
yo
The Regression Equation
op
Refer to the “Regression Analysis” part of Table 7.11 for analysis. Since
there are three independent or explanatory variables, the regression equa-
tion is of the form:
y = b0 + b1 x1 + b2 x 2 + b3 x3
tC
or
where, y is the response variable (Heating Cost), x1, x2, x3 are the in-
dependent variables as described above, the regression coefficients
b0 , b1 , b2 , b3 are stored under the column Coef. In the regression equation
Do
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
REgRESSION ANALYSIS ANd MOdELINg 155
t
Interpreting the Regression Equation
os
Equation (7.22) or (7.23) can be interpreted in the following way:
• b1 = −1.65 means that for each unit increase in the average tem-
perature (x1), the heating cost y (in dollars) can be predicted to go
rP
down by 1.65 (or, $1.65) when the house size (x2), and the age of
the furnace (x3) are held constant.
• b2 = +57.5 means that for each unit increase in the house size (x2
in thousands of square feet), the heating cost, y (in dollars) can be
predicted to go up by 57.5 when the average temperature (x1) and
yo
the age of the furnace (x3) are held constant.
• b3 = + 7.91 means that for each unit increase in the age of the furnace
(x3 in years), the heating cost y can be predicted to go up by $7.91 when
the average temperature (x1) and the house size (x2) are held constant.
op
Standard Error of the Estimate(s) and Its Meaning
Te standard error of the estimate or the standard deviation of the model
s is a measure of scatter or the measure of variation of the points around
tC
s = 37.32 dollars
The standard error of the estimate is used to check the utility of the
model and to provide a measure of reliability of the prediction made from
the model. One interpretation of s is that the interval ±2s will provide an ap-
proximation to the accuracy with which the regression model will predict the
Do
future value of the response y for given values of. Thus, for our example, we
can expect the model to provide predictions of heating cost (y) to be within
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
156 BUSINESS ANALYTICS, VOLUME II
t
The Coeffcient of Multiple Determination (r2)
os
The coefficient of multiple determination is often used to check the ad-
equacy of the regression model. The value of r2 lies between 0 and 1, or
0 percent and 100 percent, that is, 0 ≤ r2 ≤ 1. It indicates the fraction
of total variation of the dependent variable y that is explained by the in-
rP
dependent variables or predictors. Usually, closer the value of r2 to 1 or
100 percent; stronger is the model. However, one should be careful in
drawing conclusions based solely on the value of r2. A large value of r2
does not necessarily mean that the model provides a good fit to the data.
In case of multiple regression, addition of a new variable to the model
yo
always increases the value of r2 even if the added variable is not statistically
significant. Thus, addition of a new variable will increase r2 indicating a
stronger model but may lead to poor predictions of new values. The value
of r2 can be calculated using the expression
op
SSE 36, 207
r2 = 1 − = 1− = 0.88
SST 301, 985
r2 = = = 0.88
SST 301, 985
r2 = 88.0%
Do
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
REgRESSION ANALYSIS ANd MOdELINg 157
t
The value of r2 = 88.0% for our example implies that using the three
os
independent variables; average temperature, size of the house, and the age
of the furnace in the model, 88.0 percent of the total variation in heating
cost (y) can be explained. The statistic r2 tells how well the model fits the
data, and thus, provides the overall predictive usefulness of the model.
rP
The value of adjusted R2 is also used in comparing two regression
models that have the same response variable but different number of in-
dependent variables or predictors.
yo
Hypothesis Tests in Multiple Regression
Recall that in simple regression analysis, we conducted the test for the sig-
nificance using a t-test and F-test. Both of these tests in simple regression
provided the same conclusion. If the null hypothesis is rejected in these
tests, it will lead to the conclusion that the slope was not zero, or β1 = 0.
In a multiple regression, the t-test and the F-test have somewhat different
Do
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
158 BUSINESS ANALYTICS, VOLUME II
t
1. If the conclusion of the F-test indicates that the regression is sig-
os
nificant overall then a separate t-test is conducted for each of the in-
dependent variables to determine whether each of the independent
variables is significant.
Both the F-test and t-test are explained below.
rP
F-Test
The null and alternate hypotheses for the multiple regression model
y = b0 + b1 x1 + b2 x 2 + .. + bk xk are stated as
yo
H 0 : ˜1 = ˜ 2 = … = ˜ k = 0 (Regression is not significant)
H1 : at least one of the coefficients is nonzero (7.24)
MSR
F =
MSE (7.25)
the larger the explained variation of the total variability, the larger is the
F-statistic. The values of MSR, MSE, and the F statistic are calculated
in the “Analysis of Variance” table of the multiple regression computer
output (see Table 7.12 below).
The critical value for the test is given by Fk ,n − (k +1),˛ where, k is the
number of independent variables, n is the number of observations in
Do
the model, and α is the level of significance. Note that k and (n-k-1) are
the degrees of freedom associated with MSR and MSE respectively. The
null hypothesis is rejected if F > Fk ,n − (k +1),˛ where F is the calculated F
value or the test statistic value in the Analysis of Variance table.
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
REgRESSION ANALYSIS ANd MOdELINg 159
t
Table 7.12 Analysis of variance table
os
rP
Test the Overall Signifcance of Regression for the
yo
Example Problem at a 5 Percent Level of Signifcance
Step 1: State the Null and Alternate Hypotheses
For the overall significance of regression, the null and alternate hypoth-
eses are:
op
H 0 : ˜1 = ˜ 2 = … = ˜ k = 0 (Regression is not significant)
H1 : at least one of the coefficients is nonzero (7.26)
tC
MSR
F = (7.27)
MSE
No
In the ANOVA table below, the first column refers to the sources
of variation, DF = the degrees of freedom, SS = the sum of squares,
MS = mean squares, F = the F statistic, and p is the probability or p-value
associated with the calculated F statistic.
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
160 BUSINESS ANALYTICS, VOLUME II
t
The degrees of freedom (DF) for Regression and Error are k and n −
os
(k + 1) respectively where, k is the number of independent variables (k = 3
for our example) and n is the number of observations (n = 30). Also, the
total sum of squares (SST) is partitioned into two parts: sum of squares
due to regression (SSR) and the sum of squares due to error (SSE) having
rP
the following relationship.
We have labeled SST, SSR, and SSE values in Table 7.12. The mean
square due to regression (MSR) and the mean squares due to error (MSE)
yo
are calculated using the following relationships:
The test statistic value or the F statistic from the ANOVA table (see
Table 7.12) is
tC
F = 63.62
The calculated F statistic value is 63.62. Since F = 63.62 > Fcritical = 2.74,
we reject the null hypothesis stated in equation (7.26) and conclude that
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
REgRESSION ANALYSIS ANd MOdELINg 161
t
the regression is significant overall. This indicates that there exists a sig-
os
nificant relationship between the dependent and independent variables.
rP
The hypothesis stated using equation (7.26) can also be tested using the
p-value approach. The decision rule using the p-value approach is given by
If p ≥ α, do not reject H0
If P < α, reject H0
yo
From Table 7.12, the calculated p value is 0.000 (see the P column). Since
p = 0.000 < α = 0.05, we reject the null hypothesis H0 and conclude that
the regression is significant overall.
be conducted:
H0:β j = 0
H 1:β j ≠ 0 (7.28)
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
162 BUSINESS ANALYTICS, VOLUME II
t
Table 7.13 MINITAB regression analysis results
os
rP
yo
op
This hypothesis test also helps to determine if the model can be made
more effective by deleting certain independent variables, or by adding
tC
extra variables. The information to conduct the hypothesis test for each of
the independent variables is contained in the “Regression Analysis” part
of the computer output which is reproduced in Table 7.13 below. The
columns labeled T and p are used to test the hypotheses. Since there are
three independent variables, we will test to determine whether each of the
No
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
REgRESSION ANALYSIS ANd MOdELINg 163
t
H0: β1 = 0 (x1 is not significant or x1 does not contribute in prediction of y)
os
H1:β1 ≠ 0 (x1 is significant or x1 does contribute in prediction of y) (7.29)
rP
b1 (7.30)
t =
sb1
where, b1 is the estimate of slope β1 and sb1 is the estimated standard devi-
yo
ation of b1.
Step 3: Determine the value of the test statistic
The values b1, sb1 and t are all reported in the Regression Analysis part of
Table 7.13. From this table, these values for the variable x1 or the average
temperature (Avg. Temp.) are
op
b1 = −1.6457, sb1 = 0.6967
b1 −1.6457
t = = = −2.36
sb1 0.6967
t˜ / 2,[ n − ( k +1)]
which is the t-value from the t-table for [n − (k + 1)] degrees of freedom
Do
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
164 BUSINESS ANALYTICS, VOLUME II
t
Step 5: Specify the decision rule: The decision rule for the test:
os
Reject H0 if t > +2.056
or, if t < −2.056
rP
The test statistic value (T value) for the variable “av. temp” (x1) from
Table 7.13 is −2.36. Since, t = −2.36 < tcritical = −2.056
yo
contribute in the prediction of y.
The significance of other independent variables can be tested in the
same way. The test statistic or the t values for all the independent vari-
ables are reported in Table 7.13 under T column. The critical values for
testing each independent variable are the same as in the test for the first
op
independent variable above. Thus, the critical values for testing the other
independent variables are
If p ≥ α, do not reject H0
If P < α, reject H0 (7.31)
From Table 7.14, the p-value for the variable average temperature (Avg.
Temp., x1) is 0.026. Since, p = 0.026 < α = 0.05, we reject H0 and con-
Do
t
Table 7.14 Summary table
os
Independent p-value from Compare p Signifcant?
Variable Table 7.4 to α Decision Yes or No
Av. Temp. (x1) 0.026 P<α Reject H0 Yes
House Size (x2) 0.000 P<α Reject H0 Yes
rP
Age of Furnace (x3) 0.024 P<α Reject H0 Yes
yo
is 0.05 for all the tests. The hypothesis can be stated as:
t
the null hypothesis will be rejected incorrectly at least once leading to the
os
conclusion that β differs from 0. Thus, in the multiple regression models
where a large number of independent variables are involved and a series
of t- tests are conducted, there is a chance of including a large number of
insignificant variables and excluding some useful ones from the model. In
rP
order to assess the utility of the multiple regression models, we need to
conduct a test that will include all the β parameters simultaneously. Such
a test would test the overall significance of the multiple regression model.
The other useful measure of the utility of the model would be to find
some statistical quantity such as R2 that measures how well the model fits
yo
the data.
A Note on Checking the Utility of a Multiple Regression Model
(Checking the Model Adequacy)
H 0 : ˜1 = ˜ 2 = … = ˜ k = 0 (No relationship)
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
REgRESSION ANALYSIS ANd MOdELINg 167
t
In practice, it is not unusual to see correlations among the independent
os
variables. However, if serious multicollinearity is present, it may cause
problems by increasing the variance of the regression coefficients and
making them unstable and difficult to interpret. Also, highly correlated
independent variables increase the likelihood of rounding errors in the
rP
calculation of β estimates and standard errors. In the presence of multi-
collinearity, the regression results may be misleading.
Effects of Multicollinearity
yo
A) Consider a regression model where the production cost (y) is related
to three independent variables: machine hours (x1), material cost (x2),
and labor hours (x3):
y = ˛0 + ˛1 x1 + ˛ 2 x 2 + ˛3 x3
op
MINITAB computer output for this model is shown in
Table 7.15. If we perform t-tests for testing ˜1 , ˜ 2 , and ˜3 , we
find that all the three independent variables are non-significant at
α = 0.05 while the F-test for H0: β1 = β2 = β3 = 0 is significant (see
tC
the result of the F-test indicates that at least one of the three variables
is significant, or is making a contribution to the prediction of re-
sponse y. It is also possible that at least two or all the three variables
are contributing to the prediction of y. Here, the contribution of one
variable is overlapping with that of the other variable or variables.
This is because of the multicollinearity effect.
Do
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
168 BUSINESS ANALYTICS, VOLUME II
t
Table 7.15 Regression Analysis: PROD COST vs. MACHINE
os
HOURS, MATERIAL COST, and LABOR Hours.
rP
yo
the regression model indicates that for each unit increase in machine
op
hour, the production cost (y) decreases when the other two factors are
held constant. However, we would expect the production cost (y) to
increase as more machine hours are used. This may be due to the pres-
ence of multicollinearity. Because of the presence of multicollinearity,
tC
the value of a β parameter may have the opposite sign from what is
expected.
Detecting Multicollinearity
Do
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
REgRESSION ANALYSIS ANd MOdELINg 169
t
(VIF) for each predictor variable that measures how much the vari-
os
ance of the estimated regression coefficients are inflated as compared
to when the predictor variables are not linearly related. Use the
guidelines in Table 7.16 to interpret the VIF.
rP
Table 7.16 Detecting correlation using VIF values
Values of VIF Predictors are…
VIF = 1 Not correlated
1 < VIF < 5 Moderately correlated
VIF = 5 to 10 or greater Highly correlated
yo
VIF values greater than 10 may indicate that multicollinearity is un-
duly influencing your regression results. In this case, you may want to
reduce multicollinearity by removing unimportant independent variables
op
from your model.
Refer to the table above for the values of VIF for the production cost
example. The VIF value for each predictor has a value greater than 10 in-
dicating the precedence of multicollinearity. The VIF values indicate that
tC
the predictors are highly correlated. The VIF for each of the independent
variables is calculated automatically when a multiple regression model is
run using MINITAB.
Correlation, r
multicollinearity.
Table 7.18 shows the correlation coefficient, r between each pair of
predictors for the production cost example.
The above values of r show that the variables are highly corre-
lated. The correlation coefficient matrix above was calculated using
MINITAB.
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
170 BUSINESS ANALYTICS, VOLUME II
t
Table 7.17 Determining multicollinearty using correlation coeffcient, r
os
Correlation Coeffcient, r
r ˜ 0.8 Extreme multicollinearity
rP
r < 0.2 Low multicollinearity
yo
Machine Hours Material Cost(y)
Material Cost 0.964
Labor Hours 0.953 0.917
Cell Contents: Pearson correlation
op
Summary of the Key Features
of Multiple Regression Model
The multiple regression model above extended the concept of simple lin-
tC
ity of the prediction made from the model, (c) the coefficient of multiple
determination r2 that explains the variability in the response y explained by
the independent variables used in the model. Besides these, we discussed
the hypothesis tests using the computer results. Step-wise instructions
were provided to conduct the F-test and t-tests. The overall significance
of the regression model is tested using the F-test. The t- test is conducted
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
REgRESSION ANALYSIS ANd MOdELINg 171
t
on individual predictor or the independent variable to determine the sig-
os
nificance of that variable. The effect of multicollinearity and detection of
multicollinearity using computer were discussed with examples.
rP
Model Building and Computer Analysis
Introduction to Model Building
yo
analysis and interpretation of computer results. In both the simple and
multiple regression models, the relationship among the variables is linear.
In this chapter we will provide an introduction to model building and
nonlinear regression models. By model building, we mean selecting the
model that will provide a good fit to a set of data, and the one that will
op
provide a good estimate of the response or the dependent variable, y that
is related to independent variables or factors x1, x2, …xn. It is important
to choose the right model for the data.
In regression analysis, the dependent or the response variable is usu-
ally quantitative. The independent variables may be either quantitative or
tC
y = b0 + b1 x + b2 x2 + b3 x3+….+bn xn (7.32)
In the above equation, n is an integer and b0, b1,...,bn are unknown par-
ameters that must be estimated.
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
172 BUSINESS ANALYTICS, VOLUME II
t
A) First-order Model
os
The first order model is given by:
y = b0 + b1 x
or y = b0 + b1 x1 + b2 x2 + b3 x3+….+bn xn (7.33)
rP
where b0 = y-intercept, bi = regression coefficients
B) Second-order Model
A second order model can be written as
y = b0 + b1 x + b2 x2 (7.34)
yo
Equation (7.34) is a parabola in which:
Figure 7.19.
No
C) Third-order Model
A third order model can be written as:
y = b0 + b1 x + b2 x2 + b3 x3 (7.35)
b0: y-intercept and b3: controls the rate of reversal of the curvature of
curve.
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
REgRESSION ANALYSIS ANd MOdELINg 173
t
A second order model has no reversal in curvature. In a second order
os
model, the y value either continues to increase or decrease as x increases
and produces either a trough or a peak. A third order model produces one
reversal in curvature and produces one peak and one trough. Reversals
in curvature are not very common but can be modeled using third or
rP
higher order polynomial. The graph of a nth-order polynomial contains
(n − 1) peaks and troughs. Figure 7.20 shows the graph of a third order
polynomial. In real world situation, the second-order model is perhaps
the most useful.
yo
op
tC
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
174 BUSINESS ANALYTICS, VOLUME II
t
Table 7.19 Life of electronic components
os
Obs. 1 2 3 4 5 6 7 8 9 10
X(Temp.) 99 101 100 113 72 93 94 89 95 111
Y (Life) 141.0 136.7 145.7 194.3 101.5 121.4 123.5 118.4 137.0 183.2
Obs. 11 12 13 14 15 16 17 18 19 20
rP
X(Temp.) 72 76 105 84 102 103 92 81 73 97
Y (Life) 106.6 97.5 156.9 111.2 158.2 155.1 119.7 105.9 101.3 140.1
Obs. 21 22 23 24 25
X(Temp.) 105 90 94 79 91
yo
op
tC
No
Figure 7.21 Scatter Plot of Life (y) vs. Operating Temp. (x)
A second order model was fitted using MINITAB. The regression output
Do
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
REgRESSION ANALYSIS ANd MOdELINg 175
t
Table 7.20 Computer results of second order model
os
rP
yo
op
tC
No
Figure 7.23 shows the residual plots for this quadratic model. The residual
plots are useful in checking the assumptions of the model and the model
adequacy.
The analysis of residual plots for this model is similar to that of simple
and multiple regression models. The investigation of the plots shows that
the normality assumption is met. The plot of residuals versus the fitted
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
t
os
rP
yo
op
Figure 7.23 Residual plots for the quadratic model example
tC
No
Do
176
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
REgRESSION ANALYSIS ANd MOdELINg 177
t
values shows a random pattern indicating that the quadratic model fitted
os
to the data is adequate.
rP
Unlike MINITAB, EXCEL does not provide an option to run a quadratic
model of the form
y = b0 + b1x + b2 x2
yo
However, we can run a quadratic regression model by calculating the
x2 column from the x column in the data file. The EXCEL computer
results are shown in Table 7.21.
In the EXCEL output, the prediction equation can be read from the
“coefficients” column.
The r2 value is 95.9 percent which is an indication of a strong model.
No
H0:β2 = 0
H0:β2 ≠ 0 (7.36)
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
Do
178
Table 7.21 EXCEL computer output for the quadratic model
Summary Output
Regression Statistics
Multiple R 0.97947
No
R Square 0.95936
Adjusted R Square 0.95567
Standard Error 5.37620
Observations 25
tC
ANOVA
df SS MS F Signifcance F
Regression 2 15,011.7720 7,505.8860 259.6872 0.0000
op
Residual 22 635.8784 28.9036
Total 24 15,647.6504
Coeffcients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
yo
Intercept 433. 0063 61.8367 7.0024 0.0000 304.7648 561.2478 304.7648 561.2478
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
REgRESSION ANALYSIS ANd MOdELINg 179
t
The test statistic for this test is given by
os
b2
t =
sb2
rP
The test statistic value is calculated by the computer and is shown in
Table 7.21. In this table, the t value is reported in x**2 row and under t
stat column. This value is 7.93. Thus,
b2
t = = 7.93
sb2
yo
The critical value for the test is
t
n − k − 1,
˛
2
= t 22, 0.025 = 2.074
op
[Note: t n − k −1 is the t-value from the t-table for (n – k − 1) degrees of
freedom where n is the number of observations and k is the number of
independent variables.]
For our example, n = 25, k = 2 and the level of significance, α = 0.05.
tC
Using these values, the critical value or the t-value from the t-table for 22
degrees of freedom and α = 0.025 is 2.074. Since the calculated value of t
We reject the null hypothesis and conclude that the second order term in
fact contributes in the prediction of the life of components (y). Note: we
could have tested the following hypotheses:
H0:β = 0
H0:β > 0
Do
which will determine that the value of b2 = 0.0598 in the prediction equa-
tion is large enough to conclude that the life of the components increases
at an increasing rate with temperature. This hypothesis will have the same
test statistic and can be tested at α = 0.05.
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
180 BUSINESS ANALYTICS, VOLUME II
t
Terefore, our conclusion is that the mean component life increases at an
os
increasing rate of temperature and the second order term in our model, in fact,
is signifcant and contributes to the prediction of y.
rP
The fitted line plot of the temperature and yield in Figure 7.24 shows
the yield of a chemical process at different temperatures. The plot clearly
indicates a nonlinear relationship. There is an indication that the data can
be well approximated by a quadratic model.
yo
We used MINITAB and EXCEL to run a quadratic model to the
data. The prediction equation from the regression output is shown below.
propriate and the prediction equation can be used to predict the yield at
different temperatures.
No
Do
Figure 7.24 Fitted line plot showing the yield of a chemical process
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
REgRESSION ANALYSIS ANd MOdELINg 181
t
Summary of Model Building
os
The sections above provided an introduction to model building. The first
order, second order, and third order models were discussed. Unlike the
simple and multiple regression models, where the relationship among
the variables is linear, there are situation where the relationship among
rP
the variables under study may not be linear. We discussed the situation
where higher order and nonlinear models provide a better relationship
between the response and independent variables and provided examples
of quadratic or second-order models. Scatter plots were created to select
the model that would provide a good fit to a set of data and can be used to
yo
obtain a good estimate of the response or the dependent variable, y that is
related to the independent variables or predictors. Since the second order
or quadratic models are appropriate in many applications, we provided a
detailed computer analysis of such models. The computer analysis and in-
terpretation of computer results were explained and examined including
op
the residual plots and analysis.
of male and female employees based on their education and years of ex-
perience; the variable male or female is a qualitative variable that must be
included as a separate independent variable in the model. To include such
qualitative variables in the model we use a dummy or indicator variable.
The use of dummy or indicator variables in a regression model allows us
to include qualitative variables in the model. For example, to include the
Do
° 1
x1 = ˛
˙˝ 0
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
182 BUSINESS ANALYTICS, VOLUME II
t
In the above formulation, a “1” indicates that the employee is a male
os
and a “0” means the employee is a female. Which one of the male or fe-
male is assigned the value of 1 is arbitrary.
In general, the number of dummy or indicator variables needed is
one less than the total number of indicator variables to be included in
rP
the model.
yo
female employees. This model can be written as
y = b0 +b1 x
This coding scheme will allow us to compare the mean salary for male
tC
Thus, the mean salary for the female employees is b0. In a 0-1 coding
system, the mean response will always be b0 for the qualitative variable
that is assigned the value 0.This is also called the base level.
The difference in the mean salary for the male and female employees
Do
The above is the difference between the mean response for the level
that is assigned the value 1 and the level that is assigned the value 0 or the
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
REgRESSION ANALYSIS ANd MOdELINg 183
t
base level. The mean salary for the male and female employees is shown
os
graphically in Figure 7.25. We can also see that
b0 = µ F
b1 = µ M − µ F
rP
yo
op
Figure 7.25 Mean salary of female and male employees
tC
y = b0 + b1 x1 + b2 x2 where,
Do
°˝ 1 if location B
x1 = ˛
˝˙ 0 if not
°˝ 1 if location C
x2 = ˛
˝˙ 0 if not
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
184 BUSINESS ANALYTICS, VOLUME II
t
The variables x1 and x2 are known as the dummy variables that make
os
the model function.
rP
Suppose, µA = mean profit for location A
µB = mean profit for location B
µC = mean profit for location C
If we set x1 = 0 and x2 = 0, we will get the mean profit for location A.
Therefore, the mean value of profit y when the store location is A
yo
µA = y = b0 + b1(0) + b2 (0)
or, µA = b0
µB = y = b0 + b1 x1 + b2 x2 = b0 + b1(1) + b2(0)
or, µB = b0 + b1
tC
µB = µA + b1
b1 = µB − µA
No
or
µC = y = b0 + b1 x1 + b2 x2 = b0 + b1(0) + b2(1)
µC = b0 + b2
Do
or,
µC = µA + b2
b2 = µC − µA
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
REgRESSION ANALYSIS ANd MOdELINg 185
t
Thus, in the above coding system, one qualitative independent variable
os
is at three levels,
µA = b0 and b1 = µB − µA
µB = b0 + b1
rP
µC = b0 + b2 b2 = µC − µA
where µA, µB, µC are the mean profits for locations A, B, and C.
Note that the three levels of the qualitative variable can be described with only
two dummy variables. Tis is because the mean of the base level (in this case
location A) is accounted for by the intercept b0. In general form, for m levels
yo
of qualitative variable, we need (m − 1) dummy variables.
The bar graph in Figure 7.26 shows the values of mean profit (y) for
the three locations.
op
tC
No
Figure 7.26 Bar chart showing the mean proft for three locations A,
B, C
In the above bar chart, the height of the bar corresponding to location
A is y = b0. Similarly, the heights of the bars corresponding to locations
B and C are y = b0 + b1 and y = b0 + b2 respectively. Note that either b1 or
Do
b2, or both could be negative. In Figure 7.26, b1 and b2 are both positive.
t
independent variables: advertisement dollars spent (x1) in hundreds of
os
dollars, commission paid to the salespersons (x2) in hundreds of dollars,
and the number of salespersons (x3) were investigated. The company is
now interested in including different sales territories where they market
the drug. The territory in which the company markets the drug is divided
rP
into three zones: zone A, B, and C. The management wants to predict the
sales for the three zones separately. To do this, the variable “zone” which
is a qualitative independent variable must be included in the model. The
company identified the sales volumes for the three zones along with the
variables considered earlier. The data including the sales volume and
yo
the three zones are shown in the last column of Table 7.22 (Data File:
DummyVar_File1).
˜˛ 1 if zone A ˜˛ 1 if zone B
x4 ° x5 °
˝˛ 0 otherwise ˝˛ 0 otherwise
h
op
In the above coding system, the choice of 0 and 1 in the coding is
arbitrary.
Note that, we have defined only two dummy variables—x4 and x5—
tC
y = b0 + b1 x1 + b2 x 2 + b3 x3 + b4 x 4 + b5 x5
˜˛ 1 if zone A ˜˛ 1 if zone B
x4 ° x5 °
˝˛ 0 otherwise ˝˛ 0 otherwise
h
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
REgRESSION ANALYSIS ANd MOdELINg 187
t
Table 7.22 Sales for different zones
os
No. of
Sales Advertisement Commission Salesperson
Row Volume (y) (x1) (x2) (x3) Zone
1 973.62 580.17 235.48 8 A
rP
2 903.12 414.67 240.78 7 A
3 1,067.37 420.48 276.07 10 A
4 1,193.37 454.59 295.70 14 B
5 1,429.62 524.05 286.67 16 C
6 1,557.87 623.77 325.66 18 A
7 1,590.12 641.89 298.82 17 A
yo
8 1,081.62 403.03 210.19 12 C
9 1,088.37 415.76 202.91 13 C
10 1,132.62 506.73 275.88 11 B
11 1,314.87 490.35 337.14 15 A
12 1,562.37 624.24 266.30 19 C
op
13 1,050.12 459.56 240.13 10 C
14 1,055.37 447.03 254.18 12 B
15 1,112.37 493.96 237.49 14 B
16 1,235.37 543.84 276.70 16 B
17 1,518.12 618.38 271.14 18 A
tC
Table 7.23 shows the data file for this regression model with the dummy
variables. The data can be analyzed using a MINITAB data file – [Data
File: DummyVar_File(2) or from the EXCEL data file – DummyVar_File
Do
(2).xlsx].
We used both MINITAB and EXCEL to run this model The
MINITAB and EXCEL regression output and results are shown in Tables
7.24 and 7.25. Refer to the computer results to answer the following
questions.
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
188 BUSINESS ANALYTICS, VOLUME II
t
Table 7.23 Data fle for the model with dummy variables
os
No. of
Volume Advertisement Commission Salespersons Zone A Zone B
Row (y) (x1) (x2) (x3) (x4) (x5)
1 973.62 580.17 235.48 8 1 0
rP
2 903.12 414.67 240.78 7 1 0
3 1,067.37 420.48 276.07 10 1 0
4 1,193.37 454.59 295.70 14 0 1
5 1,429.62 524.05 286.67 16 0 0
6 1,557.87 623.77 325.66 18 1 0
7 1,590.12 641.89 298.82 17 1 0
yo
8 1,081.62 403.03 210.19 12 0 0
9 1,088.37 415.76 202.91 13 0 0
10 1,132.62 506.73 275.88 11 0 1
11 1,314.87 490.35 337.14 15 1 0
12 1,562.37 624.24 266.30 19 0 0
op
13 1,050.12 459.56 240.13 10 0 0
14 1,055.37 447.03 254.18 12 0 1
15 1,112.37 493.96 237.49 14 0 1
16 1,235.37 543.84 276.70 16 0 1
17 1,518.12 618.38 271.14 18 1 0
tC
A) Using the EXCEL data file, run a regression model. Show your regres-
sion output.
B) Using the MINITAB or EXCEL regression output, write down the
Do
regression equation.
C) Using a 5 percent level of significance and the column “p” in the
MINITAB regression output or “p-value” column in the EXCEL re-
gression output, conduct appropriate hypotheses tests to determine
that the independent variables advertisement, commission paid, and
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
REgRESSION ANALYSIS ANd MOdELINg 189
t
number of sales persons are significant or they contribute in predict-
os
ing the sales volume.
D) Write separate regression equations to predict the sales for each of the
zones A, B, and C.
E) Refer to the given MINITAB residual plots and check that all the regres-
rP
sion assumptions are met and the fitted regression model is adequate.
Solution:
A) The MINITAB regression output is shown in Table 7.24.
B) Table 7.25 shows the EXCEL regression output.
yo
C) From the MINITAB or the EXCEL regression outputs in Tables 7.24
and 7.25, the regression equation is:
The regression equation from the EXCEL output in Table 7.25 can be
written using the coefficients column.
The above hypothesis can be tested using the “p” column in either
Do
If p ˜ ° , do not reject H0
If p < ° , reject H0
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
190 BUSINESS ANALYTICS, VOLUME II
t
os
rP
yo
op
tC
No
Table 7.26 shows the p-value for each of the predictor variables. From
MINITAB or EXCEL computer results in Table 7.24 or 7.25 (see the “p”
or the “p-value” columns in these tables).
From the above table it can be seen that all the three independent
variables are significant.
(E) As indicated, the overall regression equation is
Do
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
REgRESSION ANALYSIS ANd MOdELINg 191
t
Table 7.26 Summary table
os
p-value from
Independent Table 7.24 Compare p Signifcant?
Variable or 7.25 to α Decision Yes or No
Advertisement 0.000 p<α Reject H0 Yes
(X1)
rP
Commissions (X2) 0.000 p<α Reject H0 Yes
No. of salesper- 0.000 p<α Reject H0 Yes
sons (X3)
Zone A: x4 = 1.0, x5 = 0
yo
Therefore, the equation for the sales volume of Zone A can be written as
Sales Volume (y) = −98.2 + 0.884 Advertisement(x1) + 1.81
Commission(x2) +33.8 No. of Salespersons(x3) − 67.2(1) − 105 (0.0) or,
Sales Volume (y) = −98.2 + 0.884 Advertisement (x1) + 1.81 Commission
op
(x2) + 33.8 No. of Salespersons (x3) – 67.2 or,
Sales Volume (y) = −165.4 + 0.884 Advertisement(x1) + 1.81
Commission(x2) + 33.8 No. of Salespersons(x3)
tC
Similarly, the regression equations for the other two zones are shown
below.
Zone B: x4 = 0, x5 = 1.0
Substituting these values in the overall regression equation of part (c)
No
Zone C: x4 = 0, x5 = 0
Do
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
192 BUSINESS ANALYTICS, VOLUME II
t
(F) The MINITAB residual plots are shown in Figure 7.27.
os
The residual plots in Figure 7.27 show that the normal probability
plot and the histogram of residuals are approximately normally dis-
tributed. The plot of residuals versus fits does not show any pattern
and is quite random indicating that the fitted linear regression model
rP
is adequate. The plot of residuals and the order of data points show no
apparent pattern indicating that there is no violation of independence
of error assumptions.
yo
op
tC
t
Interaction Models An interaction model relating y and two quan-
os
titative independent variables can be written as
y = b0 + b1 x1 + b2 x 2 + b3 x1 x 2
Models with dummy Variables general form of Model with one qualitative
(dummy)independent variable at m levels
y = b0 + b1 x1 + b2 x2 +……+ bm − 1 xm − 1
rP
where, xi is the dummy variable for level (i + 1) and
All Subset and Stepwise Regression Finding the best set of predictor variables to be
included in the model
yo
Note; the Interaction Models and All Subset Regression are not discussed
in this chapter.
There are other regression models that are not discussed but can be de-
veloped using the concepts presented for the other models. Some of these
op
models are explained here.
˝ xˇ
This model is appropriate when x and y have an inverse rela-
tionship. Note that the inverse relationship is not linear.
Log Transformation The logarithmic transformation is of the form:
of x Variable y = ˛0 + ˛1 ln( x ) + ˝
Log Transformation
This is a useful curvilinear form where ln( x ) is the natural loga-
No
of x and y variables
rithm of x and x > 0 .
ln( y ) = ˛0 + ˛1 ln( x ) + ˝
The purpose of this transformation is to achieve a linear rela-
tionship. The model is valid for positive values of x and y. This
transformation is more involved and is diffcult to compare it to
other models with y as the dependent variable.
Logistic Regression This model is used when the response variable is categorical. In
Do
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860
194 BUSINESS ANALYTICS, VOLUME II
t
Implementation Steps and Strategy
os
for Regression Models
Successful implementation of regression models requires an understand-
ing of different types of models. A knowledge of least-squares method on
which many of the regression models are based as well as the awareness of
rP
the assumptions of least-squares regression are critical in evaluating and
implementing the correct regression models. The computer packages have
made the model building and analysis easy. As we have demonstrated, the
scatter plots and matrix plots constructed using the computer are very
helpful in the initial stages of selecting the right model for the given data.
yo
The residual plots for checking the assumptions of regression can be easily
constructed using computer. While the computer packages have removed
the computational hurdle, it is important to understand the fundamen-
tals underlying the regression to apply the regression models properly.
A lack of understanding of least-squares method and the assumptions
op
underlying the regression may lead to drawing wrong conclusions and
selecting alternative course of action. For example, if the assumptions of
regression are violated, it is important to determine the alternate course
or courses of action.
tC
No
Do
This document is authorized for educator review use only by Jasashwi Mandal, NITIE - National Institute of Industrial Engineering until May 2025. Copying or posting
is an infringement of copyright. [email protected] or 617.783.7860