Regression Models
Regression Models
Regression Models
Learning Objectives:
After the completion of the chapter, the students will be able to:
identify variables, visualize them in a scatter diagram and use them in
a regression model;
develop simple linear regression equations from sample data and
interpret the slope and intercept;
calculate the coefficient of determination and coefficient of correlation
and interpret their meanings;
list the assumptions used in regression and use residual plots to identify
problems;
interpret the F test in linear regression model;
develop a multiple regression model and use it for prediction purposes;
use dummy variables to model categorical data;
determine which variables should be included in a multiple regression
model; and
identify the commonly made mistakes in using regression analysis.
Scatter Diagrams
Regression analysis is a very valuable tool for today’s manager. Regression has been
used to model things such as the relationship between level of education and income,
the price of a house and the square footage, and the sales volume for a company
relative to the dollars spent on advertising. When businesses are trying to decide which
location is best for a new store or branch office, regression models are often used.
Cost estimation models are often regression models. The applicability of regression
analysis is virtually limitless.
There are generally two purposes for regression analysis. The first is to understand
the relationship between variables such as advertising expenditures and sales. The
second purpose is to predict the value of one variable based on the value of the other.
GGP Construction Company specializes in renovating old homes. Over time, the
company makes its revenue through volume of renovation work and dependent on the
Cavite area payroll. The figures for the company revenues and the amount of money
earned by wage earners in Cavite for the past 6 years are presented in Table 4.1.
Economists have predicted the local area payroll to be PhP600 million next year, and
the construction company wants to plan accordingly.
Figure 4.1 provides a scatter diagram for the GGP Construction data given in Table
4.1. This graph indicates that higher values for the local payroll seem to result in higher
sales for the company. There is not a perfect relationship because not all the points lie
in a straight line, but there is a relationship. A line has been drawn through the data to
help show the relationship that exists between the payroll and sales. The points do not
all lie on the line, so there would be some error involved if we tried to predict sales
based on payroll using this or any other line.
Sales (PhP 100,000s)
In any regression model, there is an implicit assumption (which can be tested) that a
relationship exists between the variables. There is also some random error that cannot
be predicted. The underlying simple linear regression model is
𝒀 = 𝜷𝟎 + 𝜷𝟏 𝑿 + 𝝐 (4-1)
𝜖 = random error
The true values for the intercept and slope are not known, and therefore they are
estimated using sample data. The regression equation based on sample data is given
as
Ŷ = 𝒃𝟎 + 𝒃𝟏 𝑿 (4-2)
Ŷ = predicted value of Y
𝑏0 = estimate of 𝛽0 , based on sample results
𝑏1 = estimate of 𝛽1 , based on sample results
In the GGP Construction example, we are trying to predict the sales, so the dependent
variable (Y) would be sales. The variable we use to help predict sales is the Cavite
area payroll, so this is the independent variable (X). Although any number of lines can
be drawn through these points to show a relationship between X and Y in Figure 4.1,
the line that will be chosen is the one that in some way minimizes the errors. Error is
defined as
𝐸𝑟𝑟𝑜𝑟 = (𝐴𝑐𝑡𝑢𝑎𝑙 𝑣𝑎𝑙𝑢𝑒) − (𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑣𝑎𝑙𝑢𝑒)
ℯ=𝑌− Ŷ (4-3)
Since errors may be positive or negative, the average error could be zero even though
there are extremely large errors—both positive and negative. To eliminate the difficulty
of negative errors canceling positive errors, the errors can be squared. The best
regression line will be defined as the one with the minimum sum of the squared errors.
For this reason, regression analysis is sometimes called least squares regression.
̄Statisticians have developed formulas that we can use to find the equation of a straight
line that would minimize the sum of the squared errors. The simple linear regression
equation is X̆
Ŷ = 𝒃𝟎 + 𝒃𝟏 𝑿
The following formula can be used to compute the slope of the intercept:
∑𝑿
̅=
𝑿 = 𝑨𝒗𝒆𝒓𝒂𝒈𝒆 (𝒎𝒆𝒂𝒏)𝒐𝒇 𝑿 𝒗𝒂𝒍𝒖𝒆𝒔
𝒏
∑𝒀
Ȳ= = 𝑨𝒗𝒆𝒓𝒂𝒈𝒆 (𝒎𝒆𝒂𝒏)𝒐𝒇 𝒀 𝒗𝒂𝒍𝒖𝒆𝒔
𝒏
̅ )(𝒀 − 𝒀
𝚺(𝑿 − 𝑿 ̅)
𝒃𝟏 = (4-4)
𝚺(𝐗 − 𝑿 ̅ )𝟐
̅ − 𝒃𝟏 𝑿
𝒃𝟎 = 𝒀 ̅ (4-5)
6 3 (3 - 4)2 = 1 (3 - 4) (6 – 7) = 1
8 4 (4 - 4)2 = 0 (4 - 4) (8 – 7) = 0
9 6 (6 - 4)2 = 4 (6 - 4) (9 – 7) = 4
5 4 (4 - 4)2 = 0 (4 - 4) (5 – 7) = 0
Computing the slope and the intercept of the regression equation for the GGP
Construction Company example, we have
∑𝑋 24
𝑋̅ = = =4
6 6
∑𝑌 42
𝑌̅ = = =7
6 6
Σ(𝑋−𝑋̅ )(𝑌−𝑌̅) 12.5
𝑏1 = ̅ )2
= = 1.25
Σ(X− 𝑋 10
𝑏0 = 𝑌̅ − 𝑏1 𝑋̅ = 7 − (1.25)(4) = 2
𝑌̂ = 2 + 12.5𝑋
or
If the payroll next year is PhP 600 million (X = 6), then the predicted value would be
A regression equation can be developed for any variables X and Y, even random
numbers. We certainly would not have any confidence in the ability of one random
number to predict the value of another random number.
In the GGP Construction example, sales figures (Y) varied from a low of 4.5 to a high
of 9.5, and the mean was 7. If each sales value is compared with the mean, we see
how far they deviate from the mean, and we could compute a measure of the total
variability in sales. Because Y is sometimes higher and sometimes lower than the
mean, there may be both positive and negative deviations. Simply summing these
values would be misleading because the negatives would cancel out the positives,
making it appear that the numbers are closer to the mean than they actually are. To
prevent this problem, we will use the sum of squares total (SST) to measure the total
variability in Y:
𝑆𝑆𝑇 = Σ (𝑌 − 𝑌̅ )2 (4-6)
If we did not use X to predict Y, we would simply use the mean of Y as the prediction,
and the SST would measure the accuracy of our predictions. However, a regression
line may be used to predict the value of Y, and while there are still errors involved, the
sum of these squared errors will be less than the total sum of squares just computed.
The sum of squares error (SSE) is
Table 4.3 provides the calculations for the GGP Construction example. The mean
SST = 22.5
The prediction (𝑌̂) for each observation is computed and compared to the actual value.
This results in
SSE = 6.875
The SSE is much lower than the SST. Using the regression line has reduced the
variability in the sum of squares by 22.5 - 6.875 = 15.625. This is called the sum of
squares regression (SSR) and indicates how much of the total variability in Y is
explained by the regression model. Mathematically, this can be calculated as
𝑆𝑆𝑅 = 15.625
There is a very important relationship among the sums of squares that we have
computed:
(Sum of squares total) = (Sum of squares due to regression) + (Sum of squares error)
Figure 4.2 displays the data for GGP Construction. The regression line is shown, as is
a line representing the mean of the Y values. The errors used in computing the sums
of squares are shown on this graph. Notice how the sample points are closer to the
regression line than they are to the mean.
Table 4.3 Sum of Squares for GGP Construction
𝒀 𝑿 ̅ )𝟐
(𝒀 − 𝒀 ̂
𝒀 ̂ )𝟐
(𝒀 − 𝒀 ̂−𝒀
(𝒀 ̅ )𝟐
Figure 4.2 Deviations from the Regression Line and from the Mean
Coefficient of Determination
The SSR is sometimes called the explained variability in Y, while the SSE is the
unexplained variability in Y. The proportion of the variability in Y that is explained by
the regression equation is called the coefficient of determination and is denoted by
r2. Thus,
𝑺𝑺𝑹 𝑺𝑺𝑬
𝒓𝟐 = = 𝟏− (4-10)
𝑺𝑺𝑻 𝑺𝑺𝑻
Either the SSR or the SSE can be used to find r2. For GGP Construction, we have
15.625
𝑟2 = = 0.6944
22.5
This means that about 69% of the variability in sales (Y) is explained by the regression
equation based on payroll (X).
If every point in the sample were on the regression line (meaning all errors are 0), then
100% of the variability in Y could be explained by the regression equation, so r2 = 1
and SSE = 0. The lowest possible value of r2 is 0, indicating that X explains 0% of the
variability in Y. Thus, r2 can range from a low of 0 to a high of 1. In developing
regression equations, a good model will have an r2 value close to 1.
Correlation Coefficient
𝑟 = ±√𝑟 2 (4-11)
𝑟 = √0.6944 = 0.833
If we can make certain assumptions about the errors in a regression model, we can
perform statistical tests to determine if the model is useful. The following assumptions
are made about the errors:
It is possible to check the data to see if these assumptions are met. Often a plot of the
residuals will highlight any glaring violations of the assumptions. When the errors
(residuals) are plotted against the independent variable, the pattern should appear
random.
Figure 4.4 presents some typical error patterns, with Figure 4.4A displaying a pattern
that is expected when the assumptions are met and the model is appropriate. The
errors are random and no discernible pattern is present. Figure 4.4B demonstrates an
error pattern in which the errors increase as X increases, violating the constant
variance assumption. Figure 4.4C shows errors consistently increasing at first and then
consistently decreasing. A pattern such as this would indicate that the model is not
linear and some other form (perhaps quadratic) should be used. In general, patterns
in the plot of the errors indicate problems with the assumptions or the model
specification.
While the errors are assumed to have constant variance (𝜎 2 ), this is usually not known.
It can be estimated from the sample results. The estimate of 𝜎 2 is the mean squared
error (MSE) and is denoted by 𝑠2 . The MSE is the sum of squares due to error divided
by the degrees of freedom:
𝑺𝑺𝑬
𝒔𝟐 = 𝑴𝑺𝑬 = (4-12)
𝒏−𝒌−𝟏
𝑠 = √𝑀𝑆𝐸 (4-13)
This is called the standard error of the estimate or the standard deviation of the
regression. In this example,
This is used in many of the statistical tests about the model. It is also used to find
interval estimates for both Y and regression coefficients.
Both the MSE and r2 provide a measure of accuracy in a regression model. However,
when the sample size is too small, it is possible to get good values for both of these
even if there is no relationship between the variables in the regression model. To
determine whether these values are meaningful, it is necessary to test the model for
significance.
𝑌 = 𝛽0 + 𝛽1 𝑋 + 𝜖
If 𝛽1 = 0, then 𝑌 does not depend on 𝑋 in any way. The null hypothesis says there is
no linear relationship between the two variables (i.e. 𝛽1 = 0). The alternate hypothesis
is that there is a linear relationship (i.e. 𝛽1 ≠ 0). If the null hypothesis can be rejected,
then we have proven that a linear relationship does exist, so X is helpful in predicting
Y. The F distribution is used for testing this hypothesis.
The F statistic used in the hypothesis test is based on the MSE and the mean squared
regression (MSR). The MSR is calculated as
𝑆𝑆𝑅
𝑀𝑆𝑅 = (4-14)
𝑘
where
The F statistic is
𝑀𝑆𝑅
𝐹= (4-15)
𝑀𝑆𝐸
Based on the assumptions regarding the errors in a regression model, this calculated
F statistic is described by the F distribution with
𝐷𝑒𝑔𝑟𝑒𝑒𝑠 𝑜𝑓 𝑓𝑟𝑒𝑒𝑑𝑜𝑚 𝑓𝑜𝑟 𝑡ℎ𝑒 𝑛𝑢𝑚𝑒𝑟𝑎𝑡𝑜𝑟 = 𝑑𝑓1 = 𝑘
where
If there is very little error, the denominator (MSE) of the F statistic is very small relative
to the numerator (MSR), and the resulting F statistic will be large. This is an indication
that the model is useful. A significance level related to the value of the F statistic is
then found. Whenever the F value is large, the observed significance level (p-value)
will be low, indicating that it is extremely unlikely that this could have occurred by
chance. When the F value is large (with a resulting small significance level), we can
reject the null hypothesis that there is no linear relationship. This means that there is
a linear relationship and the values of MSE and r2 are meaningful.
Step 1:
Step 2:
Step 3:
Calculate the value of the test statistic. The MSE was already calculated to be
1.7188. The MSR is then calculated so that F can be found:
𝑆𝑆𝑅 15.6250
𝑀𝑆𝑅 = = = 15.6250
𝑘 1
𝑀𝑆𝑅 15.6250
𝐹= = = 9.09
𝑀𝑆𝐸 1.7188
Step 4:
(a) Reject the null hypothesis if the test statistic is greater than the F value
df1 = k = 1
df2 = n – k – 1 = 6 – 1 – 1 = 4
The value of F associated with a 5% level of significance and with degrees of freedom
1 and 4.
𝐹0.05,1,4 = 7.71
𝐹𝑐𝑎𝑙𝑐𝑢𝑙𝑎𝑡𝑒𝑑 = 9.09
The multiple regression model is a practical extension of the model we just observed.
It allows us to build a model with several independent variables. The underlying model
is
𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + . . . + 𝛽𝑘 𝑋𝑘 + 𝜖 (4-16)
where
𝜖 = random error
To estimate the values of these coefficients, a sample is taken and the following
equation is developed:
𝑌̂ = 𝑏0 + 𝑏1 𝑋1 + 𝑏2 𝑋2 + . . . + 𝑏𝑘 𝑋𝑘 (4-17)
where
Consider the case of Blithe Realty, a real estate company. Blithe Alcano, the owner
and broker for this company, wants to develop a model to determine a suggested listing
price for houses based on the size of the house and the age of the house. She selects
a sample of houses that have sold recently in a particular area, and she records the
selling price, the square footage of the house, the age of the house, and also the
condition (good, excellent, or mint) of each house, as shown in Table 4.4. Initially Blithe
plans to use only the square footage and age to develop a model, although she wants
to save the information oncondition of the house to use later. She wants to find the
coefficients for the following multiple regression model:
𝑌̂ = 𝑏0 + 𝑏1 𝑋1 + 𝑏2 𝑋2
Where
𝑏0 = Y intercept
𝑋1 and 𝑋2 = value of the two independent variables (square footage and age),
respectively
All of the variables we have used in regression examples have been quantitative
variables such as sales figures, payroll numbers, square footage, and age. These have
all been easily measurable and have had numbers associated with them. There are
many times when we believe a qualitative variable rather than a quantitative variable
would be helpful in predicting the dependent variable Y. For example, regression may
be used to find a relationship between annual income and certain characteristics of the
employees. Years of experience at a particular job would be a quantitative variable.
However, information regarding whether or not a person has a college degree might
also be important. This would not be a measurable value or quantity, so a special
variable called a dummy variable (or a binary variable or an indicator variable) would
be used. A dummy variable is assigned a value of 1 if a particular condition is met (e.g.,
a person has a college degree) and a value of 0 otherwise.
In Blithe’s realty example, Blithe believes believes that a better model can be
developed if the condition of the property is included. To incorporate the condition of
the house into the model, Blithe looks at the information available (see Table 4.5) and
sees that the three categories are good condition, excellent condition, and mint
condition. Since these are not quantitative variables, she must use dummy variables.
These are defined as
= 0 otherwise
= 0 otherwise
Notice there is no separate variable for “good” condition. If X3 and X4 are both 0, then
the house cannot be in excellent or mint condition, so it must be in good condition.
When using dummy variables, the number of variables must be 1 less than the number
of categories. In this problem, there were three categories (good, excellent, and mint
condition), so we must have two dummy variables. If we had mistakenly used too many
variables and the number of dummy variables equaled the number of categories, then
the mathematical computations could not be performed or would not give reliable
values.
These dummy variables will be used with the two previous variables (X1 - square
footage, and X2 - age) to try to predict the selling prices of houses for Blithe.The
significance level for the F test is 0.00017, so this model is statistically significant. The
coefficient of determination (r2) is 0.898, so this is a much better model than the
previous one. The regression equation is
This indicates that a house in excellent condition (X3 = 1, X4 = 0) would sell for about
PhP 33,162 more than a house in good condition (X3 = 0, X4 = 0). A house in mint
condition (X3 = 0, X4 = 1) would sell for about PhP47,369 more than a house in good
condition.
Model Building
As more variables are added to a regression model, r 2 will usually increase, and it
cannot decrease. It is tempting to keep adding variables to a model to try to increase
r2. However, if too many independent variables are included in the model, problems
can arise. For this reason, the adjusted r2 value is often used (rather than r2) to
determine if an additional independent variable is beneficial. The adjusted r2 takes into
account the number of independent variables in the model, and it is possible for the
adjusted r2 to decrease. The formula for r2 is
𝑆𝑆𝑅 𝑆𝑆𝐸
𝑟2 = =1−
𝑆𝑆𝑇 𝑆𝑆𝑇
The adjusted r2 is
𝑆𝑆𝐸⁄
(𝑛 − 𝑘 − 1)
𝐴𝑑𝑗𝑢𝑠𝑡𝑒𝑑 𝑟2 =1− (4-18)
𝑆𝑆𝑇⁄
(𝑛 − 1)
Notice that as the number of variables (k) increases, n - k - 1 will decrease. This causes
SSE/ (n - k – 1) to increase, and consequently the adjusted r2 will decrease unless the
extra variable in the model causes a significant decrease in the SSE. Thus, the
reduction in error (and SSE) must be sufficient to offset the change in k.
As a general rule of thumb, if the adjusted r2 increases when a new variable is added
to the model, the variable should probably remain in the model. If the adjusted r2
decreases when a new variable is added, the variable should not remain in the model.
Stepwise Regression
While the process of model building may be tedious, there are many statistical software
packages that include stepwise regression procedures to do this. Stepwise
regression is an automated process to systematically add independent variables to
or delete them from a regression model. A forward stepwise procedure puts the most
significant variable in the model first and then adds the next variable that will improve
the model the most given that the first variable is already in the model. Variables
continue to be added in this fashion until all the variables are in the model or until any
remaining variables do not significantly improve the model. A backward stepwise
procedure begins with all independent variables in the model, and one by one the least
helpful variables are deleted. This continues until only significant variables remain.
Many variations of these stepwise models exist.
Multicollinearity
In the Blithe’s Realty example, we saw an r2 of about 0.90 and an adjusted r2 of 0.85.
While other variables such as the size of the lot, the number of bedrooms, and the
number of bathrooms might be related to the selling price of a house, we may not want
to include these in the model. It is likely that these variables would be correlated with
the square footage of the house (e.g., more bedrooms usually means a larger house),
which is already included in the model. Thus, the information provided by these
additional variables might be duplication of information already in the model.
When an independent variable is correlated with one other independent variable, the
variables are said to share collinearity. If an independent variable is correlated with a
combination of other independent variables, the condition of multicollinearity exists.
This can create problems in interpreting the coefficients of the variables, as several
variables are providing duplicate information. For example, if two independent
variables were monthly salary expenses for a company and annual salary expenses
for a company, the information provided in one is also provided in the other. Several
sets of regression coefficients for these two variables would yield exactly the same
results. Thus, individual interpretation of these variables would be questionable,
although the model itself is still good for prediction purposes. When multicollinearity
exists, the overall F test is still valid, but the hypothesis tests related to the individual
coefficients are not. A variable may appear to be significant when it is insignificant, or
a variable may appear to be insignificant when it is significant.
Nonlinear Regression
The regression models we have seen are linear models. However, at times there exist
nonlinear relationships between variables. Some simple variable transformations can
be used to create an apparently linear model from a nonlinear relationship.
On every new automobile sold in Philippines, the fuel efficiency (as measured by miles
per gallon [MPG] of gasoline) of the automobile is prominently displayed on the window
sticker. The MPG is related to several factors, one of which is the weight of the
automobile. Engineers at Toyota Motors, in an attempt to improve fuel efficiency, have
been asked to study the impact of weight on MPG. They have decided that a
regression model should be used to do this.
A sample of 12 new automobiles was selected, and the weight and MPG rating were
recorded. Table 4.5 provides these data. A linear regression line is drawn through the
points. Excel was used to develop a simple linear regression equation to relate the
MPG (Y) to the weight in thousands of pounds (X1) in the form
𝑌̂ = 𝑏0 + 𝑏1 𝑋1
𝑌̂ = 47.6 + 8.2𝑋1
or
The model is useful since the significance level for the F test is small and r2 = 0.7446.
Perhaps a nonlinear relationship exists, and maybe the model should be modified to
account for this.This model would be of the form
𝑋2 = (𝑊𝑒𝑖𝑔ℎ𝑡)2
𝑌̂ = 𝑏0 + 𝑏1 𝑋1 + 𝑏2 𝑋2
We can create another column in Excel and again run the regression tool, the new
equation is
This model is good for prediction purposes. However, we should not try to interpret the
coefficients of the variables due to the correlation between X1 (weight) and X2 (weight
squared). Normally, we would interpret the coefficient for X1 as the change in Y that
results from a 1-unit change in X1, while holding all other variables constant. Obviously,
holding one variable constant while changing the other is impossible in this example,
since X2 = X21. If X1 changes, then X2 must change also. This is an example of a
problem that exists when multicollinearity is present.
This chapter has provided a brief introduction to regression analysis, one of the most
widely used quantitative techniques in business. However, some common errors are
made with regression models, so caution should be observed when using them.
If the assumptions are not met, the statistical tests may not be valid. Any
interval estimates are also invalid, although the model can still be used for
prediction purposes.
Correlation does not necessarily mean causation. Two variables (such as the
price of automobiles and your annual salary) may be highly correlated to one
another, but one is not causing the other to change. They may both be
changing due to other factors such as the economy in general or the inflation
rate.
If multicollinearity is present in a multiple regression model, the model is still
good for prediction, but interpretation of individual coefficients is questionable.
The individual tests on the regression coefficients are not valid.
Using a regression equation beyond the range of X is very questionable. A
linear relationship may exist within the range of values of X in the sample. What
happens beyond this range is unknown; the linear relationship may become
nonlinear at some point. For example, there is usually a linear relationship
between advertising and sales within a limited range. As more money is spent
on advertising, sales tend to increase even if everything else is held constant.
However, at some point, increasing advertising expenditures will have less
impact on sales unless the company does other things to help, such as opening
new markets or expanding the product offerings. If advertising is increased and
nothing else changes, the sales will probably level off at some point.
Related to the limitation regarding the range of X is the interpretation of the
intercept (𝑏0 ). Since the lowest value for X in a sample is often much greater
than 0, the intercept is a point on the regression line beyond the range of X.
Therefore, we should not be concerned if the t-test for this coefficient is not
significant, as we should not be using the regression equation to predict a value
of Y when X = 0. This intercept is merely used in defining the line that fits the
sample points the best.
Using the F test and concluding a linear regression model is helpful in
predicting Y does not mean that this is the best relationship. While this model
may explain much of the variability in Y, it is possible that a nonlinear
relationship might explain even more. Similarly, if it is concluded that no linear
relationship exists, another type of relationship could exist.
A statistically significant relationship does not mean it has any practical value.
With large enough samples, it is possible to have a statistically significant
relationship, but r2 might be 0.01. This would normally be of little use to a
manager. Similarly, a high r2 could be found due to random chance if the
sample is small. The F test must also show significance to place any value in
r2.