0% found this document useful (0 votes)
65 views21 pages

Regression Models

The chapter discusses regression models and their use in understanding relationships between variables and predicting values. It covers simple linear regression, where the dependent variable Y is predicted from the independent variable X using an equation of the form Y=b0+b1X. The chapter demonstrates calculating the slope b1 and intercept b0 for a simple linear regression model using data on construction company sales and local payroll. The model finds that sales are predicted to increase by PhP125,000 for every PhP100,000 increase in payroll.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views21 pages

Regression Models

The chapter discusses regression models and their use in understanding relationships between variables and predicting values. It covers simple linear regression, where the dependent variable Y is predicted from the independent variable X using an equation of the form Y=b0+b1X. The chapter demonstrates calculating the slope b1 and intercept b0 for a simple linear regression model using data on construction company sales and local payroll. The model finds that sales are predicted to increase by PhP125,000 for every PhP100,000 increase in payroll.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Chapter 4

Regression Models

Learning Objectives:

After the completion of the chapter, the students will be able to:
 identify variables, visualize them in a scatter diagram and use them in
a regression model;
 develop simple linear regression equations from sample data and
interpret the slope and intercept;
 calculate the coefficient of determination and coefficient of correlation
and interpret their meanings;
 list the assumptions used in regression and use residual plots to identify
problems;
 interpret the F test in linear regression model;
 develop a multiple regression model and use it for prediction purposes;
 use dummy variables to model categorical data;
 determine which variables should be included in a multiple regression
model; and
 identify the commonly made mistakes in using regression analysis.

Scatter Diagrams

Regression analysis is a very valuable tool for today’s manager. Regression has been
used to model things such as the relationship between level of education and income,
the price of a house and the square footage, and the sales volume for a company
relative to the dollars spent on advertising. When businesses are trying to decide which
location is best for a new store or branch office, regression models are often used.
Cost estimation models are often regression models. The applicability of regression
analysis is virtually limitless.

There are generally two purposes for regression analysis. The first is to understand
the relationship between variables such as advertising expenditures and sales. The
second purpose is to predict the value of one variable based on the value of the other.

To investigate the relationship between variables, it is helpful to look at a graph of the


data. Such a graph is often called a scatter diagram or a scatter plot. Normally, the
independent variable is plotted on the horizontal axis and the dependent variable is
plotted on the vertical axis.

GGP Construction Company specializes in renovating old homes. Over time, the
company makes its revenue through volume of renovation work and dependent on the
Cavite area payroll. The figures for the company revenues and the amount of money
earned by wage earners in Cavite for the past 6 years are presented in Table 4.1.
Economists have predicted the local area payroll to be PhP600 million next year, and
the construction company wants to plan accordingly.

Table 4.1 GGP Construction Company Sales and Local Payroll

GGP Construction Local Payroll


Company’s Sales (PhP 100,000s)
(PhP 100,000s)
6 3
8 4
9 6
5 4
4.5 2
9.5 5

Figure 4.1 provides a scatter diagram for the GGP Construction data given in Table
4.1. This graph indicates that higher values for the local payroll seem to result in higher
sales for the company. There is not a perfect relationship because not all the points lie
in a straight line, but there is a relationship. A line has been drawn through the data to
help show the relationship that exists between the payroll and sales. The points do not
all lie on the line, so there would be some error involved if we tried to predict sales
based on payroll using this or any other line.
Sales (PhP 100,000s)

Figure 4.1 Scatter Diagram for


GGP Construction

Payroll (PhP 100,000,000s)


Simple Linear Regression

In any regression model, there is an implicit assumption (which can be tested) that a
relationship exists between the variables. There is also some random error that cannot
be predicted. The underlying simple linear regression model is

𝒀 = 𝜷𝟎 + 𝜷𝟏 𝑿 + 𝝐 (4-1)

𝑌 = independent variable (response variable)

𝑋 = independent variable (predictor variable or explanatory variable)

𝛽0 = intercept (value of Y when X = 0)

𝛽1 = slope of regression line

𝜖 = random error

The true values for the intercept and slope are not known, and therefore they are
estimated using sample data. The regression equation based on sample data is given
as

Ŷ = 𝒃𝟎 + 𝒃𝟏 𝑿 (4-2)

Ŷ = predicted value of Y
𝑏0 = estimate of 𝛽0 , based on sample results
𝑏1 = estimate of 𝛽1 , based on sample results

In the GGP Construction example, we are trying to predict the sales, so the dependent
variable (Y) would be sales. The variable we use to help predict sales is the Cavite
area payroll, so this is the independent variable (X). Although any number of lines can
be drawn through these points to show a relationship between X and Y in Figure 4.1,
the line that will be chosen is the one that in some way minimizes the errors. Error is
defined as
𝐸𝑟𝑟𝑜𝑟 = (𝐴𝑐𝑡𝑢𝑎𝑙 𝑣𝑎𝑙𝑢𝑒) − (𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑣𝑎𝑙𝑢𝑒)
ℯ=𝑌− Ŷ (4-3)

Since errors may be positive or negative, the average error could be zero even though
there are extremely large errors—both positive and negative. To eliminate the difficulty
of negative errors canceling positive errors, the errors can be squared. The best
regression line will be defined as the one with the minimum sum of the squared errors.
For this reason, regression analysis is sometimes called least squares regression.
̄Statisticians have developed formulas that we can use to find the equation of a straight
line that would minimize the sum of the squared errors. The simple linear regression
equation is X̆

Ŷ = 𝒃𝟎 + 𝒃𝟏 𝑿

The following formula can be used to compute the slope of the intercept:

∑𝑿
̅=
𝑿 = 𝑨𝒗𝒆𝒓𝒂𝒈𝒆 (𝒎𝒆𝒂𝒏)𝒐𝒇 𝑿 𝒗𝒂𝒍𝒖𝒆𝒔
𝒏

∑𝒀
Ȳ= = 𝑨𝒗𝒆𝒓𝒂𝒈𝒆 (𝒎𝒆𝒂𝒏)𝒐𝒇 𝒀 𝒗𝒂𝒍𝒖𝒆𝒔
𝒏

̅ )(𝒀 − 𝒀
𝚺(𝑿 − 𝑿 ̅)
𝒃𝟏 = (4-4)
𝚺(𝐗 − 𝑿 ̅ )𝟐

̅ − 𝒃𝟏 𝑿
𝒃𝟎 = 𝒀 ̅ (4-5)

The preliminary calculations are shown in Table 4.2.

Table 4.2 Regression Calculations for GGP Construction

𝑌 𝑋 (𝑋 − 𝑋̅)2 (𝑋 − 𝑋̅ )(𝑌 − 𝑌̅)

6 3 (3 - 4)2 = 1 (3 - 4) (6 – 7) = 1

8 4 (4 - 4)2 = 0 (4 - 4) (8 – 7) = 0

9 6 (6 - 4)2 = 4 (6 - 4) (9 – 7) = 4

5 4 (4 - 4)2 = 0 (4 - 4) (5 – 7) = 0

4.5 2 (2 - 4)2 = 4 (2 - 4) (4.5 – 7) = 5

9.5 5 (5 - 4)2 = 1 (5 - 4) (9.5 – 7) = 2.5

Σ𝑌 = 42 Σ𝑋 = 24 Σ (𝑋 − 𝑋̅)2 = 10 Σ ((𝑋 − 𝑋̅)(𝑌 − 𝑌̅) = 12.5


42 24
𝑌̅ = =7 𝑋̅ = =4
6 6

Computing the slope and the intercept of the regression equation for the GGP
Construction Company example, we have

∑𝑋 24
𝑋̅ = = =4
6 6

∑𝑌 42
𝑌̅ = = =7
6 6
Σ(𝑋−𝑋̅ )(𝑌−𝑌̅) 12.5
𝑏1 = ̅ )2
= = 1.25
Σ(X− 𝑋 10

𝑏0 = 𝑌̅ − 𝑏1 𝑋̅ = 7 − (1.25)(4) = 2

The estimated regression equation therefore is

𝑌̂ = 2 + 12.5𝑋

or

Sales = 2 + 1.25 (Payroll)

If the payroll next year is PhP 600 million (X = 6), then the predicted value would be

𝑌̂ = 2 + 12.5(6) = 9.5 or PhP 950,000.

One of the purposes of regression is to understand the relationship among variables.


This model tells us that each time the payroll increases by PhP 100 million
(represented by X), we would expect the sales to increase by PhP125,000, since 𝑏1 =
1.25 (PhP100,000). This model helps GGP Construction see how the local economy
and company sales are related.

Measuring the Fit of the Regression Model

A regression equation can be developed for any variables X and Y, even random
numbers. We certainly would not have any confidence in the ability of one random
number to predict the value of another random number.

In the GGP Construction example, sales figures (Y) varied from a low of 4.5 to a high
of 9.5, and the mean was 7. If each sales value is compared with the mean, we see
how far they deviate from the mean, and we could compute a measure of the total
variability in sales. Because Y is sometimes higher and sometimes lower than the
mean, there may be both positive and negative deviations. Simply summing these
values would be misleading because the negatives would cancel out the positives,
making it appear that the numbers are closer to the mean than they actually are. To
prevent this problem, we will use the sum of squares total (SST) to measure the total
variability in Y:

𝑆𝑆𝑇 = Σ (𝑌 − 𝑌̅ )2 (4-6)

If we did not use X to predict Y, we would simply use the mean of Y as the prediction,
and the SST would measure the accuracy of our predictions. However, a regression
line may be used to predict the value of Y, and while there are still errors involved, the
sum of these squared errors will be less than the total sum of squares just computed.
The sum of squares error (SSE) is

𝑆𝑆𝐸 = ∑𝑒 2 = Σ(𝑌 − 𝑌̅ )2 (4-7)

Table 4.3 provides the calculations for the GGP Construction example. The mean

(𝑌̅ = 7) is compared to each value, and we get

SST = 22.5

The prediction (𝑌̂) for each observation is computed and compared to the actual value.
This results in

SSE = 6.875

The SSE is much lower than the SST. Using the regression line has reduced the
variability in the sum of squares by 22.5 - 6.875 = 15.625. This is called the sum of
squares regression (SSR) and indicates how much of the total variability in Y is
explained by the regression model. Mathematically, this can be calculated as

𝑆𝑆𝑅 = Σ (𝑌̂ − 𝑌̅)2 (4-8)

Table 4.3 indicates

𝑆𝑆𝑅 = 15.625

There is a very important relationship among the sums of squares that we have
computed:

(Sum of squares total) = (Sum of squares due to regression) + (Sum of squares error)

SST = SSR + SSE (4-9)

Figure 4.2 displays the data for GGP Construction. The regression line is shown, as is
a line representing the mean of the Y values. The errors used in computing the sums
of squares are shown on this graph. Notice how the sample points are closer to the
regression line than they are to the mean.
Table 4.3 Sum of Squares for GGP Construction

𝒀 𝑿 ̅ )𝟐
(𝒀 − 𝒀 ̂
𝒀 ̂ )𝟐
(𝒀 − 𝒀 ̂−𝒀
(𝒀 ̅ )𝟐

6 3 (6 - 7) 2 = 1 2 + 1.25 (3) = 5.75 0.0625 1.563

8 4 (8 - 7) 2 = 1 2 + 1.25 (4) = 7.00 1 0

9 6 (9 - 7) 2 = 4 2 + 1.25 (6) = 9.50 0.25 6.25

5 4 (5 - 7) 2 = 4 2 + 1.25 (4) = 7.00 4 0

4.5 2 (4.5 - 7) 2 = 6.25 2 + 1.25 (2) = 4.50 0 6.25

9.5 5 (9.5 - 7) 2 = 6.25 2 + 1.25 (5) = 8.25 1.5625 1.563

𝑌̅ = 7 ∑(𝑌 − 𝑌̅)2 = 22.5 2


Σ (𝑌 − 𝑌̂ ) = 6.875
2
Σ (𝑌̂ − 𝑌̅) = 15.625
SST = 22.5 SSE= 6.875 SSR = 15.625
Sales (PhP 100,000s)

Payroll (PhP 100,000,000s)

Figure 4.2 Deviations from the Regression Line and from the Mean

Coefficient of Determination

The SSR is sometimes called the explained variability in Y, while the SSE is the
unexplained variability in Y. The proportion of the variability in Y that is explained by
the regression equation is called the coefficient of determination and is denoted by
r2. Thus,

𝑺𝑺𝑹 𝑺𝑺𝑬
𝒓𝟐 = = 𝟏− (4-10)
𝑺𝑺𝑻 𝑺𝑺𝑻

Either the SSR or the SSE can be used to find r2. For GGP Construction, we have

15.625
𝑟2 = = 0.6944
22.5
This means that about 69% of the variability in sales (Y) is explained by the regression
equation based on payroll (X).

If every point in the sample were on the regression line (meaning all errors are 0), then
100% of the variability in Y could be explained by the regression equation, so r2 = 1
and SSE = 0. The lowest possible value of r2 is 0, indicating that X explains 0% of the
variability in Y. Thus, r2 can range from a low of 0 to a high of 1. In developing
regression equations, a good model will have an r2 value close to 1.

Correlation Coefficient

Another measure related to the coefficient of determination is the coefficient of


correlation. This measure also expresses the degree or strength of the linear
relationship. It is usually expressed as 𝑟 and can be any number between and including
+1 and -1. Figure 4.3 illustrates possible scatter diagrams for different values of r. The
value of r is the square root of r2. It is negative if the slope is negative, and it is positive
if the slope is positive. Thus,

𝑟 = ±√𝑟 2 (4-11)

For GGP Construction example with 𝑟 2 = 0.6944,

𝑟 = √0.6944 = 0.833

We know it is positive because the slope is +1.25.

Figure 4.3 Four Values of the Correlation Coefficient


Assumptions of the Regression Model

If we can make certain assumptions about the errors in a regression model, we can
perform statistical tests to determine if the model is useful. The following assumptions
are made about the errors:

1. The errors are independent.


2. The errors are normally distributed.
3. The errors have a mean of zero.
4. The errors have a constant variance (regardless of the value of X).

It is possible to check the data to see if these assumptions are met. Often a plot of the
residuals will highlight any glaring violations of the assumptions. When the errors
(residuals) are plotted against the independent variable, the pattern should appear
random.

Figure 4.4 presents some typical error patterns, with Figure 4.4A displaying a pattern
that is expected when the assumptions are met and the model is appropriate. The
errors are random and no discernible pattern is present. Figure 4.4B demonstrates an
error pattern in which the errors increase as X increases, violating the constant
variance assumption. Figure 4.4C shows errors consistently increasing at first and then
consistently decreasing. A pattern such as this would indicate that the model is not
linear and some other form (perhaps quadratic) should be used. In general, patterns
in the plot of the errors indicate problems with the assumptions or the model
specification.

Figure 4.4A. Pattern of Errors Indicating Randomness


Figure 4.4B. Nonconstant Error Variance

Figure 4.4C Pattern of Errors Indicating Relationship is Not Linear

Estimating the Variance

While the errors are assumed to have constant variance (𝜎 2 ), this is usually not known.
It can be estimated from the sample results. The estimate of 𝜎 2 is the mean squared
error (MSE) and is denoted by 𝑠2 . The MSE is the sum of squares due to error divided
by the degrees of freedom:

𝑺𝑺𝑬
𝒔𝟐 = 𝑴𝑺𝑬 = (4-12)
𝒏−𝒌−𝟏

𝑛 = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑠𝑎𝑚𝑝𝑙𝑒

𝑘 = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑖𝑛𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒𝑠

In the GGP Construction example, 𝑛 = 6 and 𝑘 = 1, so


𝑆𝑆𝐸 6.8750 6.8750
𝑠2 = 𝑀𝑆𝐸 = = = = 1.7188
𝑛−𝑘−1 6−1−1 4
From this, we can estimate the standard deviation as

𝑠 = √𝑀𝑆𝐸 (4-13)
This is called the standard error of the estimate or the standard deviation of the
regression. In this example,

𝑠 = √𝑀𝑆𝐸 = √1.7188 = 1.31

This is used in many of the statistical tests about the model. It is also used to find
interval estimates for both Y and regression coefficients.

Testing the Model for Significance

Both the MSE and r2 provide a measure of accuracy in a regression model. However,
when the sample size is too small, it is possible to get good values for both of these
even if there is no relationship between the variables in the regression model. To
determine whether these values are meaningful, it is necessary to test the model for
significance.

To see if there is a linear relationship between X and Y, a statistical hypothesis test is


performed. The underlying linear model was given in Equation 4-1 as

𝑌 = 𝛽0 + 𝛽1 𝑋 + 𝜖

If 𝛽1 = 0, then 𝑌 does not depend on 𝑋 in any way. The null hypothesis says there is
no linear relationship between the two variables (i.e. 𝛽1 = 0). The alternate hypothesis
is that there is a linear relationship (i.e. 𝛽1 ≠ 0). If the null hypothesis can be rejected,
then we have proven that a linear relationship does exist, so X is helpful in predicting
Y. The F distribution is used for testing this hypothesis.

The F statistic used in the hypothesis test is based on the MSE and the mean squared
regression (MSR). The MSR is calculated as

𝑆𝑆𝑅
𝑀𝑆𝑅 = (4-14)
𝑘

where

𝑘 = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑖𝑛𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑚𝑜𝑑𝑒𝑙

The F statistic is

𝑀𝑆𝑅
𝐹= (4-15)
𝑀𝑆𝐸

Based on the assumptions regarding the errors in a regression model, this calculated
F statistic is described by the F distribution with
𝐷𝑒𝑔𝑟𝑒𝑒𝑠 𝑜𝑓 𝑓𝑟𝑒𝑒𝑑𝑜𝑚 𝑓𝑜𝑟 𝑡ℎ𝑒 𝑛𝑢𝑚𝑒𝑟𝑎𝑡𝑜𝑟 = 𝑑𝑓1 = 𝑘

𝐷𝑒𝑔𝑟𝑒𝑒𝑠 𝑜𝑓 𝑓𝑟𝑒𝑒𝑑𝑜𝑚 𝑓𝑜𝑟 𝑡ℎ𝑒 𝑑𝑒𝑛𝑜𝑚𝑖𝑛𝑎𝑡𝑜𝑟 = 𝑑𝑓2 = 𝑛 − 𝑘 − 1

where

𝑘 = 𝑡ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑖𝑛𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒 𝑋 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒𝑠

If there is very little error, the denominator (MSE) of the F statistic is very small relative
to the numerator (MSR), and the resulting F statistic will be large. This is an indication
that the model is useful. A significance level related to the value of the F statistic is
then found. Whenever the F value is large, the observed significance level (p-value)
will be low, indicating that it is extremely unlikely that this could have occurred by
chance. When the F value is large (with a resulting small significance level), we can
reject the null hypothesis that there is no linear relationship. This means that there is
a linear relationship and the values of MSE and r2 are meaningful.

Steps in Hypothesis Test for a Significant Regression Model

1. Specify null and alternative hypotheses:


𝐻0 : 𝐵1 = 0
𝐻1 : 𝐵1 ≠ 0
2. Select the level of significance (𝛼). Common values are 0.01 abd 0.05.
3. Calculate the value of the test statistic using the formula
𝑀𝑆𝑅
𝐹=
𝑀𝑆𝐸
4. Make a decision using one of the following methods:
a) Reject the null hypothesis if the test statistic is greater than the F value
otherwise, do not reject the null hypothesis:
Reject if Fcalculated > F 𝛼, df1, df2
𝑑𝑓1 = 𝑘
𝑑𝑓2 = 𝑛 − 𝑘 − 1
b) Reject the null hypothesis if the observed significance level, or p-value,
is less than the level of significance (𝛼). Otherwise, do not reject the
null hypothesis:
𝑝 − 𝑣𝑎𝑙𝑢𝑒 = 𝑃(𝐹 > 𝑐𝑎𝑙𝑐𝑢𝑙𝑎𝑡𝑒𝑑 𝑡𝑒𝑠𝑡 𝑠𝑡𝑎𝑡𝑖𝑠𝑡𝑖𝑐)
Reject if p-value < 𝛼
GGP Construction Example

To illustrate the process of testing the hypothesis about a significant relationship,


consider the GGP Construction example.

Step 1:

𝐻0 : 𝐵1 = 0 (𝑛𝑜 𝑙𝑖𝑛𝑒𝑎𝑟 𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛𝑠ℎ𝑖𝑝 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑋 𝑎𝑛𝑑 𝑌)


𝐻1 : 𝐵1 ≠ 0 (𝑙𝑖𝑛𝑒𝑎𝑟 𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛𝑠ℎ𝑖𝑝 𝑒𝑥𝑖𝑠𝑡𝑠 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑋 𝑎𝑛𝑑 𝑌)

Step 2:

Select (𝛼) = 0.05

Step 3:

Calculate the value of the test statistic. The MSE was already calculated to be
1.7188. The MSR is then calculated so that F can be found:

𝑆𝑆𝑅 15.6250
𝑀𝑆𝑅 = = = 15.6250
𝑘 1

𝑀𝑆𝑅 15.6250
𝐹= = = 9.09
𝑀𝑆𝐸 1.7188

Step 4:

(a) Reject the null hypothesis if the test statistic is greater than the F value

df1 = k = 1

df2 = n – k – 1 = 6 – 1 – 1 = 4

The value of F associated with a 5% level of significance and with degrees of freedom
1 and 4.

𝐹0.05,1,4 = 7.71

𝐹𝑐𝑎𝑙𝑐𝑢𝑙𝑎𝑡𝑒𝑑 = 9.09

𝑅𝑒𝑗𝑒𝑐𝑡 𝐻0 𝑏𝑒𝑐𝑎𝑢𝑠𝑒 9.09 > 7.71

Thus, there is sufficient data to conclude that there is a statistically significant


relationship between X and Y, so the model is helpful. The strength of this relationship
is measured by r2 = 0.69. Thus, we can conclude that about 69% of the variability in
sales (Y) is explained by the regression model based on local payroll (X).
Multiple Regression Analysis

The multiple regression model is a practical extension of the model we just observed.
It allows us to build a model with several independent variables. The underlying model
is

𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + . . . + 𝛽𝑘 𝑋𝑘 + 𝜖 (4-16)
where

𝑌 = dependent variable (response variable)

𝑋1 = ith independent variable (predictor variable or explanatory variable)

𝛽0 = intercept (value of Y when all Xi = 0)

𝛽1 = coefficient of the ith independent variable

𝑘 = number of inependent variables

𝜖 = random error

To estimate the values of these coefficients, a sample is taken and the following
equation is developed:

𝑌̂ = 𝑏0 + 𝑏1 𝑋1 + 𝑏2 𝑋2 + . . . + 𝑏𝑘 𝑋𝑘 (4-17)

where

𝑌̂= predicted value of Y

𝑏0 = sample intercept (and is an estimate of 𝛽0 )

𝑏1 = sample coefficient of ith variable (and is an estimate of 𝛽1 )

Consider the case of Blithe Realty, a real estate company. Blithe Alcano, the owner
and broker for this company, wants to develop a model to determine a suggested listing
price for houses based on the size of the house and the age of the house. She selects
a sample of houses that have sold recently in a particular area, and she records the
selling price, the square footage of the house, the age of the house, and also the
condition (good, excellent, or mint) of each house, as shown in Table 4.4. Initially Blithe
plans to use only the square footage and age to develop a model, although she wants
to save the information oncondition of the house to use later. She wants to find the
coefficients for the following multiple regression model:
𝑌̂ = 𝑏0 + 𝑏1 𝑋1 + 𝑏2 𝑋2
Where

𝑌̂ = predicted value of dependent variable (selling price)

𝑏0 = Y intercept

𝑋1 and 𝑋2 = value of the two independent variables (square footage and age),
respectively

𝑏1 𝑎𝑛𝑑 𝑏2 = slopes of 𝑋1 and 𝑋2 respectively

Table 4.4 Blithe’s Real Estate Data

Selling Price Square Footage Age Condition


95,000 1,926 30 Good
119,000 2,069 40 Excellent
124,800 1,720 30 Excellent
135,000 1,396 15 Good
142,800 1,706 32 Mint
145,000 1,847 38 Mint
159,000 1,950 27 Mint
165,000 2,323 30 Excellent
182,000 2,285 26 Mint
183,000 3,752 35 Good
200,000 2,300 18 Good
211,000 2,525 17 Good
215,000 3,800 40 Excellent
219,000 1,740 12 Mint

Binary or Dummy Variables

All of the variables we have used in regression examples have been quantitative
variables such as sales figures, payroll numbers, square footage, and age. These have
all been easily measurable and have had numbers associated with them. There are
many times when we believe a qualitative variable rather than a quantitative variable
would be helpful in predicting the dependent variable Y. For example, regression may
be used to find a relationship between annual income and certain characteristics of the
employees. Years of experience at a particular job would be a quantitative variable.
However, information regarding whether or not a person has a college degree might
also be important. This would not be a measurable value or quantity, so a special
variable called a dummy variable (or a binary variable or an indicator variable) would
be used. A dummy variable is assigned a value of 1 if a particular condition is met (e.g.,
a person has a college degree) and a value of 0 otherwise.

In Blithe’s realty example, Blithe believes believes that a better model can be
developed if the condition of the property is included. To incorporate the condition of
the house into the model, Blithe looks at the information available (see Table 4.5) and
sees that the three categories are good condition, excellent condition, and mint
condition. Since these are not quantitative variables, she must use dummy variables.
These are defined as

𝑋3 = 1 𝑖𝑓 ℎ𝑜𝑢𝑠𝑒 𝑖𝑠 𝑖𝑛 𝑒𝑥𝑐𝑒𝑙𝑙𝑒𝑛𝑡 𝑐𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛

= 0 otherwise

𝑋4 = 0 𝑖𝑓 ℎ𝑜𝑢𝑠𝑒 𝑖𝑠 𝑖𝑛 𝑚𝑖𝑛𝑡 𝑐𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛

= 0 otherwise

Notice there is no separate variable for “good” condition. If X3 and X4 are both 0, then
the house cannot be in excellent or mint condition, so it must be in good condition.
When using dummy variables, the number of variables must be 1 less than the number
of categories. In this problem, there were three categories (good, excellent, and mint
condition), so we must have two dummy variables. If we had mistakenly used too many
variables and the number of dummy variables equaled the number of categories, then
the mathematical computations could not be performed or would not give reliable
values.

These dummy variables will be used with the two previous variables (X1 - square
footage, and X2 - age) to try to predict the selling prices of houses for Blithe.The
significance level for the F test is 0.00017, so this model is statistically significant. The
coefficient of determination (r2) is 0.898, so this is a much better model than the
previous one. The regression equation is

𝑌̂ = 121,658 + 56.43𝑋1 − 3,962𝑋2 + 33,162𝑋3 + 47,369𝑋4

This indicates that a house in excellent condition (X3 = 1, X4 = 0) would sell for about
PhP 33,162 more than a house in good condition (X3 = 0, X4 = 0). A house in mint
condition (X3 = 0, X4 = 1) would sell for about PhP47,369 more than a house in good
condition.
Model Building

In developing a good regression model, possible independent variables are identified


and the best ones are selected to be used in the model. The best model is a statistically
significant model with a high r2 and few variables.

As more variables are added to a regression model, r 2 will usually increase, and it
cannot decrease. It is tempting to keep adding variables to a model to try to increase
r2. However, if too many independent variables are included in the model, problems
can arise. For this reason, the adjusted r2 value is often used (rather than r2) to
determine if an additional independent variable is beneficial. The adjusted r2 takes into
account the number of independent variables in the model, and it is possible for the
adjusted r2 to decrease. The formula for r2 is

𝑆𝑆𝑅 𝑆𝑆𝐸
𝑟2 = =1−
𝑆𝑆𝑇 𝑆𝑆𝑇

The adjusted r2 is

𝑆𝑆𝐸⁄
(𝑛 − 𝑘 − 1)
𝐴𝑑𝑗𝑢𝑠𝑡𝑒𝑑 𝑟2 =1− (4-18)
𝑆𝑆𝑇⁄
(𝑛 − 1)

Notice that as the number of variables (k) increases, n - k - 1 will decrease. This causes
SSE/ (n - k – 1) to increase, and consequently the adjusted r2 will decrease unless the
extra variable in the model causes a significant decrease in the SSE. Thus, the
reduction in error (and SSE) must be sufficient to offset the change in k.

As a general rule of thumb, if the adjusted r2 increases when a new variable is added
to the model, the variable should probably remain in the model. If the adjusted r2
decreases when a new variable is added, the variable should not remain in the model.

Stepwise Regression

While the process of model building may be tedious, there are many statistical software
packages that include stepwise regression procedures to do this. Stepwise
regression is an automated process to systematically add independent variables to
or delete them from a regression model. A forward stepwise procedure puts the most
significant variable in the model first and then adds the next variable that will improve
the model the most given that the first variable is already in the model. Variables
continue to be added in this fashion until all the variables are in the model or until any
remaining variables do not significantly improve the model. A backward stepwise
procedure begins with all independent variables in the model, and one by one the least
helpful variables are deleted. This continues until only significant variables remain.
Many variations of these stepwise models exist.

Multicollinearity

In the Blithe’s Realty example, we saw an r2 of about 0.90 and an adjusted r2 of 0.85.
While other variables such as the size of the lot, the number of bedrooms, and the
number of bathrooms might be related to the selling price of a house, we may not want
to include these in the model. It is likely that these variables would be correlated with
the square footage of the house (e.g., more bedrooms usually means a larger house),
which is already included in the model. Thus, the information provided by these
additional variables might be duplication of information already in the model.

When an independent variable is correlated with one other independent variable, the
variables are said to share collinearity. If an independent variable is correlated with a
combination of other independent variables, the condition of multicollinearity exists.
This can create problems in interpreting the coefficients of the variables, as several
variables are providing duplicate information. For example, if two independent
variables were monthly salary expenses for a company and annual salary expenses
for a company, the information provided in one is also provided in the other. Several
sets of regression coefficients for these two variables would yield exactly the same
results. Thus, individual interpretation of these variables would be questionable,
although the model itself is still good for prediction purposes. When multicollinearity
exists, the overall F test is still valid, but the hypothesis tests related to the individual
coefficients are not. A variable may appear to be significant when it is insignificant, or
a variable may appear to be insignificant when it is significant.

Nonlinear Regression

The regression models we have seen are linear models. However, at times there exist
nonlinear relationships between variables. Some simple variable transformations can
be used to create an apparently linear model from a nonlinear relationship.

On every new automobile sold in Philippines, the fuel efficiency (as measured by miles
per gallon [MPG] of gasoline) of the automobile is prominently displayed on the window
sticker. The MPG is related to several factors, one of which is the weight of the
automobile. Engineers at Toyota Motors, in an attempt to improve fuel efficiency, have
been asked to study the impact of weight on MPG. They have decided that a
regression model should be used to do this.

Table 4.5 Automobile Weight Versus MPG


MPG Weight (1,000s lb) MPG Weight (1,000s lb)
12 4.58 20 3.18
13 4.66 23 2.68
15 4.02 24 2.65
18 2.53 33 1.70
19 3.09 36 1.95
19 3.11 42 192

A sample of 12 new automobiles was selected, and the weight and MPG rating were
recorded. Table 4.5 provides these data. A linear regression line is drawn through the
points. Excel was used to develop a simple linear regression equation to relate the
MPG (Y) to the weight in thousands of pounds (X1) in the form

𝑌̂ = 𝑏0 + 𝑏1 𝑋1

We got the equation

𝑌̂ = 47.6 + 8.2𝑋1

or

𝑀𝑃𝐺 = 47.6 − 8.2(Weight in thousands of pounds)

The model is useful since the significance level for the F test is small and r2 = 0.7446.
Perhaps a nonlinear relationship exists, and maybe the model should be modified to
account for this.This model would be of the form

𝑀𝑃𝐺 = 𝑏0 + 𝑏1 (𝑊𝑒𝑖𝑔ℎ𝑡) + 𝑏2 (𝑊𝑒𝑖𝑔ℎ𝑡)2

The easiest way to develop this model is to define a new variable

𝑋2 = (𝑊𝑒𝑖𝑔ℎ𝑡)2

This gives us the model

𝑌̂ = 𝑏0 + 𝑏1 𝑋1 + 𝑏2 𝑋2

We can create another column in Excel and again run the regression tool, the new
equation is

𝑌̂ = 79.8 + 30.2𝑋1 + 3.4𝑋2


The significance level for F is low (0.0002), so the model is useful, and r2 = 0.8478.
The adjusted r2 increased from 0.719 to 0.814, so this new variable definitely improved
the model.

This model is good for prediction purposes. However, we should not try to interpret the
coefficients of the variables due to the correlation between X1 (weight) and X2 (weight
squared). Normally, we would interpret the coefficient for X1 as the change in Y that
results from a 1-unit change in X1, while holding all other variables constant. Obviously,
holding one variable constant while changing the other is impossible in this example,
since X2 = X21. If X1 changes, then X2 must change also. This is an example of a
problem that exists when multicollinearity is present.

Other types of nonlinearities can be handled using a similar approach. A number of


transformations exist that may help to develop a linear model from variables with
nonlinear relationships.

Pitfalls in Regression Analysis

This chapter has provided a brief introduction to regression analysis, one of the most
widely used quantitative techniques in business. However, some common errors are
made with regression models, so caution should be observed when using them.

 If the assumptions are not met, the statistical tests may not be valid. Any
interval estimates are also invalid, although the model can still be used for
prediction purposes.
 Correlation does not necessarily mean causation. Two variables (such as the
price of automobiles and your annual salary) may be highly correlated to one
another, but one is not causing the other to change. They may both be
changing due to other factors such as the economy in general or the inflation
rate.
 If multicollinearity is present in a multiple regression model, the model is still
good for prediction, but interpretation of individual coefficients is questionable.
The individual tests on the regression coefficients are not valid.
 Using a regression equation beyond the range of X is very questionable. A
linear relationship may exist within the range of values of X in the sample. What
happens beyond this range is unknown; the linear relationship may become
nonlinear at some point. For example, there is usually a linear relationship
between advertising and sales within a limited range. As more money is spent
on advertising, sales tend to increase even if everything else is held constant.
However, at some point, increasing advertising expenditures will have less
impact on sales unless the company does other things to help, such as opening
new markets or expanding the product offerings. If advertising is increased and
nothing else changes, the sales will probably level off at some point.
 Related to the limitation regarding the range of X is the interpretation of the
intercept (𝑏0 ). Since the lowest value for X in a sample is often much greater
than 0, the intercept is a point on the regression line beyond the range of X.
Therefore, we should not be concerned if the t-test for this coefficient is not
significant, as we should not be using the regression equation to predict a value
of Y when X = 0. This intercept is merely used in defining the line that fits the
sample points the best.
 Using the F test and concluding a linear regression model is helpful in
predicting Y does not mean that this is the best relationship. While this model
may explain much of the variability in Y, it is possible that a nonlinear
relationship might explain even more. Similarly, if it is concluded that no linear
relationship exists, another type of relationship could exist.
 A statistically significant relationship does not mean it has any practical value.
With large enough samples, it is possible to have a statistically significant
relationship, but r2 might be 0.01. This would normally be of little use to a
manager. Similarly, a high r2 could be found due to random chance if the
sample is small. The F test must also show significance to place any value in
r2.

You might also like