0% found this document useful (0 votes)
25 views58 pages

Chapter 6

Uploaded by

liza liza
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views58 pages

Chapter 6

Uploaded by

liza liza
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 58

Chapter 6: Regression

Analysis
Chapter Contents
1. Modelling Relationships and Trends in Data
2. Simple Linear Regression
3. Multiple Linear Regression
4. Regression with Categorical Independent
Variables
Modelling Relationships
and Trends in Data
Introduction
• Begin by creating a chart of the data and choose
the appropriate type of functional relationship to
incorporate into an analytical model to understand
the data.
• For cross-sectional data, we use a scatter chart; for
time series data we use a line chart.
Types of Mathematical Functions
Linear y = a + bx Linear functions show steady increases or decreases
Function over the range of x. This is the simplest type of
function used in predictive models. It is easy to
understand, and over small ranges of values, can
approximate behavior rather well.
Logarithmic y = ln(x) Logarithmic functions are used when the rate of
Function change in a variable increases or decreases quickly
and then levels out, such as with diminishing returns
to scale. Logarithmic functions are often used in
marketing models where constant percentage
increases in advertising, result in constant, absolute
increases in sales.
Polynomial y = ax2 + bx + c A second-order polynomial is parabolic in nature and
Function (second order) has only one hill or valley; a third-order polynomial
y = ax3 + bx2 + dx + e has one or two hills or valleys. Revenue models that
(third order) incorporate price elasticity are often polynomial
functions.
Types of Mathematical Functions
Power y = axb Power functions define phenomena that increase at a
Function specific rate. Learning curves that express improving
times in performing a task are often modelled with
power functions having a > 0 and b < 0.
Exponential y = abx Exponential functions have the property that y rises
Function or falls at constantly increasing rates. E.g. the
perceived brightness of a lightbulb grows at a
decreasing rate as the wattage increases. In this case,
a would be a positive number and b would be
between 0 and 1. The exponential x function is often
defined as y = ae, where b = e, the base of natural
logarithms (approximately 2.71828).
Excel Trendline Tool
• The Excel Trendline tool provides a convenient
method for determining the best-fitting
functional relationship among these alternatives
for a set of data.
• First, click the chart to which you wish to add a
trendline; this will display the Chart Tools menu.
• Select the Chart Tools Design tab, and then click
Add Chart Element from the Chart Layouts group.
• From the Trendline submenu, you can select one
of the options (Linear is the most common) or
More Trendline Options. . .
• If you select More Trendline Options, you will get
the Format Trendline pane in the worksheet.
R-square
• Trendlines can be used to model relationships between
variables and understand how the dependent variable
behaves as the independent variable changes.
• E.g. the demand-prediction models that introduced in
Chapter 1 would generally be developed by analysing
data.
• R2 (R-squared) is a measure of the “fit” of the line to the
data. The value of R2 will be between 0 and 1.
• The larger the value of R2 the better the fit. We will
discuss this further in the context of regression analysis.
Example: Modelling a Price-
Demand Function
Example: Predicting Crude Oil
Prices

Be cautious when using polynomial functions.


The R2 value will continue to increase as the
order of the polynomial increases; i.e. a third-
order polynomial will provide a better fit than a
second order polynomial, and so on. Higher-
order polynomials will generally not be very
smooth and will be difficult to interpret visually.
Thus, we don’t recommend going beyond a
third-order polynomial when fitting data.
Simple Linear
Regression
Introduction
• Regression analysis is a tool for building mathematical
and statistical models that characterize relationships
between a dependent variable (which must be a ratio
variable and not categorical) and one or more
independent, or explanatory, variables, all of which are
numerical (but may be either ratio or categorical).
• Two broad categories of regression models are used
often in business settings:
1. Regression models of cross-sectional data and
2. Regression models of time-series data, in which the
independent variables are time or some function of time
and the focus is on predicting the future.
Types of Regression
• A regression model that involves a single
independent variable is called simple regression.
• A regression model that involves two or more
independent variables is called multiple regression.
Simple Linear Regression
• Simple linear regression involves finding a linear relationship
between one independent variable, X, and one dependent variable,
Y.
• The relationship between two variables can assume many forms and
may be linear or nonlinear, or there may be no relationship at all.
• Because we are focusing our discussion on linear regression models,
the first thing to do is to verify that the relationship is linear, we
would not expect to see the data line up perfectly along a straight
line; we simply want to verify that the general relationship is linear.
• If the relationship is clearly nonlinear, then alternative approaches
must be used, and if no relationship is evident, then it is pointless to
even consider developing a linear regression model.
Simple Linear Regression
• To determine if a linear relationship exists between
the variables, we recommend that you create a
scatter chart that can show the relationship
between variables visually.
Example: Home Market Value
Data
Finding the Best Fitting Line
• The idea behind simple linear regression is to
express the relationship between the dependent
and independent variables by a simple linear
equation, such as:
market value = a + (b x square feet)
• where a is the y-intercept and b is the slope of the
line. If we draw a straight line through the data,
some of the points will fall above the line, some will
fall below it, and a few might fall on the line itself.
Finding the Best Fitting Line

There are two possible straight lines that pass through the data. Clearly, you would
choose A as the better-fitting line over B because all the points are closer to the
line and the line appears to be in the middle of the data.
The only difference between the lines is the value of the slope and intercept; thus,
we seek to determine the values of the slope and intercept that provide the best-
fitting line.
Example: Using Excel to Find the
Best Regression Line
Example: Using Excel to Find the
Best Regression Line cont.

We can find the best-


fitting line using the Excel
Trendline tool (with the
linear option chosen).
Least-Square Regression
• The mathematical basis for the best-fitting regression line is called least-
squares regression.
• In regression analysis, we assume that the values of the dependent
variable, Y, in the sample data are drawn from some unknown
population for each value of the independent variable, X.

• Because we are assuming that a linear relationship exists, the expected


value of Y is 𝛽0 + 𝛽1X for each value of X.
• The coefficients 𝛽0 and 𝛽1 are population parameters that represent the
intercept and slope, respectively, of the population from which a
sample of observations is taken.
• Thus, for a specific value of X, we have many possible values of Y that
vary around the mean. To account for this, we add an error term, 𝜀 (the
Greek letter epsilon), to the mean.
Least Square Regression
• However, we don’t know the entire population, we
don’t know the true values of b0 and b1.
• In practice, we must estimate these as best we can
from the sample data.
• Thus, the estimated simple linear regression
equation is:
Example: Finding Least-Square
Coefficients
Simple Linear Regression with
Excel
• Data > Data Analysis > Regression
• The dialog box is displayed. In Input Y Range, specify the range of
the DV values. In Input X Range, specify the range for the IV
values. Check Labels if your data range contains a descriptive
label.
• You have the option of forcing the intercept to zero by checking
Constant is Zero; however, you will usually not check this box
because adding an intercept term allows a better fit to the data.
• You can set a Confidence Level (the default is 95%) to provide
confidence intervals for the intercept and slope parameters.
• In Residuals, you have the option of including a residuals output
table by checking the boxes for Residuals, Standardized
Residuals, Residual Plots, and Line Fit Plots.
Residual Plots generates a chart for each
• Finally, you may choose to have Excel construct a normal independent variable versus the residual,
probability plot for the DV, which transforms the cumulative and Line Fit Plots generates a scatter chart
probability scale (vertical axis) so that the graph of the with the values predicted by the regression
cumulative normal distribution is a straight line. The closer the model included.
points are to a straight line, the better the fit to a normal
distribution.
Example: Home Market Value
Coefficient of
determination. Sample correlation coefficient.
A measure of how Indicates the correlation b/w the IV and DV.
well the regression Values from -1 to 1 A statistic that modifies the value of R2
line fits the data. by incorporating the sample size and the
Values from 0 to 1. no. of explanatory variables in the
model. Useful when comparing this
Variability of observed Y- model with other models that include
values from predicted additional IV.
values (Ŷ). If data are
clustered close to the
regression line, the std.
error will be small – the
more scattered the data,
the larger the std. error.

The p-value for the F-test. If Significance F is less


than the level of significance (typically 0.05), we
would reject the null hypothesis.

Confidence intervals (Lower 95% and Upper 95% values in the


output) provide information about the unknown values of the
true regression coefficients, accounting for sampling error. They
tell us what we can reasonably expect to be the ranges for the
population intercept and slope at a 95% confidence level. We
need only check whether B1 falls within the confidence interval
for the slope. If it does not, then we reject the null hypothesis,
otherwise we fail to reject it.
Reject or Do Not Reject the Null
Hypothesis?
• If we reject the null hypothesis, then we may
conclude that the slope of the independent variable
is not zero and, therefore, is statistically significant
in the sense that it explains some of the variation of
the dependent variable around the mean.
• The value of Significance F, which is the p-value for
the F-test, if it is less than the level of significance
(typically 0.05), we would reject the null
hypothesis.
Residual Analysis and Regression
Assumptions
• Residuals are the observed errors, which are the differences between the actual
values and the estimated values of the dependent variable using the regression
equation.
The residual is simply the difference
between the actual value of the
dependent variable and the
predicted value, or Yi – Ŷi.

• Standard residuals are residuals divided by their std. dev. Standard residuals
describe how far each residual is from its mean in units of std. dev. (similar to a z-
value for a standard normal distribution). Standard residuals are useful in checking
assumptions underlying regression analysis, and to detect outliers that may bias the
results. An outlier is an extreme value that is different from the rest of the data. A
single outlier can make a significant difference in the regression equation, changing
the slope and intercept and, hence, how they would be interpreted and used in
practice. Some consider a standardized residual outside of ±2 std. dev. as an outlier.
A more conservative rule of thumb would be to consider outliers outside of a ±3 std.
dev. range.
Example: Interpreting Residual
Output
Assumptions associated with
Regression Analysis
Linearity This is usually checked by examining a scatter diagram of the data or examining the residual plot. If the
model is appropriate, then the residuals should appear to be randomly scattered about zero, with no
apparent pattern. If the residuals exhibit some well-defined pattern, such as a linear trend, a parabolic
shape etc, then there is good evidence that some other functional form might better fit the data.
Normality of Regression analysis assumes that the errors for each individual value of X are normally distributed,
errors with a mean of zero. This can be verified either by examining a histogram of the standard residuals and
inspecting for a bell-shaped distribution or by using more formal goodness-of- fit tests. It is usually
difficult to evaluate normality with small sample sizes. However, regression analysis is fairly robust
against departures from normality, so in most cases this is not a serious issue.
Homoscedasticity The third assumption is homoscedasticity, which means that the variation about the regression line is
constant for all values of the independent variable. This can also be evaluated by examining the
residual plot and looking for large differences in the variances at different values of the independent
variable. Caution should be exercised when looking at residual plots. In many applications, the model is
derived from limited data, and multiple observations for different values of X are not available, making
it difficult to draw definitive conclusions about homoscedasticity. If this assumption is seriously
violated, then techniques other than least squares should be used for estimating the regression model.
Independence of Finally, residuals should be independent for each value of the independent variable. For cross-
errors sectional data, this assumption is usually not a problem. However, when time is the IV, this is an
important assumption. If successive observations appear to be correlated— for example, by becoming
larger over time or exhibiting a cyclical type of pattern—then this assumption is violated. Correlation
among successive observations over time is called autocorrelation and can be identified by residual
plots having clusters of residuals with the same sign. Autocorrelation can be evaluated more formally
using a statistical test based on a measure called the Durbin–Watson statistic. The Durbin–Watson
statistic is a ratio of the squared differences in successive residuals to the sum of the squares of all
residuals. D will range from 0 to 4. values below 1 suggest autocorrelation; values above 1.5 and below
2.5 suggest no autocorrelation; and values above 2.5 suggest negative autocorrelation. This can
become an issue when using regression in forecasting.
Checking for Assumptions in
Linearity
• When assumptions of regression are violated, then
statistical inferences drawn from the hypothesis tests may
not be valid. Thus, before drawing inferences about
regression models and performing hypothesis tests, these
assumptions should be checked. However, other than
linearity, these assumptions are not needed solely for model
fitting and estimation purposes.
Multiple Linear
Regression
Introduction
• A linear regression model with more than one
independent variable is called a multiple linear
regression model.
• A multiple linear regression model has the form:
Example: Colleges and
Universities

For the college and university data, the proposed model would be:

Thus, b2 would represent an estimate of the change in the graduation rate for a
unit increase in the acceptance rate while holding all other variables constant.
Example: Colleges and
Universities
Example: Colleges and
Universities
Building a Good Regression
Model
• In the colleges and universities regression example,
all the independent variables were found to be
significant by evaluating the p-values of the
regression analysis.
• This will not always be the case and leads to the
question of how to build good regression models
that include the “best” set of variables.
Building a Good Regression
Model
• A systematic approach to building good regression
models:
1. Construct a model with all available IVs. Check for
significance of the IVs by examining the p-values.
2. Identify the independent variable having the largest p-
value that exceeds the chosen level of significance.
3. Remove the variable identified in Step 2 from the
model and evaluate adjusted R2 (don’t remove all
variables with p-values that exceed a at the same time,
but remove only one at a time).
4. Continue until all variables are significant.
Another criterion used to determine if a variable should be removed is the t-statistic. If |t| <
1, then the standard error will decrease and adjusted R2 will increase if the variable is
removed. If |t| > 1, then the opposite will occur.
Correlation and Multicollinearity
• Correlation, a numerical value between -1 and +1, measures the linear
relationship between pairs of variables.
• The higher the absolute value of the correlation, the greater the strength of the
relationship.
• The sign simply indicates whether variables tend to increase together (positive)
or not (negative).
• However, strong correlations among the IVs can be problematic. This can signify
a phenomenon called multicollinearity, a condition occurring when two or
more IVs in the same regression model contain high levels of the same
information and, consequently, are strongly correlated with one another and
can predict each other better than the DV.
• When significant multicollinearity is present, it becomes difficult to isolate the
effect of one IV on the DV, and the signs of coefficients may be the opposite of
what they should be, making it difficult to interpret regression coefficients.
• Also, p-values
Some experts canthat
suggest be correlations
inflated, resulting
between IVsinexceeding
the conclusion
an absolute not toofreject
value the
0.7 may null
indicate
hypothesis for
multicollinearity. significance
However, of regression
multicollinearity whenusing
is best measured it should becalled
a statistic rejected.
the variance inflation
factor (VIF) for each IV. More sophisticated software packages usually compute these; unfortunately,
Excel does not.
Example: Colleges and
Universities
Handy Tips
• It is not easy to identify the best regression model simply by
examining p-values.
• It often requires some experimentation and trial and error.
• From a practical perspective, the IVs selected should make some
sense in attempting to explain the DV (i.e., you should have some
reason to believe that changes in the IV will cause changes in the
DV even though causation cannot be proven statistically).
• Good modelers also try to have as simple a model as possible—an
age-old principle known as parsimony—with the fewest number of
explanatory variables that will provide an adequate interpretation
of the dependent variable.
• In the physical and management sciences, some of the most
powerful theories are the simplest.
Regression with
Categorical Independent
Variables
Introduction
• Some data of interest in a regression study may be
ordinal or nominal.
• This is common when including demographic data, for
example.
• Because regression analysis requires numerical data, we
could include categorical variables by coding the
variables.
• Eg. if one variable represents whether an individual is a
college graduate or not, we might code No as 0 and Yes
as 1.
• Such variables are often called dummy variables.
Example: A Model with
Categorical Variables
Example: A Model with
Categorical Variables cont.
Interaction Variable
• An interaction occurs when the effect of one
variable (i.e., the slope) is dependent on another
variable.
• We can test for interactions by defining a new
variable as the product of the two variables, X3 = X1
x X2, and testing whether this variable is significant,
leading to an alternative model.
Example: Interaction Variable

Add interaction term Age x MBA


Example: Interaction Variable
cont.
Categorical Variables with More
than Two Levels
• When a categorical variable has only two levels, as
in the previous example, we coded the levels as 0
and 1 and added a new variable to the model.
• However, when a categorical variable has k > 2
levels, we need to add k - 1 additional variables to
the model.
Example: Surface Finish Data
Regression Models with Non-
Linear Terms
• Linear regression models are not appropriate for
every situation.
• A scatter chart of the data might show a nonlinear
relationship, or the residuals for a linear fit might
result in a nonlinear pattern.
• In such cases, we might propose a nonlinear model
to explain the relationship.
• For instance, a second-order polynomial model
would be:

Sometimes, this is called a curvilinear regression model. In this model, 𝛽1 represents the
linear effect of X on Y, and 𝛽2 represents the curvilinear effect.
Example: Modelling Curvilinear
Regression
Example: Cont.
End of Chapter Exercises
1. A consumer products company has collected
some data relating monthly demand to the price
of one of its products:

What type of model would best represent these


data? Use the Trendline tool to find the best among
the options provided.
End of Chapter Exercises
2. The managing director of a consulting group has the following
monthly data on total overhead costs and professional labour
hours to bill to clients:

a. Develop a trendline to identify the relationship between billable hours


and overhead costs.
b. Interpret the coefficients of your regression model. Specifically, what
does the fixed component of the model mean to the consulting firm?
c. If a special job requiring 1,000 billable hours that would contribute a
margin of $38,000 before overhead was available, would the job be
attractive?
End of Chapter Exercises
3. Using the data in the Excel file Home Market
Value, develop a multiple linear regression model
for estimating the market value as a function of
both the age and size of the house. Predict the
value of a house that is 30 years old and has
1,800 square feet, and one that is 5 years old and
has 2,800 square feet.
Any Questions
End of Chapter 6

You might also like