Chapter 6
Chapter 6
Analysis
Chapter Contents
1. Modelling Relationships and Trends in Data
2. Simple Linear Regression
3. Multiple Linear Regression
4. Regression with Categorical Independent
Variables
Modelling Relationships
and Trends in Data
Introduction
• Begin by creating a chart of the data and choose
the appropriate type of functional relationship to
incorporate into an analytical model to understand
the data.
• For cross-sectional data, we use a scatter chart; for
time series data we use a line chart.
Types of Mathematical Functions
Linear y = a + bx Linear functions show steady increases or decreases
Function over the range of x. This is the simplest type of
function used in predictive models. It is easy to
understand, and over small ranges of values, can
approximate behavior rather well.
Logarithmic y = ln(x) Logarithmic functions are used when the rate of
Function change in a variable increases or decreases quickly
and then levels out, such as with diminishing returns
to scale. Logarithmic functions are often used in
marketing models where constant percentage
increases in advertising, result in constant, absolute
increases in sales.
Polynomial y = ax2 + bx + c A second-order polynomial is parabolic in nature and
Function (second order) has only one hill or valley; a third-order polynomial
y = ax3 + bx2 + dx + e has one or two hills or valleys. Revenue models that
(third order) incorporate price elasticity are often polynomial
functions.
Types of Mathematical Functions
Power y = axb Power functions define phenomena that increase at a
Function specific rate. Learning curves that express improving
times in performing a task are often modelled with
power functions having a > 0 and b < 0.
Exponential y = abx Exponential functions have the property that y rises
Function or falls at constantly increasing rates. E.g. the
perceived brightness of a lightbulb grows at a
decreasing rate as the wattage increases. In this case,
a would be a positive number and b would be
between 0 and 1. The exponential x function is often
defined as y = ae, where b = e, the base of natural
logarithms (approximately 2.71828).
Excel Trendline Tool
• The Excel Trendline tool provides a convenient
method for determining the best-fitting
functional relationship among these alternatives
for a set of data.
• First, click the chart to which you wish to add a
trendline; this will display the Chart Tools menu.
• Select the Chart Tools Design tab, and then click
Add Chart Element from the Chart Layouts group.
• From the Trendline submenu, you can select one
of the options (Linear is the most common) or
More Trendline Options. . .
• If you select More Trendline Options, you will get
the Format Trendline pane in the worksheet.
R-square
• Trendlines can be used to model relationships between
variables and understand how the dependent variable
behaves as the independent variable changes.
• E.g. the demand-prediction models that introduced in
Chapter 1 would generally be developed by analysing
data.
• R2 (R-squared) is a measure of the “fit” of the line to the
data. The value of R2 will be between 0 and 1.
• The larger the value of R2 the better the fit. We will
discuss this further in the context of regression analysis.
Example: Modelling a Price-
Demand Function
Example: Predicting Crude Oil
Prices
There are two possible straight lines that pass through the data. Clearly, you would
choose A as the better-fitting line over B because all the points are closer to the
line and the line appears to be in the middle of the data.
The only difference between the lines is the value of the slope and intercept; thus,
we seek to determine the values of the slope and intercept that provide the best-
fitting line.
Example: Using Excel to Find the
Best Regression Line
Example: Using Excel to Find the
Best Regression Line cont.
• Standard residuals are residuals divided by their std. dev. Standard residuals
describe how far each residual is from its mean in units of std. dev. (similar to a z-
value for a standard normal distribution). Standard residuals are useful in checking
assumptions underlying regression analysis, and to detect outliers that may bias the
results. An outlier is an extreme value that is different from the rest of the data. A
single outlier can make a significant difference in the regression equation, changing
the slope and intercept and, hence, how they would be interpreted and used in
practice. Some consider a standardized residual outside of ±2 std. dev. as an outlier.
A more conservative rule of thumb would be to consider outliers outside of a ±3 std.
dev. range.
Example: Interpreting Residual
Output
Assumptions associated with
Regression Analysis
Linearity This is usually checked by examining a scatter diagram of the data or examining the residual plot. If the
model is appropriate, then the residuals should appear to be randomly scattered about zero, with no
apparent pattern. If the residuals exhibit some well-defined pattern, such as a linear trend, a parabolic
shape etc, then there is good evidence that some other functional form might better fit the data.
Normality of Regression analysis assumes that the errors for each individual value of X are normally distributed,
errors with a mean of zero. This can be verified either by examining a histogram of the standard residuals and
inspecting for a bell-shaped distribution or by using more formal goodness-of- fit tests. It is usually
difficult to evaluate normality with small sample sizes. However, regression analysis is fairly robust
against departures from normality, so in most cases this is not a serious issue.
Homoscedasticity The third assumption is homoscedasticity, which means that the variation about the regression line is
constant for all values of the independent variable. This can also be evaluated by examining the
residual plot and looking for large differences in the variances at different values of the independent
variable. Caution should be exercised when looking at residual plots. In many applications, the model is
derived from limited data, and multiple observations for different values of X are not available, making
it difficult to draw definitive conclusions about homoscedasticity. If this assumption is seriously
violated, then techniques other than least squares should be used for estimating the regression model.
Independence of Finally, residuals should be independent for each value of the independent variable. For cross-
errors sectional data, this assumption is usually not a problem. However, when time is the IV, this is an
important assumption. If successive observations appear to be correlated— for example, by becoming
larger over time or exhibiting a cyclical type of pattern—then this assumption is violated. Correlation
among successive observations over time is called autocorrelation and can be identified by residual
plots having clusters of residuals with the same sign. Autocorrelation can be evaluated more formally
using a statistical test based on a measure called the Durbin–Watson statistic. The Durbin–Watson
statistic is a ratio of the squared differences in successive residuals to the sum of the squares of all
residuals. D will range from 0 to 4. values below 1 suggest autocorrelation; values above 1.5 and below
2.5 suggest no autocorrelation; and values above 2.5 suggest negative autocorrelation. This can
become an issue when using regression in forecasting.
Checking for Assumptions in
Linearity
• When assumptions of regression are violated, then
statistical inferences drawn from the hypothesis tests may
not be valid. Thus, before drawing inferences about
regression models and performing hypothesis tests, these
assumptions should be checked. However, other than
linearity, these assumptions are not needed solely for model
fitting and estimation purposes.
Multiple Linear
Regression
Introduction
• A linear regression model with more than one
independent variable is called a multiple linear
regression model.
• A multiple linear regression model has the form:
Example: Colleges and
Universities
For the college and university data, the proposed model would be:
Thus, b2 would represent an estimate of the change in the graduation rate for a
unit increase in the acceptance rate while holding all other variables constant.
Example: Colleges and
Universities
Example: Colleges and
Universities
Building a Good Regression
Model
• In the colleges and universities regression example,
all the independent variables were found to be
significant by evaluating the p-values of the
regression analysis.
• This will not always be the case and leads to the
question of how to build good regression models
that include the “best” set of variables.
Building a Good Regression
Model
• A systematic approach to building good regression
models:
1. Construct a model with all available IVs. Check for
significance of the IVs by examining the p-values.
2. Identify the independent variable having the largest p-
value that exceeds the chosen level of significance.
3. Remove the variable identified in Step 2 from the
model and evaluate adjusted R2 (don’t remove all
variables with p-values that exceed a at the same time,
but remove only one at a time).
4. Continue until all variables are significant.
Another criterion used to determine if a variable should be removed is the t-statistic. If |t| <
1, then the standard error will decrease and adjusted R2 will increase if the variable is
removed. If |t| > 1, then the opposite will occur.
Correlation and Multicollinearity
• Correlation, a numerical value between -1 and +1, measures the linear
relationship between pairs of variables.
• The higher the absolute value of the correlation, the greater the strength of the
relationship.
• The sign simply indicates whether variables tend to increase together (positive)
or not (negative).
• However, strong correlations among the IVs can be problematic. This can signify
a phenomenon called multicollinearity, a condition occurring when two or
more IVs in the same regression model contain high levels of the same
information and, consequently, are strongly correlated with one another and
can predict each other better than the DV.
• When significant multicollinearity is present, it becomes difficult to isolate the
effect of one IV on the DV, and the signs of coefficients may be the opposite of
what they should be, making it difficult to interpret regression coefficients.
• Also, p-values
Some experts canthat
suggest be correlations
inflated, resulting
between IVsinexceeding
the conclusion
an absolute not toofreject
value the
0.7 may null
indicate
hypothesis for
multicollinearity. significance
However, of regression
multicollinearity whenusing
is best measured it should becalled
a statistic rejected.
the variance inflation
factor (VIF) for each IV. More sophisticated software packages usually compute these; unfortunately,
Excel does not.
Example: Colleges and
Universities
Handy Tips
• It is not easy to identify the best regression model simply by
examining p-values.
• It often requires some experimentation and trial and error.
• From a practical perspective, the IVs selected should make some
sense in attempting to explain the DV (i.e., you should have some
reason to believe that changes in the IV will cause changes in the
DV even though causation cannot be proven statistically).
• Good modelers also try to have as simple a model as possible—an
age-old principle known as parsimony—with the fewest number of
explanatory variables that will provide an adequate interpretation
of the dependent variable.
• In the physical and management sciences, some of the most
powerful theories are the simplest.
Regression with
Categorical Independent
Variables
Introduction
• Some data of interest in a regression study may be
ordinal or nominal.
• This is common when including demographic data, for
example.
• Because regression analysis requires numerical data, we
could include categorical variables by coding the
variables.
• Eg. if one variable represents whether an individual is a
college graduate or not, we might code No as 0 and Yes
as 1.
• Such variables are often called dummy variables.
Example: A Model with
Categorical Variables
Example: A Model with
Categorical Variables cont.
Interaction Variable
• An interaction occurs when the effect of one
variable (i.e., the slope) is dependent on another
variable.
• We can test for interactions by defining a new
variable as the product of the two variables, X3 = X1
x X2, and testing whether this variable is significant,
leading to an alternative model.
Example: Interaction Variable
Sometimes, this is called a curvilinear regression model. In this model, 𝛽1 represents the
linear effect of X on Y, and 𝛽2 represents the curvilinear effect.
Example: Modelling Curvilinear
Regression
Example: Cont.
End of Chapter Exercises
1. A consumer products company has collected
some data relating monthly demand to the price
of one of its products: