Business Analytics
Business Analytics
Business Analytics
Before we help Disney develop a forecast of its home video units, let’s
build a basic understanding of regression analysis by taking a careful
look at some data on recent residential real estate transactions in the
Boston area. Suppose you are interested in purchasing a single-family
home near Boston and would like to understand the relationship between
selling price and the size of a house. You are confident that larger homes
tend to cost more, but you want to gain a deeper understanding about the
structure of the relationship between selling price and house size.
We'll use Excel's
regression tools
to actually identify the best
fit line through a dataset.
But to really
understand regression,
it's important to understand
how the regression
line is determined.
Let's take another look
at the housing data.
Clearly we can't draw
a single straight line
through every point
in the data set.
This shouldn't surprise us,
because house size alone
is by no means a perfect
predictor of a home's selling
price.
There are many other factors
that influence a home's value.
The regression line is
the linear relationship
that best fits the data, but it
won't pass through every point.
Remember when you tried
to find the best fit
line for the housing data?
You probably tried to draw a
line that would touch or get
close to as many
points as possible.
This is essentially
what Excel does.
Broadly speaking,
the regression line
is the line that minimizes
the dispersion of points
around that line, and we measure
the accuracy of the regression
line by measuring
that dispersion.
We attribute the difference
between the actual data points
and the values predicted
by the regression line
either to relationships
between selling price
and variables other than
house size or to chance alone.
Now that we have some understanding of how to find a best fit line, let’s
look at the structure of the equation of this line. In general, a single
variable regression line can be described by the equation
𝑦^=𝑎+𝑏𝑥
where
𝑎
is the slope. Move the sliders to see how the regression line changes as
we adjust
and
𝑦^
Dependent Variable
𝑦^
is pronounced "y-hat".)
Independent Variable
y-intercept
The point at which the regression line intersects the vertical axis. In
other words, it is the value of
when
𝑥=0
Slope
+
𝑦^=𝑎+𝑏𝑥
ayx
𝑦^=𝑎+𝑏𝑥
to distinguish it from
𝑦^=𝛼+𝛽𝑥
, the idealized equation that represents the “true” best fit line. Because
the best fit line does not perfectly fit even the population data, we add an
error term,
𝑦=𝛼+𝛽𝑥+𝜀
𝑦
. That is,
𝜀=𝑦−𝑦^
4.2 Summary
Lesson Summary
Excel Summary
Once we have found the regression equation for a given data set, we can
use that equation to obtain a point forecast, in this case, the predicted
selling price for a given house size. For example, we may want to predict
the price of a house on the basis of its size. How much can we expect to
pay for a 1,200 square foot home?
Suppose for a moment that we did not know anything about the
relationship between selling price and house size, that is, suppose we
had only the historical data. In that case, we might simply note that when
a house of that size sold recently, it sold for approximately $266,000. And
so we might predict that we would pay around the same amount for any
1,200 square foot house.
A B C
1 City House Size (Sqft) Selling Price ($)
A single historical data point does not yield the best forecast. Indeed, a
historical data point may not even exist for the house size we are
interested in. Even if it does, the price of that house provides information
on only a single house—it doesn’t reflect information about the other
houses’ sizes and prices. In contrast, regression analysis brings the
power of the entire data set to our prediction. In general, regression
allows us to generate far more accurate predictions than we could make
by inferring a future price from a single data point. Once we have
identified the linear relationship between the two variables, we can use
that regression equation to forecast.
𝑥
between 0 and 7,000 square feet. Note that some of these values fall
outside of the range of historical data so we must exhibit caution.
Specifically, we should look at the range and dispersion of the historical
values of the independent variable (x-values). Since we have no
information about houses outside the historical range, there is greater
uncertainty when predicting selling price for such homes.
In order to use this function we must have the original data. This approach also gives us a
point forecast, but does not provide other helpful values that Excel’s regression tool
produces.
The predictions you just made were point forecasts, or single values, each representing
the expected selling price for a home of a given size.
But there is often a great deal of uncertainty when we use a regression model to
forecast, in part because the regression line does not perfectly fit the data, and in part
because the regression line itself is only an estimate of the true best fit line.
In addition, our forecast uncertainty increases as we near the edges or go outside of the
historical range of our data.
BEGIN GRAPH DESCRIPTION
The rectangular area labeled Range of Historical Data contains most data points, which
are contained within 1000 and 4000 square feet and within the entire range of the Selling
Prices.
END OF GRAPH DESCRIPTION
Due to this uncertainty, we would rarely use only a point forecast in practice. The point
forecast is a good place to start.
But to make sound managerial decisions, we must try to capture to the best of our ability
the forecast uncertainty.
Suppose we want to forecast the price at which a 2,000 square foot house would sell.
Rather than predicting just a single point, we construct an interval or range around the
point forecast.
We construct this range so it's very likely that the selling price of a 2,000 square foot
house would fall within that range.
The standard error of the regression, in this case, about $151,000, is a reasonable but
conservative estimate of the forecast standard deviation.
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.8557
R Square 0.7356
Observations 30
We are able to say that we are 95% confident that the actual selling price will fall within
the prediction interval.
Like confidence intervals, the higher the confidence level we select, the wider our
prediction interval will be.
Since there is greater uncertainty when we forecast further from the mean of the
independent variable, we can infer that the prediction interval should be wider as we
move away from the average house size.
So although the standard error is a reasonable estimate on which to base our range, the
actual calculation is more complicated.
As we move towards, and then beyond, the edges of the historical data, the width of the
distribution around the point forecast increases.
In this case, a 95% prediction interval for the selling price of a 7,000 square foot home
would be much wider than that for a 2,000 square foot home.
BEGIN GRAPH DESCRIPTION
As the distance between the lines above and below the regression line increases as
house size values increase, the length of the vertical line representing the prediction
interval at 7000 square feet is much longer than the length at 2000 square feet.
END OF GRAPH DESCRIPTION
4.3 Summary
Lesson Summary
● Forecasting in Excel
● =SUMPRODUCT(array1, [array2], [array3],…) is a
convenient function for calculating point forecasts.
●
Now that we have learned how to determine a regression line and use it
to forecast the dependent variable, we turn our attention to how to
evaluate the “fit” of that line. Even when the linear relationship between
two variables is not very strong, there is still a best fit regression line
associated with that relationship—it just won’t fit the data very well or be
particularly useful. It is helpful to measure how well a regression line fits
the historical data so that we can determine how useful a regression
model might be for forecasting and explaining the relationship between
the dependent and independent variables.
Let's think back
to when we tried
to draw our best estimate
of the regression line.
Conceptually, we wanted
to find the line that
would minimize the
dispersion of the data points
above and below that line.
Now let's formalize
this process a bit more,
and clarify how we measure
dispersion around a line.
To quantify how accurately
a line fits a data set,
we first measure the vertical
distance between each data
point and the line.
We measure vertical
distance, rather than
perpendicular
distance, because we're
interested in how well the
line predicts the value
of the dependent variable.
And the dependent variable,
in this case selling price,
is measured on
the vertical axis.
We want to know, for
a given house size,
how close is the price
predicted by the line
compared to the
historically observed price
for a home of that size.
We call the vertical
distance between a data point
and the line the residual error.
This error is the difference
between the observed
value and the line's prediction
for the dependent variable.
This difference may be
due to other factors that
influence selling price,
or just a plain chance.
Collectively, the
residuals for all the data
points measure how accurately
a line fits a data set.
To quantify the total
size of the errors,
we can't just sum each of
the vertical distances.
If we did, positive
and negative distances
would cancel each other out.
Instead we take the
square of each distance,
and then add all of those
squared terms together.
This measure, called the
sum of squared errors
or the residual sum
of squares, gives us
a good measure of how accurately
a line fits a data set.
A regression line
is formally defined
as the line that minimizes
the sum of squared errors.
A critical question we
ask when we use regression
is how much the regression
adds to our understanding
of the dependent variable.
In this case, we
would like to know
how much our knowledge of the
relationship between house
size and selling price helps
us understand and predict
house selling prices.
Specifically, we want to
determine how much more we
know about selling prices if
we have data about house size
than if we do not.
To determine how
much more information
we gain from the
house size data,
we need a benchmark
telling us how much
we would know about the
behavior of prices if we did not
have the house size data, that
is, if we only had the price
data to work with.
Using the price data
alone, the best predictor
for a future selling
price would simply
be the average of
the previous prices.
Thus, we use mean
price as our benchmark
and draw a mean price
line through the data.
We already have a measure of how
accurately an individual line
fits a data set, the
sum of squared errors.
To find out how much additional
value the regression model
gives us, we'll compare the
accuracy of the regression line
with that of the
mean price line.
Specifically, we'll calculate
the sum of squared errors
for each of the two lines,
and see how much smaller
the error is around
the regression line
than around the mean line.
We've just learned that
the sum of squared errors
of the regression line is called
the residual sum of squares.
It's useful to think of this as
the variation left unexplained
by the regression model.
We can also calculate
the sum of squared errors
for the mean price line.
This represents the total
variation in the price data,
so we call it the
total sum of squares.
To determine how
much more accurate
the regression line
is than the mean line,
we subtract the residual sum
of squares, in this case,
about 636 billion, from
the total sum of squares,
in this case about 2.4 trillion.
The difference,
about 1.77 trillion,
is called the regression
sum of squares.
We can think of the
regression sum of squares
as measuring the variation
in price that's explained
by the regression model.
In this case, since the
regression sum of squares
is a large fraction of
the total variation,
we know that the
regression line helps
us predict price much more
accurately than price alone
would.
It can be misleading to use only R squared to assess whether a linear regression model
is appropriate.
R squared measures how much variation is explained by the regression line, but it does
not reveal exactly how the variables are related.
The regression line is the line that best fits the observed data, but we need a way to test
whether the linear relationship is significant.
If it is not significant, then the true regression line is just the mean line, which has a
slope of 0.
Thus, if we can show that the slope of the true regression line is not zero, we can be
confident that there is a significant linear relationship.
and the alternative hypothesis is that the true slope is not zero.
As usual, we look at a p-value, in this case, the p-value of the independent variable, to
determine whether or not we can reject the null hypothesis.
Later we'll look at how residual plots can help us determine whether the linear model is
the best fit for the data.
Even though a linear model may explain some of the variation, the true relationship may
be best described by some type of curve, for example.
There are two ways to test whether the slope of the best fit line equals zero.
1. Check whether the confidence interval for the line's slope contains zero
Remember that the coefficients of the regression line are just estimates of the true linear
relationship between the dependent and independent variables. A coefficient’s lower 95%
and upper 95% values give us the lower and upper bounds of the 95% confidence interval
for that coefficient. Recall that if the best fit regression line has a slope of zero, then the
regression line is just a flat line equal to the mean of the dependent variable, indicating that
that there is no linear relationship between the two variables. Thus, if the 95% confidence
interval for the slope does not include zero, we can be 95% confident that the true value of
the slope is not zero and thus that a significant relationship exists between the variables.
Coefficients Standard t Stat P-value Lower 95% Upper 95%
Error
In this example, we can say we are 95% confident that the true slope of the regression line
describing the relationship between selling price and house size is between 196.10 and
314.63. Because this range does not include the value zero, we can be 95% confident that
there is a significant linear relationship between the variables.
H0:𝛽=0
and
H𝑎:𝛽≠0
. Recall that the p-value for a hypothesis test is the likelihood that we would select a sample
at least as extreme as the one we observed if the null hypothesis were true. The p-value
associated with a regression coefficient is the likelihood of choosing a sample at least as
extreme as the sample we used to derive the regression equation if the slope of the true
regression line is actually zero, or equivalently, if there is no linear relationship between the
two variables. Since the p-value for house size, 0.0000, is less than 0.05, we reject the null
hypothesis that the slope is zero and can be confident that there is a significant linear
relationship between selling price and house size. (We can ignore the p-value of the
intercept coefficient because the y-intercept is just a constant. It does not represent an
independent variable and thus provides no information about the significance of the
relationship between two variables.)
4.4 Summary
Lesson Summary
H0:𝛽=0
and
H𝑎:𝛽≠0
. (In regression analysis, the hypothesis test’s p-value is calculated by assuming the two
samples have equal variances.) In this case, since SAT is a dummy variable, this is
equivalent to a hypothesis test with the following null and alternative hypotheses:
● H0
● : The selling price of homes in neighborhoods where the average SAT score is at
or above 1700
● =
● the selling price of homes in neighborhoods where the average SAT score is
below 1700.
● H𝑎
● : The selling price of homes in neighborhoods where the average SAT score is at
or above 1700
● ≠
● the selling price of homes in neighborhoods where the average SAT score is
below 1700.
The regression analysis gives us more information than the hypothesis test alone would.
Rather than simply calculating the p-value, rejecting the null hypothesis and concluding
that there is a significant linear relationship, the regression results provide the direction
and magnitude of this relationship.
4.5 Summary
Lesson Summary
● The regression output table is divided into three main parts: the
Regression Statistics table, the ANOVA table, and the Regression
Coefficients table. It is important to be able to identify the most
useful measures and interpret them correctly.
● To study the effects of qualitative variables, we use a type of
variable called a dummy variable. A dummy variable takes on one
of two values, 0 or 1, to indicate which of two categories a data
point falls into.
Excel Summary
So the way that we
forecast home video units
is fundamentally based on a
regression of home video unit
sales to gross box office.
So in our case, the independent
variable, gross box office,
is our estimate.
It depends on what point in
time we're doing the analysis.
If it's before the title's
release theatrically,
it's our best estimate of
what the gross box office will
be for a title's run,
which could typically
last from maybe three months,
three to four months, where
the final gross will land.
And then what we'll
produce from the regression
will be an estimate of
52-week unit sales at retail.
Earlier in this module, we briefly analyzed the relationship between 2011 gross box
office and home video units data. Now let’s look at the 2012 data.
A scatter plot depicting 2012 home video units versus gross box office. The x-axis is
labeled gross box office in millions of dollars and ranges from 0 to 300 in increments
of 50. The y-axis is labeled home video units in thousands and ranges from 0 to
8,000 in increments of 1,000. The plotted points are loosely clustered in an
upward-sloping pattern from left to right. Most of the points on the graph are in the 0
to 50 million dollar range, which corresponds to 0 to 1,000 home video units. The
next-largest range of points is between 50 and 100 million dollars, which
corresponds to 500 to 2,500 home video units. A few scattered points are in the 100
to 250 million dollar range, which corresponds to 1,000 to 6,900 home video units.
The rightmost point is located at ($290 million, 7,100 home video units).