Business Analytics

Download as pdf or txt
Download as pdf or txt
You are on page 1of 44

​ In the next two


modules, we'll learn
​ about regression
analysis, one of the most
​ powerful and commonly
used statistical tools.
​ Regression analysis examines
relationships among variables.
​ Linear regression investigates
linear relationships
​ among variables.
​ In this module,
we'll learn about
​ single variable linear
regression, which
​ seeks to identify a linear
relationship between two
​ variables.
​ In the next module,
we'll learn how
​ to use multivariate or
multiple linear regression
​ to examine relationships
among multiple variables.
​ Single variable
linear regression
​ can be seen as an extension
of hypothesis testing.
​ We have learned to
use hypothesis tests
​ to determine
whether or not there
​ is a significant relationship
between two variables.
​ Regression also tests whether
there is a relationship,
​ but allows us to gain
insights into the structure
​ of that relationship
and provides
​ measures of how well the
data fit that relationship.
​ Such insights can prove
extremely valuable
​ for analyzing historical trends
and developing forecasts.
​ Let's take a look at how
Disney Studios uses regression
​ analysis to predict DVD and
Blu-ray sales for Disney's
​ new movie releases.
​ The film business is
so dynamic, and it's
​ been going through
a lot of change.
​ And what a lot of
people don't realize
​ is that there are several
windows in the film
​ distribution landscape.
​ And that when a studio, like
Disney, makes a movie, actually
​ not much of the cost
of making that movie
​ is recouped in the
theatrical window.
​ The theatrical
window tends to be
​ a very marketing heavy window.
​ A lot of money is
spent on promoting
​ that window, and that,
that theatrical release.
​ And then the success
of that window
​ really fuels and sets
the stage for all
​ of the downstream markets,
like home entertainment
​ and television, where a
studio does then hopefully
​ recoup and improve
on their investment.
​ It is really important to
us to accurately forecast
​ our physical home
entertainment sales
​ for a multitude of reasons.
​ First of all, it's
important early on,
​ because when the studio
is making a decision about
​ whether they're going to
greenlight and make a movie,
​ home entertainment is actually
a really important part
​ of the profit piece of a
movie's return on investment.
​ So it's really important that we
have a fairly accurate forecast
​ upfront, so that they can
decide how much money can they
​ stand to make on a given movie,
and do they want to move ahead.
​ Movies are very expensive.
​ Do they want to put their
money into the production
​ of that movie?
​ Is it going to have a
good return on investment?
​ Then after the movie
opens theatrically,
​ we'll continue to
revise our forecast
​ as we get more information
on how many people went
​ to see the movie theatrically
and how was it received.
​ We'll continue to revise the
home entertainment forecast,
​ and then it comes into
play for different reasons.
​ When we think how much we
place within the stores,
​ we used to have, you
just placed one product.
​ Way back when, it
was a VHS tape.
​ Then it was a DVD.
​ Now you have DVD,
Blu-ray, and digital files
​ that you're selling, and
lots of different stores
​ are selling different versions.
​ So you have to try to
optimize and figure out
​ what's the correct
amount to ship in
​ of each of the different
SKUs so that you provide
​ the consumer what they
want, and at the same time,
​ maximize our profitability.
​ So acknowledging that
having an accurate forecast
​ is very important to
the financial health
​ of the physical home
entertainment business.
​ The problem that we're
really trying to solve for
​ is how we can best
use the window that
​ went before us, the
theatrical window,
​ to look at a variable
like box office,
​ and use it to
accurately predict what
​ our physical home
entertainment sales will be.

4.2.1 Visualizing the Relationship

Before we help Disney develop a forecast of its home video units, let’s
build a basic understanding of regression analysis by taking a careful
look at some data on recent residential real estate transactions in the
Boston area. Suppose you are interested in purchasing a single-family
home near Boston and would like to understand the relationship between
selling price and the size of a house. You are confident that larger homes
tend to cost more, but you want to gain a deeper understanding about the
structure of the relationship between selling price and house size.

To help build our understanding of such relationships, we have gathered


data on 30 homes that were sold in the greater Boston area during the
summer of 2013. Before we turn to regression analysis, let’s explore the
data to get a better sense of what the relationship between a house’s
selling price and its size might look like. As we learned earlier in the
course, scatter plots are an excellent tool for visualizing the relationship
between two variables, so let’s create one for the housing data.

4.2.2 The Best Fit Line

As we have seen, displaying data graphically can help us recognize


general trends and relationships. On the graph below, draw the line that
you think best fits the relationship between selling price and house size.
The best fit line, or linear regression line, is the line that best describes
the linear relationship between two variables. As we will see, the line
identifies the expected y-value for each of x-value.
Right now we are being a bit imprecise about what it means for a line to
best “fit” the data, so for the time being use your own judgment to find a
line that you think fits best – that is, which line would reduce the “total
distance” between the data points and the line. Shortly, we’ll introduce a
set of concepts and a metric that will help us measure how well a line fits
the data. This metric will be the basis for finding the best fit line.


​ We'll use Excel's
regression tools
​ to actually identify the best
fit line through a dataset.
​ But to really
understand regression,
​ it's important to understand
how the regression
​ line is determined.
​ Let's take another look
at the housing data.
​ Clearly we can't draw
a single straight line
​ through every point
in the data set.
​ This shouldn't surprise us,
because house size alone
​ is by no means a perfect
predictor of a home's selling
​ price.
​ There are many other factors
that influence a home's value.
​ The regression line is
the linear relationship
​ that best fits the data, but it
won't pass through every point.
​ Remember when you tried
to find the best fit
​ line for the housing data?
​ You probably tried to draw a
line that would touch or get
​ close to as many
points as possible.
​ This is essentially
what Excel does.
​ Broadly speaking,
the regression line
​ is the line that minimizes
the dispersion of points
​ around that line, and we measure
the accuracy of the regression
​ line by measuring
that dispersion.
​ We attribute the difference
between the actual data points
​ and the values predicted
by the regression line
​ either to relationships
between selling price
​ and variables other than
house size or to chance alone.

4.2.3 The Structure of the Regression Line

Now that we have some understanding of how to find a best fit line, let’s
look at the structure of the equation of this line. In general, a single
variable regression line can be described by the equation

𝑦^=𝑎+𝑏⁢𝑥

where
𝑎

is the y-intercept of the line and

is the slope. Move the sliders to see how the regression line changes as
we adjust

and

𝑦^
Dependent Variable

The expected value of

, the value we are trying to predict. (

𝑦^

is pronounced "y-hat".)

Independent Variable

The variable we are using to help us predict the dependent variable.

y-intercept
The point at which the regression line intersects the vertical axis. In
other words, it is the value of

when

𝑥=0

Slope

The average change in the dependent variable as the independent


variable increases by one.

+
𝑦^=𝑎+𝑏⁢𝑥

The line is horizontal along the x-axis.

ayx

As we learned in earlier in the course, we typically use Greek letters (like

) to refer to the “true” parameters associated with a population and Latin


letters (like

) to refer to the estimates of those parameters we calculate from sample


data. Similarly, we refer to the best fit line we obtain from our sample data
as

𝑦^=𝑎+𝑏⁢𝑥
to distinguish it from

𝑦^=𝛼+𝛽⁢𝑥

, the idealized equation that represents the “true” best fit line. Because
the best fit line does not perfectly fit even the population data, we add an
error term,

, to the true equation:

𝑦=𝛼+𝛽⁢𝑥+𝜀

. The error term is the difference between the actual value of

and the expected value of

𝑦
. That is,

𝜀=𝑦−𝑦^

4.2 Summary

Lesson Summary

● Single Variable Linear Regression analysis is used to identify the


best fit line between two variables. This analysis builds on two
previous concepts we have used to study relationships between
two variables:
● Scatter plots, which are useful for visualizing a relationship
between two variables.
● The correlation coefficient, a value between -1 and 1 that
measures the strength and direction (positive or negative)
of the linear relationship between two variables.
● We use regression analysis for two primary purposes:
● Studying the magnitude and structure of the relationship
between two variables.
● Forecasting a variable based on its relationship with
another variable.
● The structure of the single variable linear regression line is
● 𝑦^=𝑎+𝑏⁢𝑥
● .
● 𝑦^
● is the expected value of
● 𝑦
● , the dependent variable, for a given value of
● 𝑥
● .
● 𝑥
● is the independent variable, the variable we are using to
help us predict or better understand the dependent
variable.
● 𝑎
● is the y-intercept, the point at which the regression line
intersects the vertical axis. This is the value of
● 𝑦^
● when the independent variable,
● 𝑥
● , is set equal to 0.
● 𝑏
● is the slope, the average change in the dependent variable
● 𝑦
● as the independent variable
● 𝑥
● increases by one.
● The true relationship between two variables is described by
the equation
● 𝑦=𝛼+𝛽⁢𝑥+𝜀
● , where
● 𝜀
● is the error term (
● 𝜀=𝑦−𝑦^
● ). The idealized equation that describes the true regression
line is
● 𝑦^=𝛼+𝛽⁢𝑥
● .

Excel Summary

● Adding the best fit line to a scatter plot


CONTINUE

4.3.1 Point Forecasts

Once we have found the regression equation for a given data set, we can
use that equation to obtain a point forecast, in this case, the predicted
selling price for a given house size. For example, we may want to predict
the price of a house on the basis of its size. How much can we expect to
pay for a 1,200 square foot home?

Suppose for a moment that we did not know anything about the
relationship between selling price and house size, that is, suppose we
had only the historical data. In that case, we might simply note that when
a house of that size sold recently, it sold for approximately $266,000. And
so we might predict that we would pay around the same amount for any
1,200 square foot house.

A B C
1 City House Size (Sqft) Selling Price ($)

2 Mansfield 600 $211,000

3 Randolph 1,194 $183,000

4 North Reading 1,309 $365,000

5 Peabody 886 $380,000

6 Belmont 1,744 $860,000

7 Natick 4,184 $1,070,000

8 Arlington 4,688 $1,280,500

9 Ashland 1,388 $358,000


10 Framingham 1,528 $417,000

11 Hingham 1,888 $665,000

12 Wakefield 630 $210,000

13 Burlington 2,243 $540,000

14 Milton 2,202 $447,000

15 Framingham 1,200 $266,000

16 Ipswich 1,123 $299,000

17 Melrose 1,455 $445,000

18 Raynham 2,216 $365,000

19 Dracut 1,008 $189,900


20 Cambridge 1,025 $425,000

21 Weston 3,391 $1,130,000

22 Milton 920 $415,000

23 Acton 1,878 $470,000

24 Boxford 1,292 $314,000

25 Dedham 2,804 $724,500

26 Groveland 2,204 $420,000

27 Norwood 864 $288,000

28 Framingham 1,332 $150,000

29 Westford 1,750 $305,000


30 Holliston 1,458 $407,000

31 North Reading 1,973 $180,000

A single historical data point does not yield the best forecast. Indeed, a
historical data point may not even exist for the house size we are
interested in. Even if it does, the price of that house provides information
on only a single house—it doesn’t reflect information about the other
houses’ sizes and prices. In contrast, regression analysis brings the
power of the entire data set to our prediction. In general, regression
allows us to generate far more accurate predictions than we could make
by inferring a future price from a single data point. Once we have
identified the linear relationship between the two variables, we can use
that regression equation to forecast.

The interactive below shows the impact of house size on expected


(average) selling price. Choose a value for

𝑥
between 0 and 7,000 square feet. Note that some of these values fall
outside of the range of historical data so we must exhibit caution.
Specifically, we should look at the range and dispersion of the historical
values of the independent variable (x-values). Since we have no
information about houses outside the historical range, there is greater
uncertainty when predicting selling price for such homes.

Another quick way to forecast is to use Excel’s FORECAST function:

=FORECAST(x, known_y’s, known_x’s)

● x is the data point for which you want to predict a value.


● known_y’s is the dependent array or range of data.
● known_x’s is the independent array or range of data.

In order to use this function we must have the original data. This approach also gives us a
point forecast, but does not provide other helpful values that Excel’s regression tool
produces.

The predictions you just made were point forecasts, or single values, each representing
the expected selling price for a home of a given size.
But there is often a great deal of uncertainty when we use a regression model to
forecast, in part because the regression line does not perfectly fit the data, and in part
because the regression line itself is only an estimate of the true best fit line.

BEGIN GRAPH DESCRIPTION


A graph titled SELLING PRICE versus HOUSE SIZE.
The x-axis is House Size in square feet, running from 0 to 7,000 in increments of 1,000.
The y-axis is Selling Price in dollars, rising from 0 to $1,800,000 in increments of
$200,000.
The data points mostly cluster in the space encompassing points (500, $200,000) and
(2500, $600,000), though there are four data points higher in square feet and dollars.
The regression line is drawn rising diagonally upward from about (500, $200,000) to
(7000, $1,800,000).
END OF GRAPH DESCRIPTION

In addition, our forecast uncertainty increases as we near the edges or go outside of the
historical range of our data.
BEGIN GRAPH DESCRIPTION
The rectangular area labeled Range of Historical Data contains most data points, which
are contained within 1000 and 4000 square feet and within the entire range of the Selling
Prices.
END OF GRAPH DESCRIPTION

Due to this uncertainty, we would rarely use only a point forecast in practice. The point
forecast is a good place to start.
But to make sound managerial decisions, we must try to capture to the best of our ability
the forecast uncertainty.

Suppose we want to forecast the price at which a 2,000 square foot house would sell.
Rather than predicting just a single point, we construct an interval or range around the
point forecast.
We construct this range so it's very likely that the selling price of a 2,000 square foot
house would fall within that range.

BEGIN GRAPH DESCRIPTION


A vertical line rises from (2000, $200,000) to (2000, $850,000). The actual data point is
at (2000, $200,000) and the regression line corresponding to this data point is at about
(2000, $525,000).
END OF GRAPH DESCRIPTION

Conceptually, this is similar to constructing a confidence interval around the predicted


value.
Due to the assumptions underlying linear regression models, we know that the prediction
interval, the range around the point forecast, is normally distributed.
The center of the prediction interval is the point forecast, in this case, about $525,000.

BEGIN GRAPH DESCRIPTION


The perspective of the graph changes. The y-axis and x-axis now appear as a plane,
and a third dimension is expressed upward from this plane.
A bell curve appears over the line at 2000 square feet.
The bell curve is centered at $525,000. This amount is calculated as point forecast
13490.45 + 255.35 times 2000 equals $525,000.
END OF GRAPH DESCRIPTION

The standard error of the regression, in this case, about $151,000, is a reasonable but
conservative estimate of the forecast standard deviation.

BEGIN GRAPH DESCRIPTION


The value $151,000 extends to the right of the $525,000 that is that the middle of the bell
curve.
$151,000 is shown as one standard deviation from $525,000.
END OF GRAPH DESCRIPTION
The standard error of the regression is easily found in a regression output table.

BEGIN TABLE DESCRIPTION


The Standard Error with value 150,684.89 is highlighted from the following Summary
Output table.

SUMMARY OUTPUT

Dependent Variable: Selling Price ($)

Regression Statistics

Multiple R 0.8557

R Square 0.7356

Adjusted R Square 0.7262

Standard Error 150,684.89

Observations 30

END OF TABLE DESCRIPTION

As we did when constructing a confidence interval, we have to choose a level of


confidence for our prediction interval.
A 95% prediction interval would run about two standard deviations above and below the
point forecast.
To forecast the price of a 2,000 square foot home, the 95% prediction interval would be
about $525,000, plus or minus 2 times $151,000.

BEGIN EQUATION DESCRIPTION


The 95% Prediction Interval is approximately equal to the Point Forecast plus or minus 2
times the Standard Error.
Therefore approximately equal to $525,000 plus or minus 2 times $151,000.
Therefore approximately equal to $223,000 and $827,000.
END OF EQUATION DESCRIPTION

We are able to say that we are 95% confident that the actual selling price will fall within
the prediction interval.

BEGIN GRAPH DESCRIPTION


The vertical line at 2000 square feet is highlighted again.
The bottom of the line is labeled $223,000.
The middle of the line, which rests on the regression line, is labeled $525,000.
The top of the line is labeled $827,000.
END OF GRAPH DESCRIPTION

Like confidence intervals, the higher the confidence level we select, the wider our
prediction interval will be.

BEGIN GRAPH DESCRIPTION


The vertical line at 2000 square feet is labeled 95% at its current length.
The line then extends at both top and bottom ends equally, and as it extends, the value
of the label increments until it reaches 99%.
END OF GRAPH DESCRIPTION

Since there is greater uncertainty when we forecast further from the mean of the
independent variable, we can infer that the prediction interval should be wider as we
move away from the average house size.

BEGIN GRAPH DESCRIPTION


Two lines almost parallel to the regression line appear, one above and one below it, but
begin deviating away from it as the house size increases.
END OF GRAPH DESCRIPTION

So although the standard error is a reasonable estimate on which to base our range, the
actual calculation is more complicated.
As we move towards, and then beyond, the edges of the historical data, the width of the
distribution around the point forecast increases.
In this case, a 95% prediction interval for the selling price of a 7,000 square foot home
would be much wider than that for a 2,000 square foot home.
BEGIN GRAPH DESCRIPTION
As the distance between the lines above and below the regression line increases as
house size values increase, the length of the vertical line representing the prediction
interval at 7000 square feet is much longer than the length at 2000 square feet.
END OF GRAPH DESCRIPTION

4.3 Summary

Lesson Summary

● We use regression analysis to forecast the dependent variable,


● 𝑦
● , within the historically observed range of the independent
variable,
● 𝑥
● .
● We determine a point forecast by entering the desired
value of
● 𝑥
● into the regression equation.
● We must be extremely cautious about using regression to
forecast for values outside of the historically observed
range of the independent variable (x-values).
● Instead of predicting a single point, we can construct a prediction
interval, an interval around the point forecast that is likely to
contain, for example, the actual selling price of a house of a given
size.
● The width of a prediction interval varies based on the
standard deviation of the regression (the standard error of
the regression), the desired level of confidence, and the
location of the x-value of interest in relation to the historical
values of the independent variable.
● As the confidence level increases, the width of the
prediction interval increases.
● As we move to the edge of, and beyond, the range
of historical data, the width of the prediction interval
increases.
Excel Summary

● Forecasting in Excel
● =SUMPRODUCT(array1, [array2], [array3],…) is a
convenient function for calculating point forecasts.

4.4.1 Quantifying Predictive Power

Now that we have learned how to determine a regression line and use it
to forecast the dependent variable, we turn our attention to how to
evaluate the “fit” of that line. Even when the linear relationship between
two variables is not very strong, there is still a best fit regression line
associated with that relationship—it just won’t fit the data very well or be
particularly useful. It is helpful to measure how well a regression line fits
the historical data so that we can determine how useful a regression
model might be for forecasting and explaining the relationship between
the dependent and independent variables.


​ Let's think back
to when we tried
​ to draw our best estimate
of the regression line.
​ Conceptually, we wanted
to find the line that
​ would minimize the
dispersion of the data points
​ above and below that line.
​ Now let's formalize
this process a bit more,
​ and clarify how we measure
dispersion around a line.
​ To quantify how accurately
a line fits a data set,
​ we first measure the vertical
distance between each data
​ point and the line.
​ We measure vertical
distance, rather than
​ perpendicular
distance, because we're
​ interested in how well the
line predicts the value
​ of the dependent variable.
​ And the dependent variable,
in this case selling price,
​ is measured on
the vertical axis.
​ We want to know, for
a given house size,
​ how close is the price
predicted by the line
​ compared to the
historically observed price
​ for a home of that size.
​ We call the vertical
distance between a data point
​ and the line the residual error.
​ This error is the difference
between the observed
​ value and the line's prediction
for the dependent variable.
​ This difference may be
due to other factors that
​ influence selling price,
or just a plain chance.
​ Collectively, the
residuals for all the data
​ points measure how accurately
a line fits a data set.
​ To quantify the total
size of the errors,
​ we can't just sum each of
the vertical distances.
​ If we did, positive
and negative distances
​ would cancel each other out.
​ Instead we take the
square of each distance,
​ and then add all of those
squared terms together.
​ This measure, called the
sum of squared errors
​ or the residual sum
of squares, gives us
​ a good measure of how accurately
a line fits a data set.
​ A regression line
is formally defined
​ as the line that minimizes
the sum of squared errors.
​ A critical question we
ask when we use regression
​ is how much the regression
adds to our understanding
​ of the dependent variable.
​ In this case, we
would like to know
​ how much our knowledge of the
relationship between house
​ size and selling price helps
us understand and predict
​ house selling prices.
​ Specifically, we want to
determine how much more we
​ know about selling prices if
we have data about house size
​ than if we do not.
​ To determine how
much more information
​ we gain from the
house size data,
​ we need a benchmark
telling us how much
​ we would know about the
behavior of prices if we did not
​ have the house size data, that
is, if we only had the price
​ data to work with.
​ Using the price data
alone, the best predictor
​ for a future selling
price would simply
​ be the average of
the previous prices.
​ Thus, we use mean
price as our benchmark
​ and draw a mean price
line through the data.
​ We already have a measure of how
accurately an individual line
​ fits a data set, the
sum of squared errors.
​ To find out how much additional
value the regression model
​ gives us, we'll compare the
accuracy of the regression line
​ with that of the
mean price line.
​ Specifically, we'll calculate
the sum of squared errors
​ for each of the two lines,
and see how much smaller
​ the error is around
the regression line
​ than around the mean line.
​ We've just learned that
the sum of squared errors
​ of the regression line is called
the residual sum of squares.
​ It's useful to think of this as
the variation left unexplained
​ by the regression model.
​ We can also calculate
the sum of squared errors
​ for the mean price line.
​ This represents the total
variation in the price data,
​ so we call it the
total sum of squares.
​ To determine how
much more accurate
​ the regression line
is than the mean line,
​ we subtract the residual sum
of squares, in this case,
​ about 636 billion, from
the total sum of squares,
​ in this case about 2.4 trillion.
​ The difference,
about 1.77 trillion,
​ is called the regression
sum of squares.
​ We can think of the
regression sum of squares
​ as measuring the variation
in price that's explained
​ by the regression model.
​ In this case, since the
regression sum of squares
​ is a large fraction of
the total variation,
​ we know that the
regression line helps
​ us predict price much more
accurately than price alone
​ would.

It can be misleading to use only R squared to assess whether a linear regression model
is appropriate.
R squared measures how much variation is explained by the regression line, but it does
not reveal exactly how the variables are related.

BEGIN GRAPH DESCRIPTION


A graph labeled R squared approximately equal to 0.7 has the Independent Variable as
the x-axis and the Dependent Variable as the y-axis.
Three scatter plots appear on the graph, each with different point distributions and each
with a regression line of differing length and slope.
END OF GRAPH DESCRIPTION

Thus, we need to look beyond R squared for further insight.


Specifically, we should examine two other important metrics, the p-value of the
independent variable, and a graph known as the residual plot.

The regression line is the line that best fits the observed data, but we need a way to test
whether the linear relationship is significant.
If it is not significant, then the true regression line is just the mean line, which has a
slope of 0.

BEGIN GRAPH DESCRIPTION


A graph labeled Selling Price versus House Size has House Size in Square Feet as the
x-axis and Selling Price in dollars as the y-axis.
A scatter plot shows most points clustered around the bottom left, with a regression line
rising diagonally to the upper right.
Additional points are added to the scatter plot, and the regression line changes to a
horizontal line.
END OF GRAPH DESCRIPTION

Thus, if we can show that the slope of the true regression line is not zero, we can be
confident that there is a significant linear relationship.

We can test this by performing a hypothesis test.


The null hypothesis is that the true slope of the regression line is zero,

BEGIN FORMULA DESCRIPTION


H sub 0 is proportional to the True Slope, which equals to Beta which equals 0.
END OF FORMULA DESCRIPTION

and the alternative hypothesis is that the true slope is not zero.

BEGIN FORMULA DESCRIPTION


H sub a is proportional to the True Slope, which equals to Beta which is not equal to 0.
END OF FORMULA DESCRIPTION

As usual, we look at a p-value, in this case, the p-value of the independent variable, to
determine whether or not we can reject the null hypothesis.

BEGIN TABLE DESCRIPTION


The P-value 0.0000 is highlighted in the following table of summary statistics.
Coefficients Standard t Stat P-value Lower 95% Upper 95%
Error

Intercept 13,490.45 57,518.92 0.23 0.8163 -104,331.72 131,312.62

House 255.36 28.93 8.83 0.0000 196.10 314.63


Size (Sqft)

END OF TABLE DESCRIPTION

Later we'll look at how residual plots can help us determine whether the linear model is
the best fit for the data.
Even though a linear model may explain some of the variation, the true relationship may
be best described by some type of curve, for example.

BEGIN GRAPH DESCRIPTIONS


Two scatter plots, one with a u-shaped point distribution and one with an s-shaped point
distribution, have their diagonal ascending lines replaced by curved lines that more
closely approximate the respective scatter plots' points distribution.
END OF GRAPH DESCRIPTIONS

There are two ways to test whether the slope of the best fit line equals zero.

1. Check whether the confidence interval for the line's slope contains zero

Remember that the coefficients of the regression line are just estimates of the true linear
relationship between the dependent and independent variables. A coefficient’s lower 95%
and upper 95% values give us the lower and upper bounds of the 95% confidence interval
for that coefficient. Recall that if the best fit regression line has a slope of zero, then the
regression line is just a flat line equal to the mean of the dependent variable, indicating that
that there is no linear relationship between the two variables. Thus, if the 95% confidence
interval for the slope does not include zero, we can be 95% confident that the true value of
the slope is not zero and thus that a significant relationship exists between the variables.
Coefficients Standard t Stat P-value Lower 95% Upper 95%
Error

Intercept 13,490.45 57,518.92 0.23 0.8163 -104,331.72 131,312.62

House 255.36 28.93 8.83 0.0000 196.10 314.63


Size (Sqft)

In this example, we can say we are 95% confident that the true slope of the regression line
describing the relationship between selling price and house size is between 196.10 and
314.63. Because this range does not include the value zero, we can be 95% confident that
there is a significant linear relationship between the variables.

2. Check whether the p-value is greater than or equal to 0.05

As we noted earlier, regression analysis builds on hypothesis testing. In fact, a single


variable linear regression analysis is equivalent to the hypothesis test,

H0:𝛽=0

and

H𝑎:𝛽≠0

. Recall that the p-value for a hypothesis test is the likelihood that we would select a sample
at least as extreme as the one we observed if the null hypothesis were true. The p-value
associated with a regression coefficient is the likelihood of choosing a sample at least as
extreme as the sample we used to derive the regression equation if the slope of the true
regression line is actually zero, or equivalently, if there is no linear relationship between the
two variables. Since the p-value for house size, 0.0000, is less than 0.05, we reject the null
hypothesis that the slope is zero and can be confident that there is a significant linear
relationship between selling price and house size. (We can ignore the p-value of the
intercept coefficient because the y-intercept is just a constant. It does not represent an
independent variable and thus provides no information about the significance of the
relationship between two variables.)

Coefficients Standard t Stat P-value Lower 95% Upper 95%


Error

Intercept 13,490.45 57,518.92 0.23 0.8163 -104,331.72 131,312.62

House 255.36 28.93 8.83 0.0000 196.10 314.63


Size (Sqft)

Recall that a significance level of 5% corresponds to a confidence level of 95%, so checking


whether a regression coefficient’s p-value is less than 5% is equivalent to checking whether
the coefficient’s 95% confidence interval contains zero. Both approaches test whether or not
we can be 95% confident that there is significant linear relationship between the variables.

4.4.3 R-squared vs. p-value

We should always examine both the R2 and the coefficient’s p-value


when assessing the fit of a linear regression model. Let’s look at a few
examples to make sure we understand the difference between these two
measures.

4.4.4 Residual Analysis

As we mentioned earlier in the module, we need to look beyond R2 in


order to assess how two variables are related and whether the linear
regression model is a good fit for the data. We have already looked at the
p-value of the independent variable to gain further insight into the
significance of the relationship. Now let’s learn about the additional
insight that residual plots provide.

4.4 Summary
Lesson Summary

It is important to evaluate several metrics in order to determine whether a


single variable linear regression model is a good fit for a data set, rather
than looking at individual metrics in isolation.

● R2 measures the percent of total variation in the dependent


variable,
● 𝑦
● , that is explained by the regression line.
● 𝑅2=Variation explained by the
regression lineTotal
variation=Regression Sum of
SquaresTotal Sum of Squares

● 0≤R2≤1
● For a single variable linear regression, R2 is equal to the
square of the correlation coefficient.
● In addition to analyzing R2, we must test whether the relationship
between the dependent and independent variable is significant
and whether the linear model is a good fit for the data. We do this
by analyzing the p-value (or confidence interval) associated with
the independent variable and the regression’s residual plot.
● The p-value of an independent variable is the result of the
hypothesis test that tests whether there is a significant linear
relationship between the dependent and independent variable;
that is, it tests whether the slope of the regression line is zero,
● H0:𝛽=0
● and
● H𝑎:𝛽≠0
● .
● If the coefficient’s p-value is less than 0.05, we reject the
null hypothesis and conclude that we have sufficient
evidence to be 95% confident that there is a significant
linear relationship between the dependent and independent
variables.
● Note that the p-value and R2 provide different information.
A linear relationship can be significant (have a low p-value)
but not explain a large percentage of the variation (not
have a high R2.)
● A confidence interval associated with an independent variable’s
coefficient indicates the likely range for that coefficient.
● If the 95% confidence interval does not contain zero, we
can be 95% confident that there is a significant linear
relationship between the variables.
● Residual plots can provide insights into whether a linear model is
a good fit.
● Each observation in a data set has a residual equal to the
historically observed value minus the regression’s
predicted value, that is,
● 𝜀=𝑦−𝑦^
● .
● Linear regression models assume that the regression’s
residuals follow a normal distribution with a mean of zero
and fixed variance.

4.5.2 Using Dummy Variables

So far, we have constructed regression models using only quantitative


(numerical) variables. However, many variables we study are qualitative,
or categorical, variables, meaning that they do not naturally take on
numerical values but can be classified into categories.

● Quantitative (numerical): Variables that can be counted or


measured and that are naturally represented as numbers.
● Qualitative (categorical): Variables that can be sorted or grouped
into categories. Qualitative variables must be transformed into
dummy variables (defined below).

Note that we could artificially assign numbers to categories. For example,


if our categories are colors, we could assign the number 1 to red and 2 to
blue, but these do not have meaning in a mathematical sense—we would
not conclude that blue is twice as much as red is.

Determine whether each of the following variables is quantitative or


qualitative. Then select the appropriate category.
Relating Dummy Variable Regression to Hypothesis Testing

As we noted earlier, regression analysis builds on hypothesis testing. In fact, a single


variable linear regression analysis is equivalent to the two-sample hypothesis test,

H0:𝛽=0

and

H𝑎:𝛽≠0

. (In regression analysis, the hypothesis test’s p-value is calculated by assuming the two
samples have equal variances.) In this case, since SAT is a dummy variable, this is
equivalent to a hypothesis test with the following null and alternative hypotheses:

● H0
● : The selling price of homes in neighborhoods where the average SAT score is at
or above 1700
● =
● the selling price of homes in neighborhoods where the average SAT score is
below 1700.
● H𝑎
● : The selling price of homes in neighborhoods where the average SAT score is at
or above 1700
● ≠
● the selling price of homes in neighborhoods where the average SAT score is
below 1700.

The regression analysis gives us more information than the hypothesis test alone would.
Rather than simply calculating the p-value, rejecting the null hypothesis and concluding
that there is a significant linear relationship, the regression results provide the direction
and magnitude of this relationship.

4.5 Summary

Lesson Summary

● The regression output table is divided into three main parts: the
Regression Statistics table, the ANOVA table, and the Regression
Coefficients table. It is important to be able to identify the most
useful measures and interpret them correctly.
● To study the effects of qualitative variables, we use a type of
variable called a dummy variable. A dummy variable takes on one
of two values, 0 or 1, to indicate which of two categories a data
point falls into.

Excel Summary

● Creating a regression output table using the Data Analysis tool


● Creating regression models with dummy variables
● To perform a regression analysis with an independent
dummy variable, we follow the same steps as we would
when using quantitative variables.

4.6.1 The Disney Studio Model


Now that we have the necessary tools, let’s return to Disney Studios and
take a closer look at its single variable linear regression model.


​ So the way that we
forecast home video units
​ is fundamentally based on a
regression of home video unit
​ sales to gross box office.
​ So in our case, the independent
variable, gross box office,
​ is our estimate.
​ It depends on what point in
time we're doing the analysis.
​ If it's before the title's
release theatrically,
​ it's our best estimate of
what the gross box office will
​ be for a title's run,
which could typically
​ last from maybe three months,
three to four months, where
​ the final gross will land.
​ And then what we'll
produce from the regression
​ will be an estimate of
52-week unit sales at retail.

Earlier in this module, we briefly analyzed the relationship between 2011 gross box
office and home video units data. Now let’s look at the 2012 data.
A scatter plot depicting 2012 home video units versus gross box office. The x-axis is
labeled gross box office in millions of dollars and ranges from 0 to 300 in increments
of 50. The y-axis is labeled home video units in thousands and ranges from 0 to
8,000 in increments of 1,000. The plotted points are loosely clustered in an
upward-sloping pattern from left to right. Most of the points on the graph are in the 0
to 50 million dollar range, which corresponds to 0 to 1,000 home video units. The
next-largest range of points is between 50 and 100 million dollars, which
corresponds to 500 to 2,500 home video units. A few scattered points are in the 100
to 250 million dollar range, which corresponds to 1,000 to 6,900 home video units.
The rightmost point is located at ($290 million, 7,100 home video units).

​ Using the simple linear


regression model actually
​ is the best jumping-off
point for our forecasting
​ of the units because
there's actually
​ a very strong correlation
between the two.
​ So the R-squared in
this case is very high
​ and is a good predictor.
​ It's not perfect, but
it's a good predictor
​ and a starting point for
our analysis and forecast.


​ The business used
to be primarily
​ during the heyday of DVD
and ownership business,
​ probably because that was
the most convenient way
​ that you could consume movies in
the home entertainment market.
​ You'd go to places that
you already shopped, and it
​ was one fee, and it wasn't
that much more than renting it,
​ and you would pick
up a movie and then
​ you could do anything
you wanted with it.
​ But what we've really
seen with pressures
​ in the economy and people being
more cost conscious than ever
​ before, combined with a lot of
the innovations in technology,
​ we've seen a big
shift to digital.
​ And as people have bought
less movies physically
​ on DVD and Blu-ray, and started
doing more things digitally,
​ be it video on demand
or subscription or EST,
​ what's happened is they've
started to rent more than buy.
​ All of this change has
made it really complicated
​ for us to accurately forecast
our physical home entertainment
​ sales, because there are
all these shifts in consumer
​ behavior happening, and
they're happening quickly.
​ And it's hard to stay abreast
of where the consumer's going.


​ The shortcomings of using
the regression model
​ for forecasting are
that, first of all,
​ you need an accurate GBO,
gross box office input.
​ On the case of Frozen can be a
little bit of a moving target,
​ so that's one of
the shortcomings.
​ You need that accurate
box office input.
​ Our theatrical forecast
tend to be very accurate
​ once a movie opens.
​ Before a movie opens, it
can be very inaccurate
​ because there are
so many unknowns.
​ Originally, going in
to Frozen before it
​ had opened in theaters,
our box office assumption
​ was that it was going to do
$175 million domestically.
​ Within the first
week of opening,
​ we knew it was going
to be closer to $190,
​ and within, I'd say,
maybe 10 days of opening,
​ we had taken had
up to $340 million,
​ and it might ultimately
go up to $360 million.
​ And now we're at almost
two months post opening.
​ So it has had stronger,
we call it stronger legs.
​ It's had a longer tail than
originally anticipated.
​ But the other thing
is that there's so
​ many other variables that
ultimately affect performance
​ in the home entertainment
market beyond just
​ the strength of the box office.
​ Seasonality actually plays
an important role in the home
​ entertainment market.
​ Part of that due to the
fact that gift giving
​ is an important part of the
home entertainment business.
​ So we have much
stronger seasonality
​ in October, November,
and December,
​ for example, when people
are buying a lot of Blu-rays
​ and DVD for gifts than we
do in the middle of summer
​ when there isn't that kind
of gift giving opportunity.
​ There are some
additional factors
​ that go into trying to predict
home entertainment performance.
​ Some of the
additional factors are
​ macro trends in the industry.
​ Are people buying fewer DVDs
on the whole, and by how much?
​ Are people buying more Blu-rays
on the whole, and by how much?
​ In addition, genre
is another variable.
​ Rating is another variable.
​ So you can have two
movies with the same box
​ office that will
perform differently
​ because we know things.
​ For example, we know that
animated family movies perform
​ better on a units per box office
dollar ratio than, for example,
​ an adult drama.
​ So that can have an effect.
​ The amount that is invested in
media is an important variable.
​ Word of mouth is becoming a more
and more important variable.
​ It's not just how
many people went
​ to see the movie in
theaters, but how
​ did they feel about
that movie and what
​ did they say about it to other
people through social media
​ and through word of mouth.
​ The next step for
analytics is we're
​ going to continue to try
to refine, and figure out
​ which variables seem
to be more impactful.
​ And we expand it.
​ But it's very difficult,
because there's
​ a lot of subjective factors
when you're taking those
​ into play to figure out, one
person might rate a movie
​ as an A. Another one
might rate it as a B.
​ You really have to look at
all of the different pieces
​ of the puzzle, and you
have to have smart people
​ with different points
of view, and you have
​ to surround yourself with them.
​ And you need to
have open dialogue
​ and listen to them, because
with an industry that
​ has so many different dynamics
and that's changing so rapidly,
​ point of view is very important.
​ The data is certainly
very important,
​ but a qualitative overlay
is very important as well.

You might also like