0% found this document useful (0 votes)

41 views

Business Analytics

Uploaded by

zewdtv

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

41 views

Business Analytics

Uploaded by

zewdtv

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 44

In the next two

modules, we'll learn
about regression
analysis, one of the most
powerful and commonly
used statistical tools.
Regression analysis examines
relationships among variables.
Linear regression investigates
linear relationships
among variables.
In this module,
we'll learn about
single variable linear
regression, which
seeks to identify a linear
relationship between two
variables.
In the next module,
we'll learn how
to use multivariate or
multiple linear regression
to examine relationships
among multiple variables.
Single variable
linear regression
can be seen as an extension
of hypothesis testing.
We have learned to
use hypothesis tests
to determine
whether or not there
is a significant relationship
between two variables.
Regression also tests whether
there is a relationship,
but allows us to gain
insights into the structure
of that relationship
and provides
measures of how well the
data fit that relationship.
Such insights can prove
extremely valuable
for analyzing historical trends
and developing forecasts.
Let's take a look at how
Disney Studios uses regression
analysis to predict DVD and
Blu-ray sales for Disney's
new movie releases.
The film business is
so dynamic, and it's
been going through
a lot of change.
And what a lot of
people don't realize
is that there are several
windows in the film
distribution landscape.
And that when a studio, like
Disney, makes a movie, actually
not much of the cost
of making that movie
is recouped in the
theatrical window.
The theatrical
window tends to be
a very marketing heavy window.
A lot of money is
spent on promoting
that window, and that,
that theatrical release.
And then the success
of that window
really fuels and sets
the stage for all
of the downstream markets,
like home entertainment
and television, where a
studio does then hopefully
recoup and improve
on their investment.
It is really important to
us to accurately forecast
our physical home
entertainment sales
for a multitude of reasons.
First of all, it's
important early on,
because when the studio
is making a decision about
whether they're going to
greenlight and make a movie,
home entertainment is actually
a really important part
of the profit piece of a
movie's return on investment.
So it's really important that we
have a fairly accurate forecast
upfront, so that they can
decide how much money can they
stand to make on a given movie,
and do they want to move ahead.
Movies are very expensive.
Do they want to put their
money into the production
of that movie?
Is it going to have a
good return on investment?
Then after the movie
opens theatrically,
we'll continue to
revise our forecast
as we get more information
on how many people went
to see the movie theatrically
and how was it received.
We'll continue to revise the
home entertainment forecast,
and then it comes into
play for different reasons.
When we think how much we
place within the stores,
we used to have, you
just placed one product.
Way back when, it
was a VHS tape.
Then it was a DVD.
Now you have DVD,
Blu-ray, and digital files
that you're selling, and
lots of different stores
are selling different versions.
So you have to try to
optimize and figure out
what's the correct
amount to ship in
of each of the different
SKUs so that you provide
the consumer what they
want, and at the same time,
maximize our profitability.
So acknowledging that
having an accurate forecast
is very important to
the financial health
of the physical home
entertainment business.
The problem that we're
really trying to solve for
is how we can best
use the window that
went before us, the
theatrical window,
to look at a variable
like box office,
and use it to
accurately predict what
our physical home
entertainment sales will be.

4.2.1 Visualizing the Relationship

Before we help Disney develop a forecast of its home video units, let’s
build a basic understanding of regression analysis by taking a careful
look at some data on recent residential real estate transactions in the
Boston area. Suppose you are interested in purchasing a single-family
home near Boston and would like to understand the relationship between
selling price and the size of a house. You are confident that larger homes
tend to cost more, but you want to gain a deeper understanding about the
structure of the relationship between selling price and house size.

To help build our understanding of such relationships, we have gathered

data on 30 homes that were sold in the greater Boston area during the
summer of 2013. Before we turn to regression analysis, let’s explore the
data to get a better sense of what the relationship between a house’s
selling price and its size might look like. As we learned earlier in the
course, scatter plots are an excellent tool for visualizing the relationship
between two variables, so let’s create one for the housing data.

4.2.2 The Best Fit Line

As we have seen, displaying data graphically can help us recognize

general trends and relationships. On the graph below, draw the line that
you think best fits the relationship between selling price and house size.
The best fit line, or linear regression line, is the line that best describes
the linear relationship between two variables. As we will see, the line
identifies the expected y-value for each of x-value.
Right now we are being a bit imprecise about what it means for a line to
best “fit” the data, so for the time being use your own judgment to find a
line that you think fits best – that is, which line would reduce the “total
distance” between the data points and the line. Shortly, we’ll introduce a
set of concepts and a metric that will help us measure how well a line fits
the data. This metric will be the basis for finding the best fit line.

We'll use Excel's
regression tools
to actually identify the best
fit line through a dataset.
But to really
understand regression,
it's important to understand
how the regression
line is determined.
Let's take another look
at the housing data.
Clearly we can't draw
a single straight line
through every point
in the data set.
This shouldn't surprise us,
because house size alone
is by no means a perfect
predictor of a home's selling
price.
There are many other factors
that influence a home's value.
The regression line is
the linear relationship
that best fits the data, but it
won't pass through every point.
Remember when you tried
to find the best fit
line for the housing data?
You probably tried to draw a
line that would touch or get
close to as many
points as possible.
This is essentially
what Excel does.
Broadly speaking,
the regression line
is the line that minimizes
the dispersion of points
around that line, and we measure
the accuracy of the regression
line by measuring
that dispersion.
We attribute the difference
between the actual data points
and the values predicted
by the regression line
either to relationships
between selling price
and variables other than
house size or to chance alone.

4.2.3 The Structure of the Regression Line

Now that we have some understanding of how to find a best fit line, let’s
look at the structure of the equation of this line. In general, a single
variable regression line can be described by the equation

𝑦^=𝑎+𝑏⁢𝑥

where
𝑎

is the y-intercept of the line and

is the slope. Move the sliders to see how the regression line changes as
we adjust

and

𝑦^
Dependent Variable

The expected value of

, the value we are trying to predict. (

𝑦^

is pronounced "y-hat".)

Independent Variable

The variable we are using to help us predict the dependent variable.

y-intercept
The point at which the regression line intersects the vertical axis. In
other words, it is the value of

when

𝑥=0

Slope

The average change in the dependent variable as the independent

variable increases by one.

+
𝑦^=𝑎+𝑏⁢𝑥

The line is horizontal along the x-axis.

ayx

As we learned in earlier in the course, we typically use Greek letters (like

) to refer to the “true” parameters associated with a population and Latin

letters (like

) to refer to the estimates of those parameters we calculate from sample

data. Similarly, we refer to the best fit line we obtain from our sample data
as

𝑦^=𝑎+𝑏⁢𝑥
to distinguish it from

𝑦^=𝛼+𝛽⁢𝑥

, the idealized equation that represents the “true” best fit line. Because
the best fit line does not perfectly fit even the population data, we add an
error term,

, to the true equation:

𝑦=𝛼+𝛽⁢𝑥+𝜀

. The error term is the difference between the actual value of

and the expected value of

𝑦
. That is,

𝜀=𝑦−𝑦^

4.2 Summary

Lesson Summary

● Single Variable Linear Regression analysis is used to identify the

best fit line between two variables. This analysis builds on two
previous concepts we have used to study relationships between
two variables:
● Scatter plots, which are useful for visualizing a relationship
between two variables.
● The correlation coefficient, a value between -1 and 1 that
measures the strength and direction (positive or negative)
of the linear relationship between two variables.
● We use regression analysis for two primary purposes:
● Studying the magnitude and structure of the relationship
between two variables.
● Forecasting a variable based on its relationship with
another variable.
● The structure of the single variable linear regression line is
● 𝑦^=𝑎+𝑏⁢𝑥
● .
● 𝑦^
● is the expected value of
● 𝑦
● , the dependent variable, for a given value of
● 𝑥
● .
● 𝑥
● is the independent variable, the variable we are using to
help us predict or better understand the dependent
variable.
● 𝑎
● is the y-intercept, the point at which the regression line
intersects the vertical axis. This is the value of
● 𝑦^
● when the independent variable,
● 𝑥
● , is set equal to 0.
● 𝑏
● is the slope, the average change in the dependent variable
● 𝑦
● as the independent variable
● 𝑥
● increases by one.
● The true relationship between two variables is described by
the equation
● 𝑦=𝛼+𝛽⁢𝑥+𝜀
● , where
● 𝜀
● is the error term (
● 𝜀=𝑦−𝑦^
● ). The idealized equation that describes the true regression
line is
● 𝑦^=𝛼+𝛽⁢𝑥
● .

Excel Summary

● Adding the best fit line to a scatter plot

CONTINUE

4.3.1 Point Forecasts

Once we have found the regression equation for a given data set, we can
use that equation to obtain a point forecast, in this case, the predicted
selling price for a given house size. For example, we may want to predict
the price of a house on the basis of its size. How much can we expect to
pay for a 1,200 square foot home?

Suppose for a moment that we did not know anything about the
relationship between selling price and house size, that is, suppose we
had only the historical data. In that case, we might simply note that when
a house of that size sold recently, it sold for approximately $266,000. And
so we might predict that we would pay around the same amount for any
1,200 square foot house.

A B C
1 City House Size (Sqft) Selling Price ($)

2 Mansfield 600 $211,000

3 Randolph 1,194 $183,000

4 North Reading 1,309 $365,000

5 Peabody 886 $380,000

6 Belmont 1,744 $860,000

7 Natick 4,184 $1,070,000

8 Arlington 4,688 $1,280,500

9 Ashland 1,388 $358,000

10 Framingham 1,528 $417,000

11 Hingham 1,888 $665,000

12 Wakefield 630 $210,000

13 Burlington 2,243 $540,000

14 Milton 2,202 $447,000

15 Framingham 1,200 $266,000

16 Ipswich 1,123 $299,000

17 Melrose 1,455 $445,000

18 Raynham 2,216 $365,000

19 Dracut 1,008 $189,900

20 Cambridge 1,025 $425,000

21 Weston 3,391 $1,130,000

22 Milton 920 $415,000

23 Acton 1,878 $470,000

24 Boxford 1,292 $314,000

25 Dedham 2,804 $724,500

26 Groveland 2,204 $420,000

27 Norwood 864 $288,000

28 Framingham 1,332 $150,000

29 Westford 1,750 $305,000

30 Holliston 1,458 $407,000

31 North Reading 1,973 $180,000

A single historical data point does not yield the best forecast. Indeed, a
historical data point may not even exist for the house size we are
interested in. Even if it does, the price of that house provides information
on only a single house—it doesn’t reflect information about the other
houses’ sizes and prices. In contrast, regression analysis brings the
power of the entire data set to our prediction. In general, regression
allows us to generate far more accurate predictions than we could make
by inferring a future price from a single data point. Once we have
identified the linear relationship between the two variables, we can use
that regression equation to forecast.

The interactive below shows the impact of house size on expected

(average) selling price. Choose a value for

𝑥
between 0 and 7,000 square feet. Note that some of these values fall
outside of the range of historical data so we must exhibit caution.
Specifically, we should look at the range and dispersion of the historical
values of the independent variable (x-values). Since we have no
information about houses outside the historical range, there is greater
uncertainty when predicting selling price for such homes.

Another quick way to forecast is to use Excel’s FORECAST function:

=FORECAST(x, known_y’s, known_x’s)

● x is the data point for which you want to predict a value.

● known_y’s is the dependent array or range of data.
● known_x’s is the independent array or range of data.

In order to use this function we must have the original data. This approach also gives us a
point forecast, but does not provide other helpful values that Excel’s regression tool
produces.

The predictions you just made were point forecasts, or single values, each representing
the expected selling price for a home of a given size.
But there is often a great deal of uncertainty when we use a regression model to
forecast, in part because the regression line does not perfectly fit the data, and in part
because the regression line itself is only an estimate of the true best fit line.

BEGIN GRAPH DESCRIPTION

A graph titled SELLING PRICE versus HOUSE SIZE.
The x-axis is House Size in square feet, running from 0 to 7,000 in increments of 1,000.
The y-axis is Selling Price in dollars, rising from 0 to $1,800,000 in increments of
$200,000.
The data points mostly cluster in the space encompassing points (500, $200,000) and
(2500, $600,000), though there are four data points higher in square feet and dollars.
The regression line is drawn rising diagonally upward from about (500, $200,000) to
(7000, $1,800,000).
END OF GRAPH DESCRIPTION

In addition, our forecast uncertainty increases as we near the edges or go outside of the
historical range of our data.
BEGIN GRAPH DESCRIPTION
The rectangular area labeled Range of Historical Data contains most data points, which
are contained within 1000 and 4000 square feet and within the entire range of the Selling
Prices.
END OF GRAPH DESCRIPTION

Due to this uncertainty, we would rarely use only a point forecast in practice. The point
forecast is a good place to start.
But to make sound managerial decisions, we must try to capture to the best of our ability
the forecast uncertainty.

Suppose we want to forecast the price at which a 2,000 square foot house would sell.
Rather than predicting just a single point, we construct an interval or range around the
point forecast.
We construct this range so it's very likely that the selling price of a 2,000 square foot
house would fall within that range.

BEGIN GRAPH DESCRIPTION

A vertical line rises from (2000, $200,000) to (2000, $850,000). The actual data point is
at (2000, $200,000) and the regression line corresponding to this data point is at about
(2000, $525,000).
END OF GRAPH DESCRIPTION

Conceptually, this is similar to constructing a confidence interval around the predicted

value.
Due to the assumptions underlying linear regression models, we know that the prediction
interval, the range around the point forecast, is normally distributed.
The center of the prediction interval is the point forecast, in this case, about $525,000.

BEGIN GRAPH DESCRIPTION

The perspective of the graph changes. The y-axis and x-axis now appear as a plane,
and a third dimension is expressed upward from this plane.
A bell curve appears over the line at 2000 square feet.
The bell curve is centered at $525,000. This amount is calculated as point forecast
13490.45 + 255.35 times 2000 equals $525,000.
END OF GRAPH DESCRIPTION

The standard error of the regression, in this case, about $151,000, is a reasonable but
conservative estimate of the forecast standard deviation.

BEGIN GRAPH DESCRIPTION

The value $151,000 extends to the right of the $525,000 that is that the middle of the bell
curve.
$151,000 is shown as one standard deviation from $525,000.
END OF GRAPH DESCRIPTION
The standard error of the regression is easily found in a regression output table.

BEGIN TABLE DESCRIPTION

The Standard Error with value 150,684.89 is highlighted from the following Summary
Output table.

SUMMARY OUTPUT

Dependent Variable: Selling Price ($)

Regression Statistics

Multiple R 0.8557

R Square 0.7356

Adjusted R Square 0.7262

Standard Error 150,684.89

Observations 30

END OF TABLE DESCRIPTION

As we did when constructing a confidence interval, we have to choose a level of

confidence for our prediction interval.
A 95% prediction interval would run about two standard deviations above and below the
point forecast.
To forecast the price of a 2,000 square foot home, the 95% prediction interval would be
about $525,000, plus or minus 2 times $151,000.

BEGIN EQUATION DESCRIPTION

The 95% Prediction Interval is approximately equal to the Point Forecast plus or minus 2
times the Standard Error.
Therefore approximately equal to $525,000 plus or minus 2 times $151,000.
Therefore approximately equal to $223,000 and $827,000.
END OF EQUATION DESCRIPTION

We are able to say that we are 95% confident that the actual selling price will fall within
the prediction interval.

BEGIN GRAPH DESCRIPTION

The vertical line at 2000 square feet is highlighted again.
The bottom of the line is labeled $223,000.
The middle of the line, which rests on the regression line, is labeled $525,000.
The top of the line is labeled $827,000.
END OF GRAPH DESCRIPTION

Like confidence intervals, the higher the confidence level we select, the wider our
prediction interval will be.

BEGIN GRAPH DESCRIPTION

The vertical line at 2000 square feet is labeled 95% at its current length.
The line then extends at both top and bottom ends equally, and as it extends, the value
of the label increments until it reaches 99%.
END OF GRAPH DESCRIPTION

Since there is greater uncertainty when we forecast further from the mean of the
independent variable, we can infer that the prediction interval should be wider as we
move away from the average house size.

BEGIN GRAPH DESCRIPTION

Two lines almost parallel to the regression line appear, one above and one below it, but
begin deviating away from it as the house size increases.
END OF GRAPH DESCRIPTION

So although the standard error is a reasonable estimate on which to base our range, the
actual calculation is more complicated.
As we move towards, and then beyond, the edges of the historical data, the width of the
distribution around the point forecast increases.
In this case, a 95% prediction interval for the selling price of a 7,000 square foot home
would be much wider than that for a 2,000 square foot home.
BEGIN GRAPH DESCRIPTION
As the distance between the lines above and below the regression line increases as
house size values increase, the length of the vertical line representing the prediction
interval at 7000 square feet is much longer than the length at 2000 square feet.
END OF GRAPH DESCRIPTION

4.3 Summary

Lesson Summary

● We use regression analysis to forecast the dependent variable,

● 𝑦
● , within the historically observed range of the independent
variable,
● 𝑥
● .
● We determine a point forecast by entering the desired
value of
● 𝑥
● into the regression equation.
● We must be extremely cautious about using regression to
forecast for values outside of the historically observed
range of the independent variable (x-values).
● Instead of predicting a single point, we can construct a prediction
interval, an interval around the point forecast that is likely to
contain, for example, the actual selling price of a house of a given
size.
● The width of a prediction interval varies based on the
standard deviation of the regression (the standard error of
the regression), the desired level of confidence, and the
location of the x-value of interest in relation to the historical
values of the independent variable.
● As the confidence level increases, the width of the
prediction interval increases.
● As we move to the edge of, and beyond, the range
of historical data, the width of the prediction interval
increases.
Excel Summary

● Forecasting in Excel
● =SUMPRODUCT(array1, [array2], [array3],…) is a
convenient function for calculating point forecasts.
●

4.4.1 Quantifying Predictive Power

Now that we have learned how to determine a regression line and use it
to forecast the dependent variable, we turn our attention to how to
evaluate the “fit” of that line. Even when the linear relationship between
two variables is not very strong, there is still a best fit regression line
associated with that relationship—it just won’t fit the data very well or be
particularly useful. It is helpful to measure how well a regression line fits
the historical data so that we can determine how useful a regression
model might be for forecasting and explaining the relationship between
the dependent and independent variables.

Let's think back
to when we tried
to draw our best estimate
of the regression line.
Conceptually, we wanted
to find the line that
would minimize the
dispersion of the data points
above and below that line.
Now let's formalize
this process a bit more,
and clarify how we measure
dispersion around a line.
To quantify how accurately
a line fits a data set,
we first measure the vertical
distance between each data
point and the line.
We measure vertical
distance, rather than
perpendicular
distance, because we're
interested in how well the
line predicts the value
of the dependent variable.
And the dependent variable,
in this case selling price,
is measured on
the vertical axis.
We want to know, for
a given house size,
how close is the price
predicted by the line
compared to the
historically observed price
for a home of that size.
We call the vertical
distance between a data point
and the line the residual error.
This error is the difference
between the observed
value and the line's prediction
for the dependent variable.
This difference may be
due to other factors that
influence selling price,
or just a plain chance.
Collectively, the
residuals for all the data
points measure how accurately
a line fits a data set.
To quantify the total
size of the errors,
we can't just sum each of
the vertical distances.
If we did, positive
and negative distances
would cancel each other out.
Instead we take the
square of each distance,
and then add all of those
squared terms together.
This measure, called the
sum of squared errors
or the residual sum
of squares, gives us
a good measure of how accurately
a line fits a data set.
A regression line
is formally defined
as the line that minimizes
the sum of squared errors.
A critical question we
ask when we use regression
is how much the regression
adds to our understanding
of the dependent variable.
In this case, we
would like to know
how much our knowledge of the
relationship between house
size and selling price helps
us understand and predict
house selling prices.
Specifically, we want to
determine how much more we
know about selling prices if
we have data about house size
than if we do not.
To determine how
much more information
we gain from the
house size data,
we need a benchmark
telling us how much
we would know about the
behavior of prices if we did not
have the house size data, that
is, if we only had the price
data to work with.
Using the price data
alone, the best predictor
for a future selling
price would simply
be the average of
the previous prices.
Thus, we use mean
price as our benchmark
and draw a mean price
line through the data.
We already have a measure of how
accurately an individual line
fits a data set, the
sum of squared errors.
To find out how much additional
value the regression model
gives us, we'll compare the
accuracy of the regression line
with that of the
mean price line.
Specifically, we'll calculate
the sum of squared errors
for each of the two lines,
and see how much smaller
the error is around
the regression line
than around the mean line.
We've just learned that
the sum of squared errors
of the regression line is called
the residual sum of squares.
It's useful to think of this as
the variation left unexplained
by the regression model.
We can also calculate
the sum of squared errors
for the mean price line.
This represents the total
variation in the price data,
so we call it the
total sum of squares.
To determine how
much more accurate
the regression line
is than the mean line,
we subtract the residual sum
of squares, in this case,
about 636 billion, from
the total sum of squares,
in this case about 2.4 trillion.
The difference,
about 1.77 trillion,
is called the regression
sum of squares.
We can think of the
regression sum of squares
as measuring the variation
in price that's explained
by the regression model.
In this case, since the
regression sum of squares
is a large fraction of
the total variation,
we know that the
regression line helps
us predict price much more
accurately than price alone
would.

It can be misleading to use only R squared to assess whether a linear regression model
is appropriate.
R squared measures how much variation is explained by the regression line, but it does
not reveal exactly how the variables are related.

BEGIN GRAPH DESCRIPTION

A graph labeled R squared approximately equal to 0.7 has the Independent Variable as
the x-axis and the Dependent Variable as the y-axis.
Three scatter plots appear on the graph, each with different point distributions and each
with a regression line of differing length and slope.
END OF GRAPH DESCRIPTION

Thus, we need to look beyond R squared for further insight.

Specifically, we should examine two other important metrics, the p-value of the
independent variable, and a graph known as the residual plot.

The regression line is the line that best fits the observed data, but we need a way to test
whether the linear relationship is significant.
If it is not significant, then the true regression line is just the mean line, which has a
slope of 0.

BEGIN GRAPH DESCRIPTION

A graph labeled Selling Price versus House Size has House Size in Square Feet as the
x-axis and Selling Price in dollars as the y-axis.
A scatter plot shows most points clustered around the bottom left, with a regression line
rising diagonally to the upper right.
Additional points are added to the scatter plot, and the regression line changes to a
horizontal line.
END OF GRAPH DESCRIPTION

Thus, if we can show that the slope of the true regression line is not zero, we can be
confident that there is a significant linear relationship.

We can test this by performing a hypothesis test.

The null hypothesis is that the true slope of the regression line is zero,

BEGIN FORMULA DESCRIPTION

H sub 0 is proportional to the True Slope, which equals to Beta which equals 0.
END OF FORMULA DESCRIPTION

and the alternative hypothesis is that the true slope is not zero.

BEGIN FORMULA DESCRIPTION

H sub a is proportional to the True Slope, which equals to Beta which is not equal to 0.
END OF FORMULA DESCRIPTION

As usual, we look at a p-value, in this case, the p-value of the independent variable, to
determine whether or not we can reject the null hypothesis.

BEGIN TABLE DESCRIPTION

The P-value 0.0000 is highlighted in the following table of summary statistics.
Coefficients Standard t Stat P-value Lower 95% Upper 95%
Error

Intercept 13,490.45 57,518.92 0.23 0.8163 -104,331.72 131,312.62

House 255.36 28.93 8.83 0.0000 196.10 314.63

Size (Sqft)

END OF TABLE DESCRIPTION

Later we'll look at how residual plots can help us determine whether the linear model is
the best fit for the data.
Even though a linear model may explain some of the variation, the true relationship may
be best described by some type of curve, for example.

BEGIN GRAPH DESCRIPTIONS

Two scatter plots, one with a u-shaped point distribution and one with an s-shaped point
distribution, have their diagonal ascending lines replaced by curved lines that more
closely approximate the respective scatter plots' points distribution.
END OF GRAPH DESCRIPTIONS

There are two ways to test whether the slope of the best fit line equals zero.

1. Check whether the confidence interval for the line's slope contains zero

Remember that the coefficients of the regression line are just estimates of the true linear
relationship between the dependent and independent variables. A coefficient’s lower 95%
and upper 95% values give us the lower and upper bounds of the 95% confidence interval
for that coefficient. Recall that if the best fit regression line has a slope of zero, then the
regression line is just a flat line equal to the mean of the dependent variable, indicating that
that there is no linear relationship between the two variables. Thus, if the 95% confidence
interval for the slope does not include zero, we can be 95% confident that the true value of
the slope is not zero and thus that a significant relationship exists between the variables.
Coefficients Standard t Stat P-value Lower 95% Upper 95%
Error

Intercept 13,490.45 57,518.92 0.23 0.8163 -104,331.72 131,312.62

House 255.36 28.93 8.83 0.0000 196.10 314.63

Size (Sqft)

In this example, we can say we are 95% confident that the true slope of the regression line
describing the relationship between selling price and house size is between 196.10 and
314.63. Because this range does not include the value zero, we can be 95% confident that
there is a significant linear relationship between the variables.

2. Check whether the p-value is greater than or equal to 0.05

As we noted earlier, regression analysis builds on hypothesis testing. In fact, a single

variable linear regression analysis is equivalent to the hypothesis test,

H0:𝛽=0

and

H𝑎:𝛽≠0

. Recall that the p-value for a hypothesis test is the likelihood that we would select a sample
at least as extreme as the one we observed if the null hypothesis were true. The p-value
associated with a regression coefficient is the likelihood of choosing a sample at least as
extreme as the sample we used to derive the regression equation if the slope of the true
regression line is actually zero, or equivalently, if there is no linear relationship between the
two variables. Since the p-value for house size, 0.0000, is less than 0.05, we reject the null
hypothesis that the slope is zero and can be confident that there is a significant linear
relationship between selling price and house size. (We can ignore the p-value of the
intercept coefficient because the y-intercept is just a constant. It does not represent an
independent variable and thus provides no information about the significance of the
relationship between two variables.)

Coefficients Standard t Stat P-value Lower 95% Upper 95%

Error

Intercept 13,490.45 57,518.92 0.23 0.8163 -104,331.72 131,312.62

House 255.36 28.93 8.83 0.0000 196.10 314.63

Size (Sqft)

Recall that a significance level of 5% corresponds to a confidence level of 95%, so checking

whether a regression coefficient’s p-value is less than 5% is equivalent to checking whether
the coefficient’s 95% confidence interval contains zero. Both approaches test whether or not
we can be 95% confident that there is significant linear relationship between the variables.

4.4.3 R-squared vs. p-value

We should always examine both the R2 and the coefficient’s p-value

when assessing the fit of a linear regression model. Let’s look at a few
examples to make sure we understand the difference between these two
measures.

4.4.4 Residual Analysis

As we mentioned earlier in the module, we need to look beyond R2 in

order to assess how two variables are related and whether the linear
regression model is a good fit for the data. We have already looked at the
p-value of the independent variable to gain further insight into the
significance of the relationship. Now let’s learn about the additional
insight that residual plots provide.

4.4 Summary
Lesson Summary

It is important to evaluate several metrics in order to determine whether a

single variable linear regression model is a good fit for a data set, rather
than looking at individual metrics in isolation.

● R2 measures the percent of total variation in the dependent

variable,
● 𝑦
● , that is explained by the regression line.
● 𝑅2=Variation explained by the
regression lineTotal
variation=Regression Sum of
SquaresTotal Sum of Squares
●
● 0≤R2≤1
● For a single variable linear regression, R2 is equal to the
square of the correlation coefficient.
● In addition to analyzing R2, we must test whether the relationship
between the dependent and independent variable is significant
and whether the linear model is a good fit for the data. We do this
by analyzing the p-value (or confidence interval) associated with
the independent variable and the regression’s residual plot.
● The p-value of an independent variable is the result of the
hypothesis test that tests whether there is a significant linear
relationship between the dependent and independent variable;
that is, it tests whether the slope of the regression line is zero,
● H0:𝛽=0
● and
● H𝑎:𝛽≠0
● .
● If the coefficient’s p-value is less than 0.05, we reject the
null hypothesis and conclude that we have sufficient
evidence to be 95% confident that there is a significant
linear relationship between the dependent and independent
variables.
● Note that the p-value and R2 provide different information.
A linear relationship can be significant (have a low p-value)
but not explain a large percentage of the variation (not
have a high R2.)
● A confidence interval associated with an independent variable’s
coefficient indicates the likely range for that coefficient.
● If the 95% confidence interval does not contain zero, we
can be 95% confident that there is a significant linear
relationship between the variables.
● Residual plots can provide insights into whether a linear model is
a good fit.
● Each observation in a data set has a residual equal to the
historically observed value minus the regression’s
predicted value, that is,
● 𝜀=𝑦−𝑦^
● .
● Linear regression models assume that the regression’s
residuals follow a normal distribution with a mean of zero
and fixed variance.

4.5.2 Using Dummy Variables

So far, we have constructed regression models using only quantitative

(numerical) variables. However, many variables we study are qualitative,
or categorical, variables, meaning that they do not naturally take on
numerical values but can be classified into categories.

● Quantitative (numerical): Variables that can be counted or

measured and that are naturally represented as numbers.
● Qualitative (categorical): Variables that can be sorted or grouped
into categories. Qualitative variables must be transformed into
dummy variables (defined below).

Note that we could artificially assign numbers to categories. For example,

if our categories are colors, we could assign the number 1 to red and 2 to
blue, but these do not have meaning in a mathematical sense—we would
not conclude that blue is twice as much as red is.

Determine whether each of the following variables is quantitative or

qualitative. Then select the appropriate category.
Relating Dummy Variable Regression to Hypothesis Testing

As we noted earlier, regression analysis builds on hypothesis testing. In fact, a single

variable linear regression analysis is equivalent to the two-sample hypothesis test,

H0:𝛽=0

and

H𝑎:𝛽≠0

. (In regression analysis, the hypothesis test’s p-value is calculated by assuming the two
samples have equal variances.) In this case, since SAT is a dummy variable, this is
equivalent to a hypothesis test with the following null and alternative hypotheses:

● H0
● : The selling price of homes in neighborhoods where the average SAT score is at
or above 1700
● =
● the selling price of homes in neighborhoods where the average SAT score is
below 1700.
● H𝑎
● : The selling price of homes in neighborhoods where the average SAT score is at
or above 1700
● ≠
● the selling price of homes in neighborhoods where the average SAT score is
below 1700.

The regression analysis gives us more information than the hypothesis test alone would.
Rather than simply calculating the p-value, rejecting the null hypothesis and concluding
that there is a significant linear relationship, the regression results provide the direction
and magnitude of this relationship.

4.5 Summary

Lesson Summary

● The regression output table is divided into three main parts: the
Regression Statistics table, the ANOVA table, and the Regression
Coefficients table. It is important to be able to identify the most
useful measures and interpret them correctly.
● To study the effects of qualitative variables, we use a type of
variable called a dummy variable. A dummy variable takes on one
of two values, 0 or 1, to indicate which of two categories a data
point falls into.

Excel Summary

● Creating a regression output table using the Data Analysis tool

● Creating regression models with dummy variables
● To perform a regression analysis with an independent
dummy variable, we follow the same steps as we would
when using quantitative variables.
●

4.6.1 The Disney Studio Model

Now that we have the necessary tools, let’s return to Disney Studios and
take a closer look at its single variable linear regression model.

So the way that we
forecast home video units
is fundamentally based on a
regression of home video unit
sales to gross box office.
So in our case, the independent
variable, gross box office,
is our estimate.
It depends on what point in
time we're doing the analysis.
If it's before the title's
release theatrically,
it's our best estimate of
what the gross box office will
be for a title's run,
which could typically
last from maybe three months,
three to four months, where
the final gross will land.
And then what we'll
produce from the regression
will be an estimate of
52-week unit sales at retail.

Earlier in this module, we briefly analyzed the relationship between 2011 gross box
office and home video units data. Now let’s look at the 2012 data.
A scatter plot depicting 2012 home video units versus gross box office. The x-axis is
labeled gross box office in millions of dollars and ranges from 0 to 300 in increments
of 50. The y-axis is labeled home video units in thousands and ranges from 0 to
8,000 in increments of 1,000. The plotted points are loosely clustered in an
upward-sloping pattern from left to right. Most of the points on the graph are in the 0
to 50 million dollar range, which corresponds to 0 to 1,000 home video units. The
next-largest range of points is between 50 and 100 million dollars, which
corresponds to 500 to 2,500 home video units. A few scattered points are in the 100
to 250 million dollar range, which corresponds to 1,000 to 6,900 home video units.
The rightmost point is located at ($290 million, 7,100 home video units).

Using the simple linear

regression model actually
is the best jumping-off
point for our forecasting
of the units because
there's actually
a very strong correlation
between the two.
So the R-squared in
this case is very high
and is a good predictor.
It's not perfect, but
it's a good predictor
and a starting point for
our analysis and forecast.

The business used
to be primarily
during the heyday of DVD
and ownership business,
probably because that was
the most convenient way
that you could consume movies in
the home entertainment market.
You'd go to places that
you already shopped, and it
was one fee, and it wasn't
that much more than renting it,
and you would pick
up a movie and then
you could do anything
you wanted with it.
But what we've really
seen with pressures
in the economy and people being
more cost conscious than ever
before, combined with a lot of
the innovations in technology,
we've seen a big
shift to digital.
And as people have bought
less movies physically
on DVD and Blu-ray, and started
doing more things digitally,
be it video on demand
or subscription or EST,
what's happened is they've
started to rent more than buy.
All of this change has
made it really complicated
for us to accurately forecast
our physical home entertainment
sales, because there are
all these shifts in consumer
behavior happening, and
they're happening quickly.
And it's hard to stay abreast
of where the consumer's going.

The shortcomings of using
the regression model
for forecasting are
that, first of all,
you need an accurate GBO,
gross box office input.
On the case of Frozen can be a
little bit of a moving target,
so that's one of
the shortcomings.
You need that accurate
box office input.
Our theatrical forecast
tend to be very accurate
once a movie opens.
Before a movie opens, it
can be very inaccurate
because there are
so many unknowns.
Originally, going in
to Frozen before it
had opened in theaters,
our box office assumption
was that it was going to do
$175 million domestically.
Within the first
week of opening,
we knew it was going
to be closer to $190,
and within, I'd say,
maybe 10 days of opening,
we had taken had
up to $340 million,
and it might ultimately
go up to $360 million.
And now we're at almost
two months post opening.
So it has had stronger,
we call it stronger legs.
It's had a longer tail than
originally anticipated.
But the other thing
is that there's so
many other variables that
ultimately affect performance
in the home entertainment
market beyond just
the strength of the box office.
Seasonality actually plays
an important role in the home
entertainment market.
Part of that due to the
fact that gift giving
is an important part of the
home entertainment business.
So we have much
stronger seasonality
in October, November,
and December,
for example, when people
are buying a lot of Blu-rays
and DVD for gifts than we
do in the middle of summer
when there isn't that kind
of gift giving opportunity.
There are some
additional factors
that go into trying to predict
home entertainment performance.
Some of the
additional factors are
macro trends in the industry.
Are people buying fewer DVDs
on the whole, and by how much?
Are people buying more Blu-rays
on the whole, and by how much?
In addition, genre
is another variable.
Rating is another variable.
So you can have two
movies with the same box
office that will
perform differently
because we know things.
For example, we know that
animated family movies perform
better on a units per box office
dollar ratio than, for example,
an adult drama.
So that can have an effect.
The amount that is invested in
media is an important variable.
Word of mouth is becoming a more
and more important variable.
It's not just how
many people went
to see the movie in
theaters, but how
did they feel about
that movie and what
did they say about it to other
people through social media
and through word of mouth.
The next step for
analytics is we're
going to continue to try
to refine, and figure out
which variables seem
to be more impactful.
And we expand it.
But it's very difficult,
because there's
a lot of subjective factors
when you're taking those
into play to figure out, one
person might rate a movie
as an A. Another one
might rate it as a B.
You really have to look at
all of the different pieces
of the puzzle, and you
have to have smart people
with different points
of view, and you have
to surround yourself with them.
And you need to
have open dialogue
and listen to them, because
with an industry that
has so many different dynamics
and that's changing so rapidly,
point of view is very important.
The data is certainly
very important,
but a qualitative overlay
is very important as well.

Decision Making Under Uncertainty
100% (1)
Decision Making Under Uncertainty
17 pages
BBBC
0% (2)
BBBC
12 pages
Agile Data Warehouse PDF
No ratings yet
Agile Data Warehouse PDF
24 pages
Workstation Installation Checklist
No ratings yet
Workstation Installation Checklist
13 pages
Case Study PDF
100% (1)
Case Study PDF
16 pages
Econometrics II Week 3 Summary
No ratings yet
Econometrics II Week 3 Summary
8 pages
Data Analysis ToolPak For Statistics
No ratings yet
Data Analysis ToolPak For Statistics
10 pages
MLP Project My Part VP
No ratings yet
MLP Project My Part VP
3 pages
Business Applications of Multiple Regression
50% (4)
Business Applications of Multiple Regression
48 pages
ML Lecture - 3
No ratings yet
ML Lecture - 3
47 pages
Chapter 13 Quick Overview of Correlation and Linear Regression
No ratings yet
Chapter 13 Quick Overview of Correlation and Linear Regression
17 pages
Chapter 12
No ratings yet
Chapter 12
19 pages
Multiple Regression Analysis
100% (7)
Multiple Regression Analysis
6 pages
Computer Class 1_multiple regression
No ratings yet
Computer Class 1_multiple regression
24 pages
(English) Leverage and Influential Points in Simple Linear Regression (DownSub - Com)
No ratings yet
(English) Leverage and Influential Points in Simple Linear Regression (DownSub - Com)
5 pages
Simple Linear Regression Homework Solutions
100% (1)
Simple Linear Regression Homework Solutions
6 pages
Chapter 1 Simple Linear Regression
No ratings yet
Chapter 1 Simple Linear Regression
17 pages
Note 7
No ratings yet
Note 7
4 pages
Course Notes For Unit 1 of The Udacity Course ST101 Introduction To Statistics
No ratings yet
Course Notes For Unit 1 of The Udacity Course ST101 Introduction To Statistics
26 pages
Econometrics Project
No ratings yet
Econometrics Project
17 pages
Regression Test Lesson Notes (Optional Download)
No ratings yet
Regression Test Lesson Notes (Optional Download)
5 pages
Correlation and Regression 2
No ratings yet
Correlation and Regression 2
24 pages
Assigmnment On 3203
No ratings yet
Assigmnment On 3203
100 pages
242-44-001-Q-3
No ratings yet
242-44-001-Q-3
6 pages
Notation
100% (1)
Notation
36 pages
2 Descriptive Simple Linear Regression
No ratings yet
2 Descriptive Simple Linear Regression
13 pages
Linear Regression Subjective Questions
No ratings yet
Linear Regression Subjective Questions
14 pages
Regression Analysis Homework Solutions
100% (1)
Regression Analysis Homework Solutions
7 pages
Exploratory Data Analytics-1
No ratings yet
Exploratory Data Analytics-1
27 pages
Correlation and Regression
No ratings yet
Correlation and Regression
5 pages
BA1 Chapter 10
No ratings yet
BA1 Chapter 10
11 pages
Regression Analysis
No ratings yet
Regression Analysis
32 pages
Appendix
No ratings yet
Appendix
21 pages
Random Variables Review - unannotated
No ratings yet
Random Variables Review - unannotated
9 pages
Regression Analysis
100% (2)
Regression Analysis
9 pages
Introduction To Linear Regression: That Is Not Worth The Time and Trouble of Taking The Time To Learn How To Use
No ratings yet
Introduction To Linear Regression: That Is Not Worth The Time and Trouble of Taking The Time To Learn How To Use
12 pages
Module 2 - Intro To Regression Analysis
No ratings yet
Module 2 - Intro To Regression Analysis
29 pages
linear regression (1)
No ratings yet
linear regression (1)
8 pages
Calculating Covariance For Stocks
No ratings yet
Calculating Covariance For Stocks
10 pages
3.3.1 Quantitative Sales Forecasting Teacher Copy
No ratings yet
3.3.1 Quantitative Sales Forecasting Teacher Copy
15 pages
Report
No ratings yet
Report
7 pages
LinearRegression1 210720 171800
No ratings yet
LinearRegression1 210720 171800
41 pages
Exploratory Vistas: Ways To Become Acquainted With A Data Set For The First Time
No ratings yet
Exploratory Vistas: Ways To Become Acquainted With A Data Set For The First Time
24 pages
Statistics
No ratings yet
Statistics
87 pages
Topic 4 = ETC1000
No ratings yet
Topic 4 = ETC1000
13 pages
Statistics Chapter 4 Project - Tyler Kaitlyn and Collin
No ratings yet
Statistics Chapter 4 Project - Tyler Kaitlyn and Collin
2 pages
Yousef ML Washin Regression
No ratings yet
Yousef ML Washin Regression
590 pages
Interpreting Correlation
No ratings yet
Interpreting Correlation
13 pages
Assignment 8
No ratings yet
Assignment 8
6 pages
Correlation and Regression Are The Two Analysis Based On Multivariate Distribution
No ratings yet
Correlation and Regression Are The Two Analysis Based On Multivariate Distribution
10 pages
Esm 314- Computer Application in Estate Management
No ratings yet
Esm 314- Computer Application in Estate Management
3 pages
Literature Review On Simple Linear Regression
100% (3)
Literature Review On Simple Linear Regression
4 pages
What Is Statistics
No ratings yet
What Is Statistics
6 pages
Regression
No ratings yet
Regression
3 pages
Corr_Regression Analysis
No ratings yet
Corr_Regression Analysis
19 pages
Note Simple Linear Regression
No ratings yet
Note Simple Linear Regression
17 pages
Chi Square Test in Dissertation
100% (2)
Chi Square Test in Dissertation
7 pages
Linear Regression Homework Solution
100% (2)
Linear Regression Homework Solution
9 pages
Week 3 Supplemental Handout - Linear Model Assumptions
No ratings yet
Week 3 Supplemental Handout - Linear Model Assumptions
7 pages
Beginner’s Guide to Correlation Analysis: Bite-Size Stats, #4
From Everand
Beginner’s Guide to Correlation Analysis: Bite-Size Stats, #4
Lee Baker
No ratings yet
Start Predicting In A World Of Data Science And Predictive Analysis
From Everand
Start Predicting In A World Of Data Science And Predictive Analysis
Matthew Abbitt
No ratings yet
Accounting, Maths and Computing Principles for Business Studies Teachers Pack V11
From Everand
Accounting, Maths and Computing Principles for Business Studies Teachers Pack V11
Clive W. Humphris
No ratings yet
Employability Skills: Brush up Your Business Studies
From Everand
Employability Skills: Brush up Your Business Studies
Clive W. Humphris
No ratings yet
Design of Deployment Mechanism of Solar Array On A Small Satellite
No ratings yet
Design of Deployment Mechanism of Solar Array On A Small Satellite
7 pages
Bio-Biomedical Graduate Programs Dropping GRE Requirement
No ratings yet
Bio-Biomedical Graduate Programs Dropping GRE Requirement
2 pages
ClonezillaLiveRefCard en Flat 0.9.5
No ratings yet
ClonezillaLiveRefCard en Flat 0.9.5
3 pages
Week 14
No ratings yet
Week 14
2 pages
01b EasyIO FS20 Installation v1
No ratings yet
01b EasyIO FS20 Installation v1
13 pages
Ielts Speaking A Collection of Common Topics: Unit 1 People Lesson 7 An Old Person You Respect
No ratings yet
Ielts Speaking A Collection of Common Topics: Unit 1 People Lesson 7 An Old Person You Respect
8 pages
Schedule Line Delivery - Order Fulfillment - SAP Library
No ratings yet
Schedule Line Delivery - Order Fulfillment - SAP Library
3 pages
Ebong Indigeneoustechniques
No ratings yet
Ebong Indigeneoustechniques
220 pages
RUD Lifting Points Catalogue Ed 01
100% (1)
RUD Lifting Points Catalogue Ed 01
108 pages
Magic Mod Sci
No ratings yet
Magic Mod Sci
19 pages
ADBT Unit-1
No ratings yet
ADBT Unit-1
17 pages
TDD With Django
No ratings yet
TDD With Django
27 pages
I Am Vasundhara
No ratings yet
I Am Vasundhara
4 pages
Database Management Reference Manual
No ratings yet
Database Management Reference Manual
196 pages
Adbepkondryuk
No ratings yet
Adbepkondryuk
3 pages
Chapter 1
No ratings yet
Chapter 1
87 pages
Head-Driven Phrase Structure Grammar: July 2002
No ratings yet
Head-Driven Phrase Structure Grammar: July 2002
9 pages
Electronic Commerce Proc 1205.2020
No ratings yet
Electronic Commerce Proc 1205.2020
32 pages
English Lesson
No ratings yet
English Lesson
33 pages
General Machine Control - Automation Solutions For Industrial Machines - Catalogue 2014 PDF
No ratings yet
General Machine Control - Automation Solutions For Industrial Machines - Catalogue 2014 PDF
282 pages
Magtungtong Tayo
No ratings yet
Magtungtong Tayo
66 pages
Demo Teaching Rubric
100% (1)
Demo Teaching Rubric
2 pages
Bab-8 Partial Differential Equations
No ratings yet
Bab-8 Partial Differential Equations
27 pages
Perform Irrigation v2
No ratings yet
Perform Irrigation v2
8 pages
Andrew (Balugawhale) Seidman Easy Game Volume III (137p)
100% (8)
Andrew (Balugawhale) Seidman Easy Game Volume III (137p)
136 pages
Short Essay Rubric: Score Completion Accuracy Comprehension Organization Conventions
No ratings yet
Short Essay Rubric: Score Completion Accuracy Comprehension Organization Conventions
1 page