Introduction To Management Science: Post Mid Sessions 2 & 3 November 4 and 6 2019

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 26

Introduction to Management

Science
Post Mid Sessions 2 & 3
November 4th and 6th 2019
Regression
• Regression Analysis is used to estimate a function f( ) that
describes the relationship between a continuous dependent
variable and one or more independent variables.
Y = f(X1, X2, X3,…, Xn) + e
Note:
• f( ) describes systematic variation in the relationship.
· e represents the unsystematic variation (or random error) in the
relationship.
Advertising – Sales Example
• Consider the relationship between advertising (X1) and
sales (Y) for a company.
• There probably is a relationship...
...as advertising increases, sales should increase.
• But how would we measure and quantify this
Advertising Actual Sales
relationship? Obs (in $1000s) (in $1000s)
1 30 184.4
2 40 279.1
3 40 244.0
4 50 314.2
5 60 382.2
6 70 450.2
7 70 423.6
8 70 410.2
9 80 500.4
10 90 505.3
Scatter Plot
600.0
Actual Sales in $1000s

500.0

400.0

300.0

200.0

100.0

0.0
20 30 40 50 60 70 80 90 100

Advertising in $1000s
Simple Linear Regression Model
 The scatter plot shows a linear relation between
advertising and sales.
 So the following regression model is suggested by the
data,
Yi   0  1X1i   i
This refers to the true relationship between the entire
population of advertising and sales values.
 The estimated regression function (based on our
sample) will be represented as,
  b b X
Yi 0 1 1i

Yˆ i is the estimated (of fitted) value of Y at a given level of X


How do you determine the best fit?
 Numerical values must be assigned to b0 and b1

 The method of “least squares” selects the values


that minimize: n n
ESS   (Y  Y )   (Y  (b
i 1
i i
2

i 1
i 0  b X
1 1i
)) 2

 If ESS=0 our estimated function fits the data


perfectly.
 We could solve this problem using
 Solver
 In-built Regression Tool
 Trend() function
TREND Function
TREND(Y-range, X-range, X-value for prediction)
where:
Y-range is the spreadsheet range containing the dependent Y
variable,
X-range is the spreadsheet range containing the independent X
variable(s),
X-value for prediction is a cell (or cells) containing the values for
the independent X variable(s) for which we want an estimated
value of Y.

Note: The TREND( ) function is dynamically updated whenever any


inputs to the function change. However, it does not provide the
statistical information provided by the regression tool. It is best two
use these two different approaches to doing regression in conjunction
with one another.
Nature of Statistical Relationship
Y

Regression
Curve

Probability distributions for


Y at different levels of X

X
R Statistic
2

• The R2 statistic indicates how well an estimated


regression function fits the data.
• 0 < R2 < 1
• It measures the proportion of the total variation
in Y around its mean that is accounted for by the
estimated regression equation.
• To understand this better, consider the following
graph...
Understanding R 2
Yi (actual value)
Y
* ^
Yi - Y i

Yi - Y ^ (estimated value)
Yi
^ -Y
Y i

^
Y = b0 + b1X

X
Understanding R 2
n n n

 (Y  Y)   (Y  Y )   (Y  Y)
i 1
i
2

i 1
i i
2

i 1
i
2

or,
TSS = ESS + RSS

2 RSS ESS
R   1
TSS TSS
Predicting the Dependent Value
 Suppose we want to estimate the average
levels of sales expected if $65,000 is spent on
advertising.
  36.342  5550
Y . X1
i i

 Estimated Sales = 36.342 + 5.550 * 65


= 397.092
 So when $65,000 is spent on advertising, we
expect the average sales level to be
$397,092.
Standard Error
 The standard error measures the scatter in the
actual data around the estimate regression line.


i 1
 )2
(Yi  Yi
Se 
n  k 1

where k = the number of independent variables

 For our example, Se = 20.421


 This is helpful in making predictions...
Approximate Prediction Interval
• An approximate 95% prediction interval for a new
value of Y when X1=X1h is given by
  2S
Yh e
where:
  b b X
Yh 0 1 1 h

 Example: If $65,000 is spent on advertising:


95% lower prediction interval = 397.092 - 2*20.421 = 356.250
95% upper prediction interval = 397.092 + 2*20.421 = 437.934
 If we spend $65,000 on advertising we are
approximately 95% confident actual sales will be
between $356,250 and $437,934.
Beware of Extrapolation …
• Predictions made using an estimated
regression function may have little or no
validity for values of the independent
variables that are substantially different from
those represented in the sample.
Multiple Regression Analysis
 Most regression problems involve more than one
independent variable.
 If each independent variables varies in a linear manner
with Y, the estimated regression function in this case is:
  b  b X  b X b X
Yi 0 1 1i 2 2i k ki

 The optimal values for the bi can again be found by


minimizing the ESS.
 The resulting function fits a hyperplane to our
sample data.
Regression Surface with 2 IVs
Y

*
* **
*
* * * *
* *
* * * *
* *
* *
*
* *
*

X2 X1
Real Estate Example
• A real estate appraiser wants to develop a model to
help predict the fair market values of residential
properties.
• Three independent variables will be used to estimate
the selling price of a house:
– total square footage
– number of bedrooms
– size of the garage
Selecting the Model
• We want to identify the simplest model that
adequately accounts for the systematic variation in
the Y variable.
• Arbitrarily using all the independent variables may
result in overfitting.
• A sample reflects characteristics:
– representative of the population
– specific to the sample
• We want to avoid fitting sample specific
characteristics -- or overfitting the data.
Models with One IV
• With simplicity in mind, suppose we fit three
simple linear regression functions:
  b b X
Y i 0 1 1i
  b b X
Yi 0 2 2i
  b b X
Yi 0 3 3i
 Key regression results are:
Variables Adjusted Parameter
in the Model R2 R2 Se Estimates
X1 0.870 0.855 10.299 b0=9.503, b1=56.394
X2 0.759 0.731 14.030 b0=78.290, b2=28.382
X3 0.793 0.770 12.982 b0=16.250, b3=27.607
 The model using X1 accounts for 87% of the
variation in Y, leaving 13% unaccounted for.
Multiple Regression in Excel
When using more than one independent
variable, all variables for the X-range must be
in one contiguous block of cells (that is, in
adjacent columns).
Models with 2 IVs
• Now suppose we fit the following models with two
independent variables:
  b b X b X
Yi 0 1 1i 2 2i
  b b X b X
Yi 0 1 1 3 3 i i

 Key regression results are:


Variables Adjusted Parameter
in the Model R2 R2 Se Estimates
X1 0.870 0.855 10.299 b0=9.503, b1=56.394
X1 & X2 0.939 0.924 7.471 b0=27.684, b1=38.576 b2=12.875
X1 & X3 0.877 0.847 10.609 b0=8.311, b1=44.313 b3=6.743

 The model using X1 and X2 accounts for 93.9%


of the variation in Y, leaving 6.1% unaccounted
for.
Adjusted R 2
• As additional independent variables are added to a
model:
– The R2 statistic can only increase.
– The Adjusted-R2 statistic can increase or decrease.

 The R2 statistic can be artificially inflated by


adding any independent variable to the model.
 We can compare adjusted-R2 values as a
heuristic to tell if adding an additional
independent variable really helps.
Multicollinearity
• It should not be surprising that adding X3 (# of bedrooms)
to the model with X1 (total square footage) did not
significantly improve the model.
• Both variables represent the same (or very similar)
things -- the size of the house.
• These variables are highly correlated (or collinear).
• Multicollinearity should be avoided.
Model with 3 IVs
• Now suppose we fit the following model with three
independent variables:
  b b X b X b X
Yi 0 1 1 2 2 3 3
i i i

 Key regression results are:


Variables Adjusted Parameter
in the Model R2 R2 Se Estimates
X1 0.870 0.855 10.299 b0=9.503, b1=56.394
X1 & X 2 0.939 0.924 7.471 b0=27.684, b1=38.576, b2=12.875
X1, X2 & X3 0.943 0.918 7.762 b0=26.440, b1=30.803,
b2=12.567, b3=4.576
 The model using X1 and X2 appears to be
best:
– Highest adjusted-R2
– Lowest Se (most precise prediction intervals)
Making Predictions
• Let’s estimate the avg selling price of a house with
2,100 square feet and a 2-car garage:
  b b X b X
Yi 0 1 1i 2 2i
  27.684  38.576 * 2.1  12.875 * 2  134.444
Yi

 The estimated average selling price is $134,444

 A 95% prediction interval for the actual selling


price is approximately:
Y  2S
h e
95% lower prediction interval = 134.444 - 2*7.471 = $119,502
95% lower prediction interval = 134.444 + 2*7.471 = $149,386

You might also like