0% found this document useful (0 votes)
14 views

Fba 1

Fba

Uploaded by

Lharvae
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Fba 1

Fba

Uploaded by

Lharvae
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

LINEAR REGRESSION • Sample statistics (denoted, b0 and b1)

are computed as estimates of the


➢ Managerial decisions are often based on the
relationship between two or more variables. population parameters, β0 and β1.
• Example: After considering the
relationship between advertising B. Estimated Regression Equation
expenditures and sales, a marketing • The equation obtained by substituting
manager might attempt to predict sales the values of the sample statistics b0
for a given level of advertising and b1 for β0 and β1 in the regression
expenditures. equation.
➢ Sometimes a manager will rely on intuition • Estimated simple linear regression
to judge how two variables are related. equation:
➢ If data can be obtained, a statistical ^y = b0 + b1x
procedure called regression analysis can be • ^y = point estimator of E( y∣x )
used to develop an equation showing how
• b0 = Estimated y-intercept
the variables are related.
➢ Dependent variable or response • b1 = Estimated slope or coefficient
• Variable being predicted. ➢ Estimated Regression Line
➢ Independent variables or predictor - the graph of the estimated simple
variables linear regression equation.
• Variables being used to predict the value
of the dependent variable.
➢ Linear regression
• A regression analysis involving one
independent variable and one dependent
variable.
➢ In statistical notation:
• y = dependent variable
• x = independent variable
➢ Simple linear regression:
• A regression analysis for which any one
unit change in the independent
variable, x, is assumed to result in the
same change in the dependent variable,
y.
➢ Multiple linear regression:
• Regression analysis involving two or
more independent variables.

I. The Simple Linear Regression Model


A. Regression Model
B. Estimated Regression Equation

The Simple Linear Regression Model


A. Regression Model
➢ The equation that describes how y is
related to x and an error term.
• Simple Linear Regression Model:

y = β0 + β1 x + ε Note:

• Parameters: The characteristics of the • The regression line in Panel A shows that
population, β0 and β1 the mean value of y is related positively to
• Random variable: Error term, ε. x, with larger values of E(y|x) associated
• The error term accounts for the with larger values of x.
variability in y that cannot be explained • In Panel B, the mean value of y is related
by the linear relationship between x and negatively to x, with smaller values of E(y|
y. x) associated with larger values of x.
• The parameter values are usually not
• In Panel C, the mean value of y is not
known and must be estimated using
sample data. related to x; that is, E(y|x) is the same for
every value of x.
II. Least Squares Method ➢ Interpretation of b1: If the length of a
A. Least Squares Estimates of driving assignment were 1 unit (1 mile)
thRegression Parameters longer, the mean travel time for that
B. Using Excel’s Chart Tools to driving assignment would be 0.0678 units
Compute the Estimated Regression (0.0678 hours, or approximately 4
Equation minutes) longer.
➢ Interpretation ofb0: If the driving distance
III. Least Squares Method for a driving assignment was 0 units (0
➢ Least squares method miles), the mean travel time would be
- A procedure for using sample data 1.2739 units (1.2739 hours, or
to find the estimated regression approximately 76 minutes)
equation.
➢ Determine the values of b0 and b1. ➢ Experimental region
➢ Interpretation of b0 and b1: - The range of values of the
• The slope b1 is the estimated independent variables in the data used
change in the mean of the to estimate the model.
dependent variable y that is - The regression model is valid only
associated with a one unit increase over this region.
in the independent variable x. ➢ Extrapolation
• The y-intercept b0 is the estimated - Prediction of the value of the
value of the dependent variable y dependent variable outside the
when the independent variable x is experimental region.
equal to 0. - It is risky.
➢ Least squares equation:

• yi = observed value of the


dependent variable for the ith
observation
• ^y i = predicted value of the
dependent variable for the ith
observation
• n = total number of observations
• ith residual = the error made using
the regression model to estimate the
mean value of the dependent
variable for the ith observation
• Denoted as ei = yi - ^y i
• Hence,
n n

min ∑ ¿¿¿ = min ∑ e i2


i=1 i=1

➢ We are finding the regression that


minimizes the sum of squared errors.
B. Using Excel’s Chart Tools to Compute the
Estimated Regression Equation
A. Least Squares Estimates of the Regression • After constructing a scatter chart
Parameters with Excel’s chart tools:
➢ For the Butler Trucking Company 1. Right-click on any data point
• Estimated slope of b1 = 0.0678 and select Add Trendline…
• y-intercept of b0 = 1.2739 2. In the Format Trendline task
pane, in the Trendline Options
• The estimated simple linear
regression model: area:
^y = 1.2739 + 0.0678x • Select Linear
• Select Display Equation
on chart
• Butler Trucking Example:
For the ith driving assignment in the
sample, the difference yi – ӯ
provides a measure of the error
involved in using ӯ to predict travel
time for the ith driving assignment.

➢ Sum of Sqaures due to Regression


(SSR)

Assessing the Fit of the Simple Linear Regression


Model
A. The Sums of Squares
B. The Coefficient of Determination
C. Using Excel’s Chart Tools to Compute the
Coefficient of Determination
B. The Coefficient of Determination
➢ The ratio SSR/SST used to evaluate the
goodness of fit for the estimated
A. The Sums of Squares
regression equation
➢ Sum of Squares due to Error (SSE)
2 SSR
- The value of SSE is a measure of the • r = SST
error in using the estimated regression
• Take values between zero and one.
equation to predict the values of the
dependent variable in the sample. • Interpreted as the percentage of the
total sum of squares that can be
explained by using the estimated
regression equation.
➢ Square of the correlation between the yi
and ^y i .
➢ Referred to as the simple coefficient of
determination in simple regression.
➢ Total Sum of Squares (SST)
The Multiple Regression Model
– The corresponding sum of squares
A. Regression Model
B. Estimated Multiple Regression Equation
C. Least Squares Method and Multiple
Regression
D. Butler Trucking Company and Multiple ➢ Estimated multiple linear regression with two
Regression independent variables:
E. Using Excel’s Regression Tool to Develop ^y = b0 + b1 x1+ b2 x2
the Estimated Multiple Regression Equation • ^y = Estimated mean travel time
A. Regression Model • x1 = Distance traveled
• x2 = Number of deliveries
➢ The SST, SSR, SSE, and R2 are computed
using the formulas discussed earlier.

Inference and Regression

A. Conditions Necessary for Valid Inference in


the Least Squares Regression Model
B. Testing Individual Regression Parameters
C. Addressing Nonsignificant Independent
➢ Interpretation of slope coefficient β j Variables
- Represents the change in the mean value of D. Multicollinearity
the dependent variable y that corresponds to a E. Inference and Very Large Samples
one unit increase in the independent variable x j
, holding the values of all other independent
variables in the model constant. Inference and Regression
➢ The multiple regression equation that describes ➢ Statistical Inference
how the mean value of y is related to x1, x2, . . . – process of making estimates and drawing
, xq: conclusions about one or more characteristics
E (y | x1, x2, . . . , xq) = β0 + β1 x1+ β2 x2 + …. of a population (the value of one or more
βq xq parameters) through the analysis of sample data
drawn from the population.
➢ In regression, inference is commonly used to
estimate and draw conclusions about:
• The regression parameters β0, β1 , β2, .
. . , βq
• The mean value and/or the predicted
value of the dependent variable y for
specific values of the independent
variable x1, x2, . . . , xq.
➢ Consider both hypothesis testing and interval
estimation

A. Conditions Necessary for Valid Inference in


the Least Squares Regression Model

• For any given combination of values


of the independent variables x1, x2, .
. . , xq ,the population of potential
error terms ε is normally distributed
with a mean of 0 and a constant
variance.
• The values of ε are statistically
independent.

B. Testing Individual Regression Parameters:


• To determine whether statistically
significant relationships exist
between the dependent variable y
and each of the independent
variables x1, x2, . . . , xq
individually.
• If a β j = 0, there is no linear zero when the independent variable actually
relationship between the dependent has a strong relationship with the dependent
variable y and the independent variable.
variable x j. ➢ This problem is avoided when there is little
• If a β j≠ 0, there is a linear correlation among the independent
variables.
relationship between y and x j.
➢ Use a t-test to test the hypothesis that a
regression parameter β j is zero. E. Inference and Very Large Samples
bj ➢ Because virtually all relationships between
➢ The test statistic for this t-test, t =
sb j
independent variables and the dependent
• sb = Estimated standard deviation
j
variable will be statistically significant:
• If the sample size is sufficiently large,
of b j
inference can no longer be used to
➢ As the magnitude of t increases (as t deviates discriminate between meaningful and
from zero in either direction), we are more specious relationships.
likely to reject the hypothesis that the
• This is because the variability in
regression parameter β j is zero. potential values of an estimator b j of a
➢ Confidence interval can be used to test whether
regression parameter β jdepends on two
each of the regression parameters β0, β 1 , β2
factors:
, . . . , βqis equal to zero. 1. How closely the members of the
➢ Confidence interval population adhere to the relationship
- An estimate of a population parameter that between x jand y that is implied by
provides an interval believed to contain the β j.
value of the parameter at some level of
2. The size of the sample on which the
confidence.
value of the estimator b jis based.
➢ Confidence level
- Indicates how frequently interval estimates
based on samples of the same size taken from
the same population using identical sampling • Testing for an overall regression
techniques will contain the true value of the relationship:
parameter we are estimating. • Use an F test based on the F probability
distribution.
• If the F test leads us to reject the
C. Addressing Nonsignificant Independent hypothesis that the values of β 1 , β2, . . .
Variables , βqare all zero:
➢ If practical experience dictates that the
nonsignificant independent variable has a
• Conclude that there is an overall
regression relationship.
relationship with the dependent variable, the
independent variable should be left in the • Otherwise, conclude that there is no
model. overall regression relationship.
➢ If the model sufficiently explains the dependent • The test statistic generated by the
variable without the nonsignificant independent sample data for this test is:
variable, then consider rerunning the regression
without the nonsignificant independent SSR / q
F=
variable. SSE /(n−q−1)
➢ The appropriate treatment of the inclusion or
exclusion of the y-intercept when b0 is not • SSR = Sum of squares due to
statistically significant may require special regression
consideration. • SSE = Sum of squares due to error
• q = the number of independent
variables in the regression model.
D. Multicollinearity • n = the number of observations in
➢ Multicollinearity refers to the correlation the sample.
among the independent variables in multiple • Larger values of F provide stronger evidence
regression analysis. of an overall regression relationship.
➢ In t-tests for the significance of individual • The numerator of this test statistic is a
parameters, the difficulty caused by measure of the variability in the dependent
multicollinearity is that it is possible to variable y that is explained by the independent
conclude that a parameter associated with variables x1, x2, . . . , xq.
one of the multicollinear independent • The denominator is a measure of the
variables is not significantly different from variability in the dependent variable y that is
not explained by the independent variables x1, for the driving assignments in the
x2, . . . , xq. sample.
• Statistical software will generally report a p-
value for this test statistic. • The mean or expected value of travel
• For a given value of F, the p-value represents time for driving assignments given no
the probability of collecting a sample of the rush hour driving:
same size from the same population that yields
a larger F statistic given that the values of β1, E (y | x3=0) = β0 + β1 x1+ β2 x2 + β3
β2, . . . , βq are all actually zero. (0)
• Thus, smaller p-values indicate stronger = β0 + β1 x1+ β2 x2
evidence against the hypothesis that the values
of β1, β2, . . . , βq are all zero (i.e., stronger • The mean or expected value of travel
evidence of an overall regression relationship). time for driving assignments given
• The hypothesis is rejected when the p-value is rush hour driving:
smaller than some predetermined value
(usually 0.05 or 0.01) that is referred to as the E (y | x3=1) = β0 + β1 x1+ β2 x2 + β3
level of significance.
(1)
= β0 + β1 x1+ β2 x2 + β3
= ( β 0 + β3) + β1 x1 + β2 x2
Categorical Independent Variables
A. Butler Trucking Company and Rush Hour • Using the estimated multiple regression
B. Interpreting the Parameters equation:
C. More Complex Categorical Variables ^y = -0.3302 + 0.0672x1 + 0.6735x2 +
0.9980x3
A. Butler Trucking Company and Rush Hour • When x3 = 0:
• Dependent variable, y: Travel time
^y = -0.3302 + 0.0672x1 + 0.6735x2 +
• Independent variables: miles traveled (x1)
0.9980(0)
and number of deliveries (x2) ¿−0.3302+0.0672 x1 + 0.6735x2
• Categorical variable: rush hour (x3)
• When x3 = 1:
❖ 0 if an assignment did not include
travel on the congested segment of
^y = -0.3302 + 0.0672x1 + 0.6735x2 +
highway during afternoon rush hour 0.9980(1)
❖ 1 if an assignment included travel on = 0.6678 +0.0672 x1 + 0.6735x2
the congested segment of highway
during afternoon rush hour C. More Complex Categorical Variables
➢ If a categorical variable has k levels, k – 1
B. Interpreting the Parameters dummy variables are required, with each
➢ The model estimates that travel time dummy variable corresponding to one of the
increases by: levels of the categorical variable and coded
• 0.0672 hours for every increase of 1 as 0 or 1.
mile traveled, holding constant the
• Example:
number of deliveries and whether the
driving assignment route requires the • Suppose a manufacturer of vending
driver to travel on the congested machines organized the sales territories
segment of a highway during the for a particular state into three regions:
afternoon rush hour period. A, B, and C.
• 0.6735 hours for every delivery, holding • The managers want to use regression
constant the number of miles traveled analysis to help predict the number of
and whether the driving assignment vending machines sold per week.
route requires the driver to travel on the • Suppose the managers believe sales
congested segment of a highway during region is one of the important factors in
the afternoon rush hour period. predicting the number of units sold.
• 0.9980 hours if the driving assignment • Sales region: categorical variable with
route requires the driver to travel on the three levels (A, B, and C)
congested segment of a highway during • Number of dummy variables
holding constant the number of miles =3–1=2
traveled and the number of deliveries. • Each variable can be coded 0 or 1
• R2 = 0.8838 indicates that the regression as:
model explains approximately 88.4 x1 = 1 if sales Region B 0 otherwise
percent of the variability in travel time
x2 = 1 if sales Region C 0 otherwise
• The values of x1and x2 are: • Recognize that below some value of
Months Employed, the relationship
between Months Employed and Sales
appears to be positive and linear.
• Whereas the relationship between
• The regression equation relating the Months Employed and Sales appears to
expected value of the number of units sold, be negative and linear for the remaining
E( y | x1, x2), to the dummy variables: observations.
➢ Piecewise linear regression model
- This model will allow us to fit these
E( y | x1, x 2 ¿ = β0 + β1 x1+ β2 x2 relationships as two linear regressions that
are joined at the value of Months at which
the relationship between Months Employed
• Observations corresponding to Sales Region and Sales changes.
A are coded x1= 0, x2= 0 ➢ Knot
• Regression equation: - The value of the independent variable at
which the relationship between dependent
E( y | x1 = 0, x2 = 0) = E( y | Sales Region A) = variable and independent variable changes.
β0 + β 1 (0 )+ β 2 (0) = β0 ➢ For the Reynolds data, knot is the value of
the independent variable Months Employed
at which the relationship between Months
• Observations corresponding to Sales Region Employed and Sales changes.
C are coded x1= 0, x2 = 1
• Regression equation:
➢ Define a dummy variable:
E( y | x1 = 0, x2 = 1) = E( y | Sales Region C) =
β0 + β 1 ( 0 )+ β2(1) = β0+ β2

• β0 = Mean or expected value of


sales for Region A.
• β1 = Difference between the
mean number of units sold in
Region B and the mean number
of units sold in Region A.
• β2 = Difference between the
mean number of units sold in
Region C and the mean number
of units sold in Region A.
Modeling Nonlinear Relationships
A. Quadratic Regression Models
B. Linear Regression Models
C. Interaction Between Independent Variables ✓ Once we have decided on the location of the
knot, we define a dummy variable that is
equal to zero for any observation for which
the value of Months Employed is less than
A. Quadratic Regression Model
or equal to the value of the knot, and equal
➢ An estimated regression equation given by:
to one for any observation for which the
2^
value of Months Employed is greater than
y = b0 + b1 x1+ b 2 x 1 + e the value of the knot:
✓ The interpretation of this model is similar to
➢ In the Reynolds example, to account for the the interpretation of the quadratic regression
curvilinear relationship between months model.
employed and scales sold we could include
the square of the number of months the C. Interaction Between Independent Variables
salesperson has been employed in the model ➢ Interaction
as a second independent variable. - This occurs when the relationship between
the dependent variable and one independent
B. Piecewise Linear Regression Models variable is different at various values of a
➢ For the Reynolds data, as an alternative to a second independent variable.
quadratic regression model. ➢ The model is given by:
y = β0 + β1 x1+ β2 x2 +¿ β3 x1 x2 + e ➢ The analyst establishes a criterion for
allowing independent variables to enter the
➢ The estimated model is given by: model.
➢ Example: The independent variable j with
^y = b0 + b1 x1+ b2 x2 + b3 x1 x2 the smallest p-value associated with the test
of the hypothesis β j = 0, subject to some
predetermined maximum p-value for which
✓ In the multiple linear regression a potential independent variable will be
model, y is the dependent variable, x1 allowed to enter the model.
and x2 are independent variables. ➢ First step: The independent variable that
✓ x1x2 depicts interaction between the best satisfies the criterion is added to the
two independent variables. model.
➢ Each subsequent step: The remaining
✓ When interaction between two
independent variables not in the current
variables is present, we cannot study
model are evaluated, and the one that best
the relationship between one
satisfies the criterion is added to the model.
independent variable and the
➢ Procedure stops: When there are no
dependent variable y independently of
independent variables not currently in the
the other variable.
model that meet the criterion for being
✓ Once we obtain the estimated
added to the regression model.
regression equation, using the p-value
approach, we can conclude if the b. Backward selection procedure:
interaction is significant or not.
➢ The analyst establishes a criterion for
✓ The interpretation of other allowing independent variables to remain in
coefficients is same as the ones the model.
discussed in multiple linear ➢ Example: The largest p-value associated
regression.
with the test of the hypothesis β j= 0, subject
✓ Note that we can combine a quadratic
to some predetermined minimum p-value for
effect with interaction to produce a
which a potential independent variable will
second-order polynomial model with
be allowed to remain in the model.
interaction between the two
➢ First step: The independent variable that
independent variables. violates this criterion to the greatest degree
is removed from the model.
Model Fitting ➢ Each subsequent step: The independent
variables in the current model are evaluated,
A. Variable Selection Procedures and the one that violates this criterion to the
B. Overfitting greatest degree is removed from the model.
➢ Procedure stops: When there are no
independent variables currently in the model
A. Variable Selection Procedures that violate the criterion for remaining in the
regression model.

c. Stepwise procedure:
➢ The analyst establishes both a criterion for
allowing independent variables to enter the
model and a criterion for allowing
independent variables to remain in the
model.
➢ In the first step, the independent variable
that best satisfies the criterion for entering
the model is added.
➢ First, the remaining independent variables
not in the current model are evaluated, and
a. Backward elimination the one that best satisfies the criterion for
b. Forward selection entering is added to the model.
c. Stepwise selection ➢ Then, the independent variables in the
d. Best subsets current model are evaluated, and the one that
violates the criterion for remaining in the
model to the greatest degree is removed.
➢ The procedure stops when no independent
a. Forward selection procedure:
variables not currently in the model meet the
criterion for being added to the regression ➢ It is recommended to divide the original
model, and no independent variables sample data into training and validation sets.
currently in the model violate the criterion ➢ Training set
for remaining in the regression model. - The data set used to build the candidate
models that appear to make practical
sense.
➢ Validation set
d. Best-subsets procedure: - The set of data used to compare model
➢ Simple linear regressions for each of the performances and ultimately pick a
independent variables under consideration model for predicting values of the
are generated, and then the multiple dependent variable•
regressions with all combinations of two ➢ Holdout method
independent variables under consideration - The sample data are randomly divided
are generated, and so on. into mutually exclusive and collectively
➢ Once a regression has been generated for exhaustive training and validation sets.
every possible subset of the independent ➢ k-fold cross-validation
variables under consideration, an output - The sample data are randomly divided
that provides some criteria for selecting into k equal-sized, mutually exclusive,
regression models is produced for all and collectively exhaustive subsets
models generated. called fold, and k iterations are executed
➢ Leave-one-out cross-validation
- For a sample of n observations, an
B. Overfitting iteration consists of estimating the
➢ Results from creating an overly complex model on n – 1 observations and
model to explain idiosyncrasies in the evaluating the model on the single
sample data. observation that was omitted from the
➢ Results from the use of complex functional training data.
forms or independent variables that do not
have meaningful relationships with the
dependent variable.
➢ If a model is overfit to the sample data, it
will perform better on the sample data used
to fit the model than it will on other data
from the population.
➢ Thus, an overfit model can be misleading
about its predictive capability and its
interpretation.

How does one avoid overfitting a model?


➢ Use only independent variables that you
expect to have real and meaningful
relationships with the dependent variable.
➢ Use complex models, such as quadratic
models and piecewise linear regression
models, only when you have a reasonable
expectation that such complexity provides a
more accurate depiction of what you are
modeling.
➢ Do not let software dictate your model; use
iterative modeling procedures, such as the
stepwise and best-subsets procedures, only
for guidance and not to generate your final
model.
➢ Cross-validation
- If you have access to a sufficient
quantity of data, assess your model on
data other than the sample data that were
used to generate the model.

You might also like