0% found this document useful (0 votes)
10 views25 pages

11 Bda

Uploaded by

Qwertg Asdfg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views25 pages

11 Bda

Uploaded by

Qwertg Asdfg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Emerging Market Institute Fall 2024

Business Data Analysis

Jin Chen, PhD.


School of Statistics,
Beijing Normal University
10/30/2024

Statistics for Business and Economics © Cengage Learning 2017


No dissemination without permission or outside the class.
Lecture 11 Multiple Regression
 Multiple Regression
 Multiple regression model
 Least squares method
 Multiple coefficient of determination
 Model assumptions
 Testing for significance
 Estimation and prediction based on the estimated regression
equation
 Residual analysis
 Categorical independent variables
 Model curvilinear relationship

Statistics for Business and Economics © Cengage Learning 2017


No dissemination without permission or outside the class.
 Multiple regression model
 Two or more independent variables
 Describes how the dependent variable is related to the independent
variables and an error term:
y = β0 + β1x1 + β2x2 + . . . + βpxp + ε
where: β0, β1, β2, . . . , βp are the parameters, and ε is a random variable
called the error term
 Multiple regression equation
E(y) = β0 + β1x1 + β2x2 + . . . + βpxp
 Estimated multiple regression equation
𝑦𝑦� = b0 + b1x1 + b2x2 + . . . + bpxp
 Least squares method
min ∑ 𝑦𝑦𝑖𝑖 − 𝑦𝑦�𝑖𝑖 2

Statistics for Business and Economics © Cengage Learning 2017


No dissemination without permission or outside the class.
 Assumptions about the error term
 The error ε is a random variable with mean zero
 The variance of ε, denoted by σ 2, is the same for all values of
the independent variables
 The values of ε are independent
 The error ε is a normally distributed random variable reflecting
the deviation between the observed y value and the expected
value of y given by β0 + β1x1 + β2x2 + . . + βpxp

Statistics for Business and Economics © Cengage Learning 2017


No dissemination without permission or outside the class.
 Testing for significance
 The F test is used to determine whether a significant
relationship exists between the dependent variable and the set
of all the independent variables
 The F test is referred to as the test for overall significance
 If the F test shows an overall significance, the t test is used to
determine whether each of the individual independent variables
is significant
 A separate t test is conducted for each of the independent
variables in the model. Each t test is referred to as a test for
individual significance.

Statistics for Business and Economics © Cengage Learning 2017


No dissemination without permission or outside the class.
 Testing for Significance: F Test
 Hypotheses:
H 0: β 1 = β 2 = . . . = β p = 0
Ha: One or more of the parameters is not equal to zero
 Test statistic:
F=MSR/MSE
 Rejection rule:
Reject H0 if p-value≤ 𝛼𝛼 or if F≥ 𝐹𝐹𝛼𝛼 , where 𝐹𝐹𝛼𝛼 is based on an F
distribution with p d.f. in the numerator and n-p-1 d.f. in the
denominator

Statistics for Business and Economics © Cengage Learning 2017


No dissemination without permission or outside the class.
 Testing for Significance: t Test
 Hypotheses: H0: βi = 0 Ha: βi ≠ 0
𝑏𝑏𝑖𝑖
 Test statistic: 𝑡𝑡 =
𝑠𝑠𝑏𝑏𝑖𝑖
 Rejection rule:
Reject H0 if p-value < α or if t < -tα/2 or t > tα/2
where tα/2 is based on a t distribution with n - p – 1 degrees
of freedom.

Statistics for Business and Economics © Cengage Learning 2017


No dissemination without permission or outside the class.
 Testing for Significance: Multicollinearity
 Multicollinearity refers to the correlation among the independent
variables
 When the independent variables are highly correlated (e.g. |r|>.7), it is
not possible to determine the separate effect of any particular
independent variable on the dependent variable
 If the estimated regression equation is to be used only for predictive
purposes, multicollinearity is usually not a serious problem
 Every attempt should be made to avoid including independent variables
that are highly correlated

Statistics for Business and Economics © Cengage Learning 2017


No dissemination without permission or outside the class.
 Example: Butler Trucking Company
The business of Butler Trucking Company, an independent trucking
company in southern California, involves deliveries throughout its local
area. The managers want to predict the total daily travel time for their
drivers. They believed that the total daily travel time would be closely
related to the number of miles traveled and the number of deliveries
made daily. A simple random sample of 10 driving assignments provided
the data for the analysis.
Solution:
y = β0 + β1 x 1 + β2 x 2 + ε

Statistics for Business and Economics © Cengage Learning 2017


No dissemination without permission or outside the class.
 𝑦𝑦� = −0.869 + 0.061𝑥𝑥1 + 0.923𝑥𝑥2
 Interpretation of the coefficients
 𝑏𝑏𝑖𝑖 represents an estimate of the change in y corresponding to a one-unit change in 𝑥𝑥𝑖𝑖 when
all other independent variables are held constant.
 𝑏𝑏1 =0.061 is an estimate of the expected increase in travel time corresponding to an
increase of one mile in the distance traveled when the number of deliveries is held constant.
 𝑏𝑏2 =0.923 is an estimate of the expected increase in travel time corresponding to one more
delivery made when the number of miles traveled is held constant.
Statistics for Business and Economics © Cengage Learning 2017
No dissemination without permission or outside the class.
 Multiple coefficient of determination
 SST=SSR+SSE
Where SST=total sum of squares=∑(𝑦𝑦𝑖𝑖 − 𝑦𝑦) � 2
SSR=sum of squares due to regression=∑(𝑦𝑦�𝑖𝑖 − 𝑦𝑦) � 2
SSE=sum of square due to error =∑(𝑦𝑦𝑖𝑖 − 𝑦𝑦) � 2
𝑆𝑆𝑆𝑆𝑆𝑆 21.6006
 𝑅𝑅 2 = = = .9038
𝑆𝑆𝑆𝑆𝑆𝑆 23.900
Therefore, 90.38% of the variability in travel time y is explained by the
estimated multiple regression equation.
 In general, 𝑅𝑅 2 always increases as independent variables are added to the
model
 Many analysts prefer adjusting 𝑅𝑅 2 for the number of independent variables to
avoid overestimating the impact of adding an independent variable on the
amount of variability explained by the estimated regression equation:
𝑛𝑛−1
𝑅𝑅𝑎𝑎2 = 1 − (1 − 𝑅𝑅 2 ) =.8763
𝑛𝑛−𝑝𝑝−1

Statistics for Business and Economics © Cengage Learning 2017


No dissemination without permission or outside the class.
 F Test for overall significance
F=32.88, p-value<.01 indicates that we can reject 𝐻𝐻0 : 𝛽𝛽1 = 𝛽𝛽2 = 0,
we can conclude that a significant relationship is present between travel
time y and the two independent variables.
 t Test for individual significance
𝑏𝑏1 = .06113 𝑠𝑠𝑏𝑏1 =.00989 t=.06113/.00989=6.18 p<.01
𝑏𝑏2 = .923 𝑠𝑠𝑏𝑏2 =.221 t=.923/.221=4.18 p<.01
We reject 𝐻𝐻0 : 𝛽𝛽1 =0, and 𝐻𝐻0 : 𝛽𝛽2 =0

Statistics for Business and Economics © Cengage Learning 2017


No dissemination without permission or outside the class.
Computer solution with Minitab
 Estimation and Prediction based on Estimated Regression
Equation
 Obtain the point estimate 𝑦𝑦,
� by substituting the given values x1,
x2, . . . , xp into the estimated regression equation
 The formulas required to develop interval estimates for the
mean value of 𝑦𝑦� and for an individual value of y are beyond the
scope of the textbook. Software packages often provide these
estimates.
 Example: Butler Trucking Company

Statistics for Business and Economics © Cengage Learning 2017


No dissemination without permission or outside the class.
Statistics for Business and Economics © Cengage Learning 2017
No dissemination without permission or outside the class.
 Residual Analysis
 For simple linear regression, the residual plot against x and the
residual plot against 𝑦𝑦� provide the same information
 In multiple regression analysis, it is preferable to use the
residual plot against 𝑦𝑦� to determine if the model assumptions
are satisfied.
 Standardized residual plot against 𝑦𝑦�
 Can identify outliers (typically, <-2 or >+2)
 Can provide insight about the assumption that the error term ε has a
normal distribution

Statistics for Business and Economics © Cengage Learning 2017


No dissemination without permission or outside the class.
 Example: Butler Trucking Company
 Categorical independent variables
 Categorical independent variables such as gender (male, female), method of
payment (cash, check, credit card), etc. are commonly used in regression
analysis
 When including categorical independent variables in the regression model,
dummy or indicator variables should be used.
 If a categorical variable has k levels, k - 1 dummy variables are required, with
each dummy variable being coded as 0 or 1.
 For example, x2 might represent gender where x2=0 indicates male and x2=1
indicates female.
 For another example, method of payment (cash, check, credit card) should
be represented by two dummy variables in the regression model, x1 =1 and
x2 =0 indicates cash, x1 =0 and x2=1 indicates check, x1 =0 and x2=0
indicates credit card.
 Interpretations of the dummy variables that represent k levels are not as
straight forward.

Statistics for Business and Economics © Cengage Learning 2017


No dissemination without permission or outside the class.
 Example: Johnson Filtration, Inc.
Johnson Filtration, Inc., provides maintenance service for water-
filtration systems throughout southern Florida. To estimate the service
time and the service cost, Johnson managers want to predict the repair
time necessary for each maintenance request. Repair time is believed to
be related to the number of months since the last maintenance service
and the type of repair problem (mechanical or electrical). Develop and
estimate a regression model showing how repair time is related to
number of months since the last maintenance service and the type of
repair problem.
Solution:
y = β0 + β1x1 + β2x2 +ε
where x1 denotes number of months, x2 denotes type of repair with 1
indicating electrical and 0 indicating mechanical

Statistics for Business and Economics © Cengage Learning 2017


No dissemination without permission or outside the class.
� = .93 + .3876𝑥𝑥1+1.263𝑥𝑥2
 𝑦𝑦
 At the .05 level of significance, the p-value of .001 associated with the F test (F=21.36)
indicates that the regression relationship is significant.
 The t test results indicate that both months since last service (p-value<.001) and type of repair
(p-value=.005) are statistically significant.
 R-sq=85.92% and R-sq(adj)=81.9% indicate that the estimated regression equation does a
good job of explaining the variability in repair time.
 b2 =1.263 indicates the difference between the mean repair time for an electrical
repair and the mean repair time for a mechanical repair.
Statistics for Business and Economics © Cengage Learning 2017
No dissemination without permission or outside the class.
 Model curvilinear relationship
 Example: Sales of Laboratory Scales
 A manufacturer of laboratory scales wants to investigate the relationship between
the length of employment of their salespeople and the number of scales sold.
 We first develop a simple linear regression model and a scatter diagram to see
how the model fits the data


 The scatter diagram suggests a possible curvilinear relationship between
the length of time employed and the number of scales sold.
 We then develop a multiple regression model with two independent
variables x and x2:
y = b0 + b1x + b2x2 + ε
 This model is often referred to as a second-order polynomial or a
quadratic model.
Statistics for Business and Economics © Cengage Learning 2017
No dissemination without permission or outside the class.
Statistics for Business and Economics © Cengage Learning 2017
No dissemination without permission or outside the class.
Statistics for Business and Economics © Cengage Learning 2017
No dissemination without permission or outside the class.

You might also like