0% found this document useful (0 votes)
52 views48 pages

CH 14

Uploaded by

Nandini Goel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views48 pages

CH 14

Uploaded by

Nandini Goel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 48

Applied Business Statistics, 7th ed.

by Ken Black

Chapter 14

Building Multiple
Regression
Models

Copyright2011
Copyright 2011John Wiley &
John Wiley & Sons,
Sons, Inc.
Inc. 1
Learning Objectives

Analyze and interpret nonlinear variables in multiple


regression analysis.
Understand the role of qualitative variables and how
to use them in multiple regression analysis.
Learn how to build and evaluate multiple regression
models.
Learn how to detect influential observations in
regression analysis.
Explain when to use logistic regression and interpret
results.

Copyright 2011John Wiley & Sons, Inc. 2


General Linear Regression Model

Regression models presented thus far are based on the


general linear regression model, which has the form
Y = 0 + 1X1 + 2X2 + 3X3 + . . . + kXk+ 

Y = the value of the dependent (response) variable


0 = the regression constant
1 = the partial regression coefficient of independent variable 1
2 = the partial regression coefficient of independent variable 2
k = the partial regression coefficient of independent variable k
k = the number of independent variables
 = the error of prediction

Copyright 2011John Wiley & Sons, Inc. 3


General Linear Regression Model

In the general linear model, the parameters, βi,


are linear.
However, dependent variable, y, is not necessarily
linearly related to the predictor variables.
Multiple regression response surfaces are not
restricted to linear surfaces and may be curvilinear.
Regression models can be developed for more than
two predictors.

Copyright 2011John Wiley & Sons, Inc. 4


Polynomial Regression

Regression models in which the highest power of any


predictor variable is 1 and in which there are no
interaction terms are referred to as first-order models.
If a second independent variable is added, the model
is referred to as a first-order model with two
independent variables.
Polynomial regression models are regression models
that are second- or higher-order models.
Contain squared, cubed, or higher powers of the
predictor variable(s)

Copyright 2011John Wiley & Sons, Inc. 5


Non Linear Models:
Mathematical Transformation
Y   0  1 X 1   2 X 2   First-order with Two Independent Variables

Y   0  1 X 1   2 X 1  
2
Second-order with One Independent Variable

Y   0  1 X 1   2 X 2   3 X 1 X 2   Second-order with an
Interaction Term

Y   0  1 X 1   2 X 2   3 X 1   4 X 2   5 X 1 X 2  
2 2
Second-order with
Two Independent
Variables

Copyright 2011John Wiley & Sons, Inc. 6


Sales Data and Scatter Plot
for 13 Manufacturing Companies
Consider the table in the next slide.
The table contains sales for 13 mfg companies along with
the number of mfg representatives associated with each
firm.
A simple regression analysis to predict sales by the
number of manufacturer’s representatives results
in the Excel output.

Copyright 2011John Wiley & Sons, Inc. 7


Sales Data and Scatter Plot
for 13 Manufacturing Companies
Number of 500
Sales Manufacturing 450
400
Manufacturer ($1,000,000) Representatives
350
1 2.1 2 300
2 3.6 1 Sales 250
3 6.2 2 200
4 10.4 3 150
100
5 22.8 4
50
6 35.6 4 0
7 57.1 5 0 2 4 6 8 10 12
8 83.5 5
Number of Representatives
9 109.4 6
10 128.6 7
11 196.8 8
12 280.0 10
13 462.3 11

Copyright 2011John Wiley & Sons, Inc. 8


Excel Simple Linear Regression Output
for the Manufacturing Example
Regression Statistics
Multiple R 0.933
R Square 0.870
Adjusted R Square 0.858
Standard Error 51.10
Observations 13

Coefficients Standard Error t Stat P-value


Intercept -107.03 28.737 -3.72 0.003
numbers 41.026 4.779 8.58 0.000

ANOVA
df SS MS F Significance F
Regression 1 192395 192395 73.69 0.000
Residual 11 28721 2611
Total 12 221117

Copyright 2011John Wiley & Sons, Inc. 9


Sales Data and Scatter Plot
for 13 Manufacturing Companies
Researcher creates a second predictor variable,
(number of manufacturer’s representatives2) to use
in the regression analysis to predict sales along with
number of manufacturer’s representatives.
This variable can be created to explore second-order
parabolic relationships by squaring the data from the
independent variable of the linear model and
entering it into the analysis.
With the new data, a multiple regression model
can be developed.

Copyright 2011John Wiley & Sons, Inc. 10


Manufacturing Data
with Newly Created Variable

Number of
Sales Mgfr Reps (No. Mgfr Reps)2
Manufacturer ($1,000,000) X1 X2 = (X1)2
1 2.1 2 4
2 3.6 1 1
3 6.2 2 4
4 10.4 3 9
5 22.8 4 16
6 35.6 4 16
7 57.1 5 25
8 83.5 5 25
9 109.4 6 36
10 128.6 7 49
11 196.8 8 64
12 280.0 10 100
13 462.3 11 121

Copyright 2011John Wiley & Sons, Inc. 11


Computer Output for Quadratic
Model to Predict Sales
Regression Statistics
Multiple R 0.986
R Square 0.973
Adjusted R Square 0.967
Standard Error 24.593
Observations 13

Coefficients Standard Error t Stat P-value


Intercept 18.067 24.673 0.73 0.481
MfgrRp -15.723 9.5450 - 1.65 0.131
MfgrRpSq 4.750 0.776 6.12 0.000

ANOVA
df SS MS F Significance F
Regression 2 215069 107534 177.79 0.000
Residual 10 6048 605
Total 12 221117

Copyright 2011John Wiley & Sons, Inc. 12


Tukey’s Ladder of Transformations
Tukey’s ladder of expressions can be used to straighten
out a plot of x and y.
Tukey used a four-quadrant approach to show which
expressions on the ladder are more appropriate for a
given situation.
If the scatter plot of x and y indicates a shape like that
shown in the upper left quadrant, recoding should move
“down the ladder” for the x variable toward or “up the
ladder” for the y variable toward.
If the scatter plot of x and y indicates a shape like that of
the lower right quadrant, the recoding should move “up
the ladder” for the x variable toward or “down the ladder”
for the y variable toward.

Copyright 2011John Wiley & Sons, Inc. 13


Tukey’s Four Quadrant Approach

Copyright 2011John Wiley & Sons, Inc. 14


Regression Models with Interaction

When two different independent variables are used


in a regression analysis, an interaction occurs between
the two variables.
Interaction can be examined as a separate
independent variable:
An interaction predictor variable can be designed by
multiplying the data values of one variable by the values
of another variable, thereby creating a new variable.

Copyright 2011John Wiley & Sons, Inc. 15


Example – Three Stocks

Suppose the data in the following table represent the


closing stock prices for three corporations over a
period of 15 months. An investment firm wants to use
the prices for stocks 2 and 3 to develop a regression
model to predict the price of stock 1.

Copyright 2011John Wiley & Sons, Inc. 16


Prices of Three Stocks over
a 15-Month Period
Stock 1 Stock 2 Stock 3
41 36 35
39 36 35
38 38 32
45 51 41
41 52 39
43 55 55
47 57 52
49 58 54
41 62 65
35 70 77
36 72 75
39 74 74
33 83 81
28 101 92
31 107 91
Copyright 2011John Wiley & Sons, Inc. 17
Regression Models for the Three Stocks

Y   0  1 X 1   2 X 2

where: Y = price of stock 1
First-order with
Two Independent Variables
X
1
price of stock 2
X
2
price of stock 3

Y   X  X  X X 
0 1 1 2 2 3 1 2 Second-order with an
Y     X   X   X 
1 2 3
Interaction Term
0 1 2 3

where : Y = price of stock 1


X 1
 price of stock 2

X 2
 price of stock 3

X 3
 X X
1 2

Copyright 2011John Wiley & Sons, Inc. 18


Regression for Three Stocks:
First-order, Two Independent Variables
The regression equation is
Stock 1 = 50.9 - 0.119 Stock 2 - 0.071 Stock 3
Predictor Coef StDev T P
Constant 50.855 3.791 13.41 0.000
Stock 2 -0.1190 0.1931 -0.62 0.549
Stock 3 -0.0708 0.1990 -0.36 0.728
S = 4.570 R-Sq = 47.2% R-Sq(adj) = 38.4%
Analysis of Variance
Source DF SS MS F P
Regression 2 224.29 112.15 5.37 0.022
Error 12 250.64 20.89
Total 14 474.93

Copyright 2011John Wiley & Sons, Inc. 19


Regression for Three Stocks:
Second-order With an Interaction Term
The regression equation is
Stock 1 = 12.0 - 0.879 Stock 2 - 0.220 Stock 3 – 0.00998 Inter
Predictor Coef StDev T P
Constant 12.046 9.312 1.29 0.222
Stock 2 0.8788 0.2619 3.36 0.006
Stock 3 0.2205 0.1435 1.54 0.153
Inter -0.009985 0.002314 -4.31 0.001
S = 2.909 R-Sq = 80.4% R-Sq(adj) = 25.1%
Analysis of Variance
Source DF SS MS F P
Regression 3 381.85 127.28 15.04 0.000
Error 11 93.09 8.46
Total 14 474.93

Copyright 2011John Wiley & Sons, Inc. 20


Nonlinear Regression Models:
Model Transformation
y    0
X

log y  log   x log 


0 1

yˆ      x
0 1

where : yˆ   log yˆ
  log
 0
 0

  log
 1
 1

Copyright 2011John Wiley & Sons, Inc. 21


Data Set for Model Transformation Example
to Predict Sales by Adv. Expenditure

ORIGINAL DATA TRANSFORMED DATA

Company Y X Company LOG Y X


1 2580 1.2 1 3.41162 1.2
2 11942 2.6 2 4.077077 2.6
3 9845 2.2 3 3.993216 2.2
4 27800 3.2 4 4.444045 3.2
5 18926 2.9 5 4.277059 2.9
6 4800 1.5 6 3.681241 1.5
7 14550 2.7 7 4.162863 2.7
Y = Sales ($ million/year) X = Advertising ($ million/year)

Copyright 2011John Wiley & Sons, Inc. 22


Regression Output for Model
Transformation Example
Regression Statistics
Multiple R 0.990
R Square 0.980
Adjusted R Square 0.977
Standard Error 0.054
Observations 7

Coefficients Standard Error t Stat P-value


Intercept 2.9003 0.0729 39.80 0.000
X 0.4751 0.0300 15.82 0.000

ANOVA
df SS MS F Significance F
Regression 1 0.7392 0.7392 250.36 0.000
Residual 5 0.0148 0.0030
Total 6 0.7540

Copyright 2011John Wiley & Sons, Inc. 23


Prediction with the Transformed Model

Yˆ  b 0b1
X

log Yˆ  log b 0  x log b1


 2.9003  x  0.4751
For x=2,
log Yˆ  2.9003   2  0.4751
 3.8505
Yˆ  anti log(log Yˆ )
 anti log(3.850618)
 7087.61($million)

Copyright 2011John Wiley & Sons, Inc. 24


Indicator (Dummy) Variables

Some variables are referred to as qualitative variables


Qualitative variables do not yield quantifiable outcomes
Qualitative variables yield nominal- or ordinal-level
information; used more to categorize items.
Qualitative variables are referred to as indicator,
or dummy variables

Copyright 2011John Wiley & Sons, Inc. 25


Monthly Salary Example

As an example, consider the issue of sex


discrimination in the salary earnings of workers in
some industries. In examining this issue, suppose a
random sample of 15 workers is drawn from a pool
of employed laborers in a particular industry and the
workers’ average monthly salaries are determined,
along with their age and gender. The data are shown
in the following table. As sex can be only male or
female, this variable is coded as a dummy variable
with 0 = female, 1 = male.

Copyright 2011John Wiley & Sons, Inc. 26


Data for the Monthly Salary Example

Copyright 2011John Wiley & Sons, Inc. 27


Regression Output
for the Monthly Salary Example
The regression equation is
Salary = 1.732 + 0.111 Age + 0.459 Gender
Predictor Coef StDev T P
Constant 1.7321 0.2356 7.35 0.000
Age 0.11122 0.07208 1.54 0.149
Gender 0.45868 0.05346 8.58 0.000
S = 0.09679 R-Sq = 89.0% R-Sq(adj) = 87.2%
Analysis of Variance
Source DF SS MS F P
Regression 2 0.90949 0.45474 48.54 0.000
Error 12 0.11242 0.00937
Total 14 1.02191

Copyright 2011John Wiley & Sons, Inc. 28


Regression Output
for the Monthly Salary Example

Copyright 2011John Wiley & Sons, Inc. 29


MODEL-BUILDING

Suppose a researcher wants to develop a multiple


regression model to predict the world production of
crude oil. The researcher decides to use as predictors
the following five independent variables.
U.S. energy consumption
Gross U.S. nuclear electricity generation
U.S. coal production
Total U.S. dry gas (natural gas) production
Fuel rate of U.S.-owned automobiles

Copyright 2011John Wiley & Sons, Inc. 30


Data for Multiple Regression
to Predict Crude Oil Production
Y X1 X2 X3 X4 X5
55.7 74.3 83.5 598.6 21.7 13.30
55.7 72.5 114.0 610.0 20.7 13.42
52.8 70.5 172.5 654.6 19.2 13.52
57.3 74.4 191.1 684.9 19.1 13.53
59.7 76.3 250.9 697.2 19.2 13.80
Y World Crude Oil 60.2 78.1 276.4 670.2 19.1 14.04
62.7 78.9 255.2 781.1 19.7 14.41
Production 59.6 76.0 251.1 829.7 19.4 15.46
X1 U.S. Energy 56.1 74.0 272.7 823.8 19.2 15.94
53.5 70.8 282.8 838.1 17.8 16.65
Consumption 53.3 70.5 293.7 782.1 16.1 17.14
X2 U.S. Nuclear 54.5 74.1 327.6 895.9 17.5 17.83
54.0 74.0 383.7 883.6 16.5 18.20
Generation 56.2 74.3 414.0 890.3 16.1 18.27
X3 U.S. Coal 56.7 76.9 455.3 918.8 16.6 19.20
58.7 80.2 527.0 950.3 17.1 19.87
Production 59.9 81.3 529.4 980.7 17.3 20.31
X4 U.S. Dry Gas 60.6 81.3 576.9 1029.1 17.8 21.02
60.2 81.1 612.6 996.0 17.7 21.69
Production 60.2 82.1 618.8 997.5 17.8 21.68
X5 U.S. Fuel Rate 60.6 83.9 610.3 945.4 18.2 21.04
60.9 85.6 640.4 1033.5 18.9 21.48
for Autos

Copyright 2011John Wiley & Sons, Inc. 31


Regression Analysis for
Crude Oil Production

Copyright 2011John Wiley & Sons, Inc. 32


All Possible Regressions with Five
Independent Variables

Single Two Three Four


Predictor Predictors Predictors Predictors Five Predictors
X1 X1,X2 X1,X2,X3 X1,X2,X3,X4 X1,X2,X3,X4,X5
X2 X1,X3 X1,X2,X4 X1,X2,X3,X5
X3 X1,X4 X1,X2,X5 X1,X2,X4,X5
X4 X1,X5 X1,X3,X4 X1,X3,X4,X5
X5 X2,X3 X1,X3,X5 X2,X3,X4,X5
X2,X4 X1,X4,X5
X2,X5 X2,X3,X4
X3,X4 X2,X3,X5
X3,X5 X2,X4,X5
X4,X5 X3,X4,X5

Copyright 2011John Wiley & Sons, Inc. 33


Model-Building: Search Procedures

Search procedures are processes whereby more than


one multiple regression model is developed for a
given database, and the models are compared and
sorted by different criteria, depending on the given
procedure:
All Possible Regressions
Stepwise Regression
Forward Selection
Backward Elimination

Copyright 2011John Wiley & Sons, Inc. 34


Stepwise Regression
Stepwise regression is a step-by-step process that
begins by developing a regression model with a single
predictor variable and adds and deletes predictors one
step at a time.
Perform k simple regressions; and select the best as
the initial model.
Evaluate each variable not in the model
If none meet the criterion, stop
Add the best variable to the model; reevaluate previous
variables, and drop any which are not significant
Return to previous step.

Copyright 2011John Wiley & Sons, Inc. 35


Stepwise: Step 1 - Simple Regression
Results for Each Independent Variable

Dependent Independent
Variable Variable t-Ratio R2
Y X1 11.77 85.2%
Y X2 4.43 45.0%
Y X3 3.91 38.9%
Y X4 1.08 4.6%
Y X5 3.54 34.2%

Copyright 2011John Wiley & Sons, Inc. 36


Stepwise: Regression
Step 2:
Two
Predictors

Step 3:
Three
Predictors

Copyright 2011John Wiley & Sons, Inc. 37


Minitab Stepwise Regression Output

Copyright 2011John Wiley & Sons, Inc. 38


Forward Selection

Forward selection is like stepwise regression, but


once a variable is entered into the process, it is
never dropped out.
Forward selection begins by finding the independent
variable that will produce the largest absolute value
of t (and largest R2) in predicting y.

Copyright 2011John Wiley & Sons, Inc. 39


Backward Elimination

Start with the “full model” (all k predictors).


If all predictors are significant, stop.
Otherwise, eliminate the most non-significant
predictor; return to previous step.

Copyright 2011John Wiley & Sons, Inc. 40


Backward Elimination: Oil Production
Step 1:

Step 2:

Copyright 2011John Wiley & Sons, Inc. 41


Backward Elimination

Step 3:

Step 4:

Copyright 2011John Wiley & Sons, Inc. 42


Multicollinearity

Condition that occurs when two or more of the


independent variables of a multiple regression
model are highly correlated.
Difficult to interpret the estimates of the regression
coefficients
Inordinately small t values for the regression coefficients
Standard deviations of regression coefficients are
overestimated
Sign of predictor variable’s coefficient opposite of what
expected

Copyright 2011John Wiley & Sons, Inc. 43


Correlations among Oil
Production Predictor Variables

Energy
Consumption Nuclear Coal Dry Gas Fuel Rate

Energy
Consumption 1 0.856 0.791 0.057 0.791

Nuclear 0.856 1 0.952 -0.404 0.972

Coal 0.791 0.952 1 -0.448 0.968

Dry Gas 0.057 -0.404 -0.448 1 -0.423

Fuel Rate 0.796 0.972 0.968 -0.423 1

Copyright 2011John Wiley & Sons, Inc. 44


Logistic Regression Model

The logistic model is of the form:



e
f ( x)  p  
1 e
Also called an “s-shaped” curve
When there is only one predictor variable
   0  1 x1
For multiple predictor variables
   0  1 x1  ...   k xk

Copyright 2011John Wiley & Sons, Inc. 45


Scatterplot of Auto Club Data
with and without Logistic Model

Copyright 2011John Wiley & Sons, Inc. 46


Logistic Regression Output

Copyright 2011John Wiley & Sons, Inc. 47


Determining the Logistic Regression Model
The logistic model:
-20.7540 for Constant and 0.433680
The log of the odds ratio or logit equation:
ln( S )  20.7540  0.433680 Age
The antilog of this value results in the odds, S:
S  e0.93  2.535
The odds are 2.535 to 1.
The probability that a member will return the form:
20.7540  0.43680Age
e .7171
f ( x)  p  20.75400.43680 Age   2.535
1 e .2829
Copyright 2011John Wiley & Sons, Inc. 48

You might also like