Multiple Regression1
Multiple Regression1
9/16/2013
Multiple Regression
Summary ......................................................................................................................................... 1
Data Input........................................................................................................................................ 2
Analysis Summary .......................................................................................................................... 3
Analysis Options ............................................................................................................................. 5
Component Effects Plot ................................................................................................................ 10
Conditional Sums of Squares ........................................................................................................ 11
Observed versus Predicted ............................................................................................................ 12
Residual Plots................................................................................................................................ 12
Unusual Residuals ......................................................................................................................... 15
Influential Points ........................................................................................................................... 15
Confidence Intervals ..................................................................................................................... 16
Correlation Matrix ........................................................................................................................ 16
Reports .......................................................................................................................................... 17
Interval Plots ................................................................................................................................. 18
Autocorrelated Data ...................................................................................................................... 20
Save Results .................................................................................................................................. 23
Summary
The Multiple Regression procedure is designed to construct a statistical model describing the
impact of a two or more quantitative factors X on a dependent variable Y. The procedure
includes an option to perform a stepwise regression, in which a subset of the X variables is
selected. The fitted model may be used to make predictions, including confidence limits and/or
prediction limits. Residuals may also be plotted and influential observations identified.
The procedure contains additional options for transforming the data using either a Box-Cox or
Cochrane-Orcutt transformation. The first option is useful for stabilizing the variability of the
data, while the second is useful for handling time series data in which the residuals exhibit serial
correlation.
Sample Data
The file 93cars.sgd contains information on 26 variables for n = 93 models of automobiles, taken
from Lock (1993). The table below shows a partial list of 4 columns from that file:
A model is desired that can predict MPG Highway from Weight, Horsepower, Wheelbase, and
Drivetrain.
2013 by StatPoint Technologies, Inc. Multiple Regression - 1
STATGRAPHICS – Rev. 9/16/2013
Data Input
The data input dialog box requests the names of the columns containing the dependent variable
Y and the independent variables X:
X: numeric columns containing the n values for the independent variables X. Either column
names or STATGRAPHICS expressions may be entered.
In the example, note the use of the expression Weight^2 to add a second-order term involving the
weight of the vehicle. This was added after examining an X-Y plot that showed significant
curvature with respect to Weight. The categorical factor Drivetrain has also be introduced into
the model through the Boolean expression Drivetrain=”front”, which sets up an indicator
variable that takes the value 1 if true and 0 if false. The model to be fit thus takes the form:
where
Analysis Summary
The Analysis Summary shows information about the fitted model.
Standard T
Parameter Estimate Error Statistic P-Value
CONSTANT 49.8458 10.5262 4.73539 0.0000
Weight -0.0273685 0.00530942 -5.1547 0.0000
Weight^2 0.00000261405 8.383E-7 3.11827 0.0025
Horsepower 0.0145764 0.009668 1.50769 0.1353
Wheelbase 0.338687 0.103479 3.273 0.0015
Drive Train="front" 0.632343 0.73879 0.855918 0.3944
Analysis of Variance
Source Sum of Squares Df Mean Square F-Ratio P-Value
Model 1902.18 5 380.435 46.41 0.0000
Residual 713.136 87 8.19696
Total (Corr.) 2615.31 92
Variables: identification of the dependent variable. The general form of the model is
Coefficients: the estimated coefficients, standard errors, t-statistics, and P values. The
estimates of the model coefficients can be used to write the fitted equation, which in the
example is
R-squared - represents the percentage of the variability in Y which has been explained by the
fitted regression model, ranging from 0% to 100%. For the sample data, the regression has
accounted for about 72.7% of the variability in the miles per gallon. The remaining 27.3% is
attributable to deviations from the model, which may be due to other factors, to measurement
error, or to a failure of the current model to fit the data adequately.
Adjusted R-Squared – the R-squared statistic, adjusted for the number of coefficients in the
model. This value is often used to compare models with different numbers of coefficients.
Standard Error of Est. – the estimated standard deviation of the residuals (the deviations
around the model). This value is used to create prediction limits for new observations.
Analysis Options
Fitting Procedure – specifies the method used to fit the regression model. The options are:
o Ordinary Least Squares – fits a model using all of the independent variables.
o Box-Cox Optimization – fits a model involving all of the independent variables. The
dependent variable, however, is modified by raising it to a power. The method of Box
and Cox is used to determine the optimum power. Box-Cox transformations are a way
of dealing with situations in which the deviations from the regression model do not
have a constant variance.
Power – Specifies the power to which the dependent variable is raised. The default value of
1.0 implies no power transformation.
Addend – Specifies an amount that is added to the dependent variable before it is raised to
the specified power.
Autocorrelation – Specifies the lag 1 autocorrelation of the residuals. The default value of
0.0 implies that the residuals are assumed to be independent. If the Cochrane-Orcutt
procedure is used, this value provides the starting value for the procedure.
F-to-Remove - In a stepwise regression, variables will be removed from the model at a given
step if their F values are less than the F-to-Remove value specified.
P-to-Enter - In a stepwise regression, variables will be entered into the model at a given step
if their P values are less than or equal to the P-to-Enter value specified.
P-to-Remove - In a stepwise regression, variables will be removed from the model at a given
step if their P values are greater than the P-to-Remove value specified.
Max Steps – maximum number of steps permitted when doing a stepwise regression.
Display – whether to display the results at each step when doing a stepwise regression.
Force in… - after a model has been fit using stepwise selection, push this button to force one
or more variables that were not selected to be included in the model.
Force out… - after a model has been fit using any method, push this button to forcibly
remove one or more variables from the model.
Forward selection – Begins with a model involving only a constant term and enters
one variable at a time based on its statistical significance if added to the current
model. At each step, the algorithm brings into the model the variable that will be the
most statistically significant if entered. Selection of variables is based on either an F-
to-enter test or a P-to-enter test In the former case, as long as the most significant
variable has an F value greater or equal to that specified on the Analysis Summary
dialog box, it will be brought into the model. When no variable has a large enough F
value, variable selection stops. In addition, variables brought into the model early in
the procedure may be removed later if their F value falls below the F-to-remove
criterion.
Backward selection – Begins with a model involving all the variables specified on
the data input dialog box and removes one variable at a time based on its statistical
significance in the current model. At each step, the algorithm removes from the
model the variable that is the least statistically significant. Removal of variables is
based on either an F-to-remove test or a P-to-enter test. In the former case, if the
least significant variable has an F value less than that specified on the Analysis
Summary dialog box, it will be removed from the model. When all remaining
variables have large F values, the procedure stops. In addition, variables removed
from the model early in the procedure may be re-entered later if their F values reach
the F-to-enter criterion.
Stepwise regression
Method: backward selection
F-to-enter: 4.0
F-to-remove: 4.0
Step 0:
5 variables in the model. 87 d.f. for error.
R-squared = 72.73% Adjusted R-squared = 71.17% MSE = 8.19696
Step 1:
Removing variable Drive Train="front" with F-to-remove =0.732595
4 variables in the model. 88 d.f. for error.
R-squared = 72.50% Adjusted R-squared = 71.25% MSE = 8.17206
Step 2:
Removing variable Horsepower with F-to-remove =2.22011
3 variables in the model. 89 d.f. for error.
R-squared = 71.81% Adjusted R-squared = 70.86% MSE = 8.28409
In the first step, Drivetrain is removed since it is the least significant. At the second step,
Horsepower is removed. The algorithm then stops, since all remaining variables have F-to-
remove values greater than 4, and all previously removed variables have F-to-enter values less
than 4.
Standard T
Parameter Estimate Error Statistic P-Value
CONSTANT 51.8628 10.2179 5.07569 0.0000
Weight -0.0245435 0.00506191 -4.84867 0.0000
Weight^2 0.00000236841 8.25606E-7 2.86869 0.0051
Wheelbase 0.28345 0.0899993 3.14947 0.0022
Analysis of Variance
Source Sum of Squares Df Mean Square F-Ratio P-Value
Model 1878.03 3 626.009 75.57 0.0000
Residual 737.284 89 8.28409
Total (Corr.) 2615.31 92
NOTE: from here forward in this document, the results will be based on the reduced model
without Drivetrain or Wheelbase.
Y Y 2 1
(5)
in which the data is raised to a power 1 after shifting it a certain amount 2. Often, the shift
parameter 2 is set equal to 0. This class includes square roots, logarithms, reciprocals, and other
common transformations, depending on the power. Examples include:
Using Analysis Options, you can specify the values for 1 or 2, or specify just 2 and have the
program find an optimal value for 1 using the methods proposed by Box and Cox (1964).
For the sample data, a plot of the residuals versus predicted values does show some change in
variability as the predicted value changes:
Residual Plot
4.3
Studentized residual
2.3
0.3
-1.7
-3.7
20 25 30 35 40 45 50
predicted MPG Highway
The smaller cars tend to be somewhat more variable than the larger cars. Asking the program to
optimize the Box-Cox transformation yields:
Analysis of Variance
Source Sum of Squares Df Mean Square F-Ratio P-Value
Model 1568.28 3 522.761 74.99 0.0000
Residual 620.444 89 6.97128
Total (Corr.) 2188.73 92
Apparently, an inverse square root of MPG Highway improves the properties of the residuals, as
illustrated in the new residual plot:
Residual Plot
4.4
Studentized residual
2.4
0.4
-1.6
-3.6
20 25 30 35 40 45 50
predicted MPG Highway
Note: some caution is necessary here, however, since the transformation may be heavily
influenced by one or two outliers. To simplify the discussion that follows, however, the rest of
this document will work with the untransformed model.
4
0
-4
-8
-12
90 95 100 105 110 115 120
Wheelbase
̂ j x j x j (6)
where ˆ j is the estimated regression coefficient for variable j, xj represents the value of variable
j as plotted on the horizontal axis, and x j is the average value of the selected independent
variable amongst the n observations used to fit the model. You can judge the importance of a
factor by noting how much the component effect changes over the range of the selected variable.
The points on the above plot represent each of the n = 93 automobiles in the dataset. The vertical
positions are equal to the component effect plus the residual from the fitted model. This allows
you to gage the relative importance of a factor compared to the residuals. In the above plot, some
of the residuals are as large if not larger than the effect of Wheelbase, indicating that other
important factors may be missing from the model.
Pane Options
The table decomposes the model sum of squares SSR into contributions due to each coefficient
by showing the increase in SSR as each term is added to the model. These sums of squares are
often called Type I sums of squares. The F-Ratios compare the mean square for each term to the
MSE of the fitted model. These sums of squares are useful when fitting polynomial models, as
discussed in the Polynomial Regression documentation.
In the above table, all variables are statistically significant at the 1% significance level since
their P-Values are well below 0.01.
35
30
25
20
20 25 30 35 40 45 50
predicted
If the model fits well, the points should be randomly scattered around the diagonal line. Any
change in variability from low values of Y to high values of Y might indicate the need to
transform the dependent variable before fitting a model to the data. In the above plot, the
variability appears to increase somewhat as the predicted values get large.
Residual Plots
As with all statistical models, it is good practice to examine the residuals. In a regression, the
residuals are defined by
ei yi yˆ i (7)
i.e., the residuals are the differences between the observed data values and the fitted model.
1. versus X.
2. versus predicted value Yˆ .
3. versus row number.
Residual Plot
4.3
Studentized residual
2.3
0.3
-1.7
-3.7
1600 2100 2600 3100 3600 4100 4600
Weight
Residual Plot
4.3
Studentized residual
2.3
0.3
-1.7
-3.7
20 25 30 35 40 45 50
predicted MPG Highway
Heteroscedasticity occurs when the variability of the data changes as the mean changes, and
might necessitate transforming the data before fitting the regression model. It is usually
evidenced by a funnel-shaped pattern in the residual plot. In the plot above, some increased
variability in miles per gallon can be seen at high predicted values, which corresponds to the
smaller cars. For the smaller cars, the miles per gallon appears to vary more than for the larger
cars.
Residual Plot
4.3
Studentized residual
2.3
0.3
-1.7
-3.7
0 20 40 60 80 100
row number
If the data are arranged in chronological order, any pattern in the data might indicate an outside
influence. In the above plot, no obvious trend is present, although there is a standardized residual
in excess of 3.5, indicating that it is more than 3.5 standard deviations from the fitted curve.
Pane Options
Plot versus: the independent variable to plot on the horizontal axis, if relevant.
2013 by StatPoint Technologies, Inc. Multiple Regression - 14
STATGRAPHICS – Rev. 9/16/2013
Unusual Residuals
Once the model has been fit, it is useful to study the residuals to determine whether any outliers
exist that should be removed from the data. The Unusual Residuals pane lists all observations
that have Studentized residuals of 2.0 or greater in absolute value.
Unusual Residuals
Predicted Studentized
Row Y Y Residual Residual
31 33.0 40.1526 -7.15265 -2.81
36 20.0 26.9631 -6.96309 -2.62
39 50.0 43.4269 6.5731 2.72
42 46.0 36.4604 9.53958 3.66
60 26.0 32.8753 -6.8753 -2.50
73 41.0 35.3266 5.67338 2.04
Studentized residuals greater than 3 in absolute value correspond to points more than 3 standard
deviations from the fitted model, which is a very rare event for a normal distribution. In the
sample data, row #42 is more 3.5 standard deviations out. Row #42 is a Honda Civic, which was
listed in the dataset as achieving 46 miles per gallon, while the model predicts less than 37.
Points can be removed from the fit while examining any of the residual plots by clicking on a
point and then pressing the Exclude/Include button on the analysis toolbar.
Influential Points
In fitting a regression model, all observations do not have an equal influence on the parameter
estimates in the fitted model. Those with unusual values of the independent variables tend to
have more influence than the others. The Influential Points pane displays any observations that
have high influence on the fitted model:
Influential Points
Mahalanobis
Row Leverage Distance DFITS
19 0.139122 13.5555 0.18502
28 0.246158 28.3994 0.685044
31 0.156066 15.6544 -1.225
36 0.0961585 8.58597 -0.849931
39 0.250016 29.0136 1.89821
60 0.0298891 1.78389 -0.463748
73 0.0352144 2.29596 0.451735
83 0.102406 9.27903 0.573505
Average leverage of single data point = 0.0434783
Points are placed on this list for one of the following reasons:
Leverage – measures how distant an observation is from the mean of all n observations in
the space of the independent variables. The higher the leverage, the greater the impact of the
point on the fitted values ŷ. Points are placed on the list if their leverage is more than 3 times
that of an average data point.
Mahalanobis Distance – measures the distance of a point from the center of the collection of
points in the multivariate space of the independent variables. Since this distance is related to
leverage, it is not used to select points for the table.
2013 by StatPoint Technologies, Inc. Multiple Regression - 15
STATGRAPHICS – Rev. 9/16/2013
DFITS – measures the difference between the predicted values ŷ i when the model is fit with
and without the i-th data point. Points are placed on the list if the absolute value of DFITS
exceeds 2 p / n , where p is the number of coefficients in the fitted model.
In the sample data, rows #28 and #39 show a leverage value of nearly 6 times that of an average
data point. Rows #31 and #39 have the largest values of DFITS. Removing high influence points
is not recommended on a routine basis. However, it is important to be aware of their impact on
the estimated model.
Confidence Intervals
The Confidence Intervals pane shows the potential estimation error associated with each
coefficient in the model.
Pane Options
Correlation Matrix
The Correlation Matrix displays estimates of the correlation between the estimated coefficients.
This table can be helpful in determining how well the effects of different independent variables
have been separated from each other. Note the high correlation between the coefficients of
Weight and Weight2. This is normal whenever fitting a non-centered polynomial and simply
means that the coefficients could change dramatically if a different order polynomial was
Reports
The Reports pane creates predictions using the fitted least squares model. By default, the table
includes a line for each row in the datasheet that has complete information on the X variables
and a missing value for the Y variable. This allows you to add rows to the bottom of the
datasheet corresponding to levels at which you want predictions without affecting the fitted
model.
For example, suppose a prediction is desired for a car with a Weight of 3500 and a Wheelbase of
105. In row #94 of the datasheet, these values would be added but the MPG Highway column
would be left blank. The resulting table is shown below:
Fitted Value - the predicted value of the dependent variable using the fitted model.
Standard Error for Forecast - the estimated standard error for predicting a single new
observation.
Confidence Limits for Forecast - prediction limits for new observations at the selected
level of confidence.
Confidence Limits for Mean - confidence limits for the mean value of Y at the selected
level of confidence.
For row #94, the predicted miles per gallon is 24.7. Models with those features can be expected
to achieve between 18.9 and 30.5 miles per gallon in highway driving.
Using Pane Options, additional information about the predicted values and residuals for the data
used to fit the model can also be included in the table.
Interval Plots
The Intervals Plots pane can create a number of interesting types of plots. The plot below shows
how precisely the miles per gallon of an automobile can be predicted.
40
35
30
25
20
15
20 25 30 35 40 45 50
predicted value
Pane Options
Plot Limits For: type of limits to be included. Predicted Values plots prediction limits at
settings of the independent variables corresponding to each of the n observations used to fit
the model. Means plots confidence limits for the mean value of Y corresponding to each of
the n observations. Forecasts plots prediction limits for rows of the datasheet that have
missing values for Y. Forecast Means plots confidence limits for the mean value of Y
corresponding to each row in the datasheet with a missing value for Y.
Autocorrelated Data
When regression models are used to fit data that is recorded over time, the deviations from the
fitted model are often not independent. This can lead to inefficient estimates of the underlying
regression model coefficients and P-values that overstate the statistical significance of the fitted
model.
As an illustration, consider the following data from Neter et al. (1996), contained in the file
company.sgd:
Regressing company sales against industry sales resulting in a very good linear fit, with a very
high R-squared:
Standard T
Parameter Estimate Error Statistic P-Value
CONSTANT -1.45475 0.214146 -6.79326 0.0000
industry sales 0.176283 0.00144474 122.017 0.0000
Analysis of Variance
Source Sum of Squares Df Mean Square F-Ratio P-Value
Model 110.257 1 110.257 14888.14 0.0000
Residual 0.133302 18 0.00740568
Total (Corr.) 110.39 19
However, the Durbin-Watson statistic is very significant, and the estimated lag 1 residual
autocorrelation equals 0.626. A plot of the residuals versus row number shows marked swings
around zero:
Residual Plot
2.7
Studentized residual
1.7
0.7
-0.3
-1.3
-2.3
0 4 8 12 16 20
row number
Clearly, the residuals are not randomly distributed around the regression line.
To account for the autocorrelations of the deviations from the regression line, a more
complicated error structure can be assumed. A logical extension of the random error model is to
let the errors have a first-order autoregressive structure, in which the deviation at time t is
dependent upon the deviation at time t-1 in the following manner:
y t 0 1 xt t (8)
t t 1 ut (9)
The Analysis Options dialog box allows you to fit a model of the above form using the
Cochrane-Orcutt procedure:
You may either specify the value of in the Autocorrelation field and select Ordinary Least
Squares, or select Cochrane-Orcutt Optimization and let the value of will be determined
iteratively using the specified value as a starting point. In the latter case, the following procedure
is used:
Step 1: The model is fit using transformed values of the variables based on the initial
value of .
Step 2: The value of is re-estimated using the values of t obtained from the fit in Step
1.
Analysis of Variance
Source Sum of Squares Df Mean Square F-Ratio P-Value
Model 7.7425 1 7.7425 1750.11 0.0000
Residual 0.0752083 17 0.00442402
Total (Corr.) 7.81771 18
The above output shows that, at the final value of = 0.766, the Durbin-Watson statistic and the
lag 1 residual autocorrelation, computed using the residuals from the regression involving the
transformed variables, are much more in line with that expected if the errors are random. The
model also changed somewhat.
Save Results
The following results may be saved to the datasheet:
Calculations
Regression Model
Y 0 1 X 1 2 X 2 ... k X k (13)
n 2
n 2
Coefficient Estimates
s 2 ˆ MSE X WX
1
(17)
SSE
MSE (18)
n p
wi
i 1
SSE
Residual SSE Y WY bX WY n-k-1 MSE
n k 1
n 2
R-Squared
SSR
R 2 100 % (19)
SSR SSE
Adjusted R-Squared
n 1 SSE
2
Radj 1001 % (20)
n p SSR SSE
̂ MSE (21)
Residuals
w e i i
MAE i 1
n
(23)
w
i 1
i
Durbin-Watson Statistic
n
e i ei 1
2
D i 2
n
(24)
e
i 1
2
i
D2
D* (25)
4/ n
is compared to a standard normal distribution. For 100 < n ≤ 500, D/4 is compared to a beta
distribution with parameters
n 1
(26)
2
For smaller sample sizes, D/4 is compared to a beta distribution with parameters which are based
on the trace of certain matrices related to the X matrix, as described by Durbin and Watson
(1951) in section 4 of their classic paper.
e e i i 1
r1 i 2
n
(27)
e i 1
2
i
hi diag X i X WX X i wi
1
(28)
p
h (29)
n
Studentized Residuals
ei wi
di (30)
MSEi 1 hi
Mahalanobis Distance
n
hi wi / wi
MDi i 1 n(n 2) (31)
1 hi n 1
DFITS
di hi
DFITS i (32)
wi 1 hi
Yˆh t / 2,n p MSE X h X WX X h
1
(35)