Example of Simple and Multiple Regression
Example of Simple and Multiple Regression
The purpose of this introductory example from pages 149-158 of the text is
to demonstrate the basic concepts of regression analysis as one attempts to
develop a predictive equation containing several independent variables. The
dependent variable is the number of credit cards held by a family. The
independent variables are family size and family income.
The data for this problem is in the SPSS data set CreditCardData.Sav.
With no information other than the number of credit cards per family, i.e. we only know
the values for the dependent variable, Number of Credit Cards, our best estimate of the
number of cards in a family is the mean. The cumulative amount of error in our guesses
for all subjects in the data set is the sum of squared errors (squares of deviations from
the mean).
Recall that variance equals the sum of squared errors divided by (the number of cases
minus 1 degree of freedom). While we cannot obtain the sum of squared errors directly,
we can compute the variance and multiply it by the number of cases in our sample
minus 1.
Second, in the
Descriptives dialog
box, move the
variable 'Number of
Credit Cards (ncards)'
to the 'Variable(s)' list.
Third, click on
the 'Options...'
button to request
specific statistics.
Second, click on
the 'Continue' Third, click on
button to close the OK button
the 'Descriptives: to complete
Options' dialog our request.
box.
In the SPSS Output Navigator, we see that the variance is 3.143. If we multiply the
variance by 7 (the number of cases in the study, 8 - 1 = 7), we compute the sum of
squared errors to be equal to 22, which agrees with the text on page 151. If we use the
mean for our best guess for each case, our measure of error is 22 units. The goal of
regression is to use information from independent variables to reduce the amount of
error associated with our guesses for the value of the dependent variable.
The regression coefficients are shown in the section of output shown to the right in the
column titled 'B' of the coefficients table.
The coefficient for the independent variable Family Size is .971. The intercept is labeled as
the (Constant) which is 2.871. If we were to write the regression equation, it would be:
The ANOVA table provides the information on the sum of squared errors.
The ratio of the sum of squares attributed to the regression relationship (16.514)
to the total sum of squares (22.0) is equal to the value of R Square in the Model
Summary Table, i.e. 16.514 / 22.0 = 0.751.
We would say that the pattern of variance in the independent variable, Family Size,
explains 75.1% of the variance in the dependent variable, Number of Credit Cards.
First, click on
the 'Dialog
Recall' tool.
Third, click on
the OK button
to produce the
output.
If the independent variables are not correlated at all, their combined predictive
power is the sum of their individual correlations. If the independent variables are
perfectly correlated (co-linear), either one does an equally good job of predicting
the dependent variable and the other is superfluous. When the intercorrelation is in
between these extremes, the correlation among the independent variables can only
be counted once in the regression relationship. When the second intercorrelated
independent variable is added to the analysis, its relationship to the dependent
variable will appear to be weaker than it really is because only the variance that it
shares with the dependent variable is incorporated into the analysis.
From the correlation matrix produced by the regression command above, we see that there
is a strong correlation between 'Family Size' and 'Family Income' of 0.673. We expect 'Family
Income' to improve our ability to predict 'Number of Credit Cards', but by a smaller amount
than the 0.829 correlation between 'Number of Credit Cards' and 'Family Income' would
suggest.
The remaining output from the regression command is shown below. The R Square measure
of the strength of the relationship between the dependent variable and the independent
variables increased by 0.110 (0.861 - 0.751 for the single variable regression).
The significance of the F statistic produced by the Analysis of Variance test (.007) indicates
that there is a relationship between the dependent variable and the set of independent
variables. The Sum of Squares for the residual indicates that we reduced our measure of
error from 5.486 for the one variable equation, to 3.050 for the two variable equation.
However, the significance tests of individual coefficients in the 'Coefficients' table tells us
that the variable 'Family Income' does not have a statistically significant individual
relationship with the dependent variable (Sig = 0.102). If we used an alpha level of 0.05, we
would fail to reject the null hypothesis that the coefficient B is equal to 0.
When we interpret multiple regression output, we must examine the significance test for the
relationship between the dependent variable and the set of independent variables, the
ANOVA test of the regression model, and the significance tests for individual variables in the
Coefficients table. Understanding the patterns of relationships that exist among the
variables requires that we consider the combined results of all significance tests.
First,
click on
the
'Dialog
Recall'
tool.
Fourth,
click on the
OK button
to request
Third, move the variable the output.
'Number of Automobiles
Owned (numautos)' to the
list box of 'Independent(s):'
variables. ('Number of Credit
Cards (ncards)' should still
be in the 'Dependent:'
variable text box, and
'Family Size (famsize)' and
'Family Income (famincom)'
should still be listed in the
'Independent(s)' list box.)
The significance of the F statistic produced by the Analysis of Variance test (.029)
indicates that there is a relationship between the dependent variable and the set of
three independent variables. The Sum of Squares for the residual indicates that we
reduced our measure of error from 3.050 for the two variable equation, to 2.815 for the
three variable equation.
From this example, we see that the objective of multiple regression is to add
independent variables to the regression equation that improve our ability to predict the
dependent variable, by reducing the residual sum of squared errors between the
predicted and actual values for the dependent variable. Ideally, all of the variables in
our regression equation would have a significant individual relationship to the
dependent variable.
From this example, we see that the objective of multiple regression is to add
independent variables to the regression equation that improve our ability to predict the
dependent variable, by reducing the residual sum of squared errors between the
predicted and actual values for the dependent variable. Ideally, all of the variables in
our regression equation would have a significant individual relationship to the
dependent variable.