Multiple Regression PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

Multiple Regression

Overview for Multiple Regression


Learn more about Minitab
Use Multiple Regression to model the linear relationship between a continuous response and up to 12
continuous predictors and 1 categorical predictor.
For example, real estate appraisers want to see how the sales price of urban apartments is associated with several
predictor variables including the square footage, the number of available units, the age of the building, and the
distance from the city center. The appraisers can use multiple regression to determine which predictors are
significantly related to sales price.
Where to find this analysis
• Mac: Statistics > Regression > Multiple Regression
• PC: STATISTICS > Regression > Multiple Regression
When to use an alternate analysis
• If you have one continuous predictor, you can use Simple Regression.
• If you have one categorical predictor and no continuous predictors, use One-Way ANOVA.
• If you have two categorical predictors and no continuous predictors, use Two-way ANOVA.

Data considerations for Multiple Regression


Learn more about Minitab
To ensure that your results are valid, consider the following guidelines when you collect data, perform the
analysis, and interpret your results.
The predictors can be continuous or categorical
You can have 1 to 12 continuous predictors and, optionally, 1 categorical predictor.
A continuous variable can be measured and ordered, and has an infinite number of values between any two
values. For example, the diameters of a sample of tires is a continuous variable.
Categorical variables contain a finite, countable number of categories or distinct groups. Categorical data might
not have a logical order. For example, categorical predictors include gender, material type, and payment method.
If you have a discrete variable, you can decide whether to treat it as a continuous or categorical predictor. A
discrete variable can be measured and ordered but it has a countable number of values. For example, the number
of people that live in a household is a discrete variable. The decision to treat a discrete variable as continuous
or categorical depends on the number of levels, as well as the purpose of the analysis. For more information, go
to What are categorical, discrete, and continuous variables?.
• If you have one continuous predictor, you can use Simple Regression.
• If you have one categorical predictor and no continuous predictors, use One-Way ANOVA.
• If you have two categorical predictors and no continuous predictors, use Two-way ANOVA.
The response variable should be continuous
If the response variable is categorical, your model is less likely to meet the assumptions of the analysis, to
accurately describe your data, or to make useful predictions. If you have a categorical response variable, use
logistic regression, which is available in Minitab Statistical Software.
Collect data using best practices
To ensure that your results are valid, consider the following guidelines:
• Make sure the data represent the population of interest.
• Collect enough data to provide the necessary precision.
• Measure variables as accurately and precisely as possible.
• Record the data in the order it is collected.
The correlation among the predictors, also known as multicollinearity, should not be severe
If multicollinearity is severe, you may not easily be able to determine which predictors to include in the model.
To determine the severity of the multicollinearity, use the variance inflation factors (VIF) in the coefficients
table of the regression output.
The model should provide a good fit to the data
If the model does not fit the data, then the results can be misleading. In the output, use residual plots, diagnostic
statistics for unusual observations, and model summary statistics to determine how well the model fits the data.

Example of Multiple Regression


Learn more about Minitab
A research chemist wants to understand how several predictors are associated with the wrinkle resistance of
cotton cloth. The chemist examines 32 pieces of cotton cellulose produced at different settings of curing time,
curing temperature, formaldehyde concentration, and catalyst ratio. The durable press rating, a measure of
wrinkle resistance, is recorded for each piece of cotton.
The chemist performs a multiple regression analysis to fit a model with the predictors and eliminate the
predictors that do not have a statistically significant relationship with the response.
1. Open the sample data, WrinkleResistance.MTW.
2. Open the Multiple Regression dialog box.
• Mac: Statistics > Regression > Multiple Regression
• PC: STATISTICS > Regression > Multiple Regression
3. In Response, enter Rating.
4. In Continuous predictors, enter Conc Ratio Temp Time.
5. On the Graphs tab, do the following:
a. Select Residual plots.
b. Select Residuals versus variables, and enter Conc Ratio Temp Time.
6. Click OK.
Interpret the results
The predictors temperature, catalyst ratio, and formaldehyde concentration have p-values that are less than the
significance level of 0.05. These results indicate that these predictors have a statistically significant effect on
wrinkle resistance. The p-value for time is greater than 0.05, which indicates that there is not enough evidence
to conclude that time is related to the response. The chemist may want to refit the model without this predictor.
The residual plots indicate that there may be problems with the model.
• The points on the residuals versus fits plot do not appear to be randomly distributed about zero. There
appear to be clusters of points that could represent different groups in the data. The chemist should
investigate the groups to determine their cause.
• The plot of the residuals versus ratio shows curvature, which suggests a curvilinear relationship between
catalyst ratio and wrinkles. The chemist should consider adding a quadratic term for ratio to the model.

Regression Equation

Rating = −0.7560 + 0.15453 Conc + 0.21705 Ratio + 0.010806 Temp + 0.09464 Time

Analysis of Variance

Source DF Adj SS Adj MS F-Value P-Value


Regression 4 47.9096 11.9774 18.17 <0.0001
Error 27 17.7953 0.6591
Total 31 65.7049
Model Summary

S R-sq R-sq(adj)

0.811840 72.92% 68.90%

Coefficients

Term Coef SE Coef 95% CI T-Value P-Value


Constant -0.7560 0.7361 (-2.2665, 0.7544) -1.03 0.3135
Conc 0.15453 0.06334 (0.02457, 0.28448) 2.44 0.0215
Ratio 0.21705 0.03164 (0.15214, 0.28197) 6.86 <0.0001
Temp 0.010806 0.004622 (0.001323, 0.020290) 2.34 0.0270
Time 0.09464 0.05455 (-0.01729, 0.20657) 1.73 0.0942
Fits and Diagnostics for Unusual Observations

Obs Rating Fit Resid Std Resid


9 4.8 3.17828 1.62172 2.06 R

R Large residual
Perform the analysis
Enter your data for Multiple Regression (Data tab)
Learn more about Minitab
On the Data tab of the Multiple Regression dialog box, specify the data for your analysis and determine whether
you want to include the intercept in the regression equation.
In This Topic
• Enter your data
• Fit intercept
Enter your data
Complete the following steps to specify the columns of data that you want to analyze.
1. In Response, enter the columns of numeric data that you want to explain or predict. The response is
also called the Y variable.
2. In Continuous predictors, enter the columns of numeric data that may explain or predict changes in
the response. The predictors are also called X variables.
3. (Optional) In Categorical predictor, enter the categorical classifications or group assignments, such
as a type of raw material, that may explain or predict changes in the response. You can enter one
categorical predictor. The predictors are also called X variables.
In this worksheet, Strength is the response and contains the strength measurements of a sample of synthetic
fibers. Temperature in a continuous predictor and Machine is a categorical predictor. The predictors may
explain differences in fiber strength. The first row of the worksheet shows that the first sample of fiber has a
strength measurement of 40, has a temperature of 136, and was produced on Machine A.

C1 C2 C3
Strength Temperature Machine
40 136 A
53 142 A
32 119 B
36 127 A
42 151 B
45 121 B
Fit intercept
Select Fit intercept to include the intercept (also called the constant) in the regression model. In most cases,
you should include the constant in the model.
A possible reason to remove the constant is when you can assume that the response is 0 when the predictor
values equal 0. For example, consider a model that predicts calories based on the fat, protein, and carbohydrate
contents of a food. When the fat, protein, and carbohydrates are 0, the number of calories will also be 0 (or very
close to 0).
When you compare models that do not include the constant, use S instead of the R2 statistics to evaluate the fit
of models.
Specify confidence intervals for Multiple Regression (Options tab)
Learn more about Minitab
On the Options tab of the Multiple Regression dialog box, specify confidence intervals for the regression
coefficients.
In This Topic
• Confidence level
• Type of interval
Confidence level
From Confidence level, select the level of confidence for the confidence intervals for the regression
coefficients. The confidence intervals (CI) are ranges of values that are likely to contain the true value of the
coefficient for each term in the model.
Usually, a confidence level of 95% works well. A 95% confidence level indicates that if you took 100 random
samples from the population, the confidence intervals for approximately 95 of the samples would contain the
true value of the coefficient. For a given set of data, a lower confidence level produces a narrower interval, and
a higher confidence level produces a wider interval.
Type of interval
From Type of interval, select a two-sided interval or a one-sided bound. For the same confidence level, a bound
is closer to the point estimate than the interval. The upper bound does not give a likely lower value. The lower
bound does not give a likely upper value.
For example, the coefficient for a predictor is 13.2 mg/L. The 95% confidence interval for the coefficient is 12.8
mg/L to 13.6 mg/L. The 95% upper bound for the coefficient is 13.5 mg/L, which is more precise because the
bound is closer to the predicted mean.
Two-sided
Use a two-sided confidence interval to estimate both likely upper and lower values for the coefficient.
Lower bound
Use a lower confidence bound to estimate a likely lower value for the coefficient.
Upper bound
Use an upper confidence bound to estimate a likely higher value for the coefficient.

Perform stepwise regression for Multiple Regression (Stepwise tab)


Learn more about Minitab
On the Stepwise tab of the Multiple Regression dialog box, select the stepwise regression method.
Stepwise removes and adds terms to the model for the purpose of identifying a useful subset of the terms. For
more information, go to Basics of stepwise regression.
Specify the method that is used to fit the model.
• Stepwise: This method starts with an empty model, or includes the terms you specified to include in
the initial model or in every model. Then, Minitab adds or removes a term for each step. You can specify
terms to include in the initial model or to force into every model. Minitab stops when all variables not
in the model have p-values that are greater than the specified Alpha to enter value and when all
variables in the model have p-values that are less than or equal to the specified Alpha to remove value.
• Forward selection: This method starts with an empty model, or includes the terms you specified to
include in the initial model or in every model. Then, Minitab adds the most significant term for each
step. Minitab stops when all variables not in the model have p-values that are greater than the
specified Alpha to enter value.
• Backward selection: This method starts with all potential terms in the model and removes the least
significant term for each step. Minitab stops when all variables in the model have p-values that are less
than or equal to the specified Alpha to remove value.

Select the residual plots for Multiple Regression (Graphs tab)


Learn more about Minitab
On the Graphs tab of the Multiple Regression dialog box, select the residual plots to include in your output.
Residual plots
Select to display residual plots, including the residuals versus the fitted values, the residuals versus the order of
the data, a normal plot of the residuals, and a histogram of the residuals. Use these plots to determine whether
your model meets the assumptions of the analysis.
Residuals versus variables
Enter one or more variables to plot versus the residuals. You can plot the following types of variables:
• Predictors that are already in the current model, to look for curvature in the residuals.
• Important variables that are not in the current model, to determine whether they are related to the
response.

Interpret the key results for Multiple Regression


Learn more about Minitab
Complete the following steps to interpret a regression analysis. Key output includes the p-value, R2, and residual
plots.
In This Topic
• Step 1: Determine whether the association between the response and the term is statistically significant
• Step 2: Determine how well the model fits your data
• Step 3: Determine whether your model meets the assumptions of the analysis
Step 1: Determine whether the association between the response and the term is statistically significant
To determine whether the association between the response and each term in the model is statistically
significant, compare the p-value for the term to your significance level to assess the null hypothesis. The null
hypothesis is that the term's coefficient is equal to zero, which indicates that there is no association between the
term and the response. Usually, a significance level (denoted as α or alpha) of 0.05 works well. A significance
level of 0.05 indicates a 5% risk of concluding that an association exists when there is no actual association.
P-value ≤ α: The association is statistically significant
If the p-value is less than or equal to the significance level, you can conclude that there is a statistically
significant association between the response variable and the term.
P-value > α: The association is not statistically significant
If the p-value is greater than the significance level, you cannot conclude that there is a statistically significant
association between the response variable and the term. You may want to refit the model without the term.
If there are multiple predictors without a statistically significant association with the response, you must reduce
the model by removing terms one at a time. For more information on removing terms from the model, go
to Model reduction.
If a model term is statistically significant, the interpretation depends on the type of term. The interpretations are
as follows:
• If a continuous predictor is significant, you can conclude that the coefficient for the predictor does not
equal zero.
• If a categorical predictor is significant, you can conclude that not all the level means are equal.

Coefficients

Term Coef SE Coef 95% CI T-Value P-Value

Constant -0.7560 0.7361 (-2.2665, 0.7544) -1.03 0.3135

Conc 0.15453 0.06334 (0.02457, 0.28448) 2.44 0.0215

Ratio 0.21705 0.03164 (0.15214, 0.28197) 6.86 <0.0001

Temp 0.010806 0.004622 (0.001323, 0.020290) 2.34 0.0270

Time 0.09464 0.05455 (-0.01729, 0.20657) 1.73 0.0942

Key Result: P-Value


In these results, the relationships between rating and concentration, ratio, and temperature are statistically
significant because the p-values for these terms are less than the significance level of 0.05. The relationship
between rating and time is not statistically significant at the significance level of 0.05.
Step 2: Determine how well the model fits your data
To determine how well the model fits your data, examine the goodness-of-fit statistics in the model summary
table.
S
Use S to assess how well the model describes the response. Use S instead of the R2 statistics to compare the fit
of models that have no constant.
S is measured in the units of the response variable and represents the how far the data values fall from the fitted
values. The lower the value of S, the better the model describes the response. However, a low S value by itself
does not indicate that the model meets the model assumptions. You should check the residual plots to verify the
assumptions.
R-sq
R2 is the percentage of variation in the response that is explained by the model. The higher the R2 value, the
better the model fits your data. R2 is always between 0% and 100%.
R2 always increases when you add additional predictors to a model. For example, the best five-predictor model
will always have an R2 that is at least as high the best four-predictor model. Therefore, R2 is most useful when
you compare models of the same size.
R-sq (adj)
Use adjusted R2 when you want to compare models that have different numbers of predictors. R2 always
increases when you add a predictor to the model, even when there is no real improvement to the model. The
adjusted R2 value incorporates the number of predictors in the model to help you choose the correct model.
R-sq (pred)
Use predicted R2 to determine how well your model predicts the response for new observations. Models that
have larger predicted R2 values have better predictive ability.
A predicted R2 that is substantially less than R2 may indicate that the model is over-fit. An over-fit model occurs
when you add terms for effects that are not important in the population, although they may appear important in
the sample data. The model becomes tailored to the sample data and therefore, may not be useful for making
predictions about the population.
Predicted R2 can also be more useful than adjusted R2 for comparing models because it is calculated with
observations that are not included in the model calculation.
Consider the following points when you interpret the R2 values:
• Small samples do not provide a precise estimate of the strength of the relationship between the response
and predictors. If you need R2 to be more precise, you should use a larger sample (typically, 40 or more).
• R2 is just one measure of how well the model fits the data. Even when a model has a high R2, you should
check the residual plots to verify that the model meets the model assumptions.

Analysis of Variance

Source DF Adj SS Adj MS F-Value P-Value


Regression 4 47.9096 11.9774 18.17 <0.0001
Error 27 17.7953 0.6591
Total 31 65.7049
Key Results: S, R-sq, R-sq (adj), R-sq (pred)
In these results, the model explains 72.92% of the variation in the wrinkle resistance rating of the cloth samples.
For these data, the R2 value indicates the model provides a good fit to the data. If additional models are fit with
different predictors, use the adjusted R2 values and the predicted R2 values to compare how well the models fit
the data.
Step 3: Determine whether your model meets the assumptions of the analysis
Use the residual plots to help you determine whether the model is adequate and meets the assumptions of the
analysis. If the assumptions are not met, the model may not fit the data well and you should use caution when
you interpret the results.
Residuals versus fits plot
Use the residuals versus fits plot to verify the assumption that the residuals are randomly distributed and have
constant variance. Ideally, the points should fall randomly on both sides of 0, with no recognizable patterns in
the points.
The patterns in the following table may indicate that the model does not meet the model assumptions.
What the pattern may
Pattern indicate

Fanning or uneven spreading of residuals across Nonconstant variance


fitted values

Curvilinear A missing higher-order term

A point that is far away from zero An outlier

A point that is far away from the other points in the An influential point
x-direction
In this residuals versus fits plot, the data do not appear to be randomly distributed about zero. There appear to
be clusters of points that may represent different groups in the data. Investigate the groups to determine their
cause.

Residuals versus order plot


Use the residuals versus order plot to verify the assumption that the residuals are independent from one another.
Independent residuals show no trends or patterns when displayed in time order. Patterns in the points may
indicate that residuals near each other may be correlated, and thus, not independent. Ideally, the residuals on the
plot should fall randomly around the center line:

If you see a pattern, investigate the cause. The following types of patterns may indicate that the residuals are
dependent.

Trend
Shift

Cycle
In this residuals versus order plot, the residuals do not appear to be randomly distributed about zero. The
residuals appear to systematically decrease as the observation order increases. You should investigate the trend
to determine the cause.

For more information on how to handle patterns in the residual plots, go to Interpret all statistics and graphs for
Multiple Regression and click the name of the residual plot in the list at the top of the page.
Normality plot of the residuals
Use the normal probability plot of residuals to verify the assumption that the residuals are normally distributed.
The normal probability plot of the residuals should approximately follow a straight line.
The patterns in the following table may indicate that the model does not meet the model assumptions.

Pattern What the pattern may indicate

Not a straight line Nonnormality

A point that is far away from the line An outlier

Changing slope An unidentified variable


In this normal probability plot, the points generally follow a straight line. There is no evidence of nonnormality,
outliers, or unidentified variables.
Methods and formulas for Multiple Regression
Learn more about Minitab
Select the method or formula of your choice.
In This Topic
• Adj MS – Error • F-value • R-sq (pred)
• Adj MS – Regression • P-value – Coefficients • S
table
• Adj MS – Total • Standard error of the
• P-value – Analysis of coefficient (SE Coef)
• Adj SS
variance table
• Standardized residual
• Coefficient (Coef)
• Regression equation (Std Resid)
• Mallows' Cp
• Residual (Resid) • T-value
• Degrees of freedom
• R-sq • Variance inflation
(DF)
factor (VIF)
• R-sq (adj)
• Fit
Adj MS – Error
The Mean Square of the error (also abbreviated as MS Error or MSE, and denoted as s2) is the variance around
the fitted regression line. The formula is:
Term Description
yi ith observed response value

Notation ith fitted response

n number of observations
p number of coefficients in the
model, not counting the
constant

Adj MS – Regression Term Description


The formula for the Mean Square (MS) of the regression is: mean response
ith fitted response

p number of terms in the model


Notation
Adj MS – Total
The formula for the total Mean Square (MS) is: Term Description
mean response
yi ith observed response value
n number of observations
Adj SS
The sum of the squared distances. SS Regression is the portion of the variation explained by the model. SS Error
is the portion not explained by the model and is attributed to error. SS Total is the total variation in the data.
Formula
SS Regression:
Term Description
yi i th observed response value
SS Error:
i th fitted response

mean response
SS Total:

Coefficient (Coef)
The formula for the coefficient or slope in simple linear regression is:

Term Description
yi ith observed response value
The formula for the intercept (b0) is:
mean response
xi ith predictor value
In matrix terms, the formula that calculates the
mean predictor
vector of coefficients in multiple regression is:
X design matrix
b = (X'X)-1X'y
y response matrix

Term Description
Mallows' Cp
SSEp sum of squared errors for the model under
consideration
MSEm mean square error for the model with all predictors
n number of observations
p number of terms in the model, including the
constant
Degrees of freedom (DF)
The degrees of freedom for each component of the model are:

Sources of variation DF
Regression p
Error n–p–1
Total n–1
If your data meet certain criteria and the model includes at least one continuous predictor or more than one
categorical predictor, then Minitab uses some degrees of freedom for the lack-of-fit test. The criteria are as
follows:
• The data contain multiple observations with the same predictor values.
• The data contain the correct points to estimate additional terms that are not in the model.
Notation

Term Description
n number of observations
p number of coefficients in the model, not counting the constant

Fit

Notation

Term Description
fitted value
xk kth term. Each term can be a single predictor, a polynomial term, or an interaction term.
bk estimate of kth regression coefficient

F-value
The formulas for the F-statistics are as follows:
F(Regression)

F(Term)

F(Lack-of-fit)

Notation
Term Description
MS Regression A measure of the variation in the response that the
current model explains.
MS Error A measure of the variation that the model does not
explain.
MS Term A measure of the amount of variation that a term
explains after accounting for the other terms in the
model.
MS Lack-of-fit A measure of variation in the response that could be
modeled by adding more terms to the model.
MS Pure error A measure of the variation in replicated response data.

P-value – Coefficients table


The two-sided p-value for the null hypothesis that a regression coefficient equals 0 is:

The degrees of freedom are the degrees of freedom for error, as follows:
n–p–1
Notation

Term Description
The cumulative distribution function of the t distribution with degrees of
freedom equal to the degrees of freedom for error.
tj The t statistic for the jth coefficient.
n The number of observations in the data set.
p The sum of the degrees of freedom for the terms. The terms do not include
the constant.

P-value – Analysis of variance table


This p-value is for the test of the null hypothesis that all of the coefficients that are in the model equal zero,
except for the constant coefficient. The p-value is a probability that is calculated from an F-distribution with the
degrees of freedom (DF) as follows:
Numerator DF
sum of the degrees of freedom for the term or the terms in the test
Denominator DF
degrees of freedom for error
Formula
1 − P(F ≤ fj)
Notation
Term Description
P(F ≤ fj) cumulative distribution function for the F-distribution
fj f-statistic for the test

Regression equation
For a model with multiple predictors, the equation is:
y = β0 + β1x1 + … + βkxk + ε
The fitted equation is:

In simple linear regression, which includes only one predictor, the model is:
y=ß0+ ß1x1+ε
Using regression estimates b0 for ß0, and b1 for ß1, the fitted equation is:

Notation

Term Description
y response
xk kth term. Each term can be a single predictor, a polynomial term, or an
interaction term.
ßk kth population regression coefficient
ε error term that follows a normal distribution with a mean of 0
bk estimate of kth population regression coefficient
fitted response

Residual (Resid)

Notation

Term Description
ei i th residual
i th observed response value

i th fitted response
R-sq
R2 is also known as the coefficient of determination.
Formula

Notation

Term Description
yi i th observed response value
mean response
i th fitted response
R-sq (adj)

While the calculations for adjusted R2 can produce negative values, Minitab displays zero for these cases.
Notation

Term Description
ith observed response value

ith fitted response

mean response
n number of observations
p number of terms in the model
R-sq (pred)

While the calculations for R2(pred) can produce negative values, Minitab displays zero for these cases.
Notation

Term Description
yi i th observed response value
mean response
n number of observations
ei i th residual
hi i th diagonal element of X(X'X)–1X'
X design matrix
S

Notation

Term Description

MSE mean square error

Standard error of the coefficient (SE Coef)


For simple linear regression, the standard error of the coefficient is:

The standard errors of the coefficients for multiple regression are the square roots of the diagonal elements of
this matrix:

Notation

Term Description
xi ith predictor value
mean of the predictor
X design matrix
X' transpose of the design matrix
s2 mean square error
Standardized residual (Std Resid)
Standardized residuals are also called "internally Studentized residuals."
Formula

Notation

Term Description
ei i th residual
hi i th diagonal element of X(X'X)–1X'
s2 mean square error
X design matrix
X' transpose of the design matrix
T-value

Notation

Term Description
tj test statistic for the jth coefficient
jth estimated coefficient

standard error of the jth estimated coefficient

Variance inflation factor (VIF)


Minitab calculates the VIF by regressing each predictor on the remaining predictors and noting the R2value.
Formula
For predictor xj, the VIF is:

Notation

Term Description
R2( xj) coefficient of determination with xj as the response variable and the other terms
in the model as the predictors

You might also like