Notes 8 - Examples(March5)
Notes 8 - Examples(March5)
Notes 8 Examples
Examine the relationship between Price and the other continuous variables in the data set by generating
plots of Price versus the other variables.
ods graphics on;
proc sgscatter data=st.cars;
plot price*(citympg hwympg cylinders enginesize
horsepower fueltank luggage weight);
run;
ods graphics off;
1
Stat 448
Notes 8 Examples
Citympg appears to have a relationship with Price, but it does not appear to be linear. A quadratic
term might be appropriate.
Hwympg appears to have a relationship with Price, but it does not appear to be linear. A quadratic term
might be appropriate.
Cylinders might not be a true continuous variable because it only takes on a few values. However, the
values are ordinal in nature, so it could be used as a numeric variable in a regression.
EngineSize appears to have a positive linear relationship with Price.
Fueltank appears to have a positive relationship with Price. The relationship might be curvilinear.
Weight seems to have a positive relationship with Price. A quadratic term might be useful.
2
Stat 448
Notes 8 Examples
The next step in data exploration is to generate the correlations between the variables.
proc corr data=st.cars;
var price citympg hwympg cylinders enginesize
horsepower fueltank luggage weight;
run;
Horsepower has the strongest correlation with Price (0.81667), so you conclude that this variable
would be one of the best variables to include in a regression model. However, recall that the correlation
statistic measures the linear relationship between variables. Citympg, Hwympg, FuelTank, and
Weight all appeared to have a relationship with Price that was not linear. These relationships will be
missed by a correlation analysis. It might be that these variables will be better predictors of Price if the
nature of their relationship is considered.
Also many of the variables that are potentially independent variables are highly correlated with one
another. Correlation of the independent variables can cause model instability. This should be taken into
consideration when you develop the model.
3
Stat 448
Notes 8 Examples
data st.cars2;
set st.cars2;
Citympg2=Citympg*Citympg;
Hwympg2=Hwympg*Hwympg;
FuelTank2=FuelTank*FuelTank;
Weight2=Weight*Weight;
run;
After the data preparation is complete, you can use PROC REG to identify candidate models. It is often
helpful to use more than one of the selection options in attempting to identify the models. For example,
you might choose to do forward and backward regression as well as one or two of the all regressions
techniques.
ods graphics on;
proc reg data= st.cars2 plots=criteria(unpack label);
backward:
model price=citympg citympg2 hwympg hwympg2 cylinders
enginesize horsepower fueltank fueltank2
luggage weight weight2 / selection=backward slstay=0.1;
forward:
model price=citympg citympg2 hwympg hwympg2 cylinders
enginesize horsepower fueltank fueltank2
luggage weight weight2 / selection=forward slentry=.1;
Rsquared:
model price=citympg citympg2 hwympg hwympg2 cylinders
enginesize horsepower fueltank fueltank2
luggage weight weight2
/ selection=rsquare adjrsq cp sbc aic best=3;
plot cp.*np. / vaxis=0 to 30 by 5 haxis=0 to 12 by 1
cmallows=red nostat nomodel;
symbol v=circle w=4 h=1;
Adjusted_R2:
model price=citympg citympg2 hwympg hwympg2 cylinders
enginesize horsepower fueltank fueltank2
luggage weight weight2
/ selection=adjrsq rsquare cp sbc aic best=10;
run;
quit;
ods graphics off;
4
Stat 448
Notes 8 Examples
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 3 4442.69069 1480.89690 65.41 <.0001
Error 77 1743.20808 22.63907
Corrected Total 80 6185.89877
Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F
Intercept 4.76390 2.30147 97.00049 4.28 0.0418
Citympg -0.79763 0.21364 315.57157 13.94 0.0004
Citympg2 0.03625 0.01180 213.71551 9.44 0.0029
Horsepower 0.09174 0.01734 633.83279 28.00 <.0001
Backward elimination results in a model with three variables, Citympg, Citympg2, and
Horsepower. The model has an R2 of 0.7182 and a Cp that is less than the number of parameters.
5
Stat 448
Notes 8 Examples
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 2 4333.86956 2166.93478 91.26 <.0001
Error 78 1852.02921 23.74396
Corrected Total 80 6185.89877
Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F
Intercept 4.73535 2.52464 83.53262 3.52 0.0644
Horsepower 0.10006 0.01774 755.30263 31.81 <.0001
FuelTank 0.88156 0.29774 208.15122 8.77 0.0041
6
Stat 448
Notes 8 Examples
The forward selection procedure results in a model with two variables, Horsepower and FuelTank.
The R2 is 0.7006. However, Cp is not less than p.
The forward and backward selection procedures resulted in models with completely different variables.
This is often the case when the variables being considered for the model are highly correlated with one
another.
7
Stat 448
Notes 8 Examples
Adjusted
Model Number in R- R-
Index Model Square Square C(p) AIC SBC Variables in Model
14 5 0.7331 0.7153 2.8129 256.2007 270.56737 Hwympg Hwympg2
Horsepower FuelTank Weight
15 5 0.7327 0.7149 2.9080 256.3079 270.67457 Hwympg Hwympg2 Cylinders
Horsepower FuelTank
16 6 0.7387 0.7175 3.2971 256.4727 273.23385 Citympg Hwympg2
EngineSize Horsepower
FuelTank2 Weight
17 6 0.7380 0.7167 3.4880 256.6923 273.45348 Citympg Citympg2
EngineSize Horsepower
FuelTank2 Weight
18 6 0.7362 0.7148 3.9617 257.2348 273.99596 Hwympg Hwympg2
EngineSize Horsepower
FuelTank Weight
The models are ranked by their R2 and the best three models for each possible number of variables are
shown. Based on Mallows' Cp criteria, you should consider models with three or four variables. The best
three-variable model includes the variables Citympg, Hwympg2, and Horsepower. This presents a
problem because it includes Hwympg2 without the first power of the variable. In this case, you might
choose to use either the second or third model listed with three variables.
It is usually desirable to construct hierarchically well-formulated models. This means that a model that
includes a variable to a power should also include all lower powers of the variable. Likewise, a model that
includes a cross-product term should also include each of the individual variable terms. An exception to
this would be if there were an overriding physical reason that leads you to conclude that the true
population model does not include the lower order term.
The best four-variable model includes the variables Hwympg, Hwympg2, Horsepower, and
FuelTank. The four-variable model has a higher R2 than the three-variable models, but that could be
solely due to the increased number of variables in the model. However, the adjusted R2 is also better for
the four-variable model. The Cp and AIC are slightly smaller, and therefore better, for the four-variable
model. The SBC is slightly smaller for the three-variable model.
The number of parameters in each model is one more than the number in the model listed in the output.
The output number refers to the number of variables. The number of parameters is the number of
variables plus one parameter for the intercept.
8
Stat 448
Notes 8 Examples
The plot shows that the minimum Cp is reached with five-parameter models. Beyond five parameters the
Cp begins to increase. However, recall that Mallows suggested using the model with the fewest number of
variables that meets the criteria Cp <= p. At least one model with four parameters, or three predictor
variables, meets this criterion.
9
Stat 448
Notes 8 Examples
The model with the best adjusted R2 is the same as the best four-variable model found with the R2
selection criteria.
10
Stat 448
Notes 8 Examples
The standardized estimate for each variable can be thought of as a relative measure of the importance of
the predictor variables. The standardized estimate for Horsepower is the biggest (in absolute value),
0.56, which indicates that Horsepower affects the price more than other variables.
Now that you narrowed down the models under consideration, you can evaluate the models by checking
to see whether the assumptions of the regression were met. In addition, influential observations should be
identified and the model should be checked for collinearity.
11
Stat 448
Notes 8 Examples
Parameter Estimates
Parameter Standard Variance
Variable DF Estimate Error t Value Pr > |t| Inflation
Intercept 1 4.03949 2.17024 1.86 0.0665 0
Hwympg 1 -0.80407 0.21378 -3.76 0.0003 4.06937
Hwympg2 1 0.04350 0.01430 3.04 0.0032 2.26764
Horsepower 1 0.09730 0.01614 6.03 <.0001 2.36905
As expected, the overall F-test is significant and the adjusted R2 is 0.7082. The model equation is
Price = 4.03949 – 0.80407 * Hwympg + 0.0435 * Hwympg2 + 0.09730 * Horsepower
The largest variance inflation factor is 4.06937, which is less than 10.
12
Stat 448
Notes 8 Examples
Collinearity Diagnostics
Proportion of Variation
Condition
Number Eigenvalue Index Intercept Hwympg Hwympg2 Horsepower
1 2.17779 1.00000 0.01125 0.00133 0.03345 0.00845
2 1.52787 1.19389 0.00125 0.09241 0.06816 0.00352
3 0.26890 2.84585 0.03127 0.32286 0.68972 0.00003866
4 0.02544 9.25245 0.95623 0.58341 0.20867 0.98800
None of the condition index values is even greater than 10. Multicollinearity does not appear to be a
problem with this model.
Look at the graphs that were generated by the PLOTS option.
The normal quantile plot does not indicate any serious problems with the normality assumption. To
further evaluate the normality assumption, you can generate a test of normality, as well as additional
graphs, using the UNIVARIATE procedure.
13
Stat 448
Notes 8 Examples
The plot of the residuals versus the predicted values might cause some concern about the model. The
smaller predicted values appear to have less variability than the larger predicted values. This is a violation
of the assumption of equal variances. Remedial measures are discussed later in the course.
The constant variance assumption can be evaluated with the Spearman rank correlation coefficient
between the absolute values of the residuals and the predicted values.
data check;
set check;
abserror=abs(residual);
run;
14
Stat 448
Notes 8 Examples
abserror pred
The Spearman rank correlation coefficient between the absolute values of the residuals and the predicted
values is about 0.603. The highly significant p-value (<.0001) indicates a strong correlation between the
absolute values of the residuals and the predicted values. The positive correlation coefficient indicates that
the residuals increase as the predicted values increase.
The following graphs illustrates how to use influential statistics and the associated cutoff values to
identify influential observations.
15
Stat 448
Notes 8 Examples
There are three observations that appear to be influential based on Cook’s D and DFFITS statistics. The
data for these observations should be checked to ensure that no transcription or data entry errors have
occurred. If the data is erroneous, correct the errors and re-analyze the data.
It is possible that the model is not adequate. There might be another variable that was not considered in
developing the model, which is important in explaining these unusual observations.
Another possibility is that the observation, though valid, might be unusual. If you had a larger sample
size, there might be more observations like the unusual ones. You might have to collect more data to
confirm the relationship suggested by the influential observation.
In general, do not exclude data. In many circumstances, some of the unusual observations contain
important information. If you do choose to exclude some observations, you should include a description
of the types of observations you exclude and provide an explanation. You should also discuss the
limitations of your conclusions, given the exclusions, as part of your report or presentation.
16
Stat 448
Notes 8 Examples
The variability of the residuals appears to increase as the predicted value increases. The Spearman rank
correlation coefficient you computed earlier indicates a significant correlation between the absolute value
of the residuals and the predicted values (0.603). In an attempt to stabilize the variances, try several
transformations on the response variable and evaluate the models.
First, try the square root transformation to stabilize the variance
title 'Square Root Transformation';
data st.cars2;
set st.cars2;
sqrt=sqrt(price);
run;
ods graphics on;
proc reg data=st.cars2 plots=diagnostics(unpack);
model sqrt=hwympg hwympg2 horsepower;
output out=sqrt r=residuals p=predicted;
run;
quit;
ods graphics off;
data sqrt;
set sqrt;
abserror=abs(residuals);
17
Stat 448
Notes 8 Examples
run;
18
Stat 448
Notes 8 Examples
The residual plot appears to be slightly better than the previous plot using the original variable price.
Spearman Correlation Coefficients, N = 81
Prob > |r| under H0: Rho=0
abserror predicted
abserror 1.00000 0.37688
0.0005
predicted 0.37688 1.00000
Predicted Value of sqrt 0.0005
The Spearman correlation coefficient is reduced to 0.377, with a significant p-value of 0.0005. This
indicates that the square root transformation helped some but might not be enough. You might want to try
a stronger transformation, such as the logarithm transformation.
19
Stat 448
Notes 8 Examples
20
Stat 448
Notes 8 Examples
data log;
set log;
abserror=abs(residuals);
run;
21
Stat 448
Notes 8 Examples
The residual plot shows a random scatter around the reference line.
The Spearman correlation coefficient is reduced to 0.19 with a p-value of 0.08. The correlation is not
significant at a 0.05 alpha level.
Examine the output for the model using the log transformation.
22
Stat 448
Notes 8 Examples
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
Intercept 1 2.10223 0.10536 19.95 <.0001
Hwympg 1 -0.04193 0.01038 -4.04 0.0001
Hwympg2 1 0.00160 0.00069418 2.30 0.0240
Horsepower 1 0.00491 0.00078364 6.26 <.0001
The model is significant with an R2 of 0.7470. All parameter estimates are significant. However, when
you transform the dependent variable, it is not uncommon to observe a change in the relationship with
one or more of the independent variables. In practice, after you transform the dependent variable you
should consider starting the process of variable selection over again.
23
Stat 448
Notes 8 Examples
data _null_;
set ANOVATable;
if source='Error' then call symput('var', MS);
run;
data out;
set out;
estimate = exp(pred + &var/2);
difference = price - estimate;
run;
24
Stat 448
Notes 8 Examples
difference
20
10
-10
-20
-30
0 10 20 30 40 50 60
estimate
There is at least one observation that seems to have a large difference (in absolute value) between
the observed price and the estimated value based on the model and the detransformation.
Remember the independent variables were selected prior to transforming the response variable.
Sometimes it is possible that a different set of independent variables is more appropriate for the
transformed dependent variable.
25