0% found this document useful (0 votes)
8 views

Notes 8 - Examples(March5)

The document presents an analysis of a dataset containing information about 1993 model cars, focusing on the relationship between vehicle price and various attributes. Initial data exploration reveals potential non-linear relationships between price and several continuous variables, leading to the use of regression techniques to identify candidate models. Both backward and forward selection methods yield different models, highlighting the impact of multicollinearity among the independent variables.

Uploaded by

newtondr7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Notes 8 - Examples(March5)

The document presents an analysis of a dataset containing information about 1993 model cars, focusing on the relationship between vehicle price and various attributes. Initial data exploration reveals potential non-linear relationships between price and several continuous variables, leading to the use of regression techniques to identify candidate models. Both backward and forward selection methods yield different models, highlighting the impact of multicollinearity among the independent variables.

Uploaded by

newtondr7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Stat 448

Notes 8 Examples

Example 1: Initial Data Exploration


The SAS data set st.cars contains information about a sample of 1993 model cars. Eighty-one 4-, 5-,
and 6-passenger models were selected from the 1993 Cars Annual Auto Issue published by Consumer
Reports and from Pace New Car and Truck 1993 Buying Guide.
We are interested in examining the relationship between the price of the vehicles and various other car
attributes. The variables in the data set are
Manufacturer name of the manufacturer
Model name of the model
Type type of vehicle (Compact, Large, Midsize, Small, Sporty, or Van)
Price average of the variables MinPrice and MaxPrice
Citympg average city miles per gallon (EPA rating)
Hwympg average highway miles per gallon (EPA rating)
Cylinders number of cylinders
EngineSize engine displacement size (in liters)
Horsepower maximum horsepower
FuelTank fuel tank capacity (in gallons)
Passengers passenger capacity
Luggage luggage capacity (in cubic feet)
Weight weight of the vehicle (in pounds)
Origin origin of the vehicle (US or non-US).

Examine the relationship between Price and the other continuous variables in the data set by generating
plots of Price versus the other variables.
ods graphics on;
proc sgscatter data=st.cars;
plot price*(citympg hwympg cylinders enginesize
horsepower fueltank luggage weight);
run;
ods graphics off;

1
Stat 448
Notes 8 Examples

Citympg appears to have a relationship with Price, but it does not appear to be linear. A quadratic
term might be appropriate.
Hwympg appears to have a relationship with Price, but it does not appear to be linear. A quadratic term
might be appropriate.
Cylinders might not be a true continuous variable because it only takes on a few values. However, the
values are ordinal in nature, so it could be used as a numeric variable in a regression.
EngineSize appears to have a positive linear relationship with Price.

Horsepower appears to have a positive linear relationship with Price.

Fueltank appears to have a positive relationship with Price. The relationship might be curvilinear.

There does not appear to be a relationship between Luggage and Price.

Weight seems to have a positive relationship with Price. A quadratic term might be useful.

2
Stat 448
Notes 8 Examples

The next step in data exploration is to generate the correlations between the variables.
proc corr data=st.cars;
var price citympg hwympg cylinders enginesize
horsepower fueltank luggage weight;
run;

Partial PROC CORR Output

Pearson Correlation Coefficients, N = 81


Prob > |r| under H0: Rho=0
Price Citympg Hwympg Cylinders EngineSize Horsepower FuelTank Luggage Weight
Price 1.00000 -0.66596 -0.65672 0.71848 0.69586 0.81667 0.76059 0.39548 0.78695
<.0001 <.0001 <.0001 <.0001 <.0001 <.0001 0.0003 <.0001
Citympg - 1.00000 0.94518 -0.67818 -0.74586 -0.70294 -0.79558 -0.49358 -0.83455
0.66596 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001
<.0001
Hwympg - 0.94518 1.00000 -0.63872 -0.66457 -0.68257 -0.73645 -0.36963 -0.77411
0.65672 <.0001 <.0001 <.0001 <.0001 <.0001 0.0007 <.0001
<.0001
Cylinders 0.71848 -0.67818 -0.63872 1.00000 0.88770 0.78812 0.72227 0.58568 0.83014
<.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001
EngineSize 0.69586 -0.74586 -0.66457 0.88770 1.00000 0.77345 0.82035 0.68017 0.91680
<.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001
Horsepower 0.81667 -0.70294 -0.68257 0.78812 0.77345 1.00000 0.79511 0.35766 0.85875
<.0001 <.0001 <.0001 <.0001 <.0001 <.0001 0.0010 <.0001
FuelTank 0.76059 -0.79558 -0.73645 0.72227 0.82035 0.79511 1.00000 0.61270 0.89932
Size of Fuel Tank <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001
in Gallons
Luggage 0.39548 -0.49358 -0.36963 0.58568 0.68017 0.35766 0.61270 1.00000 0.63697
0.0003 <.0001 0.0007 <.0001 <.0001 0.0010 <.0001 <.0001
Weight 0.78695 -0.83455 -0.77411 0.83014 0.91680 0.85875 0.89932 0.63697 1.00000
<.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001

Horsepower has the strongest correlation with Price (0.81667), so you conclude that this variable
would be one of the best variables to include in a regression model. However, recall that the correlation
statistic measures the linear relationship between variables. Citympg, Hwympg, FuelTank, and
Weight all appeared to have a relationship with Price that was not linear. These relationships will be
missed by a correlation analysis. It might be that these variables will be better predictors of Price if the
nature of their relationship is considered.
Also many of the variables that are potentially independent variables are highly correlated with one
another. Correlation of the independent variables can cause model instability. This should be taken into
consideration when you develop the model.

3
Stat 448
Notes 8 Examples

Example 2: Select Candidate Models


Recall that there are four variables, Citympg, Hwympg, FuelTank, and Weight, that appear to have
a polynomial relationship with Price. Prior to generating potential regression models, it might be wise
to center these variables to reduce collinearity problems with the models. In addition, you must square the
variables prior to using them in the MODEL statement of PROC REG.
proc stdize data=st.cars method=mean out=st.cars2;
var Citympg Hwympg FuelTank Weight;
run;

data st.cars2;
set st.cars2;
Citympg2=Citympg*Citympg;
Hwympg2=Hwympg*Hwympg;
FuelTank2=FuelTank*FuelTank;
Weight2=Weight*Weight;
run;
After the data preparation is complete, you can use PROC REG to identify candidate models. It is often
helpful to use more than one of the selection options in attempting to identify the models. For example,
you might choose to do forward and backward regression as well as one or two of the all regressions
techniques.
ods graphics on;
proc reg data= st.cars2 plots=criteria(unpack label);
backward:
model price=citympg citympg2 hwympg hwympg2 cylinders
enginesize horsepower fueltank fueltank2
luggage weight weight2 / selection=backward slstay=0.1;
forward:
model price=citympg citympg2 hwympg hwympg2 cylinders
enginesize horsepower fueltank fueltank2
luggage weight weight2 / selection=forward slentry=.1;
Rsquared:
model price=citympg citympg2 hwympg hwympg2 cylinders
enginesize horsepower fueltank fueltank2
luggage weight weight2
/ selection=rsquare adjrsq cp sbc aic best=3;
plot cp.*np. / vaxis=0 to 30 by 5 haxis=0 to 12 by 1
cmallows=red nostat nomodel;
symbol v=circle w=4 h=1;
Adjusted_R2:
model price=citympg citympg2 hwympg hwympg2 cylinders
enginesize horsepower fueltank fueltank2
luggage weight weight2
/ selection=adjrsq rsquare cp sbc aic best=10;
run;
quit;
ods graphics off;

4
Stat 448
Notes 8 Examples

Selected MODEL statement options:


SELECTION= specifies the method used to select the model. FORWARD, BACKWARD,
STEPWISE, MAXR, MINR, RSQUARE, ADJRSQ, CP, or NONE (use the full model)
can be specified. The default method is NONE.
CP computes Mallows' Cp for each model.
SBC computes the Schwarz’s Bayesian Criterion (SBC) statistic for each model.
AIC computes the Akaike Information Criterion (AIC) statistic for each model.
BEST=n limits the output to only the best n models. If SELECTION=CP or
SELECTION=ADJRSQ is specified, the BEST= option specifies the maximum number
of subset models to be displayed. For SELECTION=RSQUARE, the BEST= option
requests the maximum number of subset models for each size.
Examine the plot first.

Partial PROC REG Output for the Backward Elimination Model


Backward Elimination: Step 9

Variable EngineSize Removed: R-Square = 0.7182 and C(p) = 2.8133

Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 3 4442.69069 1480.89690 65.41 <.0001
Error 77 1743.20808 22.63907
Corrected Total 80 6185.89877

Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F
Intercept 4.76390 2.30147 97.00049 4.28 0.0418
Citympg -0.79763 0.21364 315.57157 13.94 0.0004
Citympg2 0.03625 0.01180 213.71551 9.44 0.0029
Horsepower 0.09174 0.01734 633.83279 28.00 <.0001

Backward elimination results in a model with three variables, Citympg, Citympg2, and
Horsepower. The model has an R2 of 0.7182 and a Cp that is less than the number of parameters.

5
Stat 448
Notes 8 Examples

Summary of Backward Elimination


Partial Model
Variable Number R- R- F
Step Removed Label Vars In Square Square C(p) Value Pr > F
1 Luggage 11 0.0003 0.7470 11.0679 0.07 0.7952
2 Weight2 10 0.0003 0.7467 9.1367 0.07 0.7925
3 Hwympg 9 0.0007 0.7461 7.3186 0.19 0.6669
4 Hwympg2 8 0.0010 0.7451 5.5783 0.27 0.6050
5 FuelTank Size of Fuel Tank in 7 0.0039 0.7412 4.6220 1.10 0.2987
Gallons
6 Cylinders 6 0.0032 0.7380 3.4880 0.91 0.3438
7 FuelTank2 5 0.0080 0.7300 3.6353 2.25 0.1375
8 Weight 4 0.0090 0.7210 4.0469 2.49 0.1188
9 EngineSize 3 0.0028 0.7182 2.8133 0.78 0.3811

Partial PROC REG Output for the Forward Selection Model


Forward Selection: Step 2

Variable FuelTank Entered: R-Square = 0.7006 and C(p) = 5.5460

Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 2 4333.86956 2166.93478 91.26 <.0001
Error 78 1852.02921 23.74396
Corrected Total 80 6185.89877

Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F
Intercept 4.73535 2.52464 83.53262 3.52 0.0644
Horsepower 0.10006 0.01774 755.30263 31.81 <.0001
FuelTank 0.88156 0.29774 208.15122 8.77 0.0041

6
Stat 448
Notes 8 Examples

Summary of Forward Selection


Partial Model
Variable Number R- R- F
Step Entered Label Vars In Square Square C(p) Value Pr > F
1 Horsepower 1 0.6670 0.6670 12.5987 158.21 <.0001
2 FuelTank Size of Fuel Tank in 2 0.0336 0.7006 5.5460 8.77 0.0041
Gallons

The forward selection procedure results in a model with two variables, Horsepower and FuelTank.
The R2 is 0.7006. However, Cp is not less than p.
The forward and backward selection procedures resulted in models with completely different variables.
This is often the case when the variables being considered for the model are highly correlated with one
another.

Partial PROC REG Output for R2 Model Selection


Adjusted
Model Number in R- R-
Index Model Square Square C(p) AIC SBC Variables in Model
1 1 0.6670 0.6627 12.5987 266.1241 270.91297 Horsepower
2 1 0.6193 0.6145 25.4227 276.9593 281.74819 Weight
3 1 0.5785 0.5732 36.3947 285.2023 289.99121 FuelTank
4 2 0.7006 0.6929 5.5460 259.4966 266.67998 Horsepower FuelTank
5 2 0.6949 0.6871 7.0856 261.0302 268.21357 Horsepower Weight
6 2 0.6854 0.6773 9.6343 263.5070 270.69032 Hwympg Horsepower
7 3 0.7208 0.7099 2.1174 255.8447 265.42253 Citympg Hwympg2
Horsepower
8 3 0.7192 0.7082 2.5523 256.3123 265.89009 Hwympg Hwympg2
Horsepower
9 3 0.7182 0.7072 2.8133 256.5917 266.16949 Citympg Citympg2
Horsepower
10 4 0.7319 0.7178 1.1314 254.5591 266.53139 Hwympg Hwympg2
Horsepower FuelTank
11 4 0.7284 0.7141 2.0776 255.6148 267.58700 Hwympg Hwympg2
Horsepower Weight
12 4 0.7271 0.7127 2.4281 256.0023 267.97456 Citympg Hwympg Hwympg2
Horsepower
13 5 0.7334 0.7157 2.7115 256.0862 270.45291 Citympg Hwympg Hwympg2
Horsepower FuelTank

7
Stat 448
Notes 8 Examples

Adjusted
Model Number in R- R-
Index Model Square Square C(p) AIC SBC Variables in Model
14 5 0.7331 0.7153 2.8129 256.2007 270.56737 Hwympg Hwympg2
Horsepower FuelTank Weight
15 5 0.7327 0.7149 2.9080 256.3079 270.67457 Hwympg Hwympg2 Cylinders
Horsepower FuelTank
16 6 0.7387 0.7175 3.2971 256.4727 273.23385 Citympg Hwympg2
EngineSize Horsepower
FuelTank2 Weight
17 6 0.7380 0.7167 3.4880 256.6923 273.45348 Citympg Citympg2
EngineSize Horsepower
FuelTank2 Weight
18 6 0.7362 0.7148 3.9617 257.2348 273.99596 Hwympg Hwympg2
EngineSize Horsepower
FuelTank Weight
The models are ranked by their R2 and the best three models for each possible number of variables are
shown. Based on Mallows' Cp criteria, you should consider models with three or four variables. The best
three-variable model includes the variables Citympg, Hwympg2, and Horsepower. This presents a
problem because it includes Hwympg2 without the first power of the variable. In this case, you might
choose to use either the second or third model listed with three variables.
It is usually desirable to construct hierarchically well-formulated models. This means that a model that
includes a variable to a power should also include all lower powers of the variable. Likewise, a model that
includes a cross-product term should also include each of the individual variable terms. An exception to
this would be if there were an overriding physical reason that leads you to conclude that the true
population model does not include the lower order term.
The best four-variable model includes the variables Hwympg, Hwympg2, Horsepower, and
FuelTank. The four-variable model has a higher R2 than the three-variable models, but that could be
solely due to the increased number of variables in the model. However, the adjusted R2 is also better for
the four-variable model. The Cp and AIC are slightly smaller, and therefore better, for the four-variable
model. The SBC is slightly smaller for the three-variable model.
The number of parameters in each model is one more than the number in the model listed in the output.
The output number refers to the number of variables. The number of parameters is the number of
variables plus one parameter for the intercept.

8
Stat 448
Notes 8 Examples

The plot shows that the minimum Cp is reached with five-parameter models. Beyond five parameters the
Cp begins to increase. However, recall that Mallows suggested using the model with the fewest number of
variables that meets the criteria Cp <= p. At least one model with four parameters, or three predictor
variables, meets this criterion.

9
Stat 448
Notes 8 Examples

PROC REG Output for the Adjusted R2 Selection Method


Adjusted
Model Number in R- R-
Index Model Square Square C(p) AIC SBC Variables in Model
1 4 0.7178 0.7319 1.1314 254.5591 266.53139 Hwympg Hwympg2
Horsepower FuelTank
2 6 0.7175 0.7387 3.2971 256.4727 273.23385 Citympg Hwympg2 EngineSize
Horsepower FuelTank2 Weight
3 7 0.7170 0.7418 4.4747 257.5195 276.67505 Citympg Hwympg2 Cylinders
EngineSize Horsepower
FuelTank2 Weight
4 8 0.7168 0.7451 5.5783 258.4675 280.01759 Citympg Citympg2 Cylinders
EngineSize Horsepower
FuelTank FuelTank2 Weight
5 6 0.7167 0.7380 3.4880 256.6923 273.45348 Citympg Citympg2 EngineSize
Horsepower FuelTank2 Weight
6 7 0.7164 0.7412 4.6220 257.6911 276.84665 Citympg Citympg2 Cylinders
EngineSize Horsepower
FuelTank2 Weight
7 8 0.7159 0.7443 5.7972 258.7257 280.27577 Citympg Hwympg2 Cylinders
EngineSize Horsepower
FuelTank FuelTank2 Weight
8 5 0.7157 0.7334 2.7115 256.0862 270.45291 Citympg Hwympg Hwympg2
Horsepower FuelTank
9 7 0.7157 0.7405 4.8016 257.8997 277.05530 Citympg Hwympg Hwympg2
EngineSize Horsepower
FuelTank2 Weight
10 7 0.7155 0.7404 4.8336 257.9368 277.09238 Citympg Citympg2 EngineSize
Horsepower FuelTank
FuelTank2 Weight

The model with the best adjusted R2 is the same as the best four-variable model found with the R2
selection criteria.

10
Stat 448
Notes 8 Examples

Example 3: Evaluate the Importance of the Parameters


Assume that you chose to use the model with three variables, Hwympg, Hwympg2, and Horsepower.
You might want to evaluate the relative importance among these three variables in explaining the variance
in Price.
proc reg data=st.cars2;
model price=hwympg hwympg2 horsepower / stb;
run;
quit;
Selected MODEL statement option:
STB produces standardized regression coefficients. A standardized regression coefficient is computed
by dividing a parameter estimate by the ratio of the sample standard deviation of the dependent
variable to the sample standard deviation of the regressor.

Partial PROC REG Output


Parameter Standard Standardized
Variable DF Estimate Error t Value Pr > |t| Estimate
Intercept 1 4.03949 2.17024 1.86 0.0665 0
Hwympg 1 -0.80407 0.21378 -3.76 0.0003 -0.45822
Hwympg2 1 0.04350 0.01430 3.04 0.0032 0.27668
Horsepower 1 0.09730 0.01614 6.03 <.0001 0.56031

The standardized estimate for each variable can be thought of as a relative measure of the importance of
the predictor variables. The standardized estimate for Horsepower is the biggest (in absolute value),
0.56, which indicates that Horsepower affects the price more than other variables.

Now that you narrowed down the models under consideration, you can evaluate the models by checking
to see whether the assumptions of the regression were met. In addition, influential observations should be
identified and the model should be checked for collinearity.

11
Stat 448
Notes 8 Examples

Example 4: Model Diagnostics – Normality, Constant Variance,


Collinearity, and Influential Observations
Assume that you chose to use the model with three variables, Hwympg, Hwympg2, and Horsepower.
Evaluate this model by checking for violations of the assumptions and collinearity. Also identify any
observations that appear to be influential.
ods graphics on;
proc reg data= st.cars2 plots=(diagnostics(unpack) cooksd dffits);
model price=hwympg hwympg2 horsepower
/ stb vif collin collinoint influence;
output out=check r=residuals p=predicted;
run;
quit;
ods graphics off;

Partial PROC REG Output


Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 3 4448.69341 1482.89780 65.73 <.0001
Error 77 1737.20536 22.56111
Corrected Total 80 6185.89877

Root MSE 4.74985 R-Square 0.7192


Dependent Mean 18.64321 Adj R-Sq 0.7082
Coeff Var 25.47766

Parameter Estimates
Parameter Standard Variance
Variable DF Estimate Error t Value Pr > |t| Inflation
Intercept 1 4.03949 2.17024 1.86 0.0665 0
Hwympg 1 -0.80407 0.21378 -3.76 0.0003 4.06937
Hwympg2 1 0.04350 0.01430 3.04 0.0032 2.26764
Horsepower 1 0.09730 0.01614 6.03 <.0001 2.36905

As expected, the overall F-test is significant and the adjusted R2 is 0.7082. The model equation is
Price = 4.03949 – 0.80407 * Hwympg + 0.0435 * Hwympg2 + 0.09730 * Horsepower

The largest variance inflation factor is 4.06937, which is less than 10.

12
Stat 448
Notes 8 Examples

Collinearity Diagnostics
Proportion of Variation
Condition
Number Eigenvalue Index Intercept Hwympg Hwympg2 Horsepower
1 2.17779 1.00000 0.01125 0.00133 0.03345 0.00845
2 1.52787 1.19389 0.00125 0.09241 0.06816 0.00352
3 0.26890 2.84585 0.03127 0.32286 0.68972 0.00003866
4 0.02544 9.25245 0.95623 0.58341 0.20867 0.98800

None of the condition index values is even greater than 10. Multicollinearity does not appear to be a
problem with this model.
Look at the graphs that were generated by the PLOTS option.

The normal quantile plot does not indicate any serious problems with the normality assumption. To
further evaluate the normality assumption, you can generate a test of normality, as well as additional
graphs, using the UNIVARIATE procedure.

13
Stat 448
Notes 8 Examples

The plot of the residuals versus the predicted values might cause some concern about the model. The
smaller predicted values appear to have less variability than the larger predicted values. This is a violation
of the assumption of equal variances. Remedial measures are discussed later in the course.
The constant variance assumption can be evaluated with the Spearman rank correlation coefficient
between the absolute values of the residuals and the predicted values.
data check;
set check;
abserror=abs(residual);
run;

proc corr data=check spearman nosimple;


var abserror pred;
run;

PROC CORR Output


The CORR Procedure

2 Variables: abserror pred

Spearman Correlation Coefficients, N = 81


Prob > |r| under H0: Rho=0

14
Stat 448
Notes 8 Examples

abserror pred

abserror 1.00000 0.60274


<.0001

pred 0.60274 1.00000


Predicted Value of Price <.0001

The Spearman rank correlation coefficient between the absolute values of the residuals and the predicted
values is about 0.603. The highly significant p-value (<.0001) indicates a strong correlation between the
absolute values of the residuals and the predicted values. The positive correlation coefficient indicates that
the residuals increase as the predicted values increase.
The following graphs illustrates how to use influential statistics and the associated cutoff values to
identify influential observations.

15
Stat 448
Notes 8 Examples

There are three observations that appear to be influential based on Cook’s D and DFFITS statistics. The
data for these observations should be checked to ensure that no transcription or data entry errors have
occurred. If the data is erroneous, correct the errors and re-analyze the data.
It is possible that the model is not adequate. There might be another variable that was not considered in
developing the model, which is important in explaining these unusual observations.
Another possibility is that the observation, though valid, might be unusual. If you had a larger sample
size, there might be more observations like the unusual ones. You might have to collect more data to
confirm the relationship suggested by the influential observation.
In general, do not exclude data. In many circumstances, some of the unusual observations contain
important information. If you do choose to exclude some observations, you should include a description
of the types of observations you exclude and provide an explanation. You should also discuss the
limitations of your conclusions, given the exclusions, as part of your report or presentation.

16
Stat 448
Notes 8 Examples

Example 5: Transforming Variables


Recall the graph of the residuals versus the predicted values from the model for the st.cars2 data.

The variability of the residuals appears to increase as the predicted value increases. The Spearman rank
correlation coefficient you computed earlier indicates a significant correlation between the absolute value
of the residuals and the predicted values (0.603). In an attempt to stabilize the variances, try several
transformations on the response variable and evaluate the models.
First, try the square root transformation to stabilize the variance
title 'Square Root Transformation';
data st.cars2;
set st.cars2;
sqrt=sqrt(price);
run;
ods graphics on;
proc reg data=st.cars2 plots=diagnostics(unpack);
model sqrt=hwympg hwympg2 horsepower;
output out=sqrt r=residuals p=predicted;
run;
quit;
ods graphics off;

data sqrt;
set sqrt;
abserror=abs(residuals);

17
Stat 448
Notes 8 Examples

run;

proc corr data=sqrt spearman nosimple;


var abserror predicted;
run;

Partial PROC REG Output

18
Stat 448
Notes 8 Examples

The residual plot appears to be slightly better than the previous plot using the original variable price.
Spearman Correlation Coefficients, N = 81
Prob > |r| under H0: Rho=0
abserror predicted
abserror 1.00000 0.37688
0.0005
predicted 0.37688 1.00000
Predicted Value of sqrt 0.0005

The Spearman correlation coefficient is reduced to 0.377, with a significant p-value of 0.0005. This
indicates that the square root transformation helped some but might not be enough. You might want to try
a stronger transformation, such as the logarithm transformation.

19
Stat 448
Notes 8 Examples

title 'Log Transformation';


data st.cars2;
set st.cars2;
logprice=log(price);
run;

proc reg data=st.cars2;


model logprice=hwympg hwympg2 horsepower;
output out=log r=residuals p=predicted;
run;
quit;

20
Stat 448
Notes 8 Examples

data log;
set log;
abserror=abs(residuals);
run;

proc corr data=log spearman nosimple;


var abserror predicted;
run;
title;

Partial PROC REG Output

21
Stat 448
Notes 8 Examples

The residual plot shows a random scatter around the reference line.

PROC CORR Output


Spearman Correlation Coefficients, N = 81
Prob > |r| under H0: Rho=0
abserror predicted
abserror 1.00000 0.19492
0.0812
predicted 0.19492 1.00000
Predicted Value of logprice 0.0812

The Spearman correlation coefficient is reduced to 0.19 with a p-value of 0.08. The correlation is not
significant at a 0.05 alpha level.
Examine the output for the model using the log transformation.

22
Stat 448
Notes 8 Examples

Partial PROC REG Output


Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 3 12.08755 4.02918 75.78 <.0001
Error 77 4.09424 0.05317
Corrected Total 80 16.18179

Root MSE 0.23059 R-Square 0.7470


Dependent Mean 2.82388 Adj R-Sq 0.7371
Coeff Var 8.16573

Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
Intercept 1 2.10223 0.10536 19.95 <.0001
Hwympg 1 -0.04193 0.01038 -4.04 0.0001
Hwympg2 1 0.00160 0.00069418 2.30 0.0240
Horsepower 1 0.00491 0.00078364 6.26 <.0001

The model is significant with an R2 of 0.7470. All parameter estimates are significant. However, when
you transform the dependent variable, it is not uncommon to observe a change in the relationship with
one or more of the independent variables. In practice, after you transform the dependent variable you
should consider starting the process of variable selection over again.

23
Stat 448
Notes 8 Examples

Example 6: Detransformation of the Model


In order to obtain estimates of the mean for the original data, you must detransform the model, including
the low-bias adjustment factor.
First, rerun the regression and create output data sets with the mean squared error and the predicted
values. Then use a DATA step to compute the low-bias adjusted mean values.
proc reg data=st.cars2;
model logprice = hwympg hwympg2 horsepower;
ods output ANOVA=ANOVATable;
output out=out p=pred;
run;
quit;

data _null_;
set ANOVATable;
if source='Error' then call symput('var', MS);
run;

data out;
set out;
estimate = exp(pred + &var/2);
difference = price - estimate;
run;

proc print data=out;


var manufacturer model hwympg horsepower price estimate difference;
run;

proc gplot data=out;


plot difference*estimate/vref=0;
run;
quit;

24
Stat 448
Notes 8 Examples

PROC GPLOT Output

difference
20

10

-10

-20

-30

0 10 20 30 40 50 60

estimate

There is at least one observation that seems to have a large difference (in absolute value) between
the observed price and the estimated value based on the model and the detransformation.
Remember the independent variables were selected prior to transforming the response variable.
Sometimes it is possible that a different set of independent variables is more appropriate for the
transformed dependent variable.

25

You might also like