0% found this document useful (0 votes)

8 views

Notes 8 - Examples(March5)

The document presents an analysis of a dataset containing information about 1993 model cars, focusing on the relationship between vehicle price and various attributes. Initial data exploration reveals potential non-linear relationships between price and several continuous variables, leading to the use of regression techniques to identify candidate models. Both backward and forward selection methods yield different models, highlighting the impact of multicollinearity among the independent variables.

Uploaded by

newtondr7

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views

Notes 8 - Examples(March5)

Uploaded by

newtondr7

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

Stat 448

Notes 8 Examples

Example 1: Initial Data Exploration

The SAS data set st.cars contains information about a sample of 1993 model cars. Eighty-one 4-, 5-,
and 6-passenger models were selected from the 1993 Cars Annual Auto Issue published by Consumer
Reports and from Pace New Car and Truck 1993 Buying Guide.
We are interested in examining the relationship between the price of the vehicles and various other car
attributes. The variables in the data set are
Manufacturer name of the manufacturer
Model name of the model
Type type of vehicle (Compact, Large, Midsize, Small, Sporty, or Van)
Price average of the variables MinPrice and MaxPrice
Citympg average city miles per gallon (EPA rating)
Hwympg average highway miles per gallon (EPA rating)
Cylinders number of cylinders
EngineSize engine displacement size (in liters)
Horsepower maximum horsepower
FuelTank fuel tank capacity (in gallons)
Passengers passenger capacity
Luggage luggage capacity (in cubic feet)
Weight weight of the vehicle (in pounds)
Origin origin of the vehicle (US or non-US).

Examine the relationship between Price and the other continuous variables in the data set by generating
plots of Price versus the other variables.
ods graphics on;
proc sgscatter data=st.cars;
plot price*(citympg hwympg cylinders enginesize
horsepower fueltank luggage weight);
run;
ods graphics off;

1
Stat 448
Notes 8 Examples

Citympg appears to have a relationship with Price, but it does not appear to be linear. A quadratic
term might be appropriate.
Hwympg appears to have a relationship with Price, but it does not appear to be linear. A quadratic term
might be appropriate.
Cylinders might not be a true continuous variable because it only takes on a few values. However, the
values are ordinal in nature, so it could be used as a numeric variable in a regression.
EngineSize appears to have a positive linear relationship with Price.

Horsepower appears to have a positive linear relationship with Price.

Fueltank appears to have a positive relationship with Price. The relationship might be curvilinear.

There does not appear to be a relationship between Luggage and Price.

Weight seems to have a positive relationship with Price. A quadratic term might be useful.

2
Stat 448
Notes 8 Examples

The next step in data exploration is to generate the correlations between the variables.
proc corr data=st.cars;
var price citympg hwympg cylinders enginesize
horsepower fueltank luggage weight;
run;

Partial PROC CORR Output

Pearson Correlation Coefficients, N = 81

Prob > |r| under H0: Rho=0
Price Citympg Hwympg Cylinders EngineSize Horsepower FuelTank Luggage Weight
Price 1.00000 -0.66596 -0.65672 0.71848 0.69586 0.81667 0.76059 0.39548 0.78695
<.0001 <.0001 <.0001 <.0001 <.0001 <.0001 0.0003 <.0001
Citympg - 1.00000 0.94518 -0.67818 -0.74586 -0.70294 -0.79558 -0.49358 -0.83455
0.66596 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001
<.0001
Hwympg - 0.94518 1.00000 -0.63872 -0.66457 -0.68257 -0.73645 -0.36963 -0.77411
0.65672 <.0001 <.0001 <.0001 <.0001 <.0001 0.0007 <.0001
<.0001
Cylinders 0.71848 -0.67818 -0.63872 1.00000 0.88770 0.78812 0.72227 0.58568 0.83014
<.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001
EngineSize 0.69586 -0.74586 -0.66457 0.88770 1.00000 0.77345 0.82035 0.68017 0.91680
<.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001
Horsepower 0.81667 -0.70294 -0.68257 0.78812 0.77345 1.00000 0.79511 0.35766 0.85875
<.0001 <.0001 <.0001 <.0001 <.0001 <.0001 0.0010 <.0001
FuelTank 0.76059 -0.79558 -0.73645 0.72227 0.82035 0.79511 1.00000 0.61270 0.89932
Size of Fuel Tank <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001
in Gallons
Luggage 0.39548 -0.49358 -0.36963 0.58568 0.68017 0.35766 0.61270 1.00000 0.63697
0.0003 <.0001 0.0007 <.0001 <.0001 0.0010 <.0001 <.0001
Weight 0.78695 -0.83455 -0.77411 0.83014 0.91680 0.85875 0.89932 0.63697 1.00000
<.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001

Horsepower has the strongest correlation with Price (0.81667), so you conclude that this variable
would be one of the best variables to include in a regression model. However, recall that the correlation
statistic measures the linear relationship between variables. Citympg, Hwympg, FuelTank, and
Weight all appeared to have a relationship with Price that was not linear. These relationships will be
missed by a correlation analysis. It might be that these variables will be better predictors of Price if the
nature of their relationship is considered.
Also many of the variables that are potentially independent variables are highly correlated with one
another. Correlation of the independent variables can cause model instability. This should be taken into
consideration when you develop the model.

3
Stat 448
Notes 8 Examples

Example 2: Select Candidate Models

Recall that there are four variables, Citympg, Hwympg, FuelTank, and Weight, that appear to have
a polynomial relationship with Price. Prior to generating potential regression models, it might be wise
to center these variables to reduce collinearity problems with the models. In addition, you must square the
variables prior to using them in the MODEL statement of PROC REG.
proc stdize data=st.cars method=mean out=st.cars2;
var Citympg Hwympg FuelTank Weight;
run;

data st.cars2;
set st.cars2;
Citympg2=Citympg*Citympg;
Hwympg2=Hwympg*Hwympg;
FuelTank2=FuelTank*FuelTank;
Weight2=Weight*Weight;
run;
After the data preparation is complete, you can use PROC REG to identify candidate models. It is often
helpful to use more than one of the selection options in attempting to identify the models. For example,
you might choose to do forward and backward regression as well as one or two of the all regressions
techniques.
ods graphics on;
proc reg data= st.cars2 plots=criteria(unpack label);
backward:
model price=citympg citympg2 hwympg hwympg2 cylinders
enginesize horsepower fueltank fueltank2
luggage weight weight2 / selection=backward slstay=0.1;
forward:
model price=citympg citympg2 hwympg hwympg2 cylinders
enginesize horsepower fueltank fueltank2
luggage weight weight2 / selection=forward slentry=.1;
Rsquared:
model price=citympg citympg2 hwympg hwympg2 cylinders
enginesize horsepower fueltank fueltank2
luggage weight weight2
/ selection=rsquare adjrsq cp sbc aic best=3;
plot cp.*np. / vaxis=0 to 30 by 5 haxis=0 to 12 by 1
cmallows=red nostat nomodel;
symbol v=circle w=4 h=1;
Adjusted_R2:
model price=citympg citympg2 hwympg hwympg2 cylinders
enginesize horsepower fueltank fueltank2
luggage weight weight2
/ selection=adjrsq rsquare cp sbc aic best=10;
run;
quit;
ods graphics off;

4
Stat 448
Notes 8 Examples

Selected MODEL statement options:

SELECTION= specifies the method used to select the model. FORWARD, BACKWARD,
STEPWISE, MAXR, MINR, RSQUARE, ADJRSQ, CP, or NONE (use the full model)
can be specified. The default method is NONE.
CP computes Mallows' Cp for each model.
SBC computes the Schwarz’s Bayesian Criterion (SBC) statistic for each model.
AIC computes the Akaike Information Criterion (AIC) statistic for each model.
BEST=n limits the output to only the best n models. If SELECTION=CP or
SELECTION=ADJRSQ is specified, the BEST= option specifies the maximum number
of subset models to be displayed. For SELECTION=RSQUARE, the BEST= option
requests the maximum number of subset models for each size.
Examine the plot first.

Partial PROC REG Output for the Backward Elimination Model

Backward Elimination: Step 9

Variable EngineSize Removed: R-Square = 0.7182 and C(p) = 2.8133

Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 3 4442.69069 1480.89690 65.41 <.0001
Error 77 1743.20808 22.63907
Corrected Total 80 6185.89877

Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F
Intercept 4.76390 2.30147 97.00049 4.28 0.0418
Citympg -0.79763 0.21364 315.57157 13.94 0.0004
Citympg2 0.03625 0.01180 213.71551 9.44 0.0029
Horsepower 0.09174 0.01734 633.83279 28.00 <.0001

Backward elimination results in a model with three variables, Citympg, Citympg2, and
Horsepower. The model has an R2 of 0.7182 and a Cp that is less than the number of parameters.

5
Stat 448
Notes 8 Examples

Summary of Backward Elimination

Partial Model
Variable Number R- R- F
Step Removed Label Vars In Square Square C(p) Value Pr > F
1 Luggage 11 0.0003 0.7470 11.0679 0.07 0.7952
2 Weight2 10 0.0003 0.7467 9.1367 0.07 0.7925
3 Hwympg 9 0.0007 0.7461 7.3186 0.19 0.6669
4 Hwympg2 8 0.0010 0.7451 5.5783 0.27 0.6050
5 FuelTank Size of Fuel Tank in 7 0.0039 0.7412 4.6220 1.10 0.2987
Gallons
6 Cylinders 6 0.0032 0.7380 3.4880 0.91 0.3438
7 FuelTank2 5 0.0080 0.7300 3.6353 2.25 0.1375
8 Weight 4 0.0090 0.7210 4.0469 2.49 0.1188
9 EngineSize 3 0.0028 0.7182 2.8133 0.78 0.3811

Partial PROC REG Output for the Forward Selection Model

Forward Selection: Step 2

Variable FuelTank Entered: R-Square = 0.7006 and C(p) = 5.5460

Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 2 4333.86956 2166.93478 91.26 <.0001
Error 78 1852.02921 23.74396
Corrected Total 80 6185.89877

Parameter Standard
Variable Estimate Error Type II SS F Value Pr > F
Intercept 4.73535 2.52464 83.53262 3.52 0.0644
Horsepower 0.10006 0.01774 755.30263 31.81 <.0001
FuelTank 0.88156 0.29774 208.15122 8.77 0.0041

6
Stat 448
Notes 8 Examples

Summary of Forward Selection

Partial Model
Variable Number R- R- F
Step Entered Label Vars In Square Square C(p) Value Pr > F
1 Horsepower 1 0.6670 0.6670 12.5987 158.21 <.0001
2 FuelTank Size of Fuel Tank in 2 0.0336 0.7006 5.5460 8.77 0.0041
Gallons

The forward selection procedure results in a model with two variables, Horsepower and FuelTank.
The R2 is 0.7006. However, Cp is not less than p.
The forward and backward selection procedures resulted in models with completely different variables.
This is often the case when the variables being considered for the model are highly correlated with one
another.

Partial PROC REG Output for R2 Model Selection

Adjusted
Model Number in R- R-
Index Model Square Square C(p) AIC SBC Variables in Model
1 1 0.6670 0.6627 12.5987 266.1241 270.91297 Horsepower
2 1 0.6193 0.6145 25.4227 276.9593 281.74819 Weight
3 1 0.5785 0.5732 36.3947 285.2023 289.99121 FuelTank
4 2 0.7006 0.6929 5.5460 259.4966 266.67998 Horsepower FuelTank
5 2 0.6949 0.6871 7.0856 261.0302 268.21357 Horsepower Weight
6 2 0.6854 0.6773 9.6343 263.5070 270.69032 Hwympg Horsepower
7 3 0.7208 0.7099 2.1174 255.8447 265.42253 Citympg Hwympg2
Horsepower
8 3 0.7192 0.7082 2.5523 256.3123 265.89009 Hwympg Hwympg2
Horsepower
9 3 0.7182 0.7072 2.8133 256.5917 266.16949 Citympg Citympg2
Horsepower
10 4 0.7319 0.7178 1.1314 254.5591 266.53139 Hwympg Hwympg2
Horsepower FuelTank
11 4 0.7284 0.7141 2.0776 255.6148 267.58700 Hwympg Hwympg2
Horsepower Weight
12 4 0.7271 0.7127 2.4281 256.0023 267.97456 Citympg Hwympg Hwympg2
Horsepower
13 5 0.7334 0.7157 2.7115 256.0862 270.45291 Citympg Hwympg Hwympg2
Horsepower FuelTank

7
Stat 448
Notes 8 Examples

Adjusted
Model Number in R- R-
Index Model Square Square C(p) AIC SBC Variables in Model
14 5 0.7331 0.7153 2.8129 256.2007 270.56737 Hwympg Hwympg2
Horsepower FuelTank Weight
15 5 0.7327 0.7149 2.9080 256.3079 270.67457 Hwympg Hwympg2 Cylinders
Horsepower FuelTank
16 6 0.7387 0.7175 3.2971 256.4727 273.23385 Citympg Hwympg2
EngineSize Horsepower
FuelTank2 Weight
17 6 0.7380 0.7167 3.4880 256.6923 273.45348 Citympg Citympg2
EngineSize Horsepower
FuelTank2 Weight
18 6 0.7362 0.7148 3.9617 257.2348 273.99596 Hwympg Hwympg2
EngineSize Horsepower
FuelTank Weight
The models are ranked by their R2 and the best three models for each possible number of variables are
shown. Based on Mallows' Cp criteria, you should consider models with three or four variables. The best
three-variable model includes the variables Citympg, Hwympg2, and Horsepower. This presents a
problem because it includes Hwympg2 without the first power of the variable. In this case, you might
choose to use either the second or third model listed with three variables.
It is usually desirable to construct hierarchically well-formulated models. This means that a model that
includes a variable to a power should also include all lower powers of the variable. Likewise, a model that
includes a cross-product term should also include each of the individual variable terms. An exception to
this would be if there were an overriding physical reason that leads you to conclude that the true
population model does not include the lower order term.
The best four-variable model includes the variables Hwympg, Hwympg2, Horsepower, and
FuelTank. The four-variable model has a higher R2 than the three-variable models, but that could be
solely due to the increased number of variables in the model. However, the adjusted R2 is also better for
the four-variable model. The Cp and AIC are slightly smaller, and therefore better, for the four-variable
model. The SBC is slightly smaller for the three-variable model.
The number of parameters in each model is one more than the number in the model listed in the output.
The output number refers to the number of variables. The number of parameters is the number of
variables plus one parameter for the intercept.

8
Stat 448
Notes 8 Examples

The plot shows that the minimum Cp is reached with five-parameter models. Beyond five parameters the
Cp begins to increase. However, recall that Mallows suggested using the model with the fewest number of
variables that meets the criteria Cp <= p. At least one model with four parameters, or three predictor
variables, meets this criterion.

9
Stat 448
Notes 8 Examples

PROC REG Output for the Adjusted R2 Selection Method

Adjusted
Model Number in R- R-
Index Model Square Square C(p) AIC SBC Variables in Model
1 4 0.7178 0.7319 1.1314 254.5591 266.53139 Hwympg Hwympg2
Horsepower FuelTank
2 6 0.7175 0.7387 3.2971 256.4727 273.23385 Citympg Hwympg2 EngineSize
Horsepower FuelTank2 Weight
3 7 0.7170 0.7418 4.4747 257.5195 276.67505 Citympg Hwympg2 Cylinders
EngineSize Horsepower
FuelTank2 Weight
4 8 0.7168 0.7451 5.5783 258.4675 280.01759 Citympg Citympg2 Cylinders
EngineSize Horsepower
FuelTank FuelTank2 Weight
5 6 0.7167 0.7380 3.4880 256.6923 273.45348 Citympg Citympg2 EngineSize
Horsepower FuelTank2 Weight
6 7 0.7164 0.7412 4.6220 257.6911 276.84665 Citympg Citympg2 Cylinders
EngineSize Horsepower
FuelTank2 Weight
7 8 0.7159 0.7443 5.7972 258.7257 280.27577 Citympg Hwympg2 Cylinders
EngineSize Horsepower
FuelTank FuelTank2 Weight
8 5 0.7157 0.7334 2.7115 256.0862 270.45291 Citympg Hwympg Hwympg2
Horsepower FuelTank
9 7 0.7157 0.7405 4.8016 257.8997 277.05530 Citympg Hwympg Hwympg2
EngineSize Horsepower
FuelTank2 Weight
10 7 0.7155 0.7404 4.8336 257.9368 277.09238 Citympg Citympg2 EngineSize
Horsepower FuelTank
FuelTank2 Weight

The model with the best adjusted R2 is the same as the best four-variable model found with the R2
selection criteria.

10
Stat 448
Notes 8 Examples

Example 3: Evaluate the Importance of the Parameters

Assume that you chose to use the model with three variables, Hwympg, Hwympg2, and Horsepower.
You might want to evaluate the relative importance among these three variables in explaining the variance
in Price.
proc reg data=st.cars2;
model price=hwympg hwympg2 horsepower / stb;
run;
quit;
Selected MODEL statement option:
STB produces standardized regression coefficients. A standardized regression coefficient is computed
by dividing a parameter estimate by the ratio of the sample standard deviation of the dependent
variable to the sample standard deviation of the regressor.

Partial PROC REG Output

Parameter Standard Standardized
Variable DF Estimate Error t Value Pr > |t| Estimate
Intercept 1 4.03949 2.17024 1.86 0.0665 0
Hwympg 1 -0.80407 0.21378 -3.76 0.0003 -0.45822
Hwympg2 1 0.04350 0.01430 3.04 0.0032 0.27668
Horsepower 1 0.09730 0.01614 6.03 <.0001 0.56031

The standardized estimate for each variable can be thought of as a relative measure of the importance of
the predictor variables. The standardized estimate for Horsepower is the biggest (in absolute value),
0.56, which indicates that Horsepower affects the price more than other variables.

Now that you narrowed down the models under consideration, you can evaluate the models by checking
to see whether the assumptions of the regression were met. In addition, influential observations should be
identified and the model should be checked for collinearity.

11
Stat 448
Notes 8 Examples

Example 4: Model Diagnostics – Normality, Constant Variance,

Collinearity, and Influential Observations
Assume that you chose to use the model with three variables, Hwympg, Hwympg2, and Horsepower.
Evaluate this model by checking for violations of the assumptions and collinearity. Also identify any
observations that appear to be influential.
ods graphics on;
proc reg data= st.cars2 plots=(diagnostics(unpack) cooksd dffits);
model price=hwympg hwympg2 horsepower
/ stb vif collin collinoint influence;
output out=check r=residuals p=predicted;
run;
quit;
ods graphics off;

Partial PROC REG Output

Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 3 4448.69341 1482.89780 65.73 <.0001
Error 77 1737.20536 22.56111
Corrected Total 80 6185.89877

Root MSE 4.74985 R-Square 0.7192

Dependent Mean 18.64321 Adj R-Sq 0.7082
Coeff Var 25.47766

Parameter Estimates
Parameter Standard Variance
Variable DF Estimate Error t Value Pr > |t| Inflation
Intercept 1 4.03949 2.17024 1.86 0.0665 0
Hwympg 1 -0.80407 0.21378 -3.76 0.0003 4.06937
Hwympg2 1 0.04350 0.01430 3.04 0.0032 2.26764
Horsepower 1 0.09730 0.01614 6.03 <.0001 2.36905

As expected, the overall F-test is significant and the adjusted R2 is 0.7082. The model equation is
Price = 4.03949 – 0.80407 * Hwympg + 0.0435 * Hwympg2 + 0.09730 * Horsepower

The largest variance inflation factor is 4.06937, which is less than 10.

12
Stat 448
Notes 8 Examples

Collinearity Diagnostics
Proportion of Variation
Condition
Number Eigenvalue Index Intercept Hwympg Hwympg2 Horsepower
1 2.17779 1.00000 0.01125 0.00133 0.03345 0.00845
2 1.52787 1.19389 0.00125 0.09241 0.06816 0.00352
3 0.26890 2.84585 0.03127 0.32286 0.68972 0.00003866
4 0.02544 9.25245 0.95623 0.58341 0.20867 0.98800

None of the condition index values is even greater than 10. Multicollinearity does not appear to be a
problem with this model.
Look at the graphs that were generated by the PLOTS option.

The normal quantile plot does not indicate any serious problems with the normality assumption. To
further evaluate the normality assumption, you can generate a test of normality, as well as additional
graphs, using the UNIVARIATE procedure.

13
Stat 448
Notes 8 Examples

The plot of the residuals versus the predicted values might cause some concern about the model. The
smaller predicted values appear to have less variability than the larger predicted values. This is a violation
of the assumption of equal variances. Remedial measures are discussed later in the course.
The constant variance assumption can be evaluated with the Spearman rank correlation coefficient
between the absolute values of the residuals and the predicted values.
data check;
set check;
abserror=abs(residual);
run;

proc corr data=check spearman nosimple;

var abserror pred;
run;

PROC CORR Output

The CORR Procedure

2 Variables: abserror pred

Spearman Correlation Coefficients, N = 81

Prob > |r| under H0: Rho=0

14
Stat 448
Notes 8 Examples

abserror pred

abserror 1.00000 0.60274

<.0001

pred 0.60274 1.00000

Predicted Value of Price <.0001

The Spearman rank correlation coefficient between the absolute values of the residuals and the predicted
values is about 0.603. The highly significant p-value (<.0001) indicates a strong correlation between the
absolute values of the residuals and the predicted values. The positive correlation coefficient indicates that
the residuals increase as the predicted values increase.
The following graphs illustrates how to use influential statistics and the associated cutoff values to
identify influential observations.

15
Stat 448
Notes 8 Examples

There are three observations that appear to be influential based on Cook’s D and DFFITS statistics. The
data for these observations should be checked to ensure that no transcription or data entry errors have
occurred. If the data is erroneous, correct the errors and re-analyze the data.
It is possible that the model is not adequate. There might be another variable that was not considered in
developing the model, which is important in explaining these unusual observations.
Another possibility is that the observation, though valid, might be unusual. If you had a larger sample
size, there might be more observations like the unusual ones. You might have to collect more data to
confirm the relationship suggested by the influential observation.
In general, do not exclude data. In many circumstances, some of the unusual observations contain
important information. If you do choose to exclude some observations, you should include a description
of the types of observations you exclude and provide an explanation. You should also discuss the
limitations of your conclusions, given the exclusions, as part of your report or presentation.

16
Stat 448
Notes 8 Examples

Example 5: Transforming Variables

Recall the graph of the residuals versus the predicted values from the model for the st.cars2 data.

The variability of the residuals appears to increase as the predicted value increases. The Spearman rank
correlation coefficient you computed earlier indicates a significant correlation between the absolute value
of the residuals and the predicted values (0.603). In an attempt to stabilize the variances, try several
transformations on the response variable and evaluate the models.
First, try the square root transformation to stabilize the variance
title 'Square Root Transformation';
data st.cars2;
set st.cars2;
sqrt=sqrt(price);
run;
ods graphics on;
proc reg data=st.cars2 plots=diagnostics(unpack);
model sqrt=hwympg hwympg2 horsepower;
output out=sqrt r=residuals p=predicted;
run;
quit;
ods graphics off;

data sqrt;
set sqrt;
abserror=abs(residuals);

17
Stat 448
Notes 8 Examples

run;

proc corr data=sqrt spearman nosimple;

var abserror predicted;
run;

Partial PROC REG Output

18
Stat 448
Notes 8 Examples

The residual plot appears to be slightly better than the previous plot using the original variable price.
Spearman Correlation Coefficients, N = 81
Prob > |r| under H0: Rho=0
abserror predicted
abserror 1.00000 0.37688
0.0005
predicted 0.37688 1.00000
Predicted Value of sqrt 0.0005

The Spearman correlation coefficient is reduced to 0.377, with a significant p-value of 0.0005. This
indicates that the square root transformation helped some but might not be enough. You might want to try
a stronger transformation, such as the logarithm transformation.

19
Stat 448
Notes 8 Examples

title 'Log Transformation';

data st.cars2;
set st.cars2;
logprice=log(price);
run;

proc reg data=st.cars2;

model logprice=hwympg hwympg2 horsepower;
output out=log r=residuals p=predicted;
run;
quit;

20
Stat 448
Notes 8 Examples

data log;
set log;
abserror=abs(residuals);
run;

proc corr data=log spearman nosimple;

var abserror predicted;
run;
title;

Partial PROC REG Output

21
Stat 448
Notes 8 Examples

The residual plot shows a random scatter around the reference line.

PROC CORR Output

Spearman Correlation Coefficients, N = 81
Prob > |r| under H0: Rho=0
abserror predicted
abserror 1.00000 0.19492
0.0812
predicted 0.19492 1.00000
Predicted Value of logprice 0.0812

The Spearman correlation coefficient is reduced to 0.19 with a p-value of 0.08. The correlation is not
significant at a 0.05 alpha level.
Examine the output for the model using the log transformation.

22
Stat 448
Notes 8 Examples

Partial PROC REG Output

Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 3 12.08755 4.02918 75.78 <.0001
Error 77 4.09424 0.05317
Corrected Total 80 16.18179

Root MSE 0.23059 R-Square 0.7470

Dependent Mean 2.82388 Adj R-Sq 0.7371
Coeff Var 8.16573

Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
Intercept 1 2.10223 0.10536 19.95 <.0001
Hwympg 1 -0.04193 0.01038 -4.04 0.0001
Hwympg2 1 0.00160 0.00069418 2.30 0.0240
Horsepower 1 0.00491 0.00078364 6.26 <.0001

The model is significant with an R2 of 0.7470. All parameter estimates are significant. However, when
you transform the dependent variable, it is not uncommon to observe a change in the relationship with
one or more of the independent variables. In practice, after you transform the dependent variable you
should consider starting the process of variable selection over again.

23
Stat 448
Notes 8 Examples

Example 6: Detransformation of the Model

In order to obtain estimates of the mean for the original data, you must detransform the model, including
the low-bias adjustment factor.
First, rerun the regression and create output data sets with the mean squared error and the predicted
values. Then use a DATA step to compute the low-bias adjusted mean values.
proc reg data=st.cars2;
model logprice = hwympg hwympg2 horsepower;
ods output ANOVA=ANOVATable;
output out=out p=pred;
run;
quit;

data _null_;
set ANOVATable;
if source='Error' then call symput('var', MS);
run;

data out;
set out;
estimate = exp(pred + &var/2);
difference = price - estimate;
run;

proc print data=out;

var manufacturer model hwympg horsepower price estimate difference;
run;

proc gplot data=out;

plot difference*estimate/vref=0;
run;
quit;

24
Stat 448
Notes 8 Examples

PROC GPLOT Output

difference
20

-10

-20

-30

0 10 20 30 40 50 60

estimate

There is at least one observation that seems to have a large difference (in absolute value) between
the observed price and the estimated value based on the model and the detransformation.
Remember the independent variables were selected prior to transforming the response variable.
Sometimes it is possible that a different set of independent variables is more appropriate for the
transformed dependent variable.

Fundamentals of Foundation Engineering (2023)
100% (8)
Fundamentals of Foundation Engineering (2023)
436 pages
HCI 02 Laboratory Exercise 1 ARG
100% (3)
HCI 02 Laboratory Exercise 1 ARG
2 pages
Stata Cheat Sheets
100% (1)
Stata Cheat Sheets
6 pages
Regression: Descriptive Statistics
No ratings yet
Regression: Descriptive Statistics
13 pages
MTCARS Regression Analysis
No ratings yet
MTCARS Regression Analysis
5 pages
Lec11-Stata Regression
No ratings yet
Lec11-Stata Regression
9 pages
Data analysis using stata
No ratings yet
Data analysis using stata
13 pages
Introduction To R Program and Output
No ratings yet
Introduction To R Program and Output
6 pages
Motor Trend Car Road Tests
No ratings yet
Motor Trend Car Road Tests
5 pages
Lab 6
No ratings yet
Lab 6
2 pages
STA1040 Assignment
No ratings yet
STA1040 Assignment
9 pages
En Tanagra Python StatsModels PDF
No ratings yet
En Tanagra Python StatsModels PDF
20 pages
Homework1-1
No ratings yet
Homework1-1
3 pages
Cheat Sheet: With Stata
No ratings yet
Cheat Sheet: With Stata
6 pages
Topic 3-SPSS and STATA
100% (1)
Topic 3-SPSS and STATA
73 pages
S2-Linear-Regression-LKW-9March2025
No ratings yet
S2-Linear-Regression-LKW-9March2025
23 pages
StataCheatSheet Analysis
No ratings yet
StataCheatSheet Analysis
1 page
Regression Models Assignment 1
No ratings yet
Regression Models Assignment 1
5 pages
stata codes
No ratings yet
stata codes
8 pages
2015 Regression Using Stata and SAS
No ratings yet
2015 Regression Using Stata and SAS
36 pages
DS EXP6
No ratings yet
DS EXP6
5 pages
Econometrics With Stata PDF
No ratings yet
Econometrics With Stata PDF
58 pages
Regression Models Assignment 1
No ratings yet
Regression Models Assignment 1
6 pages
Regression
No ratings yet
Regression
5 pages
Eda 1
No ratings yet
Eda 1
29 pages
LinearRegression HandsOn
No ratings yet
LinearRegression HandsOn
3 pages
Statacheatsheets
No ratings yet
Statacheatsheets
6 pages
Mtcars: Choosing The Most Related Variable (S) To The Response
No ratings yet
Mtcars: Choosing The Most Related Variable (S) To The Response
13 pages
Untitled Document
No ratings yet
Untitled Document
6 pages
HW3 Isye 7406
No ratings yet
HW3 Isye 7406
8 pages
BAN110 Final Project Documentation. 2
No ratings yet
BAN110 Final Project Documentation. 2
13 pages
Using R For Basic Statistical Analysis
No ratings yet
Using R For Basic Statistical Analysis
11 pages
Introduction - To - ML - Linear - Regression - Lecture - Slides New
No ratings yet
Introduction - To - ML - Linear - Regression - Lecture - Slides New
18 pages
20BCE1205 Lab3
No ratings yet
20BCE1205 Lab3
9 pages
AllCheatSheets_Stata_v15
No ratings yet
AllCheatSheets_Stata_v15
6 pages
Data_Wrangling Analysis
No ratings yet
Data_Wrangling Analysis
26 pages
Midterm Codes
No ratings yet
Midterm Codes
8 pages
Lecture 3
No ratings yet
Lecture 3
90 pages
All Cheat Sheets
No ratings yet
All Cheat Sheets
5 pages
Stat A Cheat Sheets
No ratings yet
Stat A Cheat Sheets
6 pages
AllCheatSheets Stata v15
100% (1)
AllCheatSheets Stata v15
6 pages
AllCheatSheets Stata v15 PDF
No ratings yet
AllCheatSheets Stata v15 PDF
6 pages
Cheat Sheet: With Stata 15
No ratings yet
Cheat Sheet: With Stata 15
6 pages
Multiple Regression1
No ratings yet
Multiple Regression1
27 pages
Data Analysis
No ratings yet
Data Analysis
1 page
7406HW03
No ratings yet
7406HW03
2 pages
Week 02 Data Wrangling
No ratings yet
Week 02 Data Wrangling
10 pages
Stata
No ratings yet
Stata
6 pages
DMPM-LAB-03-Assignment: Rcode
No ratings yet
DMPM-LAB-03-Assignment: Rcode
9 pages
R
No ratings yet
R
3 pages
Final Cost Practical
No ratings yet
Final Cost Practical
29 pages
Bivariate Correlation and Multiple Regression Analyses For Continuous Variables Using SAS
No ratings yet
Bivariate Correlation and Multiple Regression Analyses For Continuous Variables Using SAS
9 pages
Big Data Scienceassigment
No ratings yet
Big Data Scienceassigment
10 pages
R Module 11 - Statistics
No ratings yet
R Module 11 - Statistics
35 pages
Swapnil Shashank Parkhe (UIN-660014865) Assignment 1 (All Are Pasted at End)
No ratings yet
Swapnil Shashank Parkhe (UIN-660014865) Assignment 1 (All Are Pasted at End)
16 pages
Cheat Sheet: With Stata 15
No ratings yet
Cheat Sheet: With Stata 15
1 page
Assignment Auto
No ratings yet
Assignment Auto
6 pages
Exercises 2 Unfinished
No ratings yet
Exercises 2 Unfinished
8 pages
Aayushi Bda File
No ratings yet
Aayushi Bda File
41 pages
Ankit Bansal-CGT19005
No ratings yet
Ankit Bansal-CGT19005
7 pages
Worked Examples in Mechanics of Machines using MATLAB
From Everand
Worked Examples in Mechanics of Machines using MATLAB
Eric Ogur
No ratings yet
The Slot Car Handbook: The definitive guide to setting-up and running Scalextric sytle 1/32 scale ready-to-race slot cars
From Everand
The Slot Car Handbook: The definitive guide to setting-up and running Scalextric sytle 1/32 scale ready-to-race slot cars
Dave Chang
3/5 (1)
Human Resource Management and Development: PSDA-3
No ratings yet
Human Resource Management and Development: PSDA-3
13 pages
Chapter Two: Review of Network Parameters & Transmission Line Theory
No ratings yet
Chapter Two: Review of Network Parameters & Transmission Line Theory
43 pages
Translation Studies Today: Old Problems and New Challenges: Vadim V. Sdobnikov
No ratings yet
Translation Studies Today: Old Problems and New Challenges: Vadim V. Sdobnikov
33 pages
Cebu Warehouse-Rev00-20.02.2022-Op 1
No ratings yet
Cebu Warehouse-Rev00-20.02.2022-Op 1
10 pages
Casr 139
No ratings yet
Casr 139
467 pages
Covert Persuasion - Reference Guide
100% (1)
Covert Persuasion - Reference Guide
10 pages
Lighting in Photography
No ratings yet
Lighting in Photography
13 pages
Optimization Strategies For The Design and Synthesis of Distributed Wastewater Treatment Networks. B. Galan+ and I.E. Grossmann
No ratings yet
Optimization Strategies For The Design and Synthesis of Distributed Wastewater Treatment Networks. B. Galan+ and I.E. Grossmann
4 pages
Grove GMK
No ratings yet
Grove GMK
5 pages
AMF 3.4L Parameter List en v1.0
No ratings yet
AMF 3.4L Parameter List en v1.0
9 pages
Lab Report 3.egg Drop
No ratings yet
Lab Report 3.egg Drop
3 pages
Contract Evaluation Checklist
100% (1)
Contract Evaluation Checklist
2 pages
Theories of Language Learning
100% (1)
Theories of Language Learning
12 pages
City of Glasgow College - Full Time Prospectus 2012-2013
No ratings yet
City of Glasgow College - Full Time Prospectus 2012-2013
180 pages
Download ebooks file Mining the Home Movie Excavations in Histories and Memories 1st Edition Karen L. Ishizuka all chapters
100% (4)
Download ebooks file Mining the Home Movie Excavations in Histories and Memories 1st Edition Karen L. Ishizuka all chapters
81 pages
Project MGMT Akanksha
No ratings yet
Project MGMT Akanksha
1 page
Welcome: Business Communication: Group 6
100% (1)
Welcome: Business Communication: Group 6
71 pages
Thales and The Origin of Theoretical Reasoning
100% (1)
Thales and The Origin of Theoretical Reasoning
81 pages
"Biomagnetism": Zulia Caamaño
No ratings yet
"Biomagnetism": Zulia Caamaño
12 pages
Revised Questionnaire
No ratings yet
Revised Questionnaire
3 pages
Thermal Spray Processes
100% (2)
Thermal Spray Processes
15 pages
Evaluation of Compressed Air Adsorption Dryer With
No ratings yet
Evaluation of Compressed Air Adsorption Dryer With
27 pages
Concept Paper Rubrics
No ratings yet
Concept Paper Rubrics
2 pages
Resume IT
No ratings yet
Resume IT
3 pages
Clearing House Lectureupdated - June11 2013
No ratings yet
Clearing House Lectureupdated - June11 2013
35 pages
New Online Seminar For Overcurrent Testing Is Now in Production - Valence Electrical Training Services
No ratings yet
New Online Seminar For Overcurrent Testing Is Now in Production - Valence Electrical Training Services
11 pages
Advances in Clothing Technology
No ratings yet
Advances in Clothing Technology
2 pages
Flexible Pavement Construction
100% (1)
Flexible Pavement Construction
18 pages

Notes 8 - Examples(March5)

Uploaded by

Notes 8 - Examples(March5)

Uploaded by

Stat 448

Example 1: Initial Data Exploration

Horsepower appears to have a positive linear relationship with Price.

There does not appear to be a relationship between Luggage and Price.

Partial PROC CORR Output

Pearson Correlation Coefficients, N = 81

Example 2: Select Candidate Models

Selected MODEL statement options:

Partial PROC REG Output for the Backward Elimination Model

Variable EngineSize Removed: R-Square = 0.7182 and C(p) = 2.8133

Summary of Backward Elimination

Partial PROC REG Output for the Forward Selection Model

Variable FuelTank Entered: R-Square = 0.7006 and C(p) = 5.5460

Summary of Forward Selection

Partial PROC REG Output for R2 Model Selection

PROC REG Output for the Adjusted R2 Selection Method

Example 3: Evaluate the Importance of the Parameters

Partial PROC REG Output

Example 4: Model Diagnostics – Normality, Constant Variance,

Partial PROC REG Output

Root MSE 4.74985 R-Square 0.7192

proc corr data=check spearman nosimple;

PROC CORR Output

2 Variables: abserror pred

Spearman Correlation Coefficients, N = 81

abserror 1.00000 0.60274

pred 0.60274 1.00000

Example 5: Transforming Variables

proc corr data=sqrt spearman nosimple;

Partial PROC REG Output

title 'Log Transformation';

proc reg data=st.cars2;

proc corr data=log spearman nosimple;

Partial PROC REG Output

PROC CORR Output

Partial PROC REG Output

Root MSE 0.23059 R-Square 0.7470

Example 6: Detransformation of the Model

proc print data=out;

proc gplot data=out;

PROC GPLOT Output

You might also like