0% found this document useful (0 votes)
117 views24 pages

Lab 2 Part 2 W21 Regression AL PDF

Uploaded by

acc125639
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
117 views24 pages

Lab 2 Part 2 W21 Regression AL PDF

Uploaded by

acc125639
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Big Data and Predictive Analysis

Assignment 4 (Lab 2 Part 2)


Predictive Modeling Using Regression-SAS Miner

REGRESSION EXERCISE
1. Predictive Modeling Using Regression
a. Return to the Chapter 3 Organics diagram in the My Project. Use the StatExplore tool on the
ORGANICS data source.
1) First StatExplore node is connected to the ORGANICS node.

2) StatExplorer node results is generated

After running the StatExplore node, the results below shows that there are missing values in the
selected variables.

b. In-order to prepare for regression, missing values are imputed? Why do you think we should
impute?

We use impute to create a synthetic value for the missing values. If there are missing values,
those values will be replaced with the mean of the non-missing values in the dataset. This
will help to manage the variables that might affect the dependent variable. Therefore,
imputation is done to before building the model to avoid bias in the model.
c. What changed after imputing?

The missing variables were replaced with the mean value of the variables. The result
below shows there are no more missing data.
SAS Diagram:

d. Add an Impute node from the Modify tab into the diagram and connect it to the Data Partition
node. Set the node to impute U for unknown class variable values and the overall mean for
unknown interval variable values. Create imputation indicators for all imputed inputs.
e. Add a Regression node to the diagram and connect it to the Impute node. Choose stepwise as
the selection model and the validation error as the selection criterion.

f. Choose stepwise as the selection model and the validation error as the selection criterion.
g. Run the Regression node and view the results. Maximize the Effect Plot.

Iteration Plot: The selected model (based on minimum error) occurred in step 6.

h. Which variables are included in the final model? Which variables are important in this model?
What is the validation ASE?

Variables in the final model: IMP_DemAffl, IMP_DemAge, IMP_DemGender,


M_DemAge, M_DemGender
All the variables are important in this model.
The validation ASE is 0.137156
Final model output:

Average Squared Error (ASE):


i) Go to line 664 in the Output window.

j) The odds ratios indicate the effect that each input has on the logit score.

The odds ratio estimates help you to interpret the model. Below are the odds ratio estimate for
the variables in the model.
k) Interpret the odds ratio estimate:

• The odds ratio estimates for IMP_DemAffl is 1.283. This means that IMP_DemAffl is
1.283 times (28.3%) more likely to predict the dependent variable (target).
• The odds ratio estimates for IMP_DemAge is 0.947. This means that IMP_DemAge is
0.947 times (5%) less likely to predict the dependent variable (target).
• For IMP_DemGender (F vs U), the odds estimate ratio is 6.967. This means that cases
that with F value (female gender) are 6.967 times more likely to predict the dependent
variable than cases with a Unique (U) imputed variables.
• For IMP_DemGender (M vs U), the odds estimate ratio is 2.899. This means that cases
that with M value (male gender) are 2.899 times more likely to predict the dependent
variable than cases with a Unique (U) imputed variables.
• For M_DemAffl (0 vs 1), the odds ratio estimate is 0.708. This means that cases with a 0
value for M_DemAffl are 0.708 times (29%) less likely to predict the target variable than
cases with a 1 value for M_DemAffl.
• For M_DemAge (0 vs 1), the odds ratio estimate is 0.796. This means that cases with a 0
value for M_DemAge are 0.796 times (20.4%) less likely to predict the target variable
than cases with a 1 value for M_DemAge.
• For M_DemGender (0 vs 1), the odds ratio estimate is 0.685. This means that cases with
a 0 value for M_DemGender are 0.685 times less likely to predict the target variable than
cases with a 1 value for M_DemGender.

l) The validation ASE is given in the Fit Statistics window.


The validation ASE is 0.137156
PART 2

a. In preparation for regression, are any transformations of the data warranted? Why or why not?
Regression models are sensitive to extreme or outlying values in the input space. Inputs
in the variables with high Skewness could be selected over inputs which yield better
prediction for our model, which is the goal of the analysis. Therefore, the log
transformation is used to reduce skewness in the data for a superior model.

i. Open the Variables window of the Regression node. Select the imputed interval inputs.

ii. Select Explore. The Explore window appears.


b. Both Card Tenure and Affluence Grade have moderately skewed distributions. Applying a log
transformation to these inputs might improve the model fit.

Card Tenure:
Affluence Grade:

c. Disconnect the Impute node from the Data Partition node.


d. Add a Transform Variables node from the Modify tab to the diagram and connect it to the
Data Partition node.
e. Connect the Transform Variables node to the Impute node.
f. Apply a log transformation to the DemAffl and PromTime inputs.
i. Open the Variables window of the Transform Variables node.
ii. Select Method  Log for the DemAffl and PromTime inputs. Select OK to close the Variables
window.

g. Run the Transform Variables node. Explore the exported training data. Did the transformations result
in less skewed distributions?
i. The easiest way to explore the created inputs is to open the Variables window in the subsequent
Impute node. Make sure that you update the Impute node before opening its Variables window.

ii. With the LOG_DemAffl and LOG_PromTime inputs selected, select Explore.

LOG_PromTime (Card Tenure):


LOG_DemAffl:

The distributions are nicely symmetric.

h. Rerun the Regression node. Do the selected variables change? How about the validation ASE?

The selected variables changed as some of them were replaced with the transformed LOG
variables and imputed LOG variable. However, the number of variables did not change.
The new variables for the model are IMP_DemAge, IMP_DemGender, IMP_LOG_DemAffl,
M_DemAge, M_DemGender and M_LOG_DemAffl.
Model Iteration Plot:
The selected model (based on minimum error) occurred in step 6.

The validation for ASE is 0.138204. The value of the average squared error for this model is
slightly higher than that for the model with untransformed inputs (ASE is 0.137156).
Fit Statistic:
i. Go to line 664 of the Output window.
Below are the independent variables in line 664.

i. Apparently the log transformation actually increased the validation ASE slightly.
j. Create a full second-degree polynomial model. How does the validation average squared error for the
polynomial model compare to the original model?
i. Add another Regression node to the diagram and rename it Polynomial Regression.
ii. Make the indicated changes to the Polynomial Regression Properties panel and run the node.

iii. Go to line 1598 of the results output window.


iv. The polynomial regression node adds additional interaction terms.

Iteration Plot:
The model ran 14 Iterations and the selected model (based on minimum error) occurred in step 7.
v. Examine the Fit Statistics window.

The ASE validation for the Polynomial regression model is 0.134038. There is a slight
improvement of ASE when compared with the model with transformed inputs.

k) In your words, describe what did you do in this assignment and why you had to do each of these steps?
Plus, how would you describe the IV’s that have an impact on the DV.

The objective of this exercise is to create the best model in predicting the purchase of Organic
products. A multiple regression model was used to analyze the data set. The data set contains 13
variables with over 22,000 observations. The model has one dependent or target variable
(TargetBuy), and the other variables are the independent variables that would help us predict the
target variable.

The first step was to create and run a Stepwise Regression Model. The stepwise method allows us
to specify how the independent variables are entered into the analysis. Missing values were
identified and replaced with the mean of the variables through imputation. Out of the nine variables
that entered the model as independent variables, only six were significant in predicting the Target
Variable. In interpreting the model, we look at the ASE and odds ratio estimate. The Average
Squared Error (ASE) for the validation data is 0.137156. The odds ratio estimate for the
independent variables shows that Gender and Affluence Grade are more likely to predict purchase.
While this might be considered a good model, it is important to explore the data even further to
run multiple iterations for the best model.
The variables were explored to check for skewness in the data as a second step, and two variables
were moderately skewed (Card Tenure and Affluence Grade). To eliminate bias in the model and
improve the model fit, log transformation was conducted to reduce the skewness of the data. The
output of the second model shows that the independent variables changed two of the independent
variables in the previous model were replaced with the transformed LOG variables. The output of
this model delivers an Average Squared Error (ASE) of 0.138204, which is slightly higher than the
previous model. As a result, the transformation of the data did not improve the model. Therefore,
we needed to create and run more iterations for the best prediction model.

The third and final step in the process is Polynomial Regression. Polynomial regression enables
prediction to better match the true input/target association. It also increases the chances of
overfitting while simultaneously reducing the interpretability of the predictions. This model ran
14 iterations, and the selected model based on minimum error occurred in step 7. The final output
of the model ASE validation for the Polynomial Regression equals 0. 134038, which is a slight
improvement compared with the ASE of the model with the transformed inputs.

The independent variables that have an impact in the dependent variable is best described by
analyzing the p-value and the odds ratio estimate. The P-value shows you the independent
variables that are significant in predicting the dependent variable. The odds ratio estimate shows
you the independent variables that are less likely or more likely to predict the dependent variable.

Summary of the Polynomial Regression Model:


P-Value and Odds Ratio Estimate:

Odds Ratio Estimate


P-Value

You might also like