0% found this document useful (0 votes)

117 views24 pages

Lab 2 Part 2 W21 Regression AL PDF

Uploaded by

acc125639

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

117 views24 pages

Lab 2 Part 2 W21 Regression AL PDF

Uploaded by

acc125639

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

Big Data and Predictive Analysis

Assignment 4 (Lab 2 Part 2)

Predictive Modeling Using Regression-SAS Miner

REGRESSION EXERCISE
1. Predictive Modeling Using Regression
a. Return to the Chapter 3 Organics diagram in the My Project. Use the StatExplore tool on the
ORGANICS data source.
1) First StatExplore node is connected to the ORGANICS node.

2) StatExplorer node results is generated

After running the StatExplore node, the results below shows that there are missing values in the
selected variables.

b. In-order to prepare for regression, missing values are imputed? Why do you think we should
impute?

We use impute to create a synthetic value for the missing values. If there are missing values,
those values will be replaced with the mean of the non-missing values in the dataset. This
will help to manage the variables that might affect the dependent variable. Therefore,
imputation is done to before building the model to avoid bias in the model.
c. What changed after imputing?

The missing variables were replaced with the mean value of the variables. The result
below shows there are no more missing data.
SAS Diagram:

d. Add an Impute node from the Modify tab into the diagram and connect it to the Data Partition
node. Set the node to impute U for unknown class variable values and the overall mean for
unknown interval variable values. Create imputation indicators for all imputed inputs.
e. Add a Regression node to the diagram and connect it to the Impute node. Choose stepwise as
the selection model and the validation error as the selection criterion.

f. Choose stepwise as the selection model and the validation error as the selection criterion.
g. Run the Regression node and view the results. Maximize the Effect Plot.

Iteration Plot: The selected model (based on minimum error) occurred in step 6.

h. Which variables are included in the final model? Which variables are important in this model?
What is the validation ASE?

Variables in the final model: IMP_DemAffl, IMP_DemAge, IMP_DemGender,

M_DemAge, M_DemGender
All the variables are important in this model.
The validation ASE is 0.137156
Final model output:

Average Squared Error (ASE):

i) Go to line 664 in the Output window.

j) The odds ratios indicate the effect that each input has on the logit score.

The odds ratio estimates help you to interpret the model. Below are the odds ratio estimate for
the variables in the model.
k) Interpret the odds ratio estimate:

• The odds ratio estimates for IMP_DemAffl is 1.283. This means that IMP_DemAffl is
1.283 times (28.3%) more likely to predict the dependent variable (target).
• The odds ratio estimates for IMP_DemAge is 0.947. This means that IMP_DemAge is
0.947 times (5%) less likely to predict the dependent variable (target).
• For IMP_DemGender (F vs U), the odds estimate ratio is 6.967. This means that cases
that with F value (female gender) are 6.967 times more likely to predict the dependent
variable than cases with a Unique (U) imputed variables.
• For IMP_DemGender (M vs U), the odds estimate ratio is 2.899. This means that cases
that with M value (male gender) are 2.899 times more likely to predict the dependent
variable than cases with a Unique (U) imputed variables.
• For M_DemAffl (0 vs 1), the odds ratio estimate is 0.708. This means that cases with a 0
value for M_DemAffl are 0.708 times (29%) less likely to predict the target variable than
cases with a 1 value for M_DemAffl.
• For M_DemAge (0 vs 1), the odds ratio estimate is 0.796. This means that cases with a 0
value for M_DemAge are 0.796 times (20.4%) less likely to predict the target variable
than cases with a 1 value for M_DemAge.
• For M_DemGender (0 vs 1), the odds ratio estimate is 0.685. This means that cases with
a 0 value for M_DemGender are 0.685 times less likely to predict the target variable than
cases with a 1 value for M_DemGender.

l) The validation ASE is given in the Fit Statistics window.

The validation ASE is 0.137156
PART 2

a. In preparation for regression, are any transformations of the data warranted? Why or why not?
Regression models are sensitive to extreme or outlying values in the input space. Inputs
in the variables with high Skewness could be selected over inputs which yield better
prediction for our model, which is the goal of the analysis. Therefore, the log
transformation is used to reduce skewness in the data for a superior model.

i. Open the Variables window of the Regression node. Select the imputed interval inputs.

ii. Select Explore. The Explore window appears.

b. Both Card Tenure and Affluence Grade have moderately skewed distributions. Applying a log
transformation to these inputs might improve the model fit.

Card Tenure:
Affluence Grade:

c. Disconnect the Impute node from the Data Partition node.

d. Add a Transform Variables node from the Modify tab to the diagram and connect it to the
Data Partition node.
e. Connect the Transform Variables node to the Impute node.
f. Apply a log transformation to the DemAffl and PromTime inputs.
i. Open the Variables window of the Transform Variables node.
ii. Select Method  Log for the DemAffl and PromTime inputs. Select OK to close the Variables
window.

g. Run the Transform Variables node. Explore the exported training data. Did the transformations result
in less skewed distributions?
i. The easiest way to explore the created inputs is to open the Variables window in the subsequent
Impute node. Make sure that you update the Impute node before opening its Variables window.

ii. With the LOG_DemAffl and LOG_PromTime inputs selected, select Explore.

LOG_PromTime (Card Tenure):

LOG_DemAffl:

The distributions are nicely symmetric.

h. Rerun the Regression node. Do the selected variables change? How about the validation ASE?

The selected variables changed as some of them were replaced with the transformed LOG
variables and imputed LOG variable. However, the number of variables did not change.
The new variables for the model are IMP_DemAge, IMP_DemGender, IMP_LOG_DemAffl,
M_DemAge, M_DemGender and M_LOG_DemAffl.
Model Iteration Plot:
The selected model (based on minimum error) occurred in step 6.

The validation for ASE is 0.138204. The value of the average squared error for this model is
slightly higher than that for the model with untransformed inputs (ASE is 0.137156).
Fit Statistic:
i. Go to line 664 of the Output window.
Below are the independent variables in line 664.

i. Apparently the log transformation actually increased the validation ASE slightly.
j. Create a full second-degree polynomial model. How does the validation average squared error for the
polynomial model compare to the original model?
i. Add another Regression node to the diagram and rename it Polynomial Regression.
ii. Make the indicated changes to the Polynomial Regression Properties panel and run the node.

iii. Go to line 1598 of the results output window.

iv. The polynomial regression node adds additional interaction terms.

Iteration Plot:
The model ran 14 Iterations and the selected model (based on minimum error) occurred in step 7.
v. Examine the Fit Statistics window.

The ASE validation for the Polynomial regression model is 0.134038. There is a slight
improvement of ASE when compared with the model with transformed inputs.

k) In your words, describe what did you do in this assignment and why you had to do each of these steps?
Plus, how would you describe the IV’s that have an impact on the DV.

The objective of this exercise is to create the best model in predicting the purchase of Organic
products. A multiple regression model was used to analyze the data set. The data set contains 13
variables with over 22,000 observations. The model has one dependent or target variable
(TargetBuy), and the other variables are the independent variables that would help us predict the
target variable.

The first step was to create and run a Stepwise Regression Model. The stepwise method allows us
to specify how the independent variables are entered into the analysis. Missing values were
identified and replaced with the mean of the variables through imputation. Out of the nine variables
that entered the model as independent variables, only six were significant in predicting the Target
Variable. In interpreting the model, we look at the ASE and odds ratio estimate. The Average
Squared Error (ASE) for the validation data is 0.137156. The odds ratio estimate for the
independent variables shows that Gender and Affluence Grade are more likely to predict purchase.
While this might be considered a good model, it is important to explore the data even further to
run multiple iterations for the best model.
The variables were explored to check for skewness in the data as a second step, and two variables
were moderately skewed (Card Tenure and Affluence Grade). To eliminate bias in the model and
improve the model fit, log transformation was conducted to reduce the skewness of the data. The
output of the second model shows that the independent variables changed two of the independent
variables in the previous model were replaced with the transformed LOG variables. The output of
this model delivers an Average Squared Error (ASE) of 0.138204, which is slightly higher than the
previous model. As a result, the transformation of the data did not improve the model. Therefore,
we needed to create and run more iterations for the best prediction model.

The third and final step in the process is Polynomial Regression. Polynomial regression enables
prediction to better match the true input/target association. It also increases the chances of
overfitting while simultaneously reducing the interpretability of the predictions. This model ran
14 iterations, and the selected model based on minimum error occurred in step 7. The final output
of the model ASE validation for the Polynomial Regression equals 0. 134038, which is a slight
improvement compared with the ASE of the model with the transformed inputs.

The independent variables that have an impact in the dependent variable is best described by
analyzing the p-value and the odds ratio estimate. The P-value shows you the independent
variables that are significant in predicting the dependent variable. The odds ratio estimate shows
you the independent variables that are less likely or more likely to predict the dependent variable.

Summary of the Polynomial Regression Model:

P-Value and Odds Ratio Estimate:

Odds Ratio Estimate

P-Value

Final Exam For SAS Enterprise Miner
100% (1)
Final Exam For SAS Enterprise Miner
17 pages
SOP-017 Physical Security (v.05)
No ratings yet
SOP-017 Physical Security (v.05)
9 pages
Linear Regression
100% (2)
Linear Regression
28 pages
Predictive Modelling Sweta Kumari
No ratings yet
Predictive Modelling Sweta Kumari
35 pages
Project Data Mining
No ratings yet
Project Data Mining
55 pages
LP III Lab Manual
100% (1)
LP III Lab Manual
8 pages
Abacus Math Worksheets Free
No ratings yet
Abacus Math Worksheets Free
10 pages
Report Logistic Regression
No ratings yet
Report Logistic Regression
17 pages
What Is Regression Analysis
No ratings yet
What Is Regression Analysis
18 pages
Lec 5 V 11
No ratings yet
Lec 5 V 11
44 pages
Assumptions of Multiple Regression
No ratings yet
Assumptions of Multiple Regression
12 pages
BDA Unit 4
No ratings yet
BDA Unit 4
144 pages
6: Regression and Multiple Regression: Independent Variable. Then, Click
No ratings yet
6: Regression and Multiple Regression: Independent Variable. Then, Click
9 pages
6: Regression and Multiple Regression: Independent Variable. Then, Click
No ratings yet
6: Regression and Multiple Regression: Independent Variable. Then, Click
9 pages
6: Regression and Multiple Regression: Independent Variable. Then, Click
No ratings yet
6: Regression and Multiple Regression: Independent Variable. Then, Click
9 pages
XSTK Project PDF
No ratings yet
XSTK Project PDF
26 pages
Data Mining Methods
No ratings yet
Data Mining Methods
17 pages
CH 14 Handout
No ratings yet
CH 14 Handout
6 pages
Experiment No.2 Title:: Predicting Missing Data Using Regression Modeling
No ratings yet
Experiment No.2 Title:: Predicting Missing Data Using Regression Modeling
8 pages
PGP25116 - Soubhagya - Dash - DPolynomial Regression
No ratings yet
PGP25116 - Soubhagya - Dash - DPolynomial Regression
4 pages
Chapter 4 Demand Estimation
No ratings yet
Chapter 4 Demand Estimation
9 pages
Unit II - Diagnotis and Multiple Linear
No ratings yet
Unit II - Diagnotis and Multiple Linear
8 pages
Computer Lab 2 Block 1-3
No ratings yet
Computer Lab 2 Block 1-3
7 pages
Objects Oriented Programming OOP
No ratings yet
Objects Oriented Programming OOP
67 pages
Objects Oriented Programming OOP
No ratings yet
Objects Oriented Programming OOP
66 pages
RM2017 Midterm Questions
No ratings yet
RM2017 Midterm Questions
9 pages
FRA Milestone 1
No ratings yet
FRA Milestone 1
33 pages
Chapter 06-Regression Analysis
No ratings yet
Chapter 06-Regression Analysis
41 pages
Concepts - Model Evaluation (Data Mining Fundamentals)
No ratings yet
Concepts - Model Evaluation (Data Mining Fundamentals)
40 pages
Concepts - Regression Overview
No ratings yet
Concepts - Regression Overview
14 pages
AADvance Controller Safety Manual Icstt-rm446P-En-p
100% (1)
AADvance Controller Safety Manual Icstt-rm446P-En-p
110 pages
Section 2
No ratings yet
Section 2
22 pages
STAT22209 - Chapter 03-Multiple Regression - 2022
No ratings yet
STAT22209 - Chapter 03-Multiple Regression - 2022
41 pages
01 2019 1 01182805 Fee Voucher
No ratings yet
01 2019 1 01182805 Fee Voucher
1 page
STAT22209 - Chapter 02-Regression Analyisis - 2022
No ratings yet
STAT22209 - Chapter 02-Regression Analyisis - 2022
41 pages
Lab 2 Part 2 W21 Regression
No ratings yet
Lab 2 Part 2 W21 Regression
13 pages
CD4541 Programmable Timer
No ratings yet
CD4541 Programmable Timer
7 pages
Gmat Questions (PDFDrive)
No ratings yet
Gmat Questions (PDFDrive)
233 pages
The Acronis Anydata Engine Meets Any Data Management Challenge 2
No ratings yet
The Acronis Anydata Engine Meets Any Data Management Challenge 2
4 pages
Mungadze Linear
No ratings yet
Mungadze Linear
21 pages
Simple Liner REgression
No ratings yet
Simple Liner REgression
27 pages
Catalogue 1VAP428601-DB - SCV - ABB
No ratings yet
Catalogue 1VAP428601-DB - SCV - ABB
2 pages
Econ7020X FinalReview (Answers)
No ratings yet
Econ7020X FinalReview (Answers)
10 pages
Unit 3 Java Script: Web Technologies
No ratings yet
Unit 3 Java Script: Web Technologies
135 pages
Regression Analysis Presentation
No ratings yet
Regression Analysis Presentation
52 pages
CSS Electronics Products
No ratings yet
CSS Electronics Products
11 pages
Frequent Pattern Mining Overview: Data Mining Techniques: Frequent Patterns in Sets and Sequences
No ratings yet
Frequent Pattern Mining Overview: Data Mining Techniques: Frequent Patterns in Sets and Sequences
14 pages
LINUX
100% (1)
LINUX
3 pages
ML PR-2
No ratings yet
ML PR-2
11 pages
Teach Your Raspberry Pi - "Yeah, World"
No ratings yet
Teach Your Raspberry Pi - "Yeah, World"
10 pages
Notes Topic 2.6 Competing Function Model Validation
No ratings yet
Notes Topic 2.6 Competing Function Model Validation
4 pages
Ebook How To Sell Your Value and Your Price 49p
No ratings yet
Ebook How To Sell Your Value and Your Price 49p
49 pages
Arun 27072021 Predictive Modeling PDF
No ratings yet
Arun 27072021 Predictive Modeling PDF
33 pages
Devidutta Predictive Modeling PDF
No ratings yet
Devidutta Predictive Modeling PDF
25 pages
Recipes4Success Tools: Using The Rubric Maker
No ratings yet
Recipes4Success Tools: Using The Rubric Maker
7 pages
Video Conferencing Industry: 5 Forces Worksheet: Key Barriers To Entry
No ratings yet
Video Conferencing Industry: 5 Forces Worksheet: Key Barriers To Entry
1 page
ADM940 - EN - Part-5
No ratings yet
ADM940 - EN - Part-5
10 pages
Disseminating Intangible Cultural Heritage Through Gamified Learning Experiences and Service Design
No ratings yet
Disseminating Intangible Cultural Heritage Through Gamified Learning Experiences and Service Design
436 pages
Unit 3
No ratings yet
Unit 3
24 pages
2.3 Assumptions of Linear Regression
No ratings yet
2.3 Assumptions of Linear Regression
16 pages
Assignment 3 (QM)
No ratings yet
Assignment 3 (QM)
3 pages
Data Analytics Class - Unit-Iii
No ratings yet
Data Analytics Class - Unit-Iii
45 pages
Notification 2021 2023 (Part Ii)
No ratings yet
Notification 2021 2023 (Part Ii)
2 pages
Module 7 Homework Prompt - JMP
No ratings yet
Module 7 Homework Prompt - JMP
6 pages
Project IS3940 - PNU
No ratings yet
Project IS3940 - PNU
28 pages
2015 Regression Using Stata and SAS
No ratings yet
2015 Regression Using Stata and SAS
36 pages
Introduction
No ratings yet
Introduction
13 pages
Capstone Project
No ratings yet
Capstone Project
24 pages
0 - بحث عن اللغة العربية PDF
No ratings yet
0 - بحث عن اللغة العربية PDF
1 page
Low-Power Embedded System Design For IoT Devices
No ratings yet
Low-Power Embedded System Design For IoT Devices
6 pages
Lecture 1.1.4 (ATmega328 Block Diagram and External Peri.)
No ratings yet
Lecture 1.1.4 (ATmega328 Block Diagram and External Peri.)
14 pages
YUER™ NEW Mini LED Moving Head Light 150W Beam+S
No ratings yet
YUER™ NEW Mini LED Moving Head Light 150W Beam+S
1 page
Database & Database Management Systems (Notes)
No ratings yet
Database & Database Management Systems (Notes)
22 pages
BBP ZS9 Cu UGVAsb QT
No ratings yet
BBP ZS9 Cu UGVAsb QT
3 pages
Assessment 3
No ratings yet
Assessment 3
9 pages
Time Series Assignment 2
No ratings yet
Time Series Assignment 2
13 pages
SMDM Predictive Modeling Business Report 05.02.2022 PDF
No ratings yet
SMDM Predictive Modeling Business Report 05.02.2022 PDF
38 pages
ISOM2500 Spring 25 - Topic 10 - Assumptions For Linear Regression
No ratings yet
ISOM2500 Spring 25 - Topic 10 - Assumptions For Linear Regression
35 pages
Da Unit-Iii-2
No ratings yet
Da Unit-Iii-2
19 pages
Email-InFOSYS TRAINING - Students Schedule - 21 June-03 Jul 2025
No ratings yet
Email-InFOSYS TRAINING - Students Schedule - 21 June-03 Jul 2025
5 pages
Fundamental Math
From Everand
Fundamental Math
Russell Pead
No ratings yet
Machine Learning Interview Questions
From Everand
Machine Learning Interview Questions
Tech Interviews
4.5/5 (2)
Linear Regression with Multiple Covariates
From Everand
Linear Regression with Multiple Covariates
Brett Kottmann
No ratings yet
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
From Everand
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
César Pérez López
No ratings yet
Process Performance Models: Statistical, Probabilistic & Simulation
From Everand
Process Performance Models: Statistical, Probabilistic & Simulation
Vishnuvarthanan Moorthy
No ratings yet
Top 20 MS Excel VBA Simulations, VBA to Model Risk, Investments, Growth, Gambling, and Monte Carlo Analysis
From Everand
Top 20 MS Excel VBA Simulations, VBA to Model Risk, Investments, Growth, Gambling, and Monte Carlo Analysis
Andrei Besedin
2.5/5 (2)
Multi-dimensional Monte Carlo Integrations Utilizing Mathematica
From Everand
Multi-dimensional Monte Carlo Integrations Utilizing Mathematica
SUJAUL CHOWDHURY
No ratings yet
MCS-011: Problem Solving and Programming
From Everand
MCS-011: Problem Solving and Programming
Dr. DK Sukhani
No ratings yet
Acceptance-Rejection Sampling and Multi-dimensional Monte Carlo Integrations Utilizing Mathematica®
From Everand
Acceptance-Rejection Sampling and Multi-dimensional Monte Carlo Integrations Utilizing Mathematica®
SUJAUL CHOWDHURY
No ratings yet
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet

Lab 2 Part 2 W21 Regression AL PDF

Uploaded by

Lab 2 Part 2 W21 Regression AL PDF

Uploaded by

Big Data and Predictive Analysis

Assignment 4 (Lab 2 Part 2)

2) StatExplorer node results is generated

Variables in the final model: IMP_DemAffl, IMP_DemAge, IMP_DemGender,

Average Squared Error (ASE):

l) The validation ASE is given in the Fit Statistics window.

ii. Select Explore. The Explore window appears.

c. Disconnect the Impute node from the Data Partition node.

LOG_PromTime (Card Tenure):

The distributions are nicely symmetric.

iii. Go to line 1598 of the results output window.

Summary of the Polynomial Regression Model:

Odds Ratio Estimate

You might also like