Final Exam For SAS Enterprise Miner

Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

Big Data Analysis Final Project

Student’s name Student ID

Tergelsaran Ser-Od 110034066

Nomin Erdene-Ochir 110034067

Erdenebolor Erdenetsogt 110034068

Munkhsaruul Altantulkhuur 110035109

Lesson 3 Decision Tree Predictive Model


1. Initial data exploration
a. Create a new diagram named Organics.

b. 1) Set the model roles for the analysis variables as shown above.
2) The proportion of individuals that purchased organic products was 25% of the same
proportion.
3) The variable DemClusterGroup contains collapsed levels of the variable
DemCluster.Presume that, based on previous experience, you believe that
DemClusterGroup is sufficient for this type of modeling effort. Set the model role for
DemCluster to Rejected.

4) As noted above, only TargetBuy can be sued for this analysis, and should have a role
of Target.
It is possible since you purchased organic food if you spend money on it. We could
utilize TargetAmt to see if someone purchased Organic food, however it is redundant
with TargetBuy. We don't care how much they spent; we just want to know whether they
bought anything so we can figure out who to market to in order to elicit further
purchases. It might be utilized in an indirect manner to achieve the same purpose.
5) Finish the Organics data source definition.
c. Add the AAEM.ORGANICS data source to the Organics diagram workspace.

d. Add a Data Partition node to the diagram and connect it to the Data Source node.
Assign 50% of the data for training and 50% for validation.
e.Add a Decision Tree node to the diagram and connect it to the Data Partition node.

f. Create a decision tree model autonomously. Use average square error as


the model assessment statistic.
1) There are 4 leaves in an optimal tree.

2) The replacement variable of age was used to make the first split.
The values for the first split are greater or equal to 44.5 or less than 44.5.
g.Add a second Decision Tree node to the diagram and connect it to the Data Partition
node.

1) In the Properties panel of the new Decision Tree node, change the maximum
number of branches from a node to 3 to enable three-way splits.
2) Create a decision tree model. Use average square error as the
model assessment statistic.

3) There are 6 leaves in the maximal tree.


h. Based on average square error, which of the decision tree models appears to be
better?

The squared error for the maximal tree is .1413 meaning that our predictions are
extremely close to the actual values. We produced the first tree with an average
squared error of .1508. This suggests that the projected values are relatively accurate,
but not as precise as the maximum tree.

Only 25% of the sample population we modeled purchased organic food. People that
buy organic food are often younger than 44 years old, have an affluence grade of
greater than 9.5, and are predominantly female. This suggests that if we want to attract
purchasers for our organic goods, we should target moderately affluent ladies under the
age of 44. We would like to sell and market our organic food near younger regions of
the city, and we would like to post advertisements in locations frequented by younger
women, such as shopping centers and internet sites popular with women.
2. Use the assigned data set for the final project in this course (Virtual Exchange
student please pickup any data set you prefer to. Repeat the product again and answer
the questions except instep b. Moreover, define the data set AAEM.assigned dataset’s
name in step b.
Lesson 4. Regression Predictive Model
Predictive Modeling Using Regression
a. Return to the lesson 3 Organics diagram Attach the StaExplore tool to the ORGANICS
data source and run it.

b. In preparation for regression, is any missing values imputation needed? If yes, should
you do this imputation before generating the decision tree models? Why or why not?
Yes. Imputation is necessary in order to prevent the biased model from being used. The
missing values are replaced with the aid of imputation.

c. Add an Impute node to the diagram and connect it to the Data Partition nodes. Set the
node to impute U for Unknown class variable values and the overall mean for unknown
interval variable values. Create imputation indicators for all imputed inputs.

Type U Indicator-Role-Input

d. Add a Regression node to the diagram and connect it to the Impute node.
e. Choose Stepwise as the selection model and Validation Error as the selection criterion.

f. Run the Regression node and view the results. Which variables are included in the final
model? Which variables are important in this model? What is the validation ASE(average
square error)?
Result

g. In preparation for regression, are any transformations of the data warranted?


No, regression models can be affected by extreme values. Inputs or values with a high
degree of skewness ought to be chosen to improve the model.

h. Disconnect the Impute node from the Data Partition node. -Done
i. Add a Transform Variables node to the diagram and connect it to the Data Partition
node. -Done
j. Connect the Transform Variables node to the Impute node.

k. Apply a log transformation to the DemAffl and PromTime inputs.


l. Run the Transform Variables node. Explore the exported training data. Did the
transformations result in less skewed distributions?
-Yes, A less lopsided distribution is achieved by transformation.

m. Rerun the Regression node. Do the selected variables change? How about the
validation ASE?
-Validation ASE changed from 0.1371 to 0.1382
n. Create a full second-degree polynomial model. How does the validation average
squared error for the polynomial model compare to the original regression model?
The additional terms reduce validation ASE slightly
2. Use the assigned data set for final project in this course (Virtual Exchange student please
pick up any data set you prefer to. Repeat the produce again and answer the questions above.

Lesson 5. Neural Networks Predictive Model


a. In preparation for a neural network model, is imputation of missing values
needed? -Yes
Why or why not? -We need a complete record for both modeling and scoring.
b. In preparation for a neural network model, is data transformation generally
needed? -No
Why or why not? -Neural network models can create transformations of inputs for
use in a regression-like model.
c. Add a Neural Networks tool to the Organics diagram. Connect the Impute node to the
Neural Networks node. -
d. Set the model selection criterion to Average Error. -
e. Run the Neural Networks node and examine the validation average squared
error. How does it compare to other models? -The validation ASE is lower on the
neural network model (0.133209) than in the regression model (0.137156) and
the full second-degree polynomial model (0.134038).

Lesson 6. Model Assessment


1. Assessing Models
a. Add a Model Comparison tool and connect all models in the ORGANICS diagram
to Model Comparison node.

b. Run the Model Comparison node and view the results.

Which model was selected? Based on what criteria?


-The Decision Tree model was selected, based on the valid:
misclassification Rate
Which model has the best ROC curve?
-Decision tree 2 has the best ROC curve, based on the validation data

c. Open the exported data from the Model comparison node. Explore the RANK
data set. What is the number of event cases for each model at a selection depth
of 5%

Decision Tree2-469.0044
Decision Tree-288.2516
Polynominal-476
Regression-479.4
Neural-477

You might also like