0% found this document useful (0 votes)
2 views17 pages

PM Mock Test

The document outlines a project for predicting car insurance claims using a dataset provided by CarZuma. It includes detailed instructions on data handling, model creation, and evaluation criteria for various predictive models, including decision trees and neural networks. The document emphasizes the importance of model assessment methods and the implications of using different models in the context of insurance claims prediction.

Uploaded by

Dhruvi Sethi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views17 pages

PM Mock Test

The document outlines a project for predicting car insurance claims using a dataset provided by CarZuma. It includes detailed instructions on data handling, model creation, and evaluation criteria for various predictive models, including decision trees and neural networks. The document emphasizes the importance of model assessment methods and the implications of using different models in the context of insurance claims prediction.

Uploaded by

Dhruvi Sethi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 17

SVKM’s NMIMS University

Date: Total Marks: 30


Time: 3 Hours
Instructions:
1. All Questions are compulsory.
2. Answer to each new question to be started on a fresh page.
3. Figures in brackets on the right-hand side indicate full marks.
4. Create a new project with name inscribed on your answer books.
5. Whenever best model is asked use appropriate criteria for assessment and mention the value of the
criteria for best model, for example if criteria is MSE, mention the best MSE value in the answer.
If target is categorical, validation misclassification rate is the assessment criteria. If continuous, use
Average squared error
If you are predicting claim occurrence (Target_ClaimsInd): Use Validation Misclassification Rate -> binary
If you are predicting claim amount (Target_Claim_Amount): Use Average Squared Error (ASE) (rejected)
6. Open data source carinsure1 and score file carinsuretest1
7. Make any assumptions wherever necessary
8. Follow the instructions clearly, failure to do so may lead in incorrect results

CarZuma: Car Insurance Claim Case Study


A car insurance company, CarZuma, has collected some of the past data about their
clients or customers that what are their attributes as well as their insured vehicles
attributes while insuring the same. They have heard a lot about the data analytics and are
very sure that their competitors are also using some of these techniques to beat the
competition. They have to be very focused to their target segment. You are expected to
predict Customers that would claim.

Measurement
Variable Role Level Description
Policy Number ID Nominal Unique Policy Identifier
Year Input Nominal Year of manufacture
IDV Input Interval Insured Declared Value of Car
City Input Nominal City of registration of vehicle
State Input Nominal State of registration
Cubic Capacity Input Nominal Capacity of engine
Mfr_Model Input Nominal Manufacture and model of car
Total Premium paid for the policy at the
Premium Input Interval beginning of the term
Type Input Binary Source of lead
Gender Input Binary M/F
Channel Input Binary Lead generation channel
Age Input Nominal Age of applicant
Cover Type Input Binary Third Party or Comprehensive
PaymentFrequency Input Binary Annual payment or monthly instalments
Target_ClaimsInd Target Binary ClaimsY/N
Target_Claim_Amoun
Total claim amount
t Rejected Interval
Create a new project with name PMMock.

Open data source carinsure1 and score file carinsuretest1 from sasuser library with roles and measurement
levels as mentioned in the table above. Make sure you open the main file with role as raw and Score file with
role as Score. Create a new diagram.
Q. 1 Answer following questions

For this question, add statexplore to the file import and run

A. For each quantitative (interval) variable what fill following table:


Standard Minimu
Variable Role Mean Deviation Missing m Median Maximum Skewness Kurtosis

B. What are number of levels for following variables: Age, City, state, cubic_capacity & Mfr_model? What is
the mode of these variables?

C. Which variable has highest variable worth? Which variable has lowest variable worth?
D. What is the percentage of primary target in the dataset? (give upto four decimals)

1 is primary and 2 is secondary

Perform following steps.


a) Sample -> Data partition next to file import-> Partition the data Train=70% validation =30%
b) Add decision tree node after partitioning for 2 branches (change the assessment measure to
misclassification throughout the paper) (if continuous use ASE)

Because the dependent variable is classification type (targetind)

How to see the branches -> view -> model -> subtree assessment plot -> output also for misclassification

c) Add decision tree node after partitioning for 3 branches (change the assessment measure to
misclassification throughout the paper) (if continuous use ASE)

Because the dependent variable is classification type (targetind)

How to see the branches -> view -> model -> subtree assessment plot -> output also for misclassification

d) Make sure you use appropriate model assessment method for binary target.

Check for complexity if same

Q. 2 Answer following questions

A. How many leaves are in the final model of 2 branch tree?

View -> models -> subtree assessment plot -> number of leaves = 5
2 branch tree -> results -> view -> model -> subtree assessment plot -> misclassification rate

B. How many leaves are in the final model of 3 branch tree?

2 branch tree -> results -> view -> model -> subtree assessment plot -> misclassification rate
Exactly same values but since 2 branch model has less branches so better model

C. Which of the above trees is better model?

D. Do you see any abnormality in the data? Do you need to do any replacements, imputations or
transformations before using trees? Why?

Abnormality means missing data and high skewness in quantitative/interval data (2: idv and premium)

None applicable

No, because trees handle it, they are non-parametric, they handle missing data, do not get affected by all
this.
In this case we do not have imputation, but in case of outliers, do transformation and then imputation.
Perform following steps.
If there is any imputation, do it after transformation since outliers need to be removed
Transform -> replace -> impute -> NN, regressions

a) Transformation – Use Max Normal transformation (modify -> transform variables -> interval inputs
(maximum normal) -> answer 3a

b) Regression – make sure in stepwise, entry level probability is 1 and stay probability is 0.5 and you do not
want more than 20 variables in the final model. (model->stepwise-> put after transform -> validation
misclassification -> selection default no -> selection options 3 dots (change according to question) -> if
variables are written then change steps in the end))

c) Poly Regression of degree 2– make sure in stepwise, entry level probability is 1 and stay probability is 0.5
and you do not want more than 20 variables in the final model. (copy regression -> poly terms yes ->
two factor interaction yes)

Q. 3 Answer following questions

A. Which variables have skewness more than 2 after transformation?

Now directly add transform variables to the data partition – set IDV and premium to Max normal then run
Ans. 0 since in output it shows 0 variables

B. How many variables are in final model of regression node?

In the output go in the end and look for this:

7 variables

C. How many variables are in final model of Poly regression node? Are there any interaction terms in the
final model? If yes, which terms?

Output to be seen in the misclassification only

D. Which of the above two models is better model?

Assess -> model comparison -> selection statistic to misclassification rate -> selection table as validation
Perform following steps. (Please note that in Neural networks convergence needs to be achieved, if it is not
converging in default iterations, you can increase up to 500 iterations to achieve convergence.

MODEL NEURAL NETWORK (needs to be converged)


a) Insert a neural network model after regression node – Do not enable Preliminary training, and select
number of hidden units as 3 (optimization max iterations as 500) -> model selection criteria to
misclassification -> this will be converged (cntrl F)

b) Insert a neural network model after regression node – Do not enable Preliminary training, and select
number of hidden units as 6
Network
c) Insert Auto-neural model after regression node with number of hidden units =1, Tolerance = Low and
select only tanh activation function.

d) Insert a neural network model after 2 branch tree node – Do not enable Preliminary training, and select
number of hidden units as 3
e) Insert a neural network model after 2 branch tree node – Do not enable Preliminary training, and select
number of hidden units as 6
f) Insert Auto-neural model after 2 branch tree node with number of hidden units =1, Tolerance = Low and
select only tanh activation function.
Q. 4 Answer following questions (8 Marks)

A. Which of the above six models is best? (4 Marks)

B. Would you insert neural network after decision tree? Why or why not?
Answer:
Yes, in this case because there was no missing data and we did not perform any imputations.

No, you typically would not insert a neural network after a decision tree in SAS Miner if any imputations
were performed.
Reasoning:
 Decision trees are non-parametric models that segment data into rules and conditions. Once a decision
tree is built, it does not provide a continuous transformation of the input variables that a neural network
can benefit from.
 Neural networks work well with raw, numeric, or transformed data but are not typically applied to
tree-generated outputs.
 If the decision tree is pruned, it might lose some important patterns, and applying a neural network
afterward would not regain lost information.
 Instead, you could use ensemble techniques like boosting or bagging, or you could try feature
engineering and preprocessing before applying a neural network.
C. Would you insert neural network after poly Regression? Why or why not?
Answer: No, you typically would not insert a neural network after polynomial regression.
Reasoning:
 Polynomial Regression already models non-linearity by introducing polynomial terms (e.g., x2,x3x^2,
x^3x2,x3).
 Neural Networks are also designed for capturing complex, nonlinear relationships, so applying a neural
network after polynomial regression is redundant.
 Overfitting Risk: Polynomial regression may already lead to overfitting if the degree is high. Adding a
neural network might amplify this issue.
 Better Approach: Instead of chaining them, use either:
o Polynomial Regression for simpler problems with clear curve fitting.
o Neural Networks for complex relationships when polynomial regression is insufficient.

Use ensemble model and connect (1) 2 branch tree, (2) regression, (3) neural network with 6 hidden units from
decision tree node, (4) Auto-neural from regression and (5) poly regression node to this node, give criteria as
voting

Use Model comparison node and connect all of the above models (including ensemble) to the model assessment
node use appropriate selection statistics for binary target.

Use score to score the data and find the accuracy of the score data. Answer following questions
Q. 5 Answer following questions

1. Which is the best model for binary target? Neural3

Model comparison

2. Without changing assessment criteria within the existing models, which model will you prefer if you
want to maximize cumulative lift on validation set? Why this method is not appropriate for this data?
Neural 4
Lift is used when we are interested only in a part of the model. Cumulative lift’s objective to identify each
observation correctly. Only when our problem is marketing or buyer type of problem, only then use this.

Imbalanced Data Issue

 If claim occurrences (Target_ClaimsInd) are rare (which is often the case in insurance datasets),
cumulative lift may be misleading.

 A high lift may come from overfitting to a small group of claimants rather than generalizing well.

Business Decision Perspective

 Lift is useful for marketing applications where ranking matters (e.g., who is most likely to respond to an
offer).

 In insurance claims prediction, accuracy (Misclassification Rate) and false positive/false negative
control are more important than just ranking.

Existing Assessment Criteria Conflict

 The dataset already defines Validation Misclassification Rate as the primary criterion.

 Switching to Lift for model selection might lead to a model that ranks well but misclassifies claims,
which is risky in an insurance context.

3. Give Classification table for score data. What is the accuracy of the best model on score data?

Drag score and connect it to the score wala data -> role change to score in the input data -> run it ->
exported data -> copy to score excel and make pivot table to this
4. If you create a profit/cost matrix where cost of non-identifying claim is 8000, which is the best model?
What is total expected cost for score data assuming cutoff of 0.5? Neural 2

For this part, do not make any changes to the original file, copy it and make changes to the other file.

Import data -> Role is to be selected as Score -> Then bring score from assess -> correct the paths of both the
inputs

Connect file import -> decisions -> data partition

Decisions -> apply decisions yes -> custom editor 3 dots -> build -> decisions -> decision weights -> minimize or
maximize profits or costs

Change all the decision tree criteria to average squared error

For regression and poly regression, make validation profit loss


All neural networks, make profit/loss

Model comparison:

Ensemble voting
5. Based on the above question, can you conclude that the model can be deployed?

It is considering every one as claimer, it will do optimization hence the last column has not classified
anything as non-claimer and hence no cost-> not be deployed
Best Luck

Sensitivity = TP / (TP + FN)


Specificity = True Negatives (TN) / (True Negatives (TN) + False Positives (FP))

You might also like