Project - Data Mining: Bank - Marketing - Part1 - Data - CSV
Project - Data Mining: Bank - Marketing - Part1 - Data - CSV
Project - Data Mining: Bank - Marketing - Part1 - Data - CSV
Dear Participants,
1. Answer Report: In this, you need to submit all the answers to all the questions in a sequential manner. It should
include a detailed explanation of the approach used, insights, inferences, all outputs of codes like graphs, tables etc. Your
report should not be filled with codes. You will be evaluated based on the business report.
2. Jupyter Notebook file: This is a must and will be used for reference while evaluating
Problem 1: Clustering
A leading bank wants to develop a customer segmentation to give promotional offers to its
customers. They collected a sample that summarizes the activities of users during the past few
months. You are given the task to identify the segments based on credit card usage.
1.1 Read the data, do the necessary initial steps, and exploratory data analysis (Univariate, Bi-
variate, and multivariate analysis).
1.2 Do you think scaling is necessary for clustering in this case? Justify
1.3 Apply hierarchical clustering to scaled data. Identify the number of optimum clusters using
Dendrogram and briefly describe them
1.4 Apply K-Means clustering on scaled data and determine optimum clusters. Apply elbow curve.
Explain the results properly. Interpret and write inferences on the finalized clusters.
1.5 Describe cluster profiles for the clusters defined. Recommend different promotional strategies
for different clusters.
Dataset for Problem 1: bank_marketing_part1_Data.csv
Data Dictionary for Market Segmentation:
1. spending: Amount spent by the customer per month (in 1000s)
2. advance_payments: Amount paid by the customer in advance by cash (in 100s)
3. probability_of_full_payment: Probability of payment done in full by the customer to the
bank
4. current_balance: Balance amount left in the account to make purchases (in 1000s)
5. credit_limit: Limit of the amount in credit card (10000s)
6. min_payment_amt : minimum paid by the customer while making payments for purchases
made monthly (in 100s)
7. max_spent_in_single_shopping: Maximum amount spent in one purchase (in 1000s)
Problem 2: CART-RF-ANN
An Insurance firm providing tour insurance is facing higher claim frequency. The management
decides to collect data from the past few years. You are assigned the task to make a model which
predicts the claim status and provide recommendations to management. Use CART & RF and
compare the models' performances in train and test sets.
2.1 Read the data, do the necessary initial steps, and do exploratory data analysis (Univariate, Bi-
variate, and multivariate analysis).
2.2 Data Split: Split the data into test and train, build classification model CART, Random Forest
2.3 Performance Metrics: Comment and Check the performance of Predictions on Train and Test
sets using Accuracy, Confusion Matrix, classification reports for each model.
2.4 Final Model: Compare all the models and write an inference about which model is
best/optimized.
2.5 Inference: Based on the whole Analysis, what are the business insights and recommendations
Dataset for Problem 2: insurance_part2_data-1.csv
Attribute Information:
1. Target: Claim Status (Claimed)
2. Code of tour firm (Agency_Code)
3. Type of tour insurance firms (Type)
4. Distribution channel of tour insurance agencies (Channel)
5. Name of the tour insurance products (Product)
6. Duration of the tour (Duration in days)
7. Destination of the tour (Destination)
8. Amount worth of sales per customer in procuring tour insurance policies in rupees (in 100’s)
9. The commission received for tour insurance firm (Commission is in the percentage of sales)
10. Age of insured (Age)
Important Note: Please reflect on all that you have learned while working on this project.
This step is critical in cementing all your concepts and closing the loop. Please write down
your thoughts here.
All the very best!
Regards,
Program Office
Criteria Points
1.1 Read the data and do exploratory data analysis (3 pts). Describe the data
briefly. Interpret the inferences for each (3 pts). Initial steps like head() .info(),
Data Types, etc . Null value check. Distribution plots(histogram) or similar
plots for the continuous columns. Box plots, Correlation plots. Appropriate
plots for categorical variables. Inferences on each plot. Summary stats, 6
Skewness, Outliers proportion should be discussed, and inferences from
above used plots should be there. There is no restriction on how the learner
wishes to implement this but the code should be able to represent the correct
output and inferences should be logical and correct.
1.2 Do you think scaling is necessary for clustering in this case? Justify The
learner is expected to check and comment about the difference in scale of
different features on the bases of appropriate measure for example std dev,
2
variance, etc. Should justify whether there is a necessity for scaling and
which method is he/she using to do the scaling. Can also comment on how
that method works.
1.3 Apply hierarchical clustering to scaled data (3 pts). Identify the number of
optimum clusters using Dendrogram and briefly describe them (4). Students
are expected to apply hierarchical clustering. It can be obtained via Fclusters
or Agglomerative Clustering. Report should talk about the used criterion, 7
affinity and linkage. Report must contain a Dendrogram and a logical reason
behind choosing the optimum number of clusters and Inferences on the
dendrogram. Customer segmentation can be visualized using limited features
Criteria Points
or whole data but it should be clear, correct and logical. Use appropriate plots
to visualize the clusters.
1.4 Apply K-Means clustering on scaled data and determine optimum clusters
(2 pts). Apply elbow curve (3 pts). Interpret the inferences from the model (2.5
pts). K-means clustering code application with different number of clusters.
Calculation of WSS(inertia for each value of k) Elbow Method must be applied
and visualized with different values of K. Reasoning behind the selection of
7
the optimal value of K must be explained properly. Report must contain
logical and correct explanations for choosing the optimum clusters using the
elbow method. Append cluster labels obtained from K-means clustering into
the original data frame. Customer Segmentation can be visualized using
appropriate graphs.
1.5 Describe cluster profiles for the clusters defined (2.5 pts). Recommend
different promotional strategies for different clusters in context to the
business problem in-hand (2.5 pts ). After adding the final clusters to the
original dataframe, do the cluster profiling. Divide the data in the finalyzed
groups and check their means. Explain each of the group briefly. There
should be at least 3-4 Recommendations. Recommendations should be easily 5
understandable and business specific, students should not give any technical
suggestions. Full marks will only be allotted if the recommendations are
correct and business specific. variable means. Students to explain the
profiles and suggest a mechanism to approach each cluster. Any logical
explanation is acceptable.
2.1 Read the data and do exploratory data analysis (4 pts). Describe the data
briefly. Interpret the inferences for each (2 pts). Initial steps like head() .info(),
Data Types, etc . Null value check. Distribution plots(histogram) or similar
plots for the continuous columns. Box plots, Correlation plots. Appropriate
plots for categorical variables. Inferences on each plot. Summary stats, 6
Skewness, Outliers proportion should be discussed, and inferences from
above used plots should be there. There is no restriction on how the learner
wishes to implement this but the code should be able to represent the correct
output and inferences should be logical and correct.
2.2 Data Split: Split the data into test and train(0.5 pts), build classification
model CART (2.5 pts), Random Forest (2.5 pts). Object data should be
converted into categorical/numerical data to fit in the models.
(pd.categorical().codes(), pd.get_dummies(drop_first=True)) Data split, ratio
defined for the split, train-test split should be discussed. Any reasonable split 5.5
is acceptable. Use of random state is mandatory. Successful implementation
of each model. Logical reason behind the selection of different values for the
parameters involved in each model. Apply grid search for each model and
make models on best_params. Feature importance for each model.
2.4 Final Model - Compare all models on the basis of the performance metrics
in a structured tabular manner (2.5 pts). Describe on which model is
best/optimized (1.5 pts ). A table containing all the values of accuracies,
precision, recall and f1 score. Comparison between the different models(final) 4
on the basis of above table values. After comparison which model suits the
best for the problem in hand on the basis of different measures. Comment on
the final model.
2.5 Based on your analysis and working on the business problem, detail out
appropriate insights and recommendations to help the management solve the
business objective. There should be at least 3-4 Recommendations and
insights in total. Recommendations should be easily understandable and 4.5
business specific, students should not give any technical suggestions. Full
marks should only be allotted if the recommendations are correct and
business specific.
Please reflect on all that you learnt and fill this reflection
report:https://docs.google.com/forms/d/e/1FAIpQLSd7e_bJVCiFpZAYbBTMtK 0
rr9TLRnl8kuvZT7IsZ5MSjRtfjcQ/viewform?usp=sf_link
Points 60
Submit Assignment