100% found this document useful (2 votes)

673 views20 pages

Data Mining Project

- The document describes a data mining project involving clustering and classification problems for a bank and insurance company. - For the clustering problem, the document analyzes customer data to identify segments based on credit card usage, finds 3 optimum clusters using hierarchical and k-means clustering, and recommends different promotional strategies tailored to each cluster. - For the classification problem, the document explores insurance customer data to predict claim status, and compares models using CART, RF, and ANN to evaluate their performance on train and test sets.

Uploaded by

Bhuvanesh Singh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (2 votes)

673 views20 pages

Data Mining Project

Uploaded by

Bhuvanesh Singh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

Data Mining Project

Problem 1: Clustering

Problem Statement:
A leading bank wants to develop a customer segmentation to give promotional offers to its customers. They
collected a sample that summarizes the activities of users during the past few months. You are given the task
to identify the segments based on credit card usage.

1.1 Read the data, do the necessary initial steps, and exploratory data analysis (Univariate, Bi-variate, and
multivariate analysis).
Exploratory Data Analysis:
Exploratory data analysis is a way of examining data sets in order to summarise their key features, which
typically involves the use of statistical graphics and other data visualisation techniques.

Samples of Data:
• The data contains overall 210 rows. It has 7 columns. None of the columns has Null values.
• All the columns have Integer values.
• There are no duplicate rows
• There are no major outliers observed. Only some outliers are seen in “min_payment_amt”

Statistical Summary Table

• We observe from the summary of statistics above that “spending” and “advance_payments” have large
values compared to values in other columns.

• We observe that mean is overall same in similar scaled column.

• “Probability_of_full_payment” columns have probability values ranging between 0.808 to 0.918

• To do some clustering analysis we will need to scale the data as there is difference in scales of columns
are explained in above 2 points.

Corelation Heatmap:

A correlation heatmap displays a 2D correlation table between two values using coloured cells. Correlation
heatmaps are great for comparing association between two values.
• We observe that there are many high positively co-related columns.

• The values higher than 0.80 denotes high co-relation, meaning that if one factor go up the other will
also go up simultaneously.

• We see high co-relation between spending -> advance_payments, credit_limit and current_balance
which is natural. People with high spending will naturally have high balance, credit and advance
payments.

• Similarly, we see good co-relation between spending -> max_spend_in_single_day. Which denotes that
people who have high spending power also has high max_spend on single day.

• Univariate/Bivariate Analysis:

• When we draw graphs of univariate and Bivariate variables, we observe similar co-relation depicted in
heatmap diagram.

• Wherever there is high corelation like spending -> advance_payments, credit_limit, current_balance and
max_spent_in_a_single_day, we see graphs going upward and downwards as the spending directed.

• The values in the dataset are not normalised.

1.2 Do you think scaling is necessary for clustering in this case? Justify
Yes, scaling will be required. From the statistical summary table, we have already observed that columns like
“spending” and “advance_payments” denotes high values as they represent the amount debited/credited to
the card, on the other hand “Probability_of_full_payment” columns have probability values ranging between
0.808 to 0.918.
As these are two different scales, we need to perform scaling on the data before moving for any clustering
model. The reason for this is because all clustering algorithms employ some form of distance metric to assess
if value is more likely to belong to the same cluster as object ‘m’ than object ‘n’. Thus, magnitude of the
variables has an impact on these distance measurements.
Without normalisation, virtually all of the distance estimated between two objects will be due to their values,
which will not be correct. We will using standardScaler, where the mean is removed and each feature/variable
is scaled to unit variance. This procedure is carried out in a feature-by-feature manner.
1.3 Apply hierarchical clustering to scaled data. Identify the number of optimum clusters using Dendrogram and
briefly describe them.
Scaling (standardSclaer)

(a) Dendrogram using “Average” method (b) Dendrogram using “Ward” method

• We generated Dendrograms using “Average” and “Ward” linkage methods. We observe there is not much
difference
• From the Dendrograms we see that there are 3 (three) optimum clusters which can segregate the customers
• Clusters C1, C2 and C3 has 70, 67 and 73 customers in their respective clusters using Ward linkage method.

- Let us see the clusters groups created by Dendrograms using Ward methods for further analysis

• The three groups C1, C2 & C3 are primarily segregated based on Spending power. There are three groups which
we can say High, Medium and Low spending power.
• We see that ‘spending’, ‘max_spent_in_single_day’ and ‘advance_payments’ are classified under High spending,
Medium spending and Low spending power.
• C1 = High Spending Power, C2 = Low Medium Power and C3 = Medium Spending Power
• Another observation is that “min_payment_amt” is highest in C2 group.
1.4 Apply K-Means clustering on scaled data and determine optimum clusters. Apply elbow curve and silhouette
score. Explain the results properly. Interpret and write inferences on the finalized clusters.

(a) Elbow-curve for K-means (b) Silhouette scores for 3 and 4 clusters

• We observe from above elbow-curve that 3 clusters are good enough as graph bends after 3

• Also, comparing Silhouette scores between 3 and 4 clusters, we see very insignificant difference, so 3 clusters
are optimum clusters.

Last column “Clus_kmeans3” shows the cluster values for each row.

Cluster Profile for KMeans clustering

• Here we observe 3 clusters 0, 1 and 2 are formed. The customer are segrerated as 72, 67 and 71
respectively in each cluster.

• The cluster are again formed based on Spending powers, just like in hierarchical clustering
(Dendrograms)

• We see that ‘spending’, ‘max_spent_in_single_day’ and ‘advance_payments’ are classified under High
spending, Medium spending and Low spending power.

• Its just that the order is as follows:

o Cluster ‘0’ – Medium spending power customers
o Cluster ‘1’ – High spending power customers
o Cluster ‘2’ – Low spending power customers

1.5 Describe cluster profiles for the clusters defined. Recommend different promotional strategies for different
clusters.
As the clusters are profiled with respect to their spending power, the marketing strategy for promotional
offers should be based on the same. Below are some recommendations based on the cluster profile analysis.
High Spending Power:

• These are Big, Premium customers who spend more and spend lavishly.

• Credit Limit: Need to higher their credit limit so they can spend more

• Loyalty Rewards: Loyalty rewards points to increase their spending and make them more Loyal

• Premium Products: Promote offer on Luxury/Lavish lifestyle items to them

• Cross-Sell: As they are high spending and good repayment, cross-sell Personal loans

• Subscription Tie-ups: Promote subscription-based offers to engage more and regular monthly
subscriptions to increase spending

• Fewer Discounts: These people do not require much of discounts, but offer them some discount on
full payment, discount if they spend ‘x’ amount in a month. This ‘x’ amount should be at least 20%
more than their current monthly spend
Medium Spending Power:

• These are Medium, regular customers who regularly spend through bills or purchases.

• These are prospective customers where they can be motivated to High spending group

• Credit Limit: Need to higher their credit limits a little so that they can spend more

• Discount Offers: Study their spending habits and promote discounts on items which they regularly buy

• Loyalty Rewards: Offer Loyalty rewards points to increase their spending power

• Surveys/Reviews: Get more engagement by asking them to fill up surveys and get to know where they
need more discount/services support

• Premium Offer: Slowly motivate them to move to premium product stores/websites

Low Spending Power:

• These are Low spending customers, making some purchases

• Discount offers: These people require major discounts and offers on stores/bills they generally pay

• Reminders: These people need proper regular reminders of payment

• Payment Offers: Offer/discount related to payments, like early payment, full payment etc

• Incentives: Offer incentives tied to spending thresholds

Problem 2: CART-RF-ANN – Claim Prediction

Problem Statement:
An Insurance firm providing tour insurance is facing higher claim frequency. The management decides to
collect data from the past few years. You are assigned the task to make a model which predicts the claim
status and provide recommendations to management. Use CART, RF & ANN and compare the models'
performances in train and test sets.

2.1 Read the data, do the necessary initial steps, and exploratory data analysis (Univariate, Bi-variate, and
multivariate analysis).
Sample of the data

• There is data of 3000 customers

• It has 10 columns describing information of customers travels, including the Target field ‘Claimed’
• There are Categorial and Numerical values. Categorial values has to be converted to Numerical.
• There are no Null values
Statistical Summary and Duplicate Check

• We can see issues with Duration column. There are 2 wrong values in the Duration column
o We observe that there is minimum value = -1. This is not possible. Let’s update to value = 1.
We assume that ‘-‘sign is by mistake/typo. It can be 0/1. We are keeping it as 1
o We see max value as 4580. It is extremely high compared to others. Looks like another typo
mistake where ‘0’ is extra. Let’s update this also to = 458

• Rest of the values in Statistical summary looks fine

• We see there are 139 duplicate records.

• But we are NOT removing these duplicate records, as we do not have any unique identifier.

• After doing the correction in Duration columns for values “-1” and “4580”, we see above the corrected
Statistical summary.
• There are no major changes in statistical metrics, so we are now fine to move ahead.
Univariate, Bi-variate, and multivariate analysis:

Numerical Columns

Categorical Columns
• Besides ‘Age’, all other numerical columns data is right-skewed.
• ‘EPX’ Agency has maximum bookings, while ‘JZI’ has the least. The difference between them it high
• ‘Customized Plan’ is the most opted plan, and ‘Gold Plan’ is the least. Difference between them it high
• ‘Asia’ is the most preferred destination, while ‘Europe’ is the least
• ‘C2B’ Agency has highest insurance claim percentage, while ‘JZI’ & ‘EPX’ agency has the least claim
percentage
• ‘Silver Plan’ & ‘Gold Plan’ has maximum claims, while ‘Cancellation plan’ has least claimed percentage
• All are ‘Online’ channels, only 46 rows are from ‘Offline’ Channel

INSURANCE CLAIMED % ( BY A G E N C Y )
Claimed Not Claimed

100%

80%
39%
70%
60% 86% 87%

40%
61%
20% 30%
14% 13%
0%
C2B CWT EPX JZI

INSURANCE CLAIMED % ( BY P L A N )
Claimed Not Claimed

100%

36% 28%
80%
61%
60%
78%
94%
40%
64% 72%

20% 39%
22%
6%
0%
Bronze Plan Cancellation Customised Gold Plan Silver Plan
Plan Plan

- Online/Offline Channel distribution

• Not much co-relation is observed between the column’s
• Though, a positive correlation is observed between ‘Sales’ and ‘Commission’. Higher the sales more the
commission, which is a quite natural association
2.2 Data Split: Split the data into test and train, build classification model CART, Random Forest, Artificial Neural
Network

Sample of data after converting all object values to Categorical codes and dropping target column “Claimed”

• We are not scaling, as they are Tree-based models and thus will not be much impacted by non-scaling.
• We observe that for “Claimed” – Yes/No, we have ~30/70 percentage. So, we will divide our Train and Test
into 70/30 percentage. This is also a standard division in classification problems.
• We see that after dividing the dataset into Train and Test sets, we have 2100 rows in Training and 900 in
the Test set

2.3 Performance Metrics: Comment and Check the performance of Predictions on Train and Test sets using
Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score, classification reports for each model.

(i) Performance Metric of Decision Tree Classifier (CART) -

We obtained the best results using the Grid Search method. Below are the best parameters for Decision Tree
Important Parameters as per Decision Tree Classifier - CART

• We see that Agency_Code and Sales are the two most significant
contributor in prediction of Claimed Vs Not-Claimed
• Product_Name is the third one with ~ 12% importance
• Destination, Channel, Type, Age and Commission do not play any
role in Classification

Decision Tree Classifier – With Best Grid Search Params:

(a) ROC curve of Training Set AUC= 0.824 (b) ROC curve of Test Set AUC= 0.805

(a) Confusion Matrix (Training Set) (b) Confusion Matrix (Test Set)

(a) Classification Report (Training Set) (b) Classification Report (Test set)
Summary - CART

• The overall Accuracy of the Decision Tree is 79% in the Training phase and 78.6% in the Testing phase

• Precision: Measures how correct model is predicting positives, is .70 and 0.68 over Train/Test

• Recall: How many positives missed by model, is 0.56 and 0.57 over Train/Test. This score is low

• So overall, models perform very similar over Train and Test datasets, which looks like overfitting.

• False Negative is higher than False positives.

• Major contribution features are Agency, Sales and Product

(ii) Performance Metric of Random Forest -
We obtained the best results using Grid Search method. Below are the best parameters for Random Forest

Important Parameters as per Random Forest

• We observe from Random Forest classifier that four major features
are Agency_Code, Product_Name, Sales and Commission

(a) ROC curve of Training set AUC= 0.849 (b) ROC curve of Test set AUC=0.817

(a) Confusion Matrix (Training set) (b) Confusion Matrix (Testing set)

(a) Classification Report (Training set) (b) Classification Report (Test set)
Summary – Random Forest

• The overall Accuracy of Random Forest is 80.0% in the Training phase and 78.8% in the Testing phase

• Precision: Measures how correct model is predicting positives, is .71 and 0.67 over Train/Test. This
score is higher than CART Classifier

• Recall: How many positives missed by the model? It is 0.60 and 0.62 over Train/Test.

• So overall, models perform well over Train and a slight decrease in Test datasets. So, it does not look
like Overfitting or Underfitting. A stable model

• The difference between FP and FN are more negligible, meaning a robust model

• Major contribution features are Agency, Sales, Product and Commission

(iii) Performance Metric of Neural Network (ANN) –
We obtained the best results using the Grid Search method. Below are the best parameters for Multi-Layer
Perceptron Classifier (Neural Network Classifier)

(a) ROC curve for Training set AUC=0.806 (b) ROC curve for Test set AUC= 0.796

(a) Confusion Matrix for Training set (b) Confusion Matrix for Test set
(a) Classification Report for Training set (b) Classification Report for Test set

Summary – ANN

• The overall accuracy of the ANN is 77.8% in the Training phase and 77.6% in the Testing phase

• Precision: Measures how to correct model is predicting positives, is .69 and 0.70 over Train/Test

• Recall: How many positives missed by the model, is 0.51 and 0.53 over Train/Test

• So overall, models perform very similar over Train and Test datasets. Looks like overfitting

• Major contribution features cannot be determined in an ANN model. It works as a black box.

2.4 Final Model: Compare all the models and write an inference about which model is best/optimized.
Let us compare the Classification Metrics and ROC curves for all 3 models and then decide.

(a) ROC curve comparison for all three models – Train (b) ROC curve comparison for all three models – Test

Model/ Metric CART CART RF RF ANN ANN

Train Test Train Test Train Test
Accuracy 0.79 0.77 0.80 0.78 0.78 0.77
AUC 0.82 0.80 0.86 0.82 0.82 0.80
Recall 0.53 0.51 0.61 0.56 0.51 0.50
Precision 0.70 0.67 0.72 0.68 0.68 0.67
F1 Score 0.60 0.58 0.66 0.62 0.59 0.57
Summary – Comparison of all 3 Models:

• All 3 models have the same performance in their Training/Test datasets.

• Random Forest has the maximum Accuracy of 80% in Train and 78.8% over Test

• We will choose Random Forest for classification prediction due to following reasons
❖ It has the maximum Accuracy in Train and Test (How accurately the model classifies)
❖ It provides information about the essential features used in classification.
❖ These critical features will support us in further analysis and bringing more clarity over the
solution to this business problem
❖ It has high Precision values (0.61), which means its predicting positives correctly is high. This
means we will be able to predict correct positives more accurately
❖ Though it has little high Recall (0.56) also, but compared to other models (0.51), this rise is
insignificant
❖ It also has the highest F1-Score. F-1 score is the harmonic mean of Precision and Recall.
❖ Highest AUC – AUROC denotes the total measure of performance across all possible
classification threshold. This means higher the value, the more accurately the model
distinguishes between positive and negative classes. If AUC=1, it means the model can
perfectly differentiate between all the Positive and the Negative class points. So having a
high AUC of 0.86 is good to have compared to others.

2.5 Inference: Based on the whole Analysis, what are the business insights and recommendations
We can get insights and recommendations from the EDA and Important Feature set identified by the
Classifiers.

- From the Classifiers, we see that Agency and Product Plan play a significant role. We also see the same
from the illustration of the Decision Tree diagram.

- The dataset is highly imbalanced based on the “Online”/ “Offline” mode. We need to get more data
with Offline modes. Also, collect more data as ASIA destination is maximum and significantly less
information about Europe and the Americas
- Agency ‘C2B’: We need to analyse more on this Agency, which has the highest claim percentage, it has
the second-highest booking, so it’s essential.
o Surprisingly there are no ‘Cancellation Plan’ offered from C2B. They should add the
‘Cancellation Plan’ in their offering as it has the minimum claim cases.
o There is no “Airlines” mode offered. And no other destination except ‘Asia’. Asia has the
highest claim percentage, So, they should expand their bookings to other countries.

- Agency ‘JZI’: We need to analyse Sales on this Agency.

o Significantly fewer sales and they only sell Online. We need to check for offline “Agency” mode
also. Training of resources should be included
o They also sell only ‘Bronze Plan’, they should increase their product offering to include other
plans. Better promotional campaigns should be looked into
o If they are not performing well, looking for a new agency can also be considered.

- Gold and Silver Product:

o Need to update features/offering of these plans as they are most sought for claims.
o Survey/Reviews/Interaction with customers are required to get more information on these
plans and assess their needs
o Update the plan features and include ‘cancellation’ facility with some fees option

- Airlines Vs. Travel Agency:

o Airlines (83) average sales are double the average sales from ‘Travel Agency’ (46) mode.
o Surprisingly, the Airlines has 50% claim cases while ‘Travel Agency’ has just 18% of claims.
o Need to relook on the commission and product offerings for ‘Travel Agency’ type. We need to
train the resources to better communicate with customers, support more sales than Airlines,
and improve the claim ratios.

- Predictive Analysis Indicator:

o Now, we have a model with 80% accuracy, so we can to some extent predict whether the claim
will be made or not.
o Need to check more with CRM activities in these cases which is predicted by model
o CRM interaction needs to be done, to understand other features like claim time, services,
customer satisfaction to get more information for predicted cases
o More diverse data with more other features will provide more insights on other issues.
Currently only two main features, Agency_code and Product_Plan, seems to be very less
pointers to customer problems.

Images:
- Images in Introduction section for both problems is taken from:
https://thenextweb.com/news/6-ways-to-save-big-on-business-travel
https://www.roamright.com/travel-insurance-blog/filing-claim/
https://towardsdatascience.com/segmenting-credit-card-customers-with-machine-learning-ed4dbcea009c
- Graphs and ROC curves are from attached Jupyter notebooks
- Some graphs have been re-created in MS-Excel for better clarity

Project-Predictive Modeling-Rajendra M Bhat
100% (3)
Project-Predictive Modeling-Rajendra M Bhat
14 pages
DataMiningProjectProblem1 Clustering
100% (4)
DataMiningProjectProblem1 Clustering
20 pages
DataMining Aug2021
100% (2)
DataMining Aug2021
49 pages
Predictive Modeling PDF
100% (3)
Predictive Modeling PDF
49 pages
FRA Assignment
100% (1)
FRA Assignment
31 pages
Project-Predictive Modelling - Tanaya - Lokhande
100% (1)
Project-Predictive Modelling - Tanaya - Lokhande
55 pages
DM Gopala Satish Kumar Business Report G8 DSBA
100% (2)
DM Gopala Satish Kumar Business Report G8 DSBA
26 pages
Predictive Modelling Project 2
100% (4)
Predictive Modelling Project 2
32 pages
Advanced Statistics Project Report
100% (1)
Advanced Statistics Project Report
34 pages
Arnab Chowdhury DM
75% (4)
Arnab Chowdhury DM
14 pages
Machine Learning Business Report - Compress (AutoRecovered)
100% (3)
Machine Learning Business Report - Compress (AutoRecovered)
69 pages
DATA MINING PROJECT PAVITHRAA GOVINDARAJAN 24 OCT 2021 Jupyter Notebook PDF
100% (3)
DATA MINING PROJECT PAVITHRAA GOVINDARAJAN 24 OCT 2021 Jupyter Notebook PDF
49 pages
Adv Stats Proj
95% (38)
Adv Stats Proj
25 pages
Project Predictive Modeling
50% (2)
Project Predictive Modeling
69 pages
Education - Post 12th Standard - CSV
88% (16)
Education - Post 12th Standard - CSV
11 pages
SMDM Project Gopala Satish Kumar Jupyter Notebook G8 DSBA
100% (1)
SMDM Project Gopala Satish Kumar Jupyter Notebook G8 DSBA
14 pages
FRA Business Report
100% (1)
FRA Business Report
21 pages
Statisitics Project 6
100% (2)
Statisitics Project 6
48 pages
Predictive Modeling
100% (1)
Predictive Modeling
22 pages
Data Mining Project
No ratings yet
Data Mining Project
11 pages
Data Mining Business Report
No ratings yet
Data Mining Business Report
38 pages
PM - ExtendedProject - Business Report
100% (4)
PM - ExtendedProject - Business Report
35 pages
Advance Statistics - Buisness Report
100% (1)
Advance Statistics - Buisness Report
26 pages
Business Report Project Machine Learning Rupesh Kumar DSBA-A5-21C-2021
100% (3)
Business Report Project Machine Learning Rupesh Kumar DSBA-A5-21C-2021
77 pages
VARUNSAINI - 11 Dec 2022
No ratings yet
VARUNSAINI - 11 Dec 2022
16 pages
Capstone Project Submission
100% (2)
Capstone Project Submission
31 pages
Cart-Rf-ANN: Prepared by Muralidharan N
0% (1)
Cart-Rf-ANN: Prepared by Muralidharan N
16 pages
Jupyter Notebook Project CART RF ANN
100% (1)
Jupyter Notebook Project CART RF ANN
41 pages
Sunira - Predictive Modeling
100% (1)
Sunira - Predictive Modeling
65 pages
Data Mining Project Anshul
100% (1)
Data Mining Project Anshul
48 pages
Business Report: Predictive Modelling
100% (2)
Business Report: Predictive Modelling
37 pages
Project Report
100% (3)
Project Report
36 pages
MRA Milestone-1 Graded Project
100% (2)
MRA Milestone-1 Graded Project
41 pages
Predictive Modelling Project 1 PDF
50% (2)
Predictive Modelling Project 1 PDF
38 pages
FRA Report
100% (1)
FRA Report
30 pages
This Study Resource Was: Quiz 3
100% (1)
This Study Resource Was: Quiz 3
5 pages
SMDM Business-Report Arvind Soni-2
0% (1)
SMDM Business-Report Arvind Soni-2
15 pages
Business Report Problem 2
No ratings yet
Business Report Problem 2
10 pages
Detail Project Report SMDM
100% (1)
Detail Project Report SMDM
25 pages
Business Report
No ratings yet
Business Report
12 pages
Data Mining Quiz 1 Clustering
100% (2)
Data Mining Quiz 1 Clustering
4 pages
Linear - Regression - Assignment: Problem Statement
100% (3)
Linear - Regression - Assignment: Problem Statement
24 pages
Weekly Quiz 2 Predictive Modeling Logistic Regression PDF
100% (1)
Weekly Quiz 2 Predictive Modeling Logistic Regression PDF
3 pages
Time Series Project
50% (4)
Time Series Project
2 pages
Data Mining Project PCA Report
100% (1)
Data Mining Project PCA Report
27 pages
Clustering Analysis: Reading The Data
100% (1)
Clustering Analysis: Reading The Data
15 pages
Anamit Deb Gupta Mra - Project Milestone - 1
100% (1)
Anamit Deb Gupta Mra - Project Milestone - 1
30 pages
Predective Modellig Project
100% (1)
Predective Modellig Project
18 pages
Data Mining Case Study PDF
100% (1)
Data Mining Case Study PDF
21 pages
Lifi
100% (1)
Lifi
16 pages
Financial Risk Analysis Project Report Financial Risk Analysis Project Report
100% (2)
Financial Risk Analysis Project Report Financial Risk Analysis Project Report
29 pages
QUIZ Week 2 CART Practice PDF
No ratings yet
QUIZ Week 2 CART Practice PDF
10 pages
This Study Resource Was: Bank Loan Default Prediction Model
No ratings yet
This Study Resource Was: Bank Loan Default Prediction Model
9 pages
Assignment Report - Data Mining
No ratings yet
Assignment Report - Data Mining
24 pages
Advance Statistics Business Report
No ratings yet
Advance Statistics Business Report
15 pages
ANFIS
No ratings yet
ANFIS
42 pages
Predictive Modelling Project Gloria Susan Raju 11 APR 2021 PDF
No ratings yet
Predictive Modelling Project Gloria Susan Raju 11 APR 2021 PDF
56 pages
Arnab Chowdhury As1
No ratings yet
Arnab Chowdhury As1
12 pages
Problem 1:: Readingcsv PD Read - Excel (Readingcsv) Readingcsv Head
No ratings yet
Problem 1:: Readingcsv PD Read - Excel (Readingcsv) Readingcsv Head
18 pages
Capstone Project
100% (1)
Capstone Project
7 pages
Machine Learning For Beginners. The Simplified Guide
No ratings yet
Machine Learning For Beginners. The Simplified Guide
24 pages
Data Mining
No ratings yet
Data Mining
27 pages
A Comprehensive Survey On Machine Learning For Networking
No ratings yet
A Comprehensive Survey On Machine Learning For Networking
99 pages
Detection of Url Based Phishing Attacks Using Machine Learning IJERTV8IS110269
No ratings yet
Detection of Url Based Phishing Attacks Using Machine Learning IJERTV8IS110269
8 pages
Predictive Analytics
No ratings yet
Predictive Analytics
7 pages
Handwritten Digit Recognition Using CNN
100% (1)
Handwritten Digit Recognition Using CNN
6 pages
Artificial Neural Network
No ratings yet
Artificial Neural Network
36 pages
Pointnet: A 3D Convolutional Neural Network For Real-Time Object Class Recognition
No ratings yet
Pointnet: A 3D Convolutional Neural Network For Real-Time Object Class Recognition
8 pages
Naive Bayes Classification
No ratings yet
Naive Bayes Classification
47 pages
Flight DElay Report
No ratings yet
Flight DElay Report
49 pages
ImageNet Classification With Deep Convolutional Convolutional Neural Networks PDF
No ratings yet
ImageNet Classification With Deep Convolutional Convolutional Neural Networks PDF
37 pages
CSE 446 Machine Learning: Instructor: Pedro Domingos
No ratings yet
CSE 446 Machine Learning: Instructor: Pedro Domingos
17 pages
Unfinished Research
No ratings yet
Unfinished Research
21 pages
WEKA
No ratings yet
WEKA
81 pages
ML.Net
No ratings yet
ML.Net
284 pages
Cyber Vulnerability Intelligence For Internet of Things Binary
No ratings yet
Cyber Vulnerability Intelligence For Internet of Things Binary
10 pages
Cross-Validation For Imbalanced Datasets: Avoiding Overoptimistic and Overfitting Approaches
No ratings yet
Cross-Validation For Imbalanced Datasets: Avoiding Overoptimistic and Overfitting Approaches
17 pages
Assignment No1 - Modified
No ratings yet
Assignment No1 - Modified
22 pages
(2018) Estimation of The Generation Rate of Different Types of Plastic Wastes and Possible Revenue Recovery From Informal Recycling - AGUNG
No ratings yet
(2018) Estimation of The Generation Rate of Different Types of Plastic Wastes and Possible Revenue Recovery From Informal Recycling - AGUNG
10 pages
Autoencoder - MPL - Basic - Ipynb - Colaboratory PDF
No ratings yet
Autoencoder - MPL - Basic - Ipynb - Colaboratory PDF
21 pages
Comparing Different Deep Learning Architectures For Classification of Chest Radiographs
No ratings yet
Comparing Different Deep Learning Architectures For Classification of Chest Radiographs
16 pages
Comparative Analysis of Optimizers in Deep Neural Networks
No ratings yet
Comparative Analysis of Optimizers in Deep Neural Networks
4 pages
Automatic Forest Fire Detection Based - On A Machine Learning and Image - Analysis Pipeline
No ratings yet
Automatic Forest Fire Detection Based - On A Machine Learning and Image - Analysis Pipeline
12 pages
Data Mining Review - 1
No ratings yet
Data Mining Review - 1
9 pages
VGG Image Classification Practical
No ratings yet
VGG Image Classification Practical
11 pages
06 - Classification Algorithms - Part II
No ratings yet
06 - Classification Algorithms - Part II
28 pages
D2L2 Caetano Classification Techniques PDF
No ratings yet
D2L2 Caetano Classification Techniques PDF
73 pages
A Weighted Majority Voting Ensemble Approach For Classification
No ratings yet
A Weighted Majority Voting Ensemble Approach For Classification
6 pages
A Novel Semisupervised Deep Learning Method For Human Activity Recognition PDF
No ratings yet
A Novel Semisupervised Deep Learning Method For Human Activity Recognition PDF
10 pages
Machine Learning - Exploring The Model
No ratings yet
Machine Learning - Exploring The Model
2 pages

Data Mining Project

Uploaded by

Data Mining Project

Uploaded by

Data Mining Project

Statistical Summary Table

• We observe that mean is overall same in similar scaled column.

• “Probability_of_full_payment” columns have probability values ranging between 0.808 to 0.918

• The values in the dataset are not normalised.

Cluster Profile for KMeans clustering

• Its just that the order is as follows:

• Premium Products: Promote offer on Luxury/Lavish lifestyle items to them

• Premium Offer: Slowly motivate them to move to premium product stores/websites

• These are Low spending customers, making some purchases

• Reminders: These people need proper regular reminders of payment

• Incentives: Offer incentives tied to spending thresholds

• There is data of 3000 customers

• Rest of the values in Statistical summary looks fine

• We see there are 139 duplicate records.

- Online/Offline Channel distribution

(i) Performance Metric of Decision Tree Classifier (CART) -

Decision Tree Classifier – With Best Grid Search Params:

• False Negative is higher than False positives.

• Major contribution features are Agency, Sales and Product

Important Parameters as per Random Forest

• Major contribution features are Agency, Sales, Product and Commission

Model/ Metric CART CART RF RF ANN ANN

• All 3 models have the same performance in their Training/Test datasets.

- Agency ‘JZI’: We need to analyse Sales on this Agency.

- Gold and Silver Product:

- Airlines Vs. Travel Agency:

- Predictive Analysis Indicator:

You might also like