80% found this document useful (10 votes)
4K views

Machine Learning - Project

This document outlines a machine learning project involving customer segmentation and insurance claim prediction. For customer segmentation (Problem 1), the document describes exploring banking customer data to identify customer segments using clustering algorithms like hierarchical and k-means clustering. For insurance claim prediction (Problem 2), the document describes building classification models like CART, Random Forest and ANN on insurance claim data and comparing their performance to predict claims and provide business recommendations. The document outlines typical steps for exploratory data analysis, variable identification, missing value treatment, applying clustering/classification algorithms, and comparing model performance for both problems.

Uploaded by

Ashit Debdas
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
80% found this document useful (10 votes)
4K views

Machine Learning - Project

This document outlines a machine learning project involving customer segmentation and insurance claim prediction. For customer segmentation (Problem 1), the document describes exploring banking customer data to identify customer segments using clustering algorithms like hierarchical and k-means clustering. For insurance claim prediction (Problem 2), the document describes building classification models like CART, Random Forest and ANN on insurance claim data and comparing their performance to predict claims and provide business recommendations. The document outlines typical steps for exploratory data analysis, variable identification, missing value treatment, applying clustering/classification algorithms, and comparing model performance for both problems.

Uploaded by

Ashit Debdas
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Machine Learning - Project

Ashit Debdas
BACP-2020

1|Page
Table of Contents
1 Project Objective…………………………………………………………………………………………………………………………3
1.1 Problem one: Clustering …………………………………………………………………………………………………...3
1.2 Problem Two: CART-RF-ANN …………………………………………………………………………………………….3
2 Exploratory Data Analysis – Step by step approach………………………………………………………………………3
2.1 Install Necessary Packages and Invoke Library………………………………………………………………………3
2.2 Set up Working Directory……………………………………………………………………………………………………3
2.3 Import and Read Data Set………………………………………………………………………………………………...3
3 Variable Identification……………………………………………………………………………………………………………….3
4 Missing Value Treatment……………………………………………………………………………………………………….….3
5 Insights from Problem one…….………………………………………………………………….…………………………......4
5.1 Read the data and do exploratory data analysis.……………………………………………………….….…...4
5.2 Do you think scaling is necessary for clustering in this case.……………………………………….….….4
5.3 Apply hierarchical clustering to scaled data, Identify the number of optimum clusters using
Dendrogram and briefly describe the………………………………………………………………………….….….5
5.4 Apply K-Means clustering on scaled data and determine optimum clusters.………………………5
5.5 Describe cluster profiles for the clusters defined. Recommend different promotional
strategies for different clusters…………...…………………….…….………….…………………………………….6
6 Insights from Problem Two…………………………………………………………………………………………….............7
6.1 Read the dataset. Do the descriptive statistics and do null value condition check, write an
inference…….……………………………………………………………………………………………………………….….…7
6.2 Data Split: Split the data into test and train, build classification model CART, Random
Forest and ANN………………………………………………………………………………………………...….…….….….8
6.3 Performance Metrics: Check the performance of Predictions on Train and Test sets using
Confusion Matrix………………………………………………………………………………………………………….….11
6.4 Final Model: Compare all the model and write an inference which model is optimized.….13
6.5 Inference: Basis on these predictions, what are the business insights and
recommendations…………………………………………………………………………………………………...………13

2|Page
1 Project Objective
1.1 Problem one
A leading bank wants to develop a customer segmentation to give promotional offers to its customers. They collected
a sample that summarizes the activities of users during the past few months. You are given the task to identify the
segments based on credit card usage. The data of the las few months is in “bank_marketing_part1_Data (2).csv”

1.1 Problem Two


An Insurance firm providing tour insurance is facing higher claim frequency. The management decides to collect data
from the past few years. You are assigned the task to make a model which predicts the claim status and provide
recommendations to management. Use CART, RF & ANN and compare the models' performances in train and test
sets. The data of the last few years is in “insurance_part2_data (1).csv”.

2 Exploratory Data Analysis – Step by step approach


A Typical Data exploration activity consists of the following setup
2.1 Install necessary packages and invoke Library
Before start this section to install necessary packages and invoke associated libraries. Having all the packages at the
same places increases code readability more efficiently.
2.2 Set up Working Directory
Setting a working directory on starting of the R session makes importing and exporting data files and code files easier.
Basically, working directory is the location/ folder on the PC where you have the data, codes related to the project
which makes thinks more sophisticated.
2.3 Import and Read the data
The given dataset is in .csv format. Hence, the command ‘read.csv’ is used for importing the file.
 Problem One for Clustering: bank_marketing_part1_Data (2).csv
 Problem Two for CART-RF-ANN: insurance_part2_data (1).csv
3 Variable Identification
The dataset is analyzed for basic understanding of the features and data contained. it is usually an activity by which
data is explored and organized.
• Variable classes
Problem one has 210 Rows and 7 columns
Problem Two has 3000 Rows and 10 columns

4 Missing Value Treatment


Missing value treatment is an important step in Exploratory Data Analysis, essentially missing data in the training
data set can reduce the power of a model or can lead to a biased model because we have not analyzed the behavior
and relationship with other variables correctly. It can lead to wrong prediction or classification. The both datasets
under scrutiny does not have any Missing values.

Problem one:

> customer_segm = read.csv("bank_marketing_part1_Data (2).csv", header = TRUE)


> anyNA(customer_segm)
[1] FALSE

Problem two:

> Insurance = read.csv("insurance_part2_data (1).csv", header = TRUE)


> anyNA(Insurance)
[1] FALSE

5 Insights from Problem one


5.1 Read the data and do exploratory data analysis.
5|Page
> summary(customer_segm)
spending advance_payments probability_of_full_payment current_balance credit_limit min_payment_amt
Min. :10.59 Min. :12.41 Min. :0.8081 Min. :4.899 Min. :2.630 Min. :0.7651
1st Qu.:12.27 1st Qu.:13.45 1st Qu.:0.8569 1st Qu.:5.262 1st Qu.:2.944 1st Qu.:2.5615
Median :14.36 Median :14.32 Median :0.8734 Median :5.524 Median :3.237 Median :3.5990
Mean :14.85 Mean :14.56 Mean :0.8710 Mean :5.629 Mean :3.259 Mean :3.7002
3rd Qu.:17.30 3rd Qu.:15.71 3rd Qu.:0.8878 3rd Qu.:5.980 3rd Qu.:3.562 3rd Qu.:4.7687
Max. :21.18 Max. :17.25 Max. :0.9183 Max. :6.675 Max. :4.033 Max. :8.4560
max_spent_in_single_shopping
Min. :4.519
1st Qu.:5.045
Median :5.223
Mean :5.408
3rd Qu.:5.877
Max. :6.550
As we can see data summary that has 7 individual columns. Every column has unique name and mean, median,
Quartiles and also can view all the necessary outputs.

> str(customer_segm)
'data.frame': 210 obs. of 7 variables:
$ spending : num 19.9 16 18.9 10.8 18 ...
$ advance_payments : num 16.9 14.9 16.4 13 15.9 ...
$ probability_of_full_payment : num 0.875 0.906 0.883 0.81 0.899 ...
$ current_balance : num 6.67 5.36 6.25 5.28 5.89 ...
$ credit_limit : num 3.76 3.58 3.75 2.64 3.69 ...
$ min_payment_amt : num 3.25 3.34 3.37 5.18 2.07 ...
$ max_spent_in_single_shopping: num 6.55 5.14 6.15 5.18 5.84 ...

The beside created visual plot helps in understanding


the correlation strength between the variables.
5.2 Do you think scaling is necessary for
clustering in this case?
Normalization is used to eliminate
redundant data and ensures that good quality
clusters are generated which can improve the
efficiency of clustering algorithms. So, it becomes an
essential step before clustering as Euclidean
distance is very sensitive to the changes in the
differences. all dimensions are equally important.

However, in this Market Segmentation data has


various dimension’s like spending: Amount spent by
the customer per month (in 1000s) 2.
advance_payments: Amount paid by the customer in
advance by cash (in 100s) 3.
probability_of_full_payment: Probability of payment done in full by the customer to the bank 4. current_balance:
Balance amount left in the account to make purchases (in 1000s) 5. credit_limit: Limit of the amount in credit card
(10000s) 6. min_payment_amt: minimum paid by the customer while making payments for purchases made monthly
(in 100s) 7. max_spent_in_single_shopping: Maximum amount spent in one purchase (in 1000s). below snapshot
shows after scaling data.

> head(customer_segm_scale)
spending advance_payments probability_of_full_payment current_balance credit_limit min_payment_amt
[1,] 1.7501726 1.8076485 0.1778050 2.3618888 1.3353877 -0.2980937
[2,] 0.3926441 0.2532349 1.4981931 -0.5993122 0.8561898 -0.2422262
[3,] 1.4099313 1.4247880 0.5036700 1.3981443 1.3142077 -0.2209434
[4,] -1.3807350 -1.2246066 -2.5856995 -0.7911583 -1.6351103 0.9855289
[5,] 1.0800003 0.9959842 1.1934881 0.5901336 1.1527101 -1.0855596
[6,] -0.7380569 -0.8800322 0.6941106 -1.0055745 -0.4437341 3.1630318
max_spent_in_single_shopping
[1,] 2.3234463
[2,] -0.5372979
[3,] 1.5055095
[4,] -0.4538765
[5,] 0.8727275
[6,] -0.8302902

5.3 Apply hierarchical clustering to scaled data. Identify the number of optimum clusters using
Dendrogram and briefly describe them.

6|Page
Since Clustering unsupervised learning after using distance matrix and plotting the dendrogram we can see 3 cluster
would be optimal cluster.

However, dendrogram gives us clear picture we can take the high value from hclust value and visual graph we can
see various highs merged happens. In this last merge has significant drops, after third merges there is not drops. So,
we can consider three optimum clusters.

As well Clusplot visitations shows first two components explained by 88.93%. so, we can conclude by saying three
optimum clusters would be best fit.

5.4 Apply K-Means clustering on scaled data and determine optimum cluster.

K-means clustering with 3 clusters of sizes 72, 71, 67 . and also verious graphicl plot which is mentiond below as
follows cluster plot, by using nclust (WSS), silhouette method, gap_stat (Bootstrapping) mehod. Every graphical
methods shows three cluster is a optimal cluster.

7|Page
The Hubert index is a graphical method of
determining the number of clusters. In the plot of
Hubert index, we seek a significant knee that
corresponds to a significant increase of the value of
the measure i.e. the significant peak in Hubert index
second differences plot.

The D index is a graphical method of determining


the number of clusters. In the plot of D index, we
seek a significant knee (the significant peak in
Dindex second differences plot) that corresponds to
a significant increase of the value of the measure.

According to the majority rule, the best number of


clusters is 3

5.5 Describe cluster profiles for the clusters defined. Recommend different promotional strategies
for different clusters.

8|Page
In Hierarchical Clustering each grope of cluster shows indifferent other variable to first group of clusters similarly
second and third group.

As we run the silhouette function, we can observe each cluster size and
average silhouette and each cluster not overlapped. And also, we can
observe, cluster 1 closest neighbor 2 cluster and. 2 cluster neighbor 3 cluster
By using Hierarchical Clustering, we can say cluster 1 grope of people
spending more and they do usually more advance payment, probability of
full payment is higher compare to 3 group cluster.

According business prospective we can target cluster 1 people give


attractive offer as followers cluster 2 and cluster 3.

K-means clustering is one of the simplest and popular unsupervised machine learning algorithms. The K-means
algorithm identifies k number of centroids, and then allocates every data point to the nearest cluster as we can see
this problem statement just like Hierarchical Clustering group 1 people are spending more money and advance
payment as well compare to other cluster.

6 Insights from Problem Two


6.1 Read the dataset. Do the descriptive statistics and do null value condition check, write an
inference?

As we can see data frame has 3000 obs. of 10 variables.

And summary reveals that data has 10 columns and there mean, median and Quartiles and also can view all the
necessary outputs.
9|Page
So, fare we don’t have Null value in this data set.

6.2 Data Split: Split the data into test and train, build classification model CART, Random Forest and
Artificial Neural Network
successfully splits the data set by 80% ratio. Now
we have training data set and test set. Training
data set has 2400 ovservation and test set has 600
ovservation.

We can observe almost similiter percentage claimed status have both the data set. Total claimed Records: 294
(30.80%) Total Not claimed Records: 2076 (69.20%).
Model Building – CART
Decision Trees are commonly used in data mining with the objective of creating a model that predicts the value of a
claimed (or dependent variable) based on the values of several input (or independent variables).
Classification Trees where the target variable is categorical and the tree is used to identify the “class” within which a
target variable would likely fall into. Regression Trees where the target variable is continuous and tree is used to
predict its value.

10 | P a g
e
The arguments to rpart. control checked against the list of valid arguments to create a Decision tree model. Visual plot
represents the decision tree model.

Here we can see the various nsplit and xerror. After the 7th split there is significant increasing trend on xerror 071448 to
10th split 0.72936.

After using post Pruning technique, we can cut the tree since xerror where increasing after 7th split. 11 | P a g
e
Model Building - Random Forest
A Supervised Classification Algorithm, as the name suggests, this algorithm creates the forest with a number of trees in
random order. In general, the more trees in the forest the more robust the forest looks like. In the same way in the random
forest classifier, the higher the number of trees in the forest gives the high accuracy results.
Some advantages of using Random Forest are as follows:
The same random forest algorithm or the random forest classifier can use for both classification and the regression task.
 Random forest classifier will handle the missing values.
 When we have more trees in the forest, random forest classifier won’t over fit the model.
 Can model the random forest classifier for categorical values also.
The model is built with dependant variable as Claimed, and considering all independent variables.

Random Forests algorithm is a classifier based on primarily two


methods - bagging and random subspace method.
Out-of-bag (OOB) error, also called out-of-bag estimate, is a
method of measuring the prediction error of random forests,
boosted decision trees, and other machine learning models
utilizing bootstrap aggregating to subsample data samples used
for training.
Out-of-bag estimates help in avoiding the need for an
independent validation dataset.
In this mode we have OOB estimate of error rate: 21.96% and
this model shows the significant decreasing error rate if we
increase the tree. OOB is a combine measure of claim frequency
yes or no.
It is observed that as the number of tress increases, the OOB
error rate starts decreasing.

In the random forests the number of variables available for splitting at each tree node is referred to as the mtry parameter.
The optimum number of variables is obtained using tuneRF function. Optimum number of mytre is 9.

Model Building – Artificial Neural Network


Artificial neural networks (ANNs) are statistical models directly inspired by, and partially modeled on biological neural
networks. They are capable of modeling and processing nonlinear relationships between inputs and outputs in parallel.
Artificial neural networks are characterized by containing adaptive weights along paths between neurons that can be tuned
by a learning algorithm that learns from observed data in order to improve the model. In addition to the learning algorithm
itself, one must choose an appropriate cost function.
12 | P a g
e
The cost function is what’s used to learn the optimal solution to the problem being solved. This involves determining the
best values for all of the tune able model parameters, with neuron path adaptive weights being the primary target, along
with algorithm tuning parameters such as the learning rate. It’s usually done through optimization techniques such as
gradient descent or stochastic gradient descent.
These optimization techniques basically try to make the ANN solution be as close as possible to the optimal solution, which
when successful means that the ANN is able to solve the intended problem with high performance.

In this Artificial Neural Network after 6312 min thresh Error


reduce 144.49637. here we can see visitation graph as well
6.3 Performance Metrics: Check the
performance of Predictions on Train and Test sets
using Confusion Matrix
Bellow visitation reflects the CART model confusion matrix
which reflects accuracy 77% for test set and training set
79% since insurance is facing higher claim frequency. Since
claim online majority status number is “No” both train and
test data set Hence we there is significant increases on
insurance claims current study shows.

13 | P a g
e
Confusion Matrix = CART

Confusion Matrix = Random Forest

14 | P a g
e
In the Random forest model slides different accuracy compare to both test and train data. Train data has accuracy 90% but
test model has 77%. I would say train data giver good accuracy.
Confusion Matrix = Artificial Neural Network

In This Artificial Neural Network, we can observe similar kind of trends test data has 77% accuracy and train data has 81%.

6.4 Final Model: Compare all the model and write an inference which model is best/optimized

The CART method has given poor performance compared to Random Forest and ANN. Looking at the percentage deviation
between Training and Testing Dataset, it looks like the Model is over fit. The Random Forest method has the best
performance (best accuracy) among all the three models. The percentage deviation between Training and Testing Dataset
also is reasonably under control, suggesting a robust model. Neural Network has given relatively secondary performance
compared to Random Forest, however, better than CART. However, the percentage deviation between Training and Testing
Data set is minimal among three models.

6.5 Inference: Basis on these predictions, what are the business insights and recommendations
The main objective of the project was to develop a predictive model to predict if An Insurance firm providing tour insurance
is facing higher claim frequency. There is a probability they would get more, as of now AUC area under the ROC curve.
customers will respond positively to a promotion or an offer using tools of Machine Learning.

15 | P a g
e
16 | P a g
e

You might also like