Machine Learning - Project
Machine Learning - Project
Ashit Debdas
BACP-2020
1|Page
Table of Contents
1 Project Objective…………………………………………………………………………………………………………………………3
1.1 Problem one: Clustering …………………………………………………………………………………………………...3
1.2 Problem Two: CART-RF-ANN …………………………………………………………………………………………….3
2 Exploratory Data Analysis – Step by step approach………………………………………………………………………3
2.1 Install Necessary Packages and Invoke Library………………………………………………………………………3
2.2 Set up Working Directory……………………………………………………………………………………………………3
2.3 Import and Read Data Set………………………………………………………………………………………………...3
3 Variable Identification……………………………………………………………………………………………………………….3
4 Missing Value Treatment……………………………………………………………………………………………………….….3
5 Insights from Problem one…….………………………………………………………………….…………………………......4
5.1 Read the data and do exploratory data analysis.……………………………………………………….….…...4
5.2 Do you think scaling is necessary for clustering in this case.……………………………………….….….4
5.3 Apply hierarchical clustering to scaled data, Identify the number of optimum clusters using
Dendrogram and briefly describe the………………………………………………………………………….….….5
5.4 Apply K-Means clustering on scaled data and determine optimum clusters.………………………5
5.5 Describe cluster profiles for the clusters defined. Recommend different promotional
strategies for different clusters…………...…………………….…….………….…………………………………….6
6 Insights from Problem Two…………………………………………………………………………………………….............7
6.1 Read the dataset. Do the descriptive statistics and do null value condition check, write an
inference…….……………………………………………………………………………………………………………….….…7
6.2 Data Split: Split the data into test and train, build classification model CART, Random
Forest and ANN………………………………………………………………………………………………...….…….….….8
6.3 Performance Metrics: Check the performance of Predictions on Train and Test sets using
Confusion Matrix………………………………………………………………………………………………………….….11
6.4 Final Model: Compare all the model and write an inference which model is optimized.….13
6.5 Inference: Basis on these predictions, what are the business insights and
recommendations…………………………………………………………………………………………………...………13
2|Page
1 Project Objective
1.1 Problem one
A leading bank wants to develop a customer segmentation to give promotional offers to its customers. They collected
a sample that summarizes the activities of users during the past few months. You are given the task to identify the
segments based on credit card usage. The data of the las few months is in “bank_marketing_part1_Data (2).csv”
Problem one:
Problem two:
> str(customer_segm)
'data.frame': 210 obs. of 7 variables:
$ spending : num 19.9 16 18.9 10.8 18 ...
$ advance_payments : num 16.9 14.9 16.4 13 15.9 ...
$ probability_of_full_payment : num 0.875 0.906 0.883 0.81 0.899 ...
$ current_balance : num 6.67 5.36 6.25 5.28 5.89 ...
$ credit_limit : num 3.76 3.58 3.75 2.64 3.69 ...
$ min_payment_amt : num 3.25 3.34 3.37 5.18 2.07 ...
$ max_spent_in_single_shopping: num 6.55 5.14 6.15 5.18 5.84 ...
> head(customer_segm_scale)
spending advance_payments probability_of_full_payment current_balance credit_limit min_payment_amt
[1,] 1.7501726 1.8076485 0.1778050 2.3618888 1.3353877 -0.2980937
[2,] 0.3926441 0.2532349 1.4981931 -0.5993122 0.8561898 -0.2422262
[3,] 1.4099313 1.4247880 0.5036700 1.3981443 1.3142077 -0.2209434
[4,] -1.3807350 -1.2246066 -2.5856995 -0.7911583 -1.6351103 0.9855289
[5,] 1.0800003 0.9959842 1.1934881 0.5901336 1.1527101 -1.0855596
[6,] -0.7380569 -0.8800322 0.6941106 -1.0055745 -0.4437341 3.1630318
max_spent_in_single_shopping
[1,] 2.3234463
[2,] -0.5372979
[3,] 1.5055095
[4,] -0.4538765
[5,] 0.8727275
[6,] -0.8302902
5.3 Apply hierarchical clustering to scaled data. Identify the number of optimum clusters using
Dendrogram and briefly describe them.
6|Page
Since Clustering unsupervised learning after using distance matrix and plotting the dendrogram we can see 3 cluster
would be optimal cluster.
However, dendrogram gives us clear picture we can take the high value from hclust value and visual graph we can
see various highs merged happens. In this last merge has significant drops, after third merges there is not drops. So,
we can consider three optimum clusters.
As well Clusplot visitations shows first two components explained by 88.93%. so, we can conclude by saying three
optimum clusters would be best fit.
5.4 Apply K-Means clustering on scaled data and determine optimum cluster.
K-means clustering with 3 clusters of sizes 72, 71, 67 . and also verious graphicl plot which is mentiond below as
follows cluster plot, by using nclust (WSS), silhouette method, gap_stat (Bootstrapping) mehod. Every graphical
methods shows three cluster is a optimal cluster.
7|Page
The Hubert index is a graphical method of
determining the number of clusters. In the plot of
Hubert index, we seek a significant knee that
corresponds to a significant increase of the value of
the measure i.e. the significant peak in Hubert index
second differences plot.
5.5 Describe cluster profiles for the clusters defined. Recommend different promotional strategies
for different clusters.
8|Page
In Hierarchical Clustering each grope of cluster shows indifferent other variable to first group of clusters similarly
second and third group.
As we run the silhouette function, we can observe each cluster size and
average silhouette and each cluster not overlapped. And also, we can
observe, cluster 1 closest neighbor 2 cluster and. 2 cluster neighbor 3 cluster
By using Hierarchical Clustering, we can say cluster 1 grope of people
spending more and they do usually more advance payment, probability of
full payment is higher compare to 3 group cluster.
K-means clustering is one of the simplest and popular unsupervised machine learning algorithms. The K-means
algorithm identifies k number of centroids, and then allocates every data point to the nearest cluster as we can see
this problem statement just like Hierarchical Clustering group 1 people are spending more money and advance
payment as well compare to other cluster.
And summary reveals that data has 10 columns and there mean, median and Quartiles and also can view all the
necessary outputs.
9|Page
So, fare we don’t have Null value in this data set.
6.2 Data Split: Split the data into test and train, build classification model CART, Random Forest and
Artificial Neural Network
successfully splits the data set by 80% ratio. Now
we have training data set and test set. Training
data set has 2400 ovservation and test set has 600
ovservation.
We can observe almost similiter percentage claimed status have both the data set. Total claimed Records: 294
(30.80%) Total Not claimed Records: 2076 (69.20%).
Model Building – CART
Decision Trees are commonly used in data mining with the objective of creating a model that predicts the value of a
claimed (or dependent variable) based on the values of several input (or independent variables).
Classification Trees where the target variable is categorical and the tree is used to identify the “class” within which a
target variable would likely fall into. Regression Trees where the target variable is continuous and tree is used to
predict its value.
10 | P a g
e
The arguments to rpart. control checked against the list of valid arguments to create a Decision tree model. Visual plot
represents the decision tree model.
Here we can see the various nsplit and xerror. After the 7th split there is significant increasing trend on xerror 071448 to
10th split 0.72936.
After using post Pruning technique, we can cut the tree since xerror where increasing after 7th split. 11 | P a g
e
Model Building - Random Forest
A Supervised Classification Algorithm, as the name suggests, this algorithm creates the forest with a number of trees in
random order. In general, the more trees in the forest the more robust the forest looks like. In the same way in the random
forest classifier, the higher the number of trees in the forest gives the high accuracy results.
Some advantages of using Random Forest are as follows:
The same random forest algorithm or the random forest classifier can use for both classification and the regression task.
Random forest classifier will handle the missing values.
When we have more trees in the forest, random forest classifier won’t over fit the model.
Can model the random forest classifier for categorical values also.
The model is built with dependant variable as Claimed, and considering all independent variables.
In the random forests the number of variables available for splitting at each tree node is referred to as the mtry parameter.
The optimum number of variables is obtained using tuneRF function. Optimum number of mytre is 9.
13 | P a g
e
Confusion Matrix = CART
14 | P a g
e
In the Random forest model slides different accuracy compare to both test and train data. Train data has accuracy 90% but
test model has 77%. I would say train data giver good accuracy.
Confusion Matrix = Artificial Neural Network
In This Artificial Neural Network, we can observe similar kind of trends test data has 77% accuracy and train data has 81%.
6.4 Final Model: Compare all the model and write an inference which model is best/optimized
The CART method has given poor performance compared to Random Forest and ANN. Looking at the percentage deviation
between Training and Testing Dataset, it looks like the Model is over fit. The Random Forest method has the best
performance (best accuracy) among all the three models. The percentage deviation between Training and Testing Dataset
also is reasonably under control, suggesting a robust model. Neural Network has given relatively secondary performance
compared to Random Forest, however, better than CART. However, the percentage deviation between Training and Testing
Data set is minimal among three models.
6.5 Inference: Basis on these predictions, what are the business insights and recommendations
The main objective of the project was to develop a predictive model to predict if An Insurance firm providing tour insurance
is facing higher claim frequency. There is a probability they would get more, as of now AUC area under the ROC curve.
customers will respond positively to a promotion or an offer using tools of Machine Learning.
15 | P a g
e
16 | P a g
e