Data Mining Project: Cluster Analysis and Dimensionality Reduction in R Using Bank Marketing Data Set
Data Mining Project: Cluster Analysis and Dimensionality Reduction in R Using Bank Marketing Data Set
net/publication/339988189
CITATIONS READS
0 1,621
2 authors:
Some of the authors of this publication are also working on these related projects:
Data Mining Project: Cluster Analysis and Dimensionality Reduction in R using Bank Marketing Data Set View project
All content following this page was uploaded by Kingsley Success Ikani on 17 March 2020.
Authors:
Kinga Włodarczyk
Kingsley Ikani
2.2 PAM
While K-means uses centroids(the artificial points in the dataset), PAM method uses
medoid, which is the actual points in the dataset. There is an extension to PAM method
called CLARA(Clustering Large Application), the CLARA applies PAM algorithm to a
small sample of the data and not to the entire data set.
The following are the algoritms for PAM:
• Choose random K objects for an initial set medodoids
• Apportion each data point to the closest centroid, based on Euclidean distance or other
• Try to improve the quality of clustering by exchanging selected objects with unselected
objects
1
•Repeat steps 2 and 3 until the average distance of objects from medoids is minimised.
2.3 CLARA
The CLARA(Clustering Large Application) relies on the sampling approach to handling
large data sets. Instead of finding medoids for the data set, CLARA draws a small sample
from the data set and applies the PAM algorithm to generate an optimal set of medoids
for the sample.
The algorithm is as follows:
• Create randomly, from the original dataset, multiple subsets with fixed size(sample size)
• Compute Pam algorithm on each subset and choose the corresponding k representative
objects(medoids). Assign each observation of the entire data set to the closest medoid.
• Calculate the mean (or the sum) of the dissimilarities of the observations to their closest
medoid. This is used as a measure of the goodness of the clustering
• Retain the sub-dataset for which the mean(or sum) is minimal. Further analysis is
carried out on the final partition.
2.5.1 Silhouette
Silhouette gives a visualization of how well each object lies within its cluster and is
b(i)−a(i)
calculating using the formula below: S = max(a(i),b(i))
where a(i) is the average distance to all other data points in the cluster and b(i) is the
minimum of the average distance to other clusters. Its measure has a range of [−1, 1].
Generally, the positive silhouette values are considered to be the best as they indicate
that the sample is far away from the neighboring clusters and such positive values are
deem fit for the goodness of clustering.
2
2.5.2 Average Within-Cluster Distance to Centroid
Another method normally used to compare a different number of clusters K is the average
distance between the data points and their cluster centroid. As the number of clusters
increases, the statistics decrease, so our main goal here is to find a point where the rate
of decrease shifts sharply i.e. elbow point.
Note that there are many more metrics for assigning the optimal number of clusters. In
R language, the package NbClust provides 30 indices for determining the optimal number
of clusters.
3 DATA
The data used is from UCI Machine Learning Repository called Bank Marketing Data
Set. The data is related to the direct marketing campaigns of a Portuguese banking
institution. The marketing campaigns were based on phone calls. In this part of our
report, we decide to analyze smaller data set. We chose a dataset named ’bank.csv’,
which is smaller than the previous dataset. It contains 4521 observations and 17 variables
as before. In our data description, we can specify data on the whole campaign as well
as individual clients. For client data we include the following features:’age’, ’marital’,
’education’,’default’,’balance’,’housing’, ’loan’. In cluster analysis we will limit ourselves
only to these variables, here we have two continuous features, two categorical features,
and the last 3 binary features.
4 DATA PREPARATION
Due to the length of the calculations, we decided to reduce the number of observations. For
this purpose, we rejected all observations when contact with the client was unknown and
we chose data for only one month, which was May. In the next step, we plot histograms
of age and balance to check how is distribution about our data. We see that age have
outliers. In picture 1 we see first distributions of age and balance and in picture 2 are
distributions after removing outliers.
3
Figure 2: Density of age and balance after removing outliers.
In our analysis, we will consider two approaches to each method of clustering. First,
one will be based just on age and balance, the second will apply to all previously selected
data. But before this, we need to scale values of age and balance to obtain a similar range
of them. Plots of values before scaling are on scatter plot in 3a, and after scaling they
are on plot 3b. In both on the x-axis are age values and on y-axis balance values.
Figure 3: Scatter plot of age and balance, before and after scaling.
4
5 ANALYSIS OF SELECTED METHODS FOR CLUS-
TERING
5.1 K-means
5.1.1 Choosing optimal number of clusters
In the beginning, we decided to check the optimal number of clusters thanks to the built-
in function friz_nbclust from the package factoextra. We chose two methods, obviously
average silhouette values and ’wss’ method, which is within cluster sums of squares statis-
tics. In the following picture 4 we present obtained plot of average silhouette for different
number of clusters. Observing the graph, one can clearly conclude that the optimal num-
ber of clusters for the K-means method is 3 clusters. Therefore, we will cluster for the
given number of clusters, however, to examine the behavior for other numbers of clusters,
we also decided to perform clustering for dividing into 4 and 5 groups.
In figure 5, by analyzing the chart from right to left, we can see that that when the
number of clusters (K) reduces from 2 to 1, there is a big increase in the sum of squares
bigger than any other previous increase. That means that when it passes from 2 to 1
cluster there is a reduction in the clustering compactness(by compactness, we mean the
similarity within a cluster). Since our goal is not to achieve a similarity of 100%, we would
take each observation as a group/cluster. Our main purpose is to find a fair number of
clusters that could explain satisfactorily a considerable part of the data. The higher the
sum of squares, the better the grouping of data, but our goal is to find the location of a
bend (knee) in the plot, which is generally considered as an indicator of the appropriate
number of clusters. In fig. 5 we locate this point of bend for 3 clusters, which is consistent
with the result obtained in the average silhouette analysis.
5
Figure 5: Values of total within sum of squares for different number of cluster.
5.1.2 Clustering
As we said in previous section we decide to use k-means clustering method for k = 3, 4,
5. Our data divided into cluster we can see on plots 6 7, 8, we cluster just scaled values
of age and balance.
6
Figure 7: Results of clustering using k-means method for k = 3.
When we clustered our data we decided to make a silhouette plots for all of case, which
is presented on pictures 9, 10, 11. Comparing this plots one can conclude that figure 9 is
the best to fit the data, since the higher the average silhouette width the better fitting
of data and all the clusters above the average score is considered an as good choice. As
we increased the number of clusters to 4 and 5, the average silhouette score decreased to
around 0.37. Moreover, the thickness of the silhouette plot started showing wide fluctu-
ations and the thickness gives an indication of how big each cluster is. Also, the data
points for clustering with 3 groups appear to be well matched to their own clusters. A
good number of clusters will have a well above 0.5 silhouettes average score, as well as all
7
of the clusters, have higher than the average score, but unfortunately, ours are below 0.5
silhouette average score.
8
Figure 11: Silhouette plot for k-means clustering with k = 5.
5.2 PAM
5.2.1 Choosing optimal number of clusters
We will conduct analysis for PAM method. Average silhouette and within cluster sums
of squares statistics will be visualised.
Figure 12: Values of average silhouette width for different number of clusters for PAM
method.
The analysis of the average silhouette and within cluster sums of squares statistics lead
to conclusion that the optimal number of clusters for PAM is 3. Similarly to K-means,
9
Figure 13: Values of total within sum of squares for different number of cluster for PAM
method.
5.2.2 Clustering
Now we are going to present results for PAM clustering for 3, 4 and 5 clusters. Results of
our clustering we can see on plots 14, 15 and 16. We can see that result are quite similar
to results obtained by k-means clustering.
10
Figure 15: Results of clustering by PAM method with 4 clusters.
As in k-means case we are going to print silhouette plots for each number of clusters
above and our results we can see on plots 17, 18 and 19. As we said before the best results
of clustering according to average silhouette are for 3 clusters (picture 14 and 17).
11
Figure 17: Silhouette plot for PAM method with 3 clusters.
12
Figure 19: Silhouette plot for PAM method with 5 clusters.
figure 23.
13
Figure 20: Results in scatter plot of age and balance obtained by PAM clustering of all
selected features data with 3 clusters.
Figure 21: Clustering by PAM with 3 clusters presented on values from dissimilarity
matrix.
14
Figure 22: Optimal Number of Clusters of the data distribution
15
5.3 CLARA
5.3.1 Choosing the optimal number of clusters
To estimate the optimal number of clusters in your data, it’s also possible to use the
average silhouette method. The R function fviz_nbclust() [factoextra package] provides a
solution to facilitate this step.
16
We again obtain that the best clustering for ’age’ and ’balance’ is 3 clusters, so print
this results in plot 26 and silhouette plot in figure 27.
17
5.3.2 More features
We done Clara cluttering just for 3 clusters and later we decided to put clustering for
more features, which is ’age’, ’job’, ’marital’, ’education’, ’balance’ and again we prepared
silhouette average plot for different number of clusters, what we can see in figure 28. As
we can see now the optimal number of cluster is 10.
18
5.3.3 Clustering
In figure 30 we see age and balance clustered into 10 groups and seems to be here a big
mess. But when we look at the same clustering presented in dissimilarities way, where are
is more numbers of inputs we see that this clustering seems to make sense. This results
are on figure 31.
The picture 5.3.3 shows the silhouette plot with 10 clusters for Clara. The data points
appear to well be matched to their clusters and the dimensions of the clustering both
19
horizontally and vertically are measured in their percentages.
The shows the silhouette plot for clara. The average silhouette is 0.39. This plot
appears to be quite good as pretty number of clusters are well above the average silhouette
and the fluctuation are not wide, so those number of clusters are a good pick for our given
data.
20
5.4 AGNES
5.4.1 Choosing the optimal number of clusters
The last method of clustering is hierarchical method called AGNES (Agglomerative Nest-
ing). As before at the beginning we plot silhouette average plot for just two features:
’age’ and ’balance’ and below are the plots. Looking at plots in 33 and 34, one can clearly
see that the optimal number of clusters K is 3 and at this point we have a good fit of
clustering. We see that as the number of clusters decrease from 3 to 1, the total within
sum of square increases, which is considered good for clustering and fitting of data, since
the optimal number of clusters are obtained there.
Figure 33: Average silhouette plot for AGNES clustering for 20 different numbers of
clusters.
21
Figure 34: Average silhouette plot for AGNES clustering for 20 different numbers of
clusters.
5.4.2 Clustering
At the beginning we present clustering just for feature ’age’ and ’balance’. In the picture
35 we have dendogram for all samples in out data set. We use complete method.
The picture 35 shows how our samples data are decomposed into several level of nested
partitioning( called a cluster of tree) We can see that the clustering pattern for complete
data create compact clusters of clusters. As we might know from the knowledge of dis-
22
tances, Euclidean distance and correlation distance produce very different dendrograms.In
order to identify sub-groups (i.e., clusters), we can cut the dendrogram with cutree(). The
height of the cut to our dendrogram controls the number of clusters obtained. Agnes does
not tell us how many clusters there are, or where to cut the dendrogram to form clusters,
but we indeed know that the dendogram is the cluster for all our data sample.
The 36 shows the plot of Silhouette for three clusters. The average silhouette width is
0.29, which appear not to be the best, since the higher the silhouette average, the better
the clustering of data, so the average of 0.29 is below the generally acceptable average of
0.5 considered to be good for good clustering and fitting of data. Later we made a graph
for a smaller number of samples because it is more visible - plot 37.
The 37 shows the dendogram of AGNES for 50 of our sample data. The height of
the cut to our dendrogram controls the number of clusters obtained and it shows us to
identify how the analysed data samples are clusters are grouped in the tree form.
And at the end we do hierarchical clustering for more features: ’age’, ’job’, ’marital’,
’education’,’balance’ - plot 38.
The 39 shows the Silhouette plot for AGNES clustering with 3 clusters for more fea-
tures of our data and the clusters above the average silhouette width of 0.29 are considered
to be better for data fitting.
23
Figure 37: Dendogram for AGNES clustering for 50 samples.
techniques could help reduce the risk of overfitiing, however, using instead feature extrac-
tion technigues lead to other type of advantages such overfitting risk reduction, speed up
in training, improved data visualization, accuracy improvement and increase in explan-
ability of the model.So with this reason we will be using feature extraction techniques
(specifically PCA Method) in this project to reduce our data to the dimensions that will
help improve the accuracy and data visualization of our data analysis.
24
Figure 39: Silhouette plot for AGNES clustering with 3 clusters for more features.
7 Feature Extraction
It aims at reducing the number of features from the existing ones and then discarding
the original features.These new reduced set of features will then summarize most of the
information contained in the original set of features.
25
"The greatest value of a picture is when it forces us to notice what we never expected
to see." The last quote is relevant and reinforces the necessity of data visualization. So
after the feature extraction of our data analysis, visualising the result is the key to really
understand what is happening and creates room for easy reading of our analysed data.
We are looking for that number of components, which explain more that 0.8 data
variability, we looking for this value on cumulative variance plot and we see that the
optimal number of components is 5.
26
Figure 42: Cumulative variance of all componets for PCA.
able to compare the results of clustering. In this subset of data we have just 2 numerical
components, we suppose to we can reduce it into one. We start as the last section from
counting importance of components, results are on picture 43.
Our data after PCA look in the following way, what is presented on picture 44.
For see clustering results we use R function clValid and results we present in figure 45
In clustering as the best results we take mostly clustering with 3 clusters. When we
look at values for silhouette average we see that is mostly about 0.4-0.5. The same values
we obtained in clustering data in previous section, so dimension reduction didn’t change
too much in our results.
11 Summarizing
Our clustering for the best option mostly give as silhouette average about 0.4 to 0.5 it’s
not that so bad clustering. We also made clustering for more features where we include
categorical features. This made our clustering different but also more interesting. We
know that age and balance can be correlated but age and balance also can be correlated
with marital status, or job. So results for using more features are also satisfying.
27
Figure 44: Data after PCA.
References
[1] https://towardsdatascience.com/10-tips-for-choosing-the-optimal-\
\number-of-clusters-277e93d72d92.
[2] https://www.datanovia.com/en/lessons/cluster-validation-statistics-must-know-meth
#silhouette-coefficient.
[3] https://www.datanovia.com/en/lessons/determining-the-optimal-number-of-clusters-3
[4] https://rpubs.com/cpatinof/clusteringMktngDataCaseStudy.
[5] https://www.kaggle.com/janiobachmann/bank-marketing-campaign-opening-a-term-depos
28
[6] https://www.datasciencecentral.com/profiles/blogs/
usarrests-hierarchical-clustering-using-diana-and-agnes.
[7] https://bradleyboehmke.github.io/HOML/hierarchical.html
29