0% found this document useful (0 votes)
113 views31 pages

Data Mining Project: Cluster Analysis and Dimensionality Reduction in R Using Bank Marketing Data Set

This is a project report on clustering analysis. In data mining, clustering is a technique to group data points that share similar properties.

Uploaded by

Bindu Saira
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
113 views31 pages

Data Mining Project: Cluster Analysis and Dimensionality Reduction in R Using Bank Marketing Data Set

This is a project report on clustering analysis. In data mining, clustering is a technique to group data points that share similar properties.

Uploaded by

Bindu Saira
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/339988189

Data Mining Project: Cluster Analysis and Dimensionality Reduction in R


using Bank Marketing Data Set

Preprint · March 2020

CITATIONS READS

0 1,621

2 authors:

Kinga Włodarczyk Kingsley Success Ikani


Wroclaw University of Science and Technology Wroclaw University of Science and Technology
6 PUBLICATIONS   4 CITATIONS    16 PUBLICATIONS   1 CITATION   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Data Mining Project: Cluster Analysis and Dimensionality Reduction in R using Bank Marketing Data Set View project

Optimal Portfolio Selection: Case Study of NGSE View project

All content following this page was uploaded by Kingsley Success Ikani on 17 March 2020.

The user has requested enhancement of the downloaded file.


Wrocław University of Science and Technology
Faculty of Pure and Applied Mathematics
Applied Mathematics

Data Mining Project: Cluster Analysis


and Dimensionality Reduction in R using
Bank Marketing Data Set

Lecture: Prof. Adam Zagdanski, PhD


Lab: Prof.Adam Zagdanski, PhD

Authors:
Kinga Włodarczyk
Kingsley Ikani

March 11, 2020


1 Introduction
In the first part of this report, we will deal with the cluster analysis and quality assessment
of the cluster analysis results and in the second part, we will deal with feature extraction,
visualization of multidimensional reduced data and explain the use of the selected methods
as it is connected to classification and clustering. Next, we will explain the methods
selected for our cluster analysis and the algorithm steps. It is imperative to begin by
giving the definition or state the meaning of clustering and cluster analysis.
Clustering
Clustering is the process of making a group of abstract objects into classes of similar
objects. A cluster of data objects can be treated as one group.
Cluster Analysis
In cluster analysis, we partition the set of the collected data into groups based on data
‘similarity and then assign the labels to the groups.

2 DEFINITIONS OF THE SELECTED METHODS FOR


CLUSTER ANALYSIS
2.1 K-means
K-means helps to divide data into groups called ’clusters’, this has to do with finding
the K centroids. Centroids are simply introduced or existing points that represent the
centers of the clusters. The main aim of K-means is to identify K number of centroids
and apportion every object to the nearest cluster.
The following are the algorithms for K-means:
• Specify number of clusters K
• Initialise centroids by first shuffling the dataset and then randomly selecting K data
points for the centroids without replacement.
• Keep iterating until there is no change to the centroids. i.e assignment of data points
to clusters is not changing.
• Compute the sum of the squared distance between data points and all centroids.
•Assign each data point to the closest cluster(centroid).
• Compute the centroids for the clusters by taking the average of all data points that
belong to each cluster.

2.2 PAM
While K-means uses centroids(the artificial points in the dataset), PAM method uses
medoid, which is the actual points in the dataset. There is an extension to PAM method
called CLARA(Clustering Large Application), the CLARA applies PAM algorithm to a
small sample of the data and not to the entire data set.
The following are the algoritms for PAM:
• Choose random K objects for an initial set medodoids
• Apportion each data point to the closest centroid, based on Euclidean distance or other
• Try to improve the quality of clustering by exchanging selected objects with unselected
objects

1
•Repeat steps 2 and 3 until the average distance of objects from medoids is minimised.

2.3 CLARA
The CLARA(Clustering Large Application) relies on the sampling approach to handling
large data sets. Instead of finding medoids for the data set, CLARA draws a small sample
from the data set and applies the PAM algorithm to generate an optimal set of medoids
for the sample.
The algorithm is as follows:
• Create randomly, from the original dataset, multiple subsets with fixed size(sample size)
• Compute Pam algorithm on each subset and choose the corresponding k representative
objects(medoids). Assign each observation of the entire data set to the closest medoid.
• Calculate the mean (or the sum) of the dissimilarities of the observations to their closest
medoid. This is used as a measure of the goodness of the clustering
• Retain the sub-dataset for which the mean(or sum) is minimal. Further analysis is
carried out on the final partition.

2.4 Agglomerative Nesting


It is also known as AGNES. It is the bottom-up approach. This method constructs the
tree of clusters i.e. nodes. The criteria used in this method for clustering the data is min
distance, max distance, avg distance, center distance.
The steps of this method are:
• Initially all the objects are clusters i.e. leaf
• It recursively merges the nodes(clusters) that have the maximum similarity between
them
• At the end of the process all the nodes belong to the same cluster i.e. known as the
root of the tree structure.

2.5 CHOOSING OPTIMAL K NUMBER


There are many statistics useful for choosing optimal K, but in general there is no method
for determining the exact value of K. The algorithms for the selected methods described
above divide the data into clusters for a chosen K number of clusters.

2.5.1 Silhouette
Silhouette gives a visualization of how well each object lies within its cluster and is
b(i)−a(i)
calculating using the formula below: S = max(a(i),b(i))
where a(i) is the average distance to all other data points in the cluster and b(i) is the
minimum of the average distance to other clusters. Its measure has a range of [−1, 1].
Generally, the positive silhouette values are considered to be the best as they indicate
that the sample is far away from the neighboring clusters and such positive values are
deem fit for the goodness of clustering.

2
2.5.2 Average Within-Cluster Distance to Centroid
Another method normally used to compare a different number of clusters K is the average
distance between the data points and their cluster centroid. As the number of clusters
increases, the statistics decrease, so our main goal here is to find a point where the rate
of decrease shifts sharply i.e. elbow point.
Note that there are many more metrics for assigning the optimal number of clusters. In
R language, the package NbClust provides 30 indices for determining the optimal number
of clusters.

3 DATA
The data used is from UCI Machine Learning Repository called Bank Marketing Data
Set. The data is related to the direct marketing campaigns of a Portuguese banking
institution. The marketing campaigns were based on phone calls. In this part of our
report, we decide to analyze smaller data set. We chose a dataset named ’bank.csv’,
which is smaller than the previous dataset. It contains 4521 observations and 17 variables
as before. In our data description, we can specify data on the whole campaign as well
as individual clients. For client data we include the following features:’age’, ’marital’,
’education’,’default’,’balance’,’housing’, ’loan’. In cluster analysis we will limit ourselves
only to these variables, here we have two continuous features, two categorical features,
and the last 3 binary features.

4 DATA PREPARATION
Due to the length of the calculations, we decided to reduce the number of observations. For
this purpose, we rejected all observations when contact with the client was unknown and
we chose data for only one month, which was May. In the next step, we plot histograms
of age and balance to check how is distribution about our data. We see that age have
outliers. In picture 1 we see first distributions of age and balance and in picture 2 are
distributions after removing outliers.

Figure 1: Density of age and balance before removing outliers.

3
Figure 2: Density of age and balance after removing outliers.

In our analysis, we will consider two approaches to each method of clustering. First,
one will be based just on age and balance, the second will apply to all previously selected
data. But before this, we need to scale values of age and balance to obtain a similar range
of them. Plots of values before scaling are on scatter plot in 3a, and after scaling they
are on plot 3b. In both on the x-axis are age values and on y-axis balance values.

(a) Before scaling (b) After scaling

Figure 3: Scatter plot of age and balance, before and after scaling.

4
5 ANALYSIS OF SELECTED METHODS FOR CLUS-
TERING
5.1 K-means
5.1.1 Choosing optimal number of clusters
In the beginning, we decided to check the optimal number of clusters thanks to the built-
in function friz_nbclust from the package factoextra. We chose two methods, obviously
average silhouette values and ’wss’ method, which is within cluster sums of squares statis-
tics. In the following picture 4 we present obtained plot of average silhouette for different
number of clusters. Observing the graph, one can clearly conclude that the optimal num-
ber of clusters for the K-means method is 3 clusters. Therefore, we will cluster for the
given number of clusters, however, to examine the behavior for other numbers of clusters,
we also decided to perform clustering for dividing into 4 and 5 groups.

Figure 4: Values of average silhouette width for different number of clusters.

In figure 5, by analyzing the chart from right to left, we can see that that when the
number of clusters (K) reduces from 2 to 1, there is a big increase in the sum of squares
bigger than any other previous increase. That means that when it passes from 2 to 1
cluster there is a reduction in the clustering compactness(by compactness, we mean the
similarity within a cluster). Since our goal is not to achieve a similarity of 100%, we would
take each observation as a group/cluster. Our main purpose is to find a fair number of
clusters that could explain satisfactorily a considerable part of the data. The higher the
sum of squares, the better the grouping of data, but our goal is to find the location of a
bend (knee) in the plot, which is generally considered as an indicator of the appropriate
number of clusters. In fig. 5 we locate this point of bend for 3 clusters, which is consistent
with the result obtained in the average silhouette analysis.

5
Figure 5: Values of total within sum of squares for different number of cluster.

5.1.2 Clustering
As we said in previous section we decide to use k-means clustering method for k = 3, 4,
5. Our data divided into cluster we can see on plots 6 7, 8, we cluster just scaled values
of age and balance.

Figure 6: Results of clustering using k-means method for k = 3.

6
Figure 7: Results of clustering using k-means method for k = 3.

Figure 8: Results of clustering using k-means method for k = 3.

When we clustered our data we decided to make a silhouette plots for all of case, which
is presented on pictures 9, 10, 11. Comparing this plots one can conclude that figure 9 is
the best to fit the data, since the higher the average silhouette width the better fitting
of data and all the clusters above the average score is considered an as good choice. As
we increased the number of clusters to 4 and 5, the average silhouette score decreased to
around 0.37. Moreover, the thickness of the silhouette plot started showing wide fluctu-
ations and the thickness gives an indication of how big each cluster is. Also, the data
points for clustering with 3 groups appear to be well matched to their own clusters. A
good number of clusters will have a well above 0.5 silhouettes average score, as well as all

7
of the clusters, have higher than the average score, but unfortunately, ours are below 0.5
silhouette average score.

Figure 9: Silhouette plot for k-means clustering with k = 3.

Figure 10: Silhouette plot for k-means clustering with k = 4.

8
Figure 11: Silhouette plot for k-means clustering with k = 5.

5.2 PAM
5.2.1 Choosing optimal number of clusters
We will conduct analysis for PAM method. Average silhouette and within cluster sums
of squares statistics will be visualised.

Figure 12: Values of average silhouette width for different number of clusters for PAM
method.

The analysis of the average silhouette and within cluster sums of squares statistics lead
to conclusion that the optimal number of clusters for PAM is 3. Similarly to K-means,

9
Figure 13: Values of total within sum of squares for different number of cluster for PAM
method.

the chosen number of clusters for PAM method is also 3.

5.2.2 Clustering
Now we are going to present results for PAM clustering for 3, 4 and 5 clusters. Results of
our clustering we can see on plots 14, 15 and 16. We can see that result are quite similar
to results obtained by k-means clustering.

Figure 14: Results of clustering by PAM method with 3 clusters.

10
Figure 15: Results of clustering by PAM method with 4 clusters.

Figure 16: Results of clustering by PAM method with 5 clusters.

As in k-means case we are going to print silhouette plots for each number of clusters
above and our results we can see on plots 17, 18 and 19. As we said before the best results
of clustering according to average silhouette are for 3 clusters (picture 14 and 17).

11
Figure 17: Silhouette plot for PAM method with 3 clusters.

Figure 18: Silhouette plot for PAM method with 4 clusters.

5.2.3 More features


Now we take all the data of the clients and determined the dissimilarity matrix of their
values, using this matrix as input in PAM, we obtained the following results in (figure
20). This figure presents clustering in real data. Next, we showed results in dissimilarities
values in (figure 21).
We see that values in dissimilarities are grouped in small groups, so we decided to
check for the number of clusters we will need to obtain the best clustering, and we came
to discover it was 32 clusters, which looks a bit irrational when clustering it. In figure 22
we plot the average silhouette from 1 to 50 clusters, and we saw that the optimal number
of clusters K is 32. So we plotted clustering for 32 clusters and the result is shown in

12
Figure 19: Silhouette plot for PAM method with 5 clusters.

figure 23.

13
Figure 20: Results in scatter plot of age and balance obtained by PAM clustering of all
selected features data with 3 clusters.

Figure 21: Clustering by PAM with 3 clusters presented on values from dissimilarity
matrix.

14
Figure 22: Optimal Number of Clusters of the data distribution

Figure 23: Clustering by PAM for large data distribution

15
5.3 CLARA
5.3.1 Choosing the optimal number of clusters
To estimate the optimal number of clusters in your data, it’s also possible to use the
average silhouette method. The R function fviz_nbclust() [factoextra package] provides a
solution to facilitate this step.

Figure 24: Optimal number of Silhouette clusters for PAM

Figure 25: Optimal number of Clusters for PAM

16
We again obtain that the best clustering for ’age’ and ’balance’ is 3 clusters, so print
this results in plot 26 and silhouette plot in figure 27.

Figure 26: Clustering of selected feature’s data for CLARA

Figure 27: Silhouette clustering plot for CLARA

17
5.3.2 More features
We done Clara cluttering just for 3 clusters and later we decided to put clustering for
more features, which is ’age’, ’job’, ’marital’, ’education’, ’balance’ and again we prepared
silhouette average plot for different number of clusters, what we can see in figure 28. As
we can see now the optimal number of cluster is 10.

Figure 28: Optimal Silhouette number of clusters for CLARA

Figure 29: Optimal number of clusters for CLARA

18
5.3.3 Clustering
In figure 30 we see age and balance clustered into 10 groups and seems to be here a big
mess. But when we look at the same clustering presented in dissimilarities way, where are
is more numbers of inputs we see that this clustering seems to make sense. This results
are on figure 31.

Figure 30: Real data clustered by Clara for more features.

Figure 31: Dissimilarities clustered by Clara for more features.

The picture 5.3.3 shows the silhouette plot with 10 clusters for Clara. The data points
appear to well be matched to their clusters and the dimensions of the clustering both

19
horizontally and vertically are measured in their percentages.

Figure 32: Silhouette plot for Clara for more features.

The shows the silhouette plot for clara. The average silhouette is 0.39. This plot
appears to be quite good as pretty number of clusters are well above the average silhouette
and the fluctuation are not wide, so those number of clusters are a good pick for our given
data.

20
5.4 AGNES
5.4.1 Choosing the optimal number of clusters
The last method of clustering is hierarchical method called AGNES (Agglomerative Nest-
ing). As before at the beginning we plot silhouette average plot for just two features:
’age’ and ’balance’ and below are the plots. Looking at plots in 33 and 34, one can clearly
see that the optimal number of clusters K is 3 and at this point we have a good fit of
clustering. We see that as the number of clusters decrease from 3 to 1, the total within
sum of square increases, which is considered good for clustering and fitting of data, since
the optimal number of clusters are obtained there.

Figure 33: Average silhouette plot for AGNES clustering for 20 different numbers of
clusters.

21
Figure 34: Average silhouette plot for AGNES clustering for 20 different numbers of
clusters.

5.4.2 Clustering
At the beginning we present clustering just for feature ’age’ and ’balance’. In the picture
35 we have dendogram for all samples in out data set. We use complete method.

Figure 35: Dendogram for AGNES clustering in all samples.

The picture 35 shows how our samples data are decomposed into several level of nested
partitioning( called a cluster of tree) We can see that the clustering pattern for complete
data create compact clusters of clusters. As we might know from the knowledge of dis-

22
tances, Euclidean distance and correlation distance produce very different dendrograms.In
order to identify sub-groups (i.e., clusters), we can cut the dendrogram with cutree(). The
height of the cut to our dendrogram controls the number of clusters obtained. Agnes does
not tell us how many clusters there are, or where to cut the dendrogram to form clusters,
but we indeed know that the dendogram is the cluster for all our data sample.

Figure 36: Silhouette plot for AGNES clustering with 3 clusters.

The 36 shows the plot of Silhouette for three clusters. The average silhouette width is
0.29, which appear not to be the best, since the higher the silhouette average, the better
the clustering of data, so the average of 0.29 is below the generally acceptable average of
0.5 considered to be good for good clustering and fitting of data. Later we made a graph
for a smaller number of samples because it is more visible - plot 37.
The 37 shows the dendogram of AGNES for 50 of our sample data. The height of
the cut to our dendrogram controls the number of clusters obtained and it shows us to
identify how the analysed data samples are clusters are grouped in the tree form.
And at the end we do hierarchical clustering for more features: ’age’, ’job’, ’marital’,
’education’,’balance’ - plot 38.
The 39 shows the Silhouette plot for AGNES clustering with 3 clusters for more fea-
tures of our data and the clusters above the average silhouette width of 0.29 are considered
to be better for data fitting.

6 DIMENSIONALITY REDUCTION OF DATASETS(PART


II)
It is becoming quite common to work with datasets of hundreds or even thousands of
features. If the number of features becomes similar or even bigger than the number of
observations stored in a dataset, then this can lead to a Machine learning model suffering
from overfitting. To avoid this kind of problem, it is necessary to apply either a regular-
ization or dimensionality reduction techniques(called Feature extraction). Regularization

23
Figure 37: Dendogram for AGNES clustering for 50 samples.

Figure 38: Dendogram for AGNES clustering for more features

techniques could help reduce the risk of overfitiing, however, using instead feature extrac-
tion technigues lead to other type of advantages such overfitting risk reduction, speed up
in training, improved data visualization, accuracy improvement and increase in explan-
ability of the model.So with this reason we will be using feature extraction techniques
(specifically PCA Method) in this project to reduce our data to the dimensions that will
help improve the accuracy and data visualization of our data analysis.

24
Figure 39: Silhouette plot for AGNES clustering with 3 clusters for more features.

7 Feature Extraction
It aims at reducing the number of features from the existing ones and then discarding
the original features.These new reduced set of features will then summarize most of the
information contained in the original set of features.

7.1 Selected Method of Data Dimensionality Reduction


7.1.1 Principle Components Analysis (PCA)
It is one of the most used linear dimensionality reduction technique. It works by taking
as input the original data and trying to find a combination of the input features which
can best summarize the original data distribution so as to reduce its original dimensions.
It does this by maximising variances and minimizing the recontruction of error by looking
at pair wised distances. It is an unsupervised learning algorithm, as such, it does not
concern about the data label but only about the variation, though this in some cases may
lead to mis-classification of data

8 Visualization of Multidimensional data


Data visualization has been a powerful tool and has been widely adopted by organizations
owing to its effectiveness in abstracting out the right information, understanding and
interpreting results clearly and easily. However, dealing with multi-dimensional datasets
with typically more than two attributes start causing problems, since our medium of data
analysis and communication is typically restricted to two dimensions. In this project, we
will explore some effective strategies of visualizing data in multiple dimensions (ranging
from 1-D up to 6-D). Well, it is said that "A picture is worth a thousand words", that
popular quote serves a motivation and as an inspiration to understanding and leveraging
data visualization as as effective tool in our analysis. It is also said by John Tukey that

25
"The greatest value of a picture is when it forces us to notice what we never expected
to see." The last quote is relevant and reinforces the necessity of data visualization. So
after the feature extraction of our data analysis, visualising the result is the key to really
understand what is happening and creates room for easy reading of our analysed data.

9 Data Dimensionality Reduction in R


In our data we have 7 numeric features and we choose them into dimension reduction. We
use prcomp function in R for summary of dimensions opportunities. Importance of com-
ponents we can see on figure 40. In addition we print barlots of variance and cumulative
variance for all components for PCA, which are on pictures 41 and 42 respectively.

Figure 40: Importance of components

Figure 41: Variance of all components for PCA.

We are looking for that number of components, which explain more that 0.8 data
variability, we looking for this value on cumulative variance plot and we see that the
optimal number of components is 5.

10 Data Dimensionality Reduction for clustering


Because in part of clustering, we have limited ourselves only to client data, here we will
also focus on this subset of data. And on it, you will perform dimension reduction to be

26
Figure 42: Cumulative variance of all componets for PCA.

able to compare the results of clustering. In this subset of data we have just 2 numerical
components, we suppose to we can reduce it into one. We start as the last section from
counting importance of components, results are on picture 43.

Figure 43: Importance of components.

Our data after PCA look in the following way, what is presented on picture 44.
For see clustering results we use R function clValid and results we present in figure 45
In clustering as the best results we take mostly clustering with 3 clusters. When we
look at values for silhouette average we see that is mostly about 0.4-0.5. The same values
we obtained in clustering data in previous section, so dimension reduction didn’t change
too much in our results.

11 Summarizing
Our clustering for the best option mostly give as silhouette average about 0.4 to 0.5 it’s
not that so bad clustering. We also made clustering for more features where we include
categorical features. This made our clustering different but also more interesting. We
know that age and balance can be correlated but age and balance also can be correlated
with marital status, or job. So results for using more features are also satisfying.

27
Figure 44: Data after PCA.

Figure 45: Results of clustering after PCA.

References
[1] https://towardsdatascience.com/10-tips-for-choosing-the-optimal-\
\number-of-clusters-277e93d72d92.

[2] https://www.datanovia.com/en/lessons/cluster-validation-statistics-must-know-meth
#silhouette-coefficient.

[3] https://www.datanovia.com/en/lessons/determining-the-optimal-number-of-clusters-3

[4] https://rpubs.com/cpatinof/clusteringMktngDataCaseStudy.

[5] https://www.kaggle.com/janiobachmann/bank-marketing-campaign-opening-a-term-depos

28
[6] https://www.datasciencecentral.com/profiles/blogs/
usarrests-hierarchical-clustering-using-diana-and-agnes.

[7] https://bradleyboehmke.github.io/HOML/hierarchical.html

29

View publication stats

You might also like