Dynamic Customer Segmentation Using Unsupervised Machine Learning in Python
Dynamic Customer Segmentation Using Unsupervised Machine Learning in Python
In this modern era, everything and everyone is innovative, where everyone competes
with being better than others. The emergence of many entrepreneurs, competitors,
and business interested people has created a lot of insecurities and tension among
competing businesses to find new customers and hold the old customers. Because of
this one should need and maintain exceptional customer service and it becomes very
appropriate irrespective of the business scale. And also, it is equally important to
understand the needs of customers specifically to provide greater customer support
and to advertise them with the most appropriate products. In the pool of these online
products customers are confused about what to buy and what not to and also the
company or the business people are confused about which section of customers to
be targeted for selling their particular type of products. This confusion will probably be
possible by the process called CUSTOMER SEGMENTATION. The process of
segmenting the customers with similar interests and similar shopping behavior into
the same segment and with different interests and different shopping patterns into
different segments is called customer segmentation. Customer segmentation and
pattern extraction are the major aspects of a business decision support system. Each
segment has the same set of customers who most probably has the same kind of
interests and shopping patterns. In this paper, we planned to do this customer
segmentation using three different clustering algorithms namely K means clustering
algorithm, Mini batch means, and hierarchical clustering algorithms and also going to
compare all these clustering algorithms based on their efficiency and root mean
squared errors.
i
CHAPTER 1
INTRODUCTION
1.1 OVERVIEW
Data is very precious in today‟s ever-competitive world. Every day organizations and
people are encountered with a large amount of data. A most efficient way to handle
this data is to classify or categorize the data into Clusters, set of groups, or partitions.
“Usually, the classification methods are either supervised or unsupervised,
depending on whether they have labeled datasets or not”. Unsupervised
classification is the exploratory data analysis where there won‟t be any training data
set and having to extract hidden patterns in the data set with no labeled responses
is achieved whereas classification of supervised learning model is machine learning
task of deducing a function from training data set. The main focus is to enhance the
propinquity or closeness in data points belonging to the same group and increase
the variance among various groups and all this is achieved through some measure
of similarity. Exploratory- by data analysis is all about dealing with a wide range of
applications such as “ engineering, text mining, pattern recognition, bioinformatics,
spatial data analysis, - mechanical engineering, voice mining, textual document
collection, artificial intelligence, image segmentation, ”. This diversity explains the
importance of clustering in scientific research but this diversity can lead to
contradictions due to different purposes and nomenclature.
Customer segmentation has importance as it includes, the ability to modify the pro-
grams of the market so that it is suitable to each of the segments, support in a
business decision, identification of products associated with each customer
segment, and managing the demand and supply of that product, and predicting
customer defection, identifying and targeting the potential customer base, providing
directions in finding the solutions. Clustering is an iterative process of knowledge
discovery from unorganized and huge amounts of data that is raw. Clustering is one
of the kinds of exploration of data mining that is used in several applications, those
are classification, machine learning, and recognition of patterns
Machine learning (ML) is the study of computer algorithms that can improve
automatically through experience and by the use of data. It is seen as a part
of artificial intelligence. Machine learning algorithms build a model based on sample
data, known as training data, to make predictions or decisions without being
explicitly programmed to do so. Machine learning algorithms are used in a wide
variety of applications, such as in medicine, email filtering, speech recognition,
and computer vision, where it is difficult or unfeasible to develop conventional
algorithms to perform the needed tasks.
2
related field of study, focusing on exploratory data analysis through unsupervised
learning. Some implementations of machine learning use data and neural
networks in a way that mimics the working of a biological brain. In its application
across business problems, machine learning is also referred to as predictive
analytics.
Machine learning could be a subfield of computer science (AI). The goal of machine
learning typically is to know the structure information of knowledge of information
and match that data into models which will be understood and used by folks.
Although machine learning could be a field inside technology, it differs from ancient
process approaches.
3
algorithm specific to classifying data may use computer vision of moles
coupled with supervised learning to train it to classify the cancerous moles.
Whereas, a machine learning algorithm for stock trading may inform the
trader of future potential predictions.
In machine learning, tasks square measure is typically classified into broad classes.
These classes square measure supported however learning is received or however,
feedback on the education is given to the system developed. Two of the foremost
wide adopted machine learning strategies are square measure supervised learning
that trains algorithms supported example input and output information that's tagged
by humans, and unattended learning that provides the algorithmic program with no
tagged information to permit it to search out structure at intervals its computer file.
Machine learning approaches are traditionally divided into three broad categories,
depending on the nature of the "signal" or "feedback" available to the learning
system:
Supervised learning: The computer is presented with example inputs and their
desired outputs, given by a "teacher", and the goal is to learn a general rule that
maps inputs to outputs.
4
maximize.
5
1.3.2 Unsupervised Learning
Unsupervised learning is usually used for transactional information. You will have
an oversized dataset of consumers and their purchases, however, as a person,
you'll probably not be able to add up what similar attributes will be drawn from
client profiles and their styles of purchases.
With this information fed into the Associate in Nursing unattended learning rule, it
should be determined that ladies of a definite age vary UN agency obtain
unscented soaps square measure probably to be pregnant, and so a promoting
campaign associated with physiological condition and baby will be merchandised
6
1.2 Machine Learning Task
1.4 CLUSTERING
Clustering is the task of dividing the data points into definite groups such that the
data points in the same group have similar characteristics or similar behavior. In
short, segregating the data points into different clusters based on their similar traits.
It depends on the type of algorithm we use which decides how the clusters will be
created. The inferences that need to be drawn from the data sets also depend upon
the user as there is no criterion for good clustering.
Clustering itself can be categorized into two types viz. Hard Clustering and
Soft Clustering. In hard clustering, one data point can belong to one cluster
only. But in soft clustering, the output provided is a probability likelihood of a
data point belonging to each of the pre-defined numbers of clusters.
The task of clustering is subjective which means there are many ways of
achieving the goal of clustering. Each methodology has its own set of rules to
7
segregate data points into different clusters. There is n number of clustering
algorithms in which these are few mostly used algorithms such as K means
clustering algorithm, Hierarchical clustering algorithms, and Mini-batch K
means clustering algorithm, etc.
In this method, the clusters are created based upon the density of the data
points which are represented in the data space. The regions that become
dense due to the huge number of data points residing in that region are
considered clusters.
The data points in the sparse region (the region where the data points are
very few) are considered as noise or outliers. The clusters created in these
methods can be of arbitrary shape.
Divisive is the opposite of Agglomerative, it starts with all the points into one
cluster and divides them to create more clusters. These algorithms create a
distance matrix of all the existing clusters and perform the linkage between
the clusters depending on the criteria of the linkage. The clustering of the data
points is represented by using a dendrogram.
1.4.4 Centroid-based
Centroid-based clustering is the one you probably hear about the most. It's a
little sensitive to the initial parameters you give it, but it's fast and efficient.
8
These types of algorithms separate data points based on multiple centroids in
the data. Each data point is assigned to a cluster based on its squared
distance from the centroid. This is the most commonly used type of clustering.
K-Means clustering is one of the most widely used algorithms. It partitions the
data points into k clusters based upon the distance metric used for the
clustering. The value of „k‟ is to be defined by the user. The distance is
calculated between the data points and the centroids of the clusters.
The data point which is closest to the centroid of the cluster gets assigned to
that cluster. After an iteration, it computes the centroids of those clusters again
and the process continues until a pre-defined number of iterations are
completed or when the centroids of the clusters do not change after an
iteration.
9
Discovery of clusters with arbitrary shape − Some clustering algorithms
determine clusters depending on Euclidean or Manhattan distance
measures. Algorithms based on such distance measures tend to discover
spherical clusters with the same size and density. However, a cluster can be
of any shape. It is essential to develop algorithms that can identify clusters of
arbitrary shapes.
Ability to deal with noisy data − Some real-world databases include outliers
or missing, unknown, or erroneous records. Some clustering algorithms are
sensitive to such data and may lead to clusters of poor quality.
10
CHAPTER 2
LITERATURE SURVEY
Kai Peng (Member, IEEE), Victor C. M. Leung, (Fellow, IEEE), and Qinghai Huang
in [3] get to know in detail about mini-batch K-means clustering algorithm. Get to
know about the advantages and disadvantages of the algorithm and also about the
implementation.
Fionn Murtagh and Pedro Contreras in [4] studied hierarchical clustering algorithms.
In this paper get to know more about this clustering algorithm and also observe how
clusters formed and also about advantages and disadvantages and compare it with
the other different clustering algorithms.
Manju Kaushik, Bhawana Mathur in [6] get to know in detail about two different
clustering algorithms such as K-means clustering algorithm and hierarchical
clustering algorithm. And perform customer segmentation using these two
algorithms and compare the results and decide the best clustering algorithm
between these two to perform customer segmentation.
Ali Feizollah, Nor Badrul Anuar, Rosli Salleh, and Fairuz Amalina in [7] studied in
detail two different clustering algorithms such as K-means clustering algorithm and
mini-batch means clustering algorithm. And perform customer segmentation using
11
these two algorithms and compare the results and decide the best clustering
algorithm between these two to perform customer segmentation.
Asith Ishantha in [8] studied in detail different clustering algorithms such as K-means
clustering algorithm and mini-batch-means clustering algorithm and hierarchical
clustering and many more. And perform customer segmentation using all these
algorithms and compared the results and decide the best clustering algorithm
between all these to perform customer segmentation.
Onur Dogan, Ejder Aycin, Zeki Atil Bulut in [9] studied customer segmentation in
detail using the RFM model and some clustering algorithms.
Juni Norma Sari, Ride Dedriana, Lukito Nugroho, Paulus Insap Santosa in [10]
reviewed all customer segmentation techniques.
Shi Na; Liu Xumin; Guan Yong in [11] studied in detail about k means clustering
algorithm and observed its pros and cons.
Francesco Musumeci; Cristina Rottondi; Avishek Nag et. al in [12] get an overall
overview of the application of machine learning techniques and understand their
implementation.
Şükrü Ozan et. al in [13] studied about Case Study on Customer Segmentation by
using Machine Learning Methods.
Tushar Kansal; Suraj Bahuguna; Vishal Singh; Tanupriya Choudhury in [14] studied
mostly customer segmentation using the K-means clustering algorithm.
Ina Maryani; Dwiza Riana; Rachmawati Darma Astuti; Ahmad Ishaq; Sutrisno; Eva
Argarini Pratama in [15] studied different clustering techniques.
12
CHAPTER 3
METHODOLOGY
The existing model for the customer segmentation depicts that it is based on the K-
means clustering algorithm which comes under centroid-based clustering. The
suitable K value for the given dataset is selected appropriately which represents the
predefined clusters. Raw and unlabeled data is taken as input which is further
divided into clusters until the best clusters are found. Centroid based algorithm used
in this model is efficient but sensitive to initial conditions and outliers
In the proposed system, the customer segmentation model includes not only
centroid-based but also hierarchical clustering. • The three clustering algorithms K
means, Minibatch K means and the hierarchical algorithm has been selected from
the literature survey. • By deploying the three different algorithms, the clusters are
formed and analyzed respectively. • The most effective and efficient algorithm is
determined by comparing and evaluating the precision rate among the three
algorithms
The emergence of many competitors and entrepreneurs has caused a lot of tension
among competing businesses to find new buyers and keep the old ones. As a result
of the predecessor, the need for exceptional customer service becomes appropriate
regardless of the size of the business. Furthermore, the ability of any business to
understand the needs of each of its customers will provide greater customer support
in providing targeted customer services and developing customized customer
service plans. This understanding is possible through structured customer service.
13
3.4 SOFTWARE AND HARDWARE REQUIREMENTS
Python
Anaconda
Jupyter Notebook
RAM: 8GB
OS: Windows
3.4.3 Libraries:
14
Pandas Profiling-This is a library of python which can be used by anyone
free of cost. It is used for data analysis. We have used this for getting the
report of the dataset.
15
3.5. PROGRAMMING LANGUAGES
3.5.1 Python
Easy to be told and perceive- The syntax of Python is simpler; thence it's
comparatively straightforward, even for beginners conjointly, to be told and
perceive the language.
3.5.2 Domain
16
3.6. SYSTEM ARCHITECTURE
A. Collect data
This is a data preparation phase. The feature usually helps to refine all data items
at a standard rate to improve the performance of clustering algorithms.[12] Each
data point varies from grade 2 to +2. Integration techniques that include min-max,
decimal, and Z-point are the standard z-signing strategy used to make things
uneven before the dataset. While you‟ll be occupied with analyzing the dataset, you
should also start the process of collecting your data in the right shape and format. It
could be the same format as in the reference dataset (if that fits your purpose), or if
the difference is quite substantial – some other format.
The data are usually divided into two types: Structured and Unstructured. The
simplest example of structured data would be a .xls or .csv file where every column
stands for an attribute of the data. Unstructured data could be represented by a set
17
of text files, photos, or video files. Often, business dictates how to organize the
collection and storage of data.
Data exploration refers to the initial step in data analysis in which data analysts
use data visualization and statistical techniques to describe dataset
characterizations, such as size, quantity, and accuracy, to better understand the
nature of the data.
Considering the knowledge gained from the literature survey, the three most used
and efficient algorithms are taken into account for clustering the customers. K
means clustering algorithm; Mini batch k means clustering algorithm and
Hierarchical clustering algorithm. The three algorithms are all set to be deployed on
the dataset respectively.
D. Cluster Results
By deploying the three selected algorithms on the dataset the customer data has
been clustered and clusters are formed. Further analyzing the clusters formed by
different algorithms the results of the cluster are obtained for three different
algorithms which are deployed respectively. Because clustering is unsupervised, no
“truth” is available to verify results. The absence of truth complicates assessing
quality.
18
E. Comparison and Determination of Precise Algorithm
Checking the quality of clustering is not a rigorous process because clustering lacks
“truth”. Implementing a clustering model with no target to aim, it is not possible to
calculate the accuracy score. Henceforth, the aim is to create clusters with distinct
or unique characteristics. The two most common metrics to measure the
distinctness of clusters are Silhouette Coefficient and Davies-Bouldin Index.
Comparing the metric scores produced by the three algorithms, the most precise
algorithm is determined.
The algorithm takes the unlabelled dataset as input, divides the dataset into k-
number of clusters, and repeats the process until it does not find the best
clusters. The value of k should be predetermined in this algorithm.
o Assigns each data point to its closest k-center. Those data points which are
near to the particular k-center, create a cluster.
19
The working of the K-Means algorithm is explained in the below steps:
Step-2: Select random K points or centroids. (It can be other from the input
dataset).
Step-3: Assign each data point to their closest centroid, which will form the
predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassigning each datapoint to the
new closest centroid of each cluster.
The closest distance between the two clusters is crucial for hierarchical
clustering. There are various ways to calculate the distance between two
clusters, and these ways decide the rule for clustering. These measures are
called Linkage methods.
20
Single Linkage: It is the Shortest Distance between the closest points of the
clusters.
Complete Linkage: It is the farthest distance between the two points of two
different clusters. It is one of the popular linkage methods as it forms tighter
clusters than single-linkage.
There is no doubt that k-means is one of the most popular clustering algorithms
because of its performance and low cost of time but with an increase in the
size of the datasets being taken into consideration for analysis the computation
time of k-means increases. To overcome this, a different approach is
introduced called the Minibatch k-means algorithm whose main idea is to
divide the whole dataset into small- fixed-size batches of data and use a new
random mini batch from the dataset and update the clusters where this iteration
is repeated till the convergence.
Mini Batch K-means algorithm„s main idea is to use small random batches of
data of a fixed size, so they can be stored in memory. Each iteration a new
random sample from the dataset is obtained and used to update the clusters
and this is repeated until convergence. Each mini-batch updates the clusters
using a convex combination of the values of the prototypes and the data,
applying a learning rate that decreases with the number of iterations. This
learning rate is the inverse of the number of data assigned to a cluster during
the process. As the number of iterations increases, the effect of new data is
reduced, so convergence can be detected when no changes in the clusters
occur in several consecutive iterations.
21
3.7.4 Elbow Method
Determining the optimal no of clusters for the given dataset is the most
fundamental step for any unsupervised algorithm. The Elbow method helps us
to determine the best value of k. the k value is selected where the point starts
to flatten out forming an elbow in the graph plotted using the sum of squared
distance between the data points and their respective assigned cluster
centroids. Therefore, the optimal number of clusters is determined.
The Elbow method is one of the most popular ways to find the optimal number
of clusters. This method uses the concept of WCSS value. WCSS stands
for Within Cluster Sum of Squares, which defines the total variations within a
cluster. The formula to calculate the value of WCSS (for 3 clusters) is given
below:
3.8 MODULES
Train and test the model- We had used three clustering algorithms named
K-means clustering algorithm, Hierarchical clustering, and mini-batch K-
means algorithm to train the dataset. After training, we had tested the model
and found their clusters, silhouette score, and Davies Boulding score.
Deploy the models- Deployed the model to get the clusters formed. The
cluster shows the different segmentation of customers based on many
attributes. By this, we will get the silhouette score and Davies Boulding scores
of the model as the output.
22
Following are the steps to do this project (use Jupyter Notebook):
E) Test the model and find the clusters and silhouette score and Davies Boulding
score.
G) Based on the scores predict which algorithm is best for customer segmentation
and go ahead with that clustering algorithm.
23
CHAPTER 4
Silhouette Coefficient:
The silhouette score is a measure of the average similarity of the objects within a
cluster and their distance to the other objects in the other clusters.
Secondly, we define:
24
This score ranges between -1 and 1, where the clusters are more well-defined and
distinct with higher scores.
Davies-Bouldin Index:
Dij is the "within-to-between cluster distance ratio" for the ith and jth clusters.
where d¯ i is the average distance between every data point in cluster I and its
centroid, similar for d¯ j. dij is the Euclidean distance between the centroids of the
two clusters.
On contrary to the Silhouette score, this score measures the similarity among the
clusters which defines that the lower the score the better clusters are formed.
25
CHAPTER 5
Comparing the clusters obtained by deploying the three different clustering algorithms
on the customers‟ data using the metrics that measure the distinctness anduniqueness
of the clusters. It is observed that the K means algorithm produces the best clusters by
obtaining the highest Silhouette score and the least Davies Bouldin score followed by
hierarchical clustering and minibatch k means clustering.
It couldn‟t be said that the K means is the most effective clustering algorithm every time.
It depends on the various factors such as the size of the data, attributes of the data, etc.,
This Project can further be enhanced by including different clustering algorithms that
may depict more proficiency and by considering the large datasets which in turn
increases the efficiency.
26
REFERENCES
[7] Ali Feizollah, Nor Badrul Anuar, Rosli Salleh, and Fairuz Amalina,
"Comparative Study of K-means and Mini Batch K-means Clustering Algorithms",
Computer System and Technology Department, The University of Malaya, Kuala
Lumpur, Malaysia, International Journal of Software and Hardware Research in
Engineering.
[9] Onur Dogan, Dokuz eylul University, Ejder Aycin, Kocaeli University, Zeki Atil
Bulut, Dokuz Eylul University, "Customer Segmentation by using RFM model and
clustering methods: A case Study in Retail Industry", In July 2018, International
Journal of Contemporary Economics and Administrative Sciences.
27
[10] Juni Nurma Sari,Ridi Ferdiana,Lukito Nugroho,Paulus Insap Santosa,"Review
on Customer Segmentation Technique",Department of Electrical Engineering and
Information Technology, University of Gadjah Mada, Jogjakarta, Indonesia,
Department of Informatics Technology, Polytechnic Caltex Riau, Pekanbaru,
Indonesia
[12] Francesco Musumeci, Cristina Rottondi, Avishek Nag, Irene Macaluso, Darko
Zibar, Marco Ruffini, Massimo Tornatore, "An Overview on Application of Machine
Learning Techniques ".
[13] Şükrü Ozan, "A case study on customer segmentation by using machine
learning methods",2018 International Conference on Artificial Intelligence and Data
Processing (IDAP), IEEE.
[15] Maryani, Ina et al. “Customer Segmentation based on RFM model and
Clustering Techniques With K-Means Algorithm.” 2018 Third International
Conference on Informatics and Computing (ICIC) (2018): 1-6
28
APPENDICES
A. SOURCE CODE
import NumPy as np
import pandas as PD
import matplotlib.pyplot as plt
import seaborn as sns
df=pd.read_csv("Mall_Customers.csv")
df.head()
df.shape
df.describe()
df.dtypes
df.isnull().sum()
df.drop(["CustomerID"],axis=1, inplace=True)
df.head()
plt.figure(1,figsize=(15,6))
n=0
for x in ['Age','No. of Purchases','Spending Score (1-100)']:
n += 1
plt.subplot(1,3,n)
plt.subplots_adjust(hspace = 0.5,wspace = 0.5)
sns.distplot(df[x],bins = 20)
plt.title('Distplot of {}'.format(x))
plt.show()
plt.figure(figsize=(15,5))
sns.countplot(y='Gender',data=df)
plt.show()
plt.figure(1,figsize=(15,7))
n=0
for cols in ['Age','No. of Purchases','Spending Score (1-100)']:
n+=1
plt.subplot(1 , 3 , n)
sns.set(style="whitegrid")
plt.subplots_adjust(hspace = 0.5, wspace = 0.5)
29
sns.violinplot(x = cols, y = 'Gender', data = df)
plt.ylabel('Gender' if n == 1 else '')
plt.title('Violin Plot')
plt.show()
age_18_25 = df.Age[(df.Age >= 18) & (df.Age <= 25)]
age_26_35 = df.Age[(df.Age >= 26) & (df.Age <= 35)]
age_36_45 = df.Age[(df.Age >= 36) & (df.Age <= 45)]
age_46_55 = df.Age[(df.Age >= 46) & (df.Age <= 55)]
age_55above = df.Age[df.Age >= 56]
agex = ["18-25","26-35","36-45","46-55","55+"]
agey =
[len(age_18_25.values),len(age_26_35.values),len(age_36_45.values),len(age_46_55.val
ues),len(age_55above.values)]
plt.figure(figsize=(15,6))
30
ssx = ["1-20", "21-40", "41-60", "61-80", "81-100"]
ssy = [len(ss_1_20.values), len(ss_21_40.values), len(ss_41_60.values),
len(ss_61_80.values), len(ss_81_100.values)]
plt.figure(figsize=(15,6))
sns.barplot(x=ssx, y=ssy, palette="rocket")
plt.title("Spending Scores")
plt.xlabel("Score")
plt.ylabel("Number of Customer Having the Score")
plt.show()
aix = ["$ 0 - 30,000","$ 30,001 - 60,000", "$ 60,000 - 90,000", "$ 90,001 - 120,000",
"120,001 - 150,000"]
aiy = [len(ai0_30.values), len(ai31_60.values), len(ai61_90.values), len(ai91_120.values),
len(ai121_150.values)]
plt.figure(figsize=(15,6))
sns.barplot(x=aix, y=aiy, palette="Spectral")
plt.title("No. of Purchases")
plt.xlabel("Income")
plt.ylabel("Number of Customer")
plt.show()
31
X1=df.loc[:, ["Age","Spending Score (1-100)"]].values
label = kmeans.fit_predict(X1)
print(label)
print(kmeans.cluster_centers_)
plt.scatter(X1[:,0],X1[:,1], c=kmeans.labels_, cmap='rainbow')
plt.scatter(kmeans.cluster_centers_[:,0],kmeans.cluster_centers_[:,1], color='black')
plt.title('Clusters of Customers')
plt.xlabel('Age')
plt.ylabel('Spending Score(1-100)')
plt.show()
X2=df.loc[:, ["No. of Purchases","Spending Score (1-100)"]].values
32
plt.grid()
plt.plot(range(1,11),wcss, linewidth=2, color="red", marker = "8")
plt.xlabel("K Value")
plt.ylabel("WCSS")
plt.show()
kmeans = KMeans(n_clusters=5)
label = kmeans.fit_predict(X2)
print(label)
print(kmeans.cluster_centers_)
plt.scatter(X2[:,0], X1[:,1], c=kmeans.labels_, cmap='rainbow')
plt.scatter(kmeans.cluster_centers_[:,0] ,kmeans.cluster_centers_[:,1], color='black')
plt.title('Clusters of Customers')
plt.xlabel('No. of Purchases')
plt.ylabel('Spending Score(1-100)')
plt.show()
X3=df.iloc[:,1:]
wcss = []
for k in range(1,11):
kmeans = KMeans(n_clusters=k, init="k-means++")
kmeans.fit(X3)
wcss.append(kmeans.inertia_)
plt.figure(figsize=(12,6))
plt.grid()
plt.plot(range(1,11),wcss, linewidth=2, color="red", marker ="8")
plt.xlabel("K Value")
plt.ylabel("WCSS")
plt.show()
kmeans = KMeans(n_clusters = 5)
33
label = kmeans.fit_predict(X3)
print(label)
print(kmeans.cluster_centers_)
clusters = kmeans.fit_predict(X3)
df["label"] = clusters
fig = plt.figure(figsize=(20,10))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(df.Age[df.label == 0], df["No. of Purchases"][df.label == 0], df["Spending Score
(1-100)"][df.label == 0], c='blue',s=60)
ax.scatter(df.Age[df.label == 1], df["No. of Purchases"][df.label == 1], df["Spending Score
(1-100)"][df.label == 1], c='red',s=60)
ax.scatter(df.Age[df.label == 2], df["No. of Purchases"][df.label == 2], df["Spending Score
(1-100)"][df.label == 2], c='green',s=60)
ax.scatter(df.Age[df.label == 3], df["No. of Purchases"][df.label == 3], df["Spending Score
(1-100)"][df.label == 3], c='orange',s=60)
ax.scatter(df.Age[df.label == 4], df["No. of Purchases"][df.label == 4], df["Spending Score
(1-100)"][df.label == 4], c='purple',s=60)
ax.view_init(30, 185)
plt.xlabel("Age")
plt.ylabel("No. of Purchases")
ax.set_zlabel('Spendiing Score (1-100)')
plt.show()
from sklearn.cluster import MiniBatchKMeans
34
from sklearn import metrics
minikm = MiniBatchKMeans(n_clusters=5,init='random',batch_size=100000)
minikm_labels = minikm.fit_predict(X3)
print(minikm_labels)
mini_cluster = minikm.fit_predict(X3)
df["minikm_labels"] = mini_cluster
fig = plt.figure(figsize=(20,10))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(df.Age[df.label == 0], df["No. of Purchases"][df.label == 0], df["Spending Score
(1-100)"][df.label == 0], c='blue',s=60)
ax.scatter(df.Age[df.label == 1], df["No. of Purchases"][df.label == 1], df["Spending Score
(1-100)"][df.label == 1], c='red',s=60)
ax.scatter(df.Age[df.label == 2], df["No. of Purchases"][df.label == 2], df["Spending Score
(1-100)"][df.label == 2], c='green',s=60)
ax.scatter(df.Age[df.label == 3], df["No. of Purchases"][df.label == 3], df["Spending Score
(1-100)"][df.label == 3], c='orange',s=60)
ax.scatter(df.Age[df.label == 4], df["No. of Purchases"][df.label == 4], df["Spending Score
(1-100)"][df.label == 4], c='purple',s=60)
ax.view_init(30, 185)
plt.xlabel("Age")
plt.ylabel("No. of Purchases")
ax.set_zlabel('Spendiing Score (1-100)')
plt.show()
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage
agglo_clustering =
AgglomerativeClustering(n_clusters=5,affinity='euclidean',linkage='ward')
agglo_clustering_labels = agglo_clustering.fit_predict(X3)
agglo_clusters = agglo_clustering.fit_predict(X3)
df["agglo_clustering_labels"] = agglo_clusters
35
Z = linkage(X3,method = 'ward')
dendro = dendrogram(Z)
plt.title('Dendrogram')
plt.ylabel('Euclidean disance')
plt.show
algorithms = ["K-Means","Hierarchical Clustering","MiniBatch K-Means"]
#Silhoutte Score
ss =
[metrics.silhouette_score(X3,label),metrics.silhouette_score(X3,minikm_labels),metrics.sil
houette_score(X3,agglo_clustering_labels)]
#Davies Bouldin Score
db =
[metrics.davies_bouldin_score(X3,label),metrics.davies_bouldin_score(X3,minikm_labels)
,metrics.davies_bouldin_score(X3,agglo_clustering_labels)]
comparision = {"ALGORITHMS":algorithms,"SILHOUETTE SCORE":ss,"DAVIES
BOULDING SCORE":db}
compdf = pd.DataFrame(comparision)
display(compdf.sort_values(by=["SILHOUETTE SCORE"],ascending = False))
B. SCREENSHOTS
36
B.1. Dataset
37
B.4. Violin plot for features
38
A.1. 3D plot for K means clustering
39
B. PLAGIARISM REPORT
40
C. JOURNAL PAPER
41