Customer Categorization by Data Analysis Using Clustering Algorithms of Machine Learning
Customer Categorization by Data Analysis Using Clustering Algorithms of Machine Learning
CLUSTERING TECHNIQUES:
*K-means Clustering:
It is the simplest algorithm of clustering based on partitioning principle. The
algorithm is sensitive to the initialization of the centroids position, the number of K
(centroids) is calculated by elbow method (discussed in later section), after calculation
of K centroids by the terms of Euclidean distance data points are assigned to the
closest centroid forming the cluster, after the cluster formation the barycentre’s are
once again calculated by the means of the cluster and this process is repeated until
there is no change in centroid position.
*Agglomerative Clustering-:
Agglomerative Clustering is based on forming a hierarchy represented by
dendrograms (discussed in later section). Dendrogram acts as memory for the
algorithm to tell about how the clusters are being formed. The clustering starts with
forming N clusters for N data points and then merging along the closest data points
together in each step such that the current step contains one cluster less than the
previous one.
*Mean shift Clustering-:
This clustering algorithm is a non-parametric iterative algorithm functions by
assuming the all the data points in the feature space as empirical probability density
function. The algorithm clusters each data point by allowing data point converge to a
region of local maxima which is achieved by fixing a window around each data point
finding the mean and then shifting the window to the mean and repeat the steps until
all the data point converges forming the clusters.
*Elbow Method-:
Elbow method is used for finding optimal value of K for K-means clustering
algorithm. This method works by finding the SSE of each data point with its nearest
centroid with different values of K. As value of K increases the SSE will decrease and
at a particular value of K where there is most decline in the SSE is the elbow, the
point at which we should stop dividing data further.
METHODOLOGY:
Data Collection:
The dataset has been taken from a local retail shop consisting of two features, average
number of visits to the shop and average amount of shopping done on yearly basis.
Feature Scaling:
The data has been scaled using Standard Scaler [9], by applying standard scaler the
data gets centred around 0 with standard deviation of 1.
X - mean(x)/stdev(x)
x = entry in a feature set xi ϵ X
mean (X) = mean of feature set X
stdev (X) = standard deviation of X
K means Clustering:
Choosing the optimal number of clusters:
Elbow method is applied to calculate value of K for the dataset.
Step-1: Run the algorithm for various values of k i.e making the k vary from 1 to 10.
Step-2: Calculate the within cluster squared error.
Step-3: Plot the calculated error, where a bent elbow like structure will form, will
give the optimal value of clusters.
Algorithm:
Step-1: Initialize the K (= 5) clusters.
Step-2: Assign the data point that is closest to any particular cluster.
Step-3: Recalculate the centroid position based on the mean of the cluster formed.
Step-4: Repeat step 2 and 3 until the centroid position remains unchanged in the
previous and current iteration.
Agglomerative Clustering:
Choosing the optimal number of clusters:
Cluster value for this algorithm have been calculated by the Dendrogram
Algorithm:
Step-1: Each data point is taken as to be a cluster.
Step-2: Merge the two closest cluster.
Step-3: Step 2 needs to be repeated until all the data points are merged together to
form a single cluster. However, as we have defined the value of K as 5, the algorithm
will stop when all the data points are part of any of the 5 clusters.
Mean Shift Clustering:
This non-parametric clustering method is being applied to see some different pattern
in a dataset as Kmeans and Agglomerative gave almost the same result. There is no
need of choosing the number of clusters. However, it needs one input parameter,
bandwidth (radius) which is calculated using K-nearest neighbour algorithm. This
algorithm follows an iterative approach where a point of local maxima is found
around each data point defined by probability density function, and iterates until when
all the data point converges up the hill (created by PDF), also known as ‘hill climbing
algorithm’.
Algorithm:
Step-1: A window is associated around each data point created by PDF.
Step-2: Mean around the window is calculated.
Step-3: Window is moved towards the newly calculated mean.
Step-4: Step 2 and 3 are repeated until when all the data points converge to a local
maxima resulting in clusters.
CONCLUSION:
In this data science project, we went through the customer segmentation model. We
developed this using a class of machine learning known as unsupervised learning.
Specifically, we made use of a clustering algorithm called K-means clustering. We
analyzed and visualized the data and then proceeded to implement our algorithm. In
this we have opted for internal clustering validation rather than external clustering
validation, which depends on some external data like labels. Internal cluster validation
can be used for choosing clustering algorithm which best suits the dataset and can
correctly cluster data into its opposite cluster.