PeerEval Unsupervised
PeerEval Unsupervised
• Spending Score: Score assigned by the mall based on customer behavior and spending nature
Since the data is clean there is no need for any data cleaning process. There is no scaling applied to the data.
We drop the customerID column as it is just used as an unique identifier and has little relevance to our analysis.
The data is then encoded using the pandas.dummies function. Next step is to apply Clustering Algorithm on
the data and calculate the accuracy of clustering.
Clustering Algorithm
We will be using two different clustering algorithm in this report and apply hyper parameter tuning to get our
results
• Kmeans Clustering
The KMeans algorithm clusters data by trying to separate samples in n groups of equal variance,
minimizing a criterion known as the inertia or within-cluster sum-of-squares (see below). This
algorithm requires the number of clusters to be specified. It scales well to large number of samples and
has been used across a large range of application areas in many different fields.The k-means algorithm
divides a set of samples into disjoint clusters , each described by the mean of the samples in the
cluster. The means are commonly called the cluster “centroids”; note that they are not, in general,
points from , although they live in the same space.
The K-means algorithm aims to choose centroids that minimise the inertia, or within-cluster sum-of-
squares criterion. To find the best K value there are many techniques we can use.
WE use The elbow method which finds the value of the optimal number of clusters using the total
within-cluster sum of square values.
From the above elbow method we see that one of K = 5,6,7 are the best K values for our clustering
• Agglomerative Clustering
Hierarchical clustering is a general family of clustering algorithms that build nested clusters by merging
or splitting them successively. This hierarchy of clusters is represented as a tree (or dendrogram). The
root of the tree is the unique cluster that gathers all the samples, the leaves being the clusters with only
one sample. The AgglomerativeClustering object performs a hierarchical clustering using a bottom up
approach: each observation starts in its own cluster, and clusters are successively merged together.
The linkage criteria determines the metric used for the merge strategy:
• Ward minimizes the sum of squared differences within all clusters. It is a variance-minimizing
approach and in this sense is similar to the k-means objective function but tackled with an
agglomerative hierarchical approach.
• Maximum or complete linkage minimizes the maximum distance between observations of pairs of
clusters.
• Average linkage minimizes the average of the distances between all observations of pairs of
clusters.
• Single linkage minimizes the distance between the closest observations of pairs of clusters.
Key Findings
The effectives of the clustering algorithm can be parameterized by various function. Here we use silhouette
score. The Silhouette Coefficient (sklearn.metrics.silhouette_score) is an example of such an evaluation, where
a higher Silhouette Coefficient score relates to a model with better defined clusters. The Silhouette Coefficient
is defined for each sample and is composed of two scores:
• a: The mean distance between a sample and all other points in the same class.
• b: The mean distance between a sample and all other points in the next nearest cluster.
The Silhouette Coefficient s for a single sample is then given as: (b - a) / max(a, b).
In this assignment we have used K means with no of clusters 5,6,7 and Agglomerative clustering with linkage
ward,average. The corresponding silhouette scores is shown in the table above. We can see that the Kmeans is
the best algorithm with the number of clusters equal to 6.
Figure 9: 3D plot of clustering of Kmeans = 6