LP I Assignment A4 Clustering
LP I Assignment A4 Clustering
Group: A
Assignment No. 04
Clustering
Download the following customer dataset from below link: Data Set:
https://www.kaggle.com/shwetabh123/mall-customers
This dataset gives the data of Income and money spent by the customers visiting a Shopping
Mall. The data set contains Customer ID, Gender, Age, Annual Income, and Spending Score.
Therefore, as a mall owner you need to find the group of people who are the profitable
customers for the mall owner. Apply at least two clustering algorithms (based on Spending
Score) to find the group of customers.
a. Apply Data pre-processing (Label Encoding, Data Transformation….) techniques if
necessary.
b. Perform data-preparation (Train-Test Split)
c. Apply Machine Learning Algorithm
d. Evaluate Model.
e. Apply Cross-Validation and Evaluate Model
Objective:
• Perform Data Preparation, Model Evaluation, and Cross-Validation
• Apply and Evaluate Clustering Algorithms on Customer Spending Data
Outcome:
Students will be able to:
• Prepare the data, apply clustering algorithms, and validate the results to ensure accurate
and actionable customer segmentation.
• Use clustering algorithms to segment customers and evaluate these segments to identify
profitable customer groups.
Theory:
Clustering: Clustering or cluster analysis is a machine learning technique, which groups the
unlabelled dataset. It can be defined as "A way of grouping the data points into different clusters,
consisting of similar data points. The objects with the possible similarities remain in a group that
has less or no similarities with another group.
Applications of Clustering: Market Segmentation, Statistical data analysis, Social network
analysis, Image segmentation, Anomaly detection, etc.
K-Means Clustering:
K-Means clustering is the most popular unsupervised learning algorithm. It is used when we have
unlabelled data which is data without defined categories or groups. The algorithm follows an easy
or simple way to classify a given data set through a certain number of clusters.
K-Means Algorithm:
df = pd.read_csv("Mall_Customers.csv")
df.head()
# Checking outliers
# Checking distribution of Annual Income
# Checking the relationship between variables Annual Income, Spending score and Gender
Step 3: Preprocess the data by scaling the features
Step 6: Finding optimal number of clusters using techniques like the elbow method and
Silhouette Score
The Elbow Method is a common technique for determining the optimal number of clusters 𝑘.
k for K-Means clustering. This method involves plotting the sum of squared distances from
each point to its assigned cluster center (known as the Within-Cluster Sum of Squares or
WCSS) as a function of the number of clusters. The "elbow" of the plot, where the rate of
decrease sharply changes, suggests the optimal number of clusters.
Silhouette Score
The Silhouette Score is a metric used to evaluate the quality of clustering. It measures how
similar an object is to its own cluster compared to other clusters. The Silhouette Score ranges
from -1 to +1:
0 indicates that the object is on or very close to the decision boundary between two
neighboring clusters.