0% found this document useful (0 votes)
14 views13 pages

LP I Assignment A4 Clustering

Uploaded by

riya2004p
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views13 pages

LP I Assignment A4 Clustering

Uploaded by

riya2004p
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Course: Laboratory Practice -I (Machine Learning) Course Code: 314448

Class: T. E. [IT] Division: A


Name of the Student: Roll No:

Group: A
Assignment No. 04
Clustering

Date of Performance: Marks:

Sign with Date of Checking


Group A
Assignment 4
Clustering
Problem Statement:

Assignment on Clustering Techniques

Download the following customer dataset from below link: Data Set:
https://www.kaggle.com/shwetabh123/mall-customers

This dataset gives the data of Income and money spent by the customers visiting a Shopping
Mall. The data set contains Customer ID, Gender, Age, Annual Income, and Spending Score.
Therefore, as a mall owner you need to find the group of people who are the profitable
customers for the mall owner. Apply at least two clustering algorithms (based on Spending
Score) to find the group of customers.
a. Apply Data pre-processing (Label Encoding, Data Transformation….) techniques if
necessary.
b. Perform data-preparation (Train-Test Split)
c. Apply Machine Learning Algorithm
d. Evaluate Model.
e. Apply Cross-Validation and Evaluate Model

Objective:
• Perform Data Preparation, Model Evaluation, and Cross-Validation
• Apply and Evaluate Clustering Algorithms on Customer Spending Data

Outcome:
Students will be able to:
• Prepare the data, apply clustering algorithms, and validate the results to ensure accurate
and actionable customer segmentation.
• Use clustering algorithms to segment customers and evaluate these segments to identify
profitable customer groups.

Theory:

Clustering: Clustering or cluster analysis is a machine learning technique, which groups the
unlabelled dataset. It can be defined as "A way of grouping the data points into different clusters,
consisting of similar data points. The objects with the possible similarities remain in a group that
has less or no similarities with another group.
Applications of Clustering: Market Segmentation, Statistical data analysis, Social network
analysis, Image segmentation, Anomaly detection, etc.
K-Means Clustering:
K-Means clustering is the most popular unsupervised learning algorithm. It is used when we have
unlabelled data which is data without defined categories or groups. The algorithm follows an easy
or simple way to classify a given data set through a certain number of clusters.

K-Means Algorithm:

Step-1: Select the number K to decide the number of clusters.


Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid of
each cluster.
Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
Step-7: The model is ready.

K-Means Clustering Intuition:

1. Centroid: A centroid is a data point at the centre of a cluster. In centroid-based clustering,


clusters are represented by a centroid. The algorithm requires number of clusters K and the data
set as input. The data set is a collection of features for each data point. The algorithm starts with
initial estimates for the K centroids.
2. Data Assignment Step: Each centroid defines one of the clusters. In this step, each data point
is assigned to its nearest centroid, which is based on the squared Euclidean distance. So, if ci is the
collection of centroids in set C, then each data point is assigned to a cluster based on minimum
Euclidean distance.
3. Centroid update Step: In this step, the centroids are recomputed and updated. This is done by
taking the mean of all data points assigned to that centroid’s cluster.
4. Choosing the value of K: The K-Means algorithm depends upon finding the number of clusters
and data labels for a pre-defined value of K. We should choose the optimal value of K that gives
us best performance. There are different techniques available to find the optimal value of K. The
most common technique is the elbow method.
5. The elbow method: The elbow method is used to determine the optimal number of clusters in
K- means clustering.
6. WCSS List: Elbow method uses the concept of WCSS value. WCSS stands for Within
Cluster Sum of Squares, which defines the total variations within a cluster. To find the optimal
value of clusters, the elbow method follows the below steps:

Python Implementation of K-means Clustering Algorithm:


Step 1: Data Exploration and Preprocessing

# Importing the required libraries


import numpy as np // for Numeric Operations
import pandas as pd //For Dataframe Operations
import matplotlib.pyplot as plt // For Plotting and Visualization
import seaborn as sns

# Loading the dataset

df = pd.read_csv("Mall_Customers.csv")

# Reading the first and last n rows of a dataset.

df.head()

# Checking the dimensions of the data frame.

# Checking data types


# Renaming column name

# Printing information about a DataFrame.

# Providing descriptive statistics for dataset.

# Checking for missing values in the data.


# Checking duplicates

# Checking no of unique values for each column

# Dropping features containing unique values

Step 2: Exploratory Data Analysis

# Checking outliers
# Checking distribution of Annual Income

# Checking distribution of Age

# Checking distribution of Spending score


# Checking distribution of categorical feature ‘Gender’

# Checking the relationship between variables Annual Income, Spending score and Gender
Step 3: Preprocess the data by scaling the features

Step 4: Checking correlation among features

Step 5: Applying ‘LabelEncoder’ for encoding binary categories in ‘Gender’ column

Step 6: Finding optimal number of clusters using techniques like the elbow method and
Silhouette Score
The Elbow Method is a common technique for determining the optimal number of clusters 𝑘.

k for K-Means clustering. This method involves plotting the sum of squared distances from
each point to its assigned cluster center (known as the Within-Cluster Sum of Squares or
WCSS) as a function of the number of clusters. The "elbow" of the plot, where the rate of
decrease sharply changes, suggests the optimal number of clusters.

Silhouette Score

The Silhouette Score is a metric used to evaluate the quality of clustering. It measures how
similar an object is to its own cluster compared to other clusters. The Silhouette Score ranges
from -1 to +1:

+1 indicates that the objects are well clustered.

0 indicates that the object is on or very close to the decision boundary between two
neighboring clusters.

-1 indicates that the objects are misclassified


Step 7: Applying K-means clustering algorithm

Step 8: Visualize the resulting clusters

Step 9: Apply Hierarchical Clustering


Conclusion: The clustering analysis successfully segmented customers into distinct groups,
revealing valuable insights into their spending behavior.

Write short answer of following questions:


1. What is the purpose of scaling the data before applying clustering algorithms?
2. How does the Elbow Method help in determining the optimal number of clusters for K-
Means clustering?
3. What is the significance of the Silhouette Score in clustering analysis?
4. How does hierarchical clustering differ from K-Means clustering?
5. What does the dendrogram in hierarchical clustering represent?
6. Explain the purpose of the linkage function and how it is used in hierarchical clustering in
the code.

You might also like