0% found this document useful (0 votes)
16 views11 pages

Clustering Mall Data Students

The document outlines a Python code implementation for clustering mall customer data using KMeans. It includes data loading, feature selection, scaling, and determining the optimal number of clusters through the Elbow Method and silhouette scores. The final clusters are visualized using PCA for dimensionality reduction.

Uploaded by

vinaybuddyy2000
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views11 pages

Clustering Mall Data Students

The document outlines a Python code implementation for clustering mall customer data using KMeans. It includes data loading, feature selection, scaling, and determining the optimal number of clusters through the Elbow Method and silhouette scores. The final clusters are visualized using PCA for dimensionality reduction.

Uploaded by

vinaybuddyy2000
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

11/28/24, 12:31 PM Clustering_Mall_data

In [3]: import numpy as np


import matplotlib.pyplot as plt
import pandas as pd
import sklearn

In [4]: df = pd.read_csv('Mall_Customers.csv')
df

Out[4]: CustomerID Genre Age Annual Income (k$) Spending Score (1-100)

0 1 Male 19 15 39

1 2 Male 21 15 81

2 3 Female 20 16 6

3 4 Female 23 16 77

4 5 Female 31 17 40

... ... ... ... ... ...

195 196 Female 35 120 79

196 197 Female 45 126 28

197 198 Male 32 126 74

198 199 Male 32 137 18

199 200 Male 30 137 83

200 rows × 5 columns

In [4]:

In [5]: # 2. Select relevant features and scale


features = ['Annual Income (k$)', 'Spending Score (1-100)']
X = df[features]

In [6]: # Import the necessary class


from sklearn.preprocessing import StandardScaler

file:///C:/Users/Admin/OneDrive - NATIONAL INSTITUTE OF INDUSTRIAL ENGINEERING/COURSES/AI&ML/PPT/My PPT 2024/Code/Clustering_Mall_data.html 1/11


11/28/24, 12:31 PM Clustering_Mall_data

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [7]: # Import the necessary class


from sklearn.cluster import KMeans # Import KMeans
from sklearn.metrics import silhouette_score

# 3. Find optimal k using Elbow Method


inertia = []
silhouette_scores = []
k_range = range(2, 11)

for k in k_range:
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(X_scaled)
inertia.append(kmeans.inertia_)
silhouette_scores.append(silhouette_score(X_scaled, kmeans.labels_))

# Plot Elbow Method


plt.figure(figsize=(10, 5))
plt.plot(k_range, inertia, marker='o')
plt.title('Elbow Method for Optimal k')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia (Sum of Squared Distances)')
plt.grid()
plt.show()

file:///C:/Users/Admin/OneDrive - NATIONAL INSTITUTE OF INDUSTRIAL ENGINEERING/COURSES/AI&ML/PPT/My PPT 2024/Code/Clustering_Mall_data.html 2/11


11/28/24, 12:31 PM Clustering_Mall_data

In [8]: # Plot Silhouette Scores


plt.figure(figsize=(10, 5))
plt.plot(k_range, silhouette_scores, marker='o', color='orange')
plt.title('Silhouette Scores for Optimal k')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Silhouette Score')
plt.grid()
plt.show()

file:///C:/Users/Admin/OneDrive - NATIONAL INSTITUTE OF INDUSTRIAL ENGINEERING/COURSES/AI&ML/PPT/My PPT 2024/Code/Clustering_Mall_data.html 3/11


11/28/24, 12:31 PM Clustering_Mall_data

In [9]: # 4. Apply KMeans with the optimal k


optimal_k = 5 # Choose based on elbow/silhouette analysis
kmeans = KMeans(n_clusters=optimal_k, random_state=42)
df['Cluster'] = kmeans.fit_predict(X_scaled)
df

file:///C:/Users/Admin/OneDrive - NATIONAL INSTITUTE OF INDUSTRIAL ENGINEERING/COURSES/AI&ML/PPT/My PPT 2024/Code/Clustering_Mall_data.html 4/11


11/28/24, 12:31 PM Clustering_Mall_data

Out[9]: CustomerID Genre Age Annual Income (k$) Spending Score (1-100) Cluster

0 1 Male 19 15 39 4

1 2 Male 21 15 81 2

2 3 Female 20 16 6 4

3 4 Female 23 16 77 2

4 5 Female 31 17 40 4

... ... ... ... ... ... ...

195 196 Female 35 120 79 1

196 197 Female 45 126 28 3

197 198 Male 32 126 74 1

198 199 Male 32 137 18 3

199 200 Male 30 137 83 1

200 rows × 6 columns

In [10]: import matplotlib.pyplot as plt


import numpy as np

# Assuming you have optimal_k, X_scaled, and kmeans defined from your previous code

plt.figure(figsize=(10, 7))

# Define a list of colors for the clusters


colors = ['deeppink', 'green', 'red', 'purple', 'orange'] # Adjust colors as needed

# Plot each cluster with a different color


for cluster in range(optimal_k):
cluster_points = X_scaled[df['Cluster'] == cluster] # Select points in the current cluster
plt.scatter(cluster_points[:, 0], cluster_points[:, 1],
c=colors[cluster], label=f'Cluster {cluster}')

plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],


s=300, c='black', marker='*', label='Centroids')

file:///C:/Users/Admin/OneDrive - NATIONAL INSTITUTE OF INDUSTRIAL ENGINEERING/COURSES/AI&ML/PPT/My PPT 2024/Code/Clustering_Mall_data.html 5/11


11/28/24, 12:31 PM Clustering_Mall_data
plt.title('Customer Segments Visualization')
plt.xlabel('Component 1')
plt.ylabel('Component 2')
plt.legend()
plt.grid()
plt.show()

file:///C:/Users/Admin/OneDrive - NATIONAL INSTITUTE OF INDUSTRIAL ENGINEERING/COURSES/AI&ML/PPT/My PPT 2024/Code/Clustering_Mall_data.html 6/11


11/28/24, 12:31 PM Clustering_Mall_data

In [11]: df = pd.read_csv('Mall_Customers.csv')

In [12]: # Assuming 'df' is your DataFrame and 'Genre' is the categorical column
encoded_columns = pd.get_dummies(df['Genre'], prefix='Genre') # 'Genre' is used as a prefix for new columns
file:///C:/Users/Admin/OneDrive - NATIONAL INSTITUTE OF INDUSTRIAL ENGINEERING/COURSES/AI&ML/PPT/My PPT 2024/Code/Clustering_Mall_data.html 7/11
11/28/24, 12:31 PM Clustering_Mall_data

# Concatenate the encoded columns to the DataFrame


df = pd.concat([df, encoded_columns], axis=1)

# Remove the original 'Genre' column (optional)


df = df.drop('Genre', axis=1)

In [13]: # 2. Select relevant features and scale


features = ['Genre_Male', 'Genre_Female', 'Age', 'Annual Income (k$)', 'Spending Score (1-100)'] # Update features
X = df[features]

In [13]:

In [14]: # Import the necessary class


from sklearn.preprocessing import StandardScaler

# Now your existing code should work:


scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [15]: # Import the necessary class


from sklearn.cluster import KMeans # Import KMeans
from sklearn.metrics import silhouette_score

# 3. Find optimal k using Elbow Method


inertia = []
silhouette_scores = []
k_range = range(2, 9)

for k in k_range:
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(X_scaled)
inertia.append(kmeans.inertia_)
silhouette_scores.append(silhouette_score(X_scaled, kmeans.labels_))

# Plot Elbow Method


plt.figure(figsize=(10, 5))
plt.plot(k_range, inertia, marker='o')
plt.title('Elbow Method for Optimal k')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia (Sum of Squared Distances)')

file:///C:/Users/Admin/OneDrive - NATIONAL INSTITUTE OF INDUSTRIAL ENGINEERING/COURSES/AI&ML/PPT/My PPT 2024/Code/Clustering_Mall_data.html 8/11


11/28/24, 12:31 PM Clustering_Mall_data
plt.grid()
plt.show()

In [16]: # 4. Apply KMeans with the optimal k


optimal_k = 4 # Choose based on elbow/silhouette analysis
kmeans = KMeans(n_clusters=optimal_k, random_state=42)
df['Cluster'] = kmeans.fit_predict(X_scaled)
df

file:///C:/Users/Admin/OneDrive - NATIONAL INSTITUTE OF INDUSTRIAL ENGINEERING/COURSES/AI&ML/PPT/My PPT 2024/Code/Clustering_Mall_data.html 9/11


11/28/24, 12:31 PM Clustering_Mall_data

Out[16]: CustomerID Age Annual Income (k$) Spending Score (1-100) Genre_Female Genre_Male Cluster

0 1 19 15 39 False True 3

1 2 21 15 81 False True 3

2 3 20 16 6 True False 2

3 4 23 16 77 True False 1

4 5 31 17 40 True False 2

... ... ... ... ... ... ... ...

195 196 35 120 79 True False 1

196 197 45 126 28 True False 2

197 198 32 126 74 False True 3

198 199 32 137 18 False True 0

199 200 30 137 83 False True 3

200 rows × 7 columns

In [17]: from sklearn.decomposition import PCA

# Apply PCA to reduce to 2 dimensions


pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Modify the plotting part to use X_pca:


for cluster in range(optimal_k):
cluster_points = X_pca[df['Cluster'] == cluster] # Select points in the current cluster
plt.scatter(cluster_points[:, 0], cluster_points[:, 1],
c=colors[cluster], label=f'Cluster {cluster}')

plt.scatter(pca.transform(kmeans.cluster_centers_)[:, 0], pca.transform(kmeans.cluster_centers_)[:, 1],


s=300, c='black', marker='*', label='Centroids')

plt.title('Customer Segments Visualization')


plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.legend()
file:///C:/Users/Admin/OneDrive - NATIONAL INSTITUTE OF INDUSTRIAL ENGINEERING/COURSES/AI&ML/PPT/My PPT 2024/Code/Clustering_Mall_data.html 10/11
11/28/24, 12:31 PM Clustering_Mall_data
plt.grid()
plt.show()

file:///C:/Users/Admin/OneDrive - NATIONAL INSTITUTE OF INDUSTRIAL ENGINEERING/COURSES/AI&ML/PPT/My PPT 2024/Code/Clustering_Mall_data.html 11/11

You might also like