0% found this document useful (0 votes)
64 views3 pages

Customer Segmentation With K-Means Clustering and Visualization - Colab

The document outlines a data analysis process using Python, focusing on customer segmentation based on spending and transaction frequency from an online retail dataset. It includes data cleaning, handling missing values, and applying KMeans clustering to identify customer groups. The results are visualized using plots, and a summary of each cluster's average total spend and number of transactions is provided.

Uploaded by

Bhavesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
64 views3 pages

Customer Segmentation With K-Means Clustering and Visualization - Colab

The document outlines a data analysis process using Python, focusing on customer segmentation based on spending and transaction frequency from an online retail dataset. It includes data cleaning, handling missing values, and applying KMeans clustering to identify customer groups. The results are visualized using plots, and a summary of each cluster's average total spend and number of transactions is provided.

Uploaded by

Bhavesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

import pandas as pd

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

data = pd.read_excel('/OnlineRetail.xlsx')

print(data.head())

InvoiceNo StockCode Description Quantity \


0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6
1 536365 71053 WHITE METAL LANTERN 6
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6

InvoiceDate UnitPrice CustomerID Country


0 2010-12-01 08:26:00 2.55 17850.0 United Kingdom
1 2010-12-01 08:26:00 3.39 17850.0 United Kingdom
2 2010-12-01 08:26:00 2.75 17850.0 United Kingdom
3 2010-12-01 08:26:00 3.39 17850.0 United Kingdom
4 2010-12-01 08:26:00 3.39 17850.0 United Kingdom

print(data.isnull().sum())

InvoiceNo 0
StockCode 0
Description 1454
Quantity 0
InvoiceDate 0
UnitPrice 0
CustomerID 135080
Country 0
dtype: int64

data['Description'].fillna('Unknown', inplace=True)

<ipython-input-12-b328947c4b82>:1: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained ass
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col

data['Description'].fillna('Unknown', inplace=True)

 

data['CustomerID'].fillna(0, inplace=True)

<ipython-input-14-3d3ed6052492>:1: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained ass
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col

data['CustomerID'].fillna(0, inplace=True)

 

print("\nMissing values after handling:")

Missing values after handling:

print(data.isnull().sum())

InvoiceNo 0
StockCode 0
Description 0
Quantity 0
InvoiceDate 0
UnitPrice 0
CustomerID 0
Country 0
dtype: int64

data['TotalSpend'] = data['Quantity'] * data['UnitPrice']


customer_summary = data.groupby('CustomerID').agg(
TotalSpend=('TotalSpend', 'sum'),
NumTransactions=('InvoiceNo', 'nunique')
).reset_index()

X = customer_summary[['TotalSpend', 'NumTransactions']]

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

inertia = []
for k in range(1, 11): # Check for 1 to 10 clusters
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(X_scaled)
inertia.append(kmeans.inertia_)

plt.figure(figsize=(8, 6))
plt.plot(range(1, 11), inertia, marker='o', color='b')
plt.title('Elbow Method for Optimal k')
plt.xlabel('Number of Clusters')
plt.ylabel('Inertia')
plt.show()

 

optimal_k = 4
kmeans = KMeans(n_clusters=optimal_k, random_state=42)

customer_summary['Cluster'] = kmeans.fit_predict(X_scaled)

plt.figure(figsize=(8, 6))
sns.scatterplot(x=customer_summary['TotalSpend'],
y=customer_summary['NumTransactions'],
hue=customer_summary['Cluster'],
palette='Set2', s=100, alpha=0.6)
plt.title('Customer Segmentation based on Total Spend and Number of Transactions')
plt.xlabel('Total Spend')
plt.ylabel('Number of Transactions')
plt.legend(title='Cluster')
plt.show()
 

cluster_summary = customer_summary.groupby('Cluster')[['TotalSpend', 'NumTransactions']].mean()

print("\nCluster Summary (Mean values for each cluster):")


print(cluster_summary)

Cluster Summary (Mean values for each cluster):


TotalSpend NumTransactions
Cluster
0 1.342493e+03 4.458728
1 1.447682e+06 3710.000000
2 3.416323e+04 57.853659
3 1.821820e+05 89.000000

You might also like