0% found this document useful (0 votes)
4 views4 pages

Intro Qugates

The document outlines a data normalization and clustering process using the Standard Scaler and Agglomerative Clustering. It details the creation of a synthetic customer dataset, the normalization of features, and the clustering of customers into four groups based on selected features. Additionally, it includes steps for visualizing the data distributions and cluster characteristics through various plots.

Uploaded by

skandapmwork2003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views4 pages

Intro Qugates

The document outlines a data normalization and clustering process using the Standard Scaler and Agglomerative Clustering. It details the creation of a synthetic customer dataset, the normalization of features, and the clustering of customers into four groups based on selected features. Additionally, it includes steps for visualizing the data distributions and cluster characteristics through various plots.

Uploaded by

skandapmwork2003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Data Normalization (Standard Scaler)

Before performing clustering, it is crucial to normalize the data, especially when features are on
different scales. For example, age might range from 18 to 70, whereas monthly_spending could
range from 50 to 500.
Standard Scaler is used here to standardize features by subtracting the mean and scaling to unit
variance, making them comparable.

Clustering Process
Feature Selection: Only a subset of features (age, tenure, monthly_spending) is selected for
clustering.
Agglomerative Clustering:
Uses linkage='ward' to minimize variance within clusters.
The n_clusters=4 parameter specifies that the data will be grouped into 4 clusters.

Code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import AgglomerativeClustering
import scipy.cluster.hierarchy as sch

# Step 1: Data Aggregation


# 1.1 Create a synthetic customer dataset with random values
np.random.seed(42)
n = 500 # Number of customers

# Features: age, tenure, monthly_spending, number of products


data = {
'age': np.random.randint(18, 70, size=n), # Age between 18 and 70
'tenure': np.random.randint(1, 10, size=n), # Tenure between 1 and 10 years
'monthly_spending': np.random.uniform(50, 500, size=n), # Monthly spending between 50 and
500
'num_products': np.random.randint(1, 6, size=n) # Number of products between 1 and 5
}

df = pd.DataFrame(data)

# 1.2 Introduce missing values in 'monthly_spending' column


df.loc[::10, 'monthly_spending'] = np.nan # Set every 10th value as NaN

# 1.3 Handle missing values by filling them with the mean of the column
df['monthly_spending'].fillna(df['monthly_spending'].mean(), inplace=True)

# 1.4 Visualize the distribution of key features


sns.set(style="whitegrid")
plt.figure(figsize=(12, 8))

# Plot the distribution of key features


for i, feature in enumerate(['age', 'tenure', 'monthly_spending', 'num_products'], 1):
plt.subplot(2, 2, i)
sns.histplot(df[feature], kde=True, color="teal")
plt.title(f'Distribution of {feature}')
plt.xlabel(feature)
plt.ylabel("Frequency")

plt.tight_layout()
plt.show()

# 1.5 Normalize the numerical columns (age, tenure, monthly_spending, num_products) using
StandardScaler
scaler = StandardScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)
# Step 2: Clustering Using Hierarchical Clustering
# 2.1 Select features for clustering (scaled data)
X = df_scaled[['age', 'tenure', 'monthly_spending']]

# 2.2 Perform Agglomerative Clustering


agg_clust = AgglomerativeClustering(linkage='ward', n_clusters=4) # We start with 4 clusters
df['cluster'] = agg_clust.fit_predict(X)

# 2.3 Plot a dendrogram to visualize the hierarchical clustering process


plt.figure(figsize=(10, 7))
sch.dendrogram(sch.linkage(X, method='ward'))
plt.title('Dendrogram of Customer Segments')
plt.xlabel('Customers')
plt.ylabel('Euclidean Distance')
plt.show()

# Step 3: Cluster Evaluation


# 3.1 Analyze the characteristics of each cluster using mean, median, and std deviation
cluster_means = df.groupby('cluster')[['age', 'tenure', 'monthly_spending', 'num_products']].mean()
cluster_medians = df.groupby('cluster')[['age', 'tenure', 'monthly_spending',
'num_products']].median()
cluster_std = df.groupby('cluster')[['age', 'tenure', 'monthly_spending', 'num_products']].std()

# 3.2 Print out the cluster statistics (mean, median, std)


print("Cluster Means:\n", cluster_means)
print("\nCluster Medians:\n", cluster_medians)
print("\nCluster Standard Deviations:\n", cluster_std)

# Step 4: Cluster Profiling


# 4.1 Visualize the clusters using a pairplot, color-coded by cluster labels
sns.pairplot(df[['age', 'tenure', 'monthly_spending', 'num_products', 'cluster']], hue='cluster',
palette='Set2')
plt.suptitle("Pairplot of Customer Features by Cluster", y=1.02)
plt.show()

# 4.2 Visualize clusters using scatter plots


# Scatter plot of Age vs Monthly Spending
plt.figure(figsize=(10, 6))
sns.scatterplot(x='age', y='monthly_spending', hue='cluster', data=df, palette='Set2', s=100,
alpha=0.7)
plt.title('Customer Segments: Age vs Monthly Spending')
plt.xlabel('Age')
plt.ylabel('Monthly Spending')
plt.show()

# Scatter plot of Tenure vs Number of Products


plt.figure(figsize=(10, 6))
sns.scatterplot(x='tenure', y='num_products', hue='cluster', data=df, palette='Set2', s=100,
alpha=0.7)
plt.title('Customer Segments: Tenure vs Number of Products')
plt.xlabel('Tenure (Years)')
plt.ylabel('Number of Products')
plt.show()

You might also like