0% found this document useful (0 votes)
25 views29 pages

ML Review PPT 2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views29 pages

ML Review PPT 2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 29

SRM INSTITUTE OF SCIENCE AND TECHNOLOGY

SCHOOL OF COMPUTING
DEPARTMENT OF COMPUTING TECHNOLOGIES
21CSC305P- MINOR PROJECT

Customer Segmentation
Batch ID: 16

Reg. No: RA2211003011093


Name: M. Durga Prasad

Reg. No:RA221100301104
Name: K. Yaswanth
Guide name: Dr.Poornima S
Reg. No: RA2211003011107
Designation: Associate Professor Name: S. Tejaswi
Department: C.Tech
Reg No.: RA2211003011123
Name: V. Yaswanth
Customer Segmentation

Introduction
In today's competitive market, businesses must understand their customers at a granular
level to effectively meet their needs and enhance customer loyalty. Customer
segmentation is a powerful data-driven approach that divides a company's customer
base into distinct groups based on similar characteristics, such as purchasing behavior,
demographic information, and preferences. By identifying these segments, companies
can develop personalized marketing strategies, optimize product offerings, and
improve customer engagement. The objective of this project is to leverage data analysis
to create actionable customer segments, enabling more targeted and efficient business
decisions.

31/08/2024
Problem Statement
The traditional approach to customer segmentation is often manual, subjective, and
inefficient, leading to missed opportunities for personalized marketing and customer
engagement.

As businesses grow, understanding large and diverse customer bases becomes


increasingly difficult, resulting in generalized marketing strategies that fail to address
specific customer needs and preferences.

Without a clear understanding of different customer segments, companies risk losing


valuable customers by not providing targeted offers or services that resonate with their
unique behaviors and characteristics.

The challenge is to develop an automated system that can efficiently analyze customer
data, identify distinct segments, and provide actionable insights, allowing businesses to
tailor their marketing efforts, improve customer satisfaction, and drive growth.

31/08/2024
Objectives
• Automate Customer Segmentation: Develop a machine learning system to automatically analyze
and segment customers based on behavioral, demographic, and transactional data, reducing manual
effort in identifying key segments.

•Enhance Personalization: Improve the precision of marketing strategies by identifying distinct


customer groups, enabling the delivery of personalized offers, products, and communications that
resonate with each segment.

• Optimize Resource Allocation: Help businesses allocate marketing and sales resources more
efficiently by focusing efforts on the most valuable customer segments, maximizing ROI.

• Scalability: Ensure the system can process and analyze large volumes of customer data, making it
adaptable to businesses of varying sizes and industries.

• Real-time Insights: Provide real-time or near-real-time customer segmentation, allowing businesses


to quickly adapt to changing customer behaviors and market trends.

31/08/2024
Customer Segmentation
S.No Literature
Title reviewAuthor Inference Link

“Customer Segmentation Areeba Afzal, Laiba Khan, Our dataset encompasses a diverse range of mall customers, spanning
1. Using Hierarchical Muhammad Zunnurain Hussain, demographics and behavioral attributes. Hierarchical clustering https://ieeexplore.ieee.org
Clustering “ Muzzamil Mustafa, Aqsa systematically groups customers into clusters, revealing distinct /document/10543349
Khalid, Nawaz khan segments within the mall’s customer base. A comprehensive analysis of
(2024) these clusters unveils profound insights into customer tendencies,
preferences, and purchasing habitsBy leveraging hierarchical
clustering for mall customer segmentation, businesses can enhance
customer satisfaction, drive sales, and foster lasting customer
relationships.
“Customer Segmentation Tushar Kansal, Suraj The process of segmenting the customers with similar behaviours into https://ieeexplore.ieee.org
2. using K-means Bahuguna,Vishal Singh., the same segment and with different patterns into different segments is /document/8769171
Clustering” Tanupriya Choudhury called customer segmentation. In this paper, 3 different clustering
algorithms (k-Means, Agglomerative, and Meanshift) are been
(2018) implemented to segment the customers and finally compare the results
of clusters obtained from the algorithms.
“An efficiency analysis on Ananthi Sheshasaayee, Now a day's commercial marketing growth is improved by customer https://ieeexplore.ieee.org
3. the TPA clustering Santhosh S, L. Logeshwari segmentation model. Literatures use the data mining technology to /document/7975573
methods for intelligent review the customer segmentation and sound effectives. Stages of
customer segmentation” CRM have been used in most of the cases. Based on RFM,
(2017) demographic and LTV data the paper is prepared using the data mining
tools for the new customer segmentation

“Market segmentation Juhi Singh, Kritika Jaiswal, Market segmentation is an approach whose aim is to identify and https://ieeexplore.ieee.org
4. using ML” Minal Singh, Muskan Sama, outline the market segments on which an organization can target for its /document/10150639
Swasti Singhal marketing plans. Market Segmentation is used not only for selling a
(2023) commodity and various services but also plays a crucial role in
meeting the customer’s needs because without customers there is no
business. So satisfying a customer’s need is really important and hence
the need for market segmentation. The general objective of this
research service is to analyze various factors which influence the
31/08/2024 student’s admission process in various private institutions
Customer Segmentation
Proposed System / Methodology
Approach
The objective is to perform customer segmentation using unsupervised machine learning techniques. The two
primary clustering methods—K-Means Clustering and Hierarchical Clustering—will be applied to identify
distinct customer groups based on their behaviors and characteristics. The approach will include the following
steps:

• Data Collection and Understanding:


• Load the customer segmentation dataset.
• Understand the data structure, data types, and identify key features relevant to segmentation.
• Data Preprocessing:
• Handle missing values by removing or imputing them.
• Standardize the data using techniques like StandardScaler to ensure that all features contribute equally to the
clustering process.
• Select only the numeric columns that are meaningful for clustering (e.g., customer age, income, spending score, etc.).
• Clustering Analysis:
 K-Means Clustering:
• Apply the Elbow Method to determine the optimal number of clusters.
• Fit the K-Means model with the chosen number of clusters.
• Assign cluster labels to each customer and analyze the results.
 Hierarchical Clustering:
• Generate a dendrogram to visualize hierarchical relationships and determine the optimal number of clusters.
• Fit the Hierarchical Clustering model based on the determined number of clusters.
• Assign cluster labels to each customer and analyze the results.

31/08/2024
Customer Segmentation
Proposed System / Methodology
• Model Evaluation and Comparison:
• Evaluate the clustering performance using the Silhouette Score to measure the
quality of clusters.
• Compare the results of K-Means and Hierarchical Clustering based on
visualization, interpretability, and silhouette score.
• Select the best-performing model to understand distinct customer segments.
• Insights and Interpretation:
• Interpret the clusters formed by the chosen model to derive actionable insights
for marketing strategies, customer targeting, and personalized promotions.

31/08/2024
Architectural Diagram

31/08/2024
Technologies/Tool Used:
• Python Programming Language:
• Python will be used as the primary programming language due to its wide range of libraries and tools for
data analysis and machine learning.
• Jupyter Lab/Notebook:
• Jupyter Lab or Notebook will be used as the development environment for running Python code,
visualizing results, and documenting the analysis process interactively.
• Data Manipulation and Analysis Libraries:
• Pandas: For loading and preprocessing the dataset (handling missing values, data manipulation).
• NumPy: For numerical operations and mathematical computations.
• Data Visualization Libraries:
• Matplotlib: For creating basic visualizations (Elbow method plot, scatter plots).
• Seaborn: For enhanced visualization (plotting clusters, dendrograms).
• Machine Learning and Clustering Libraries:
• Scikit-Learn (sklearn): For implementing K-Means Clustering, Hierarchical Clustering, data scaling
(StandardScaler), and evaluation metrics (Silhouette Score).
• SciPy: For hierarchical clustering and dendrogram plotting.
• Data Standardization:
• StandardScaler from sklearn.preprocessing to normalize features before applying clustering algorithms.
31/08/2024
Conclusion

• Enhanced Customer Targeting: Our customer segmentation model identifies distinct


customer groups based on behavioral, demographic, and transactional data, enabling
businesses to craft personalized marketing strategies tailored to each segment.
• Optimized Resource Allocation: By grouping customers into meaningful segments,
businesses can allocate their marketing, sales, and service resources more efficiently, focusing
on high-value groups to maximize ROI.
• Real-Time Insights: Our solution provides up-to-date customer segmentation that adapts to
changing behaviors, helping businesses stay agile, anticipate customer needs, and improve
overall customer satisfaction and retention.

31/08/2024
References:-
https://ieeexplore.ieee.org/document/9777194
https://ieeexplore.ieee.org/document/8769171
https://ieeexplore.ieee.org/document/7975573
https://ieeexplore.ieee.org/document/10150639

31/08/2024
REVIEW -2​
Customer Segmentation

31/08/2024
K- Means Clustering​

Algorithm

1. Initialize centroids: Randomly select k initial centroids.

2. Assign points to the nearest cluster: Assign each data point to the closest centroid using
Euclidean distance.

3. Update centroids: Recalculate centroids by averaging the data points in each cluster.

4. Repeat: Repeat assignment and updating steps until convergence or max iterations.

5. Evaluate the clustering: Use metrics like the silhouette score to evaluate clustering
performance.

31/08/2024
Silhouette Score Calculation

1. Cohesion: Measures how close a point is to its own cluster.

2. Separation: Compares how far a point is from other clusters.

3. Score Range: Ranges from -1 to 1, with values closer to 1 indicating well-separated clusters.

31/08/2024
Code:-
# Store silhouette scores for each clustering method
silhouette_scores = {}
import pandas as pd
import matplotlib.pyplot as plt # K-Means Clustering
from sklearn.preprocessing import
StandardScaler kmeans = KMeans(n_clusters=5, random_state=42)
from sklearn.cluster import KMeans, DBSCAN kmeans_labels = kmeans.fit_predict(scaled_data)
from sklearn.mixture import GaussianMixture data['KMeans_Cluster'] = kmeans_labels
from sklearn.metrics import silhouette_score
# Calculate Silhouette Score for K-Means
# Load dataset (update the file path if necessary) kmeans_silhouette = silhouette_score(scaled_data,
kmeans_labels)
data =
silhouette_scores['K-Means'] = kmeans_silhouette
pd.read_csv('customer_segmentation_data.csv')
print(f"K-Means Silhouette Score:
# Selecting only numeric columns for clustering {kmeans_silhouette}")
numeric_features =
data.select_dtypes(include=['float64', 'int64'])

# Scale the numeric data


scaler = StandardScaler()
scaled_data =
scaler.fit_transform(numeric_features)
31/08/2024
Output

31/08/2024
Hierarchical Clustering

Algorithm
1. Compute linkage matrix: Calculate distances between clusters using methods like
ward, single, or complete.

2. Plot dendrogram: Visualize the hierarchical clustering structure with a dendrogram.

3. Cut the dendrogram: Select the number of clusters by cutting the dendrogram at a
specific height.

4. Assign cluster labels: Use the fcluster function to assign data points to clusters.

5. Evaluate the clustering: Calculate the silhouette score to measure cluster cohesion
and separation.

31/08/2024
Code
# Plot the dendrogram to visualize the hierarchical clustering
import pandas as pd
plt.figure(figsize=(10, 7))
from sklearn.preprocessing import StandardScaler dendrogram(linkage_matrix)
from sklearn.metrics import silhouette_score plt.title('Hierarchical Clustering Dendrogram (Sample)')
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster plt.xlabel('Sample Index')
import matplotlib.pyplot as plt plt.ylabel('Distance')
plt.show()

file_path = 'customer_segmentation_data.csv' # Extract clusters by specifying the number of clusters (e.g., 5 clusters)
data = pd.read_csv(file_path) num_clusters = 5
cluster_labels = fcluster(linkage_matrix, num_clusters, criterion='maxclust')
# Select relevant numerical features for clustering # Add the cluster labels to the original sampled data
numerical_features = ['Age', 'Income Level', 'Coverage Amount', data_sample_with_clusters = data_sample.copy()
'Premium Amount'] data_sample_with_clusters['Cluster'] = cluster_labels
data_numerical = data[numerical_features]
numeric_features = data.select_dtypes(include=['float64', 'int64']) # Calculate the silhouette score for the clustering
silhouette_avg = silhouette_score(data_sample_normalized, cluster_labels)
# Normalize the data print(f'Silhouette Score for {num_clusters} clusters: {silhouette_avg}')
scaler = StandardScaler()
scaled_data = scaler.fit_transform(numeric_features) # Display the first few rows of the data with the cluster labels
print(data_sample_with_clusters.head())
data_normalized = scaler.fit_transform(data_numerical)
# Optionally, save the results to a new CSV file
# Sample a smaller subset of the data for clustering (to avoid memory data_sample_with_clusters.to_csv('customer_segmentation_with_clusters.csv',index=False)
issues)
data_sample = data_numerical.sample(n=1000, random_state=42)
data_sample_normalized = scaler.fit_transform(data_sample)

# Perform hierarchical clustering using the 'ward' method


linkage_matrix = linkage(data_sample_normalized, method='ward')

31/08/2024
Output

31/08/2024
DBSCAN Clustering

Algorithm
1. Initialize DBSCAN: Set parameters eps (maximum distance between points) and
min_samples (minimum points to form a cluster).

2. Fit and predict: Apply DBSCAN to scaled_data to assign cluster labels, where -1 represents
outliers.

3. Assign cluster labels: Store the cluster labels in the original dataset.

4. Check cluster count: Ensure multiple clusters are formed (i.e., more than one unique label).

5. Calculate silhouette score: Compute silhouette score to evaluate clustering quality (if there
are multiple clusters).

6. Handle outliers: If only one cluster or all points are outliers, the silhouette score is set as
"N/A."
31/08/2024
Code:-
# Store silhouette scores for each clustering method
silhouette_scores = {}
import pandas as pd
import matplotlib.pyplot as plt # DBSCAN Clustering
from sklearn.preprocessing import dbscan = DBSCAN(eps=0.5, min_samples=5)
StandardScaler dbscan_labels = dbscan.fit_predict(scaled_data)
from sklearn.cluster import KMeans, DBSCAN data['DBSCAN_Cluster'] = dbscan_labels
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score # Calculate Silhouette Score for DBSCAN (if not all
labels are outliers)
# Load dataset (update the file path if necessary) if len(set(dbscan_labels)) > 1: # Ensure there are
multiple clusters
data =
dbscan_silhouette = silhouette_score(scaled_data,
pd.read_csv('customer_segmentation_data.csv')
dbscan_labels)
silhouette_scores['DBSCAN'] = dbscan_silhouette
# Selecting only numeric columns for clustering else:
numeric_features = dbscan_silhouette = "N/A (Only one cluster or all
data.select_dtypes(include=['float64', 'int64']) points are considered outliers)"

# Scale the numeric data print(f"DBSCAN Silhouette Score: {dbscan_silhouette}")


scaler = StandardScaler()
scaled_data =
scaler.fit_transform(numeric_features)
31/08/2024
Output

31/08/2024
Gaussian Mixture Model Clustering

Algorithm
1. Initialize GMM: Set the number of components (clusters) and a random state for
reproducibility.

2. Fit and predict: Apply the GMM to the scaled_data to assign cluster labels.

3. Assign cluster labels: Store the cluster labels in the original dataset.

4. Calculate silhouette score: Compute the silhouette score to evaluate the quality of clustering.

5. Store silhouette score: Save the silhouette score for comparison with other models.

6. Print the result: Output the silhouette score for GMM clustering.

31/08/2024
Code:-
# Store silhouette scores for each clustering method
silhouette_scores = {}
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import # Gaussian Mixture Model (GMM)
StandardScaler
from sklearn.cluster import KMeans, DBSCAN gmm = GaussianMixture(n_components=5,
from sklearn.mixture import GaussianMixture random_state=42)
from sklearn.metrics import silhouette_score gmm_labels = gmm.fit_predict(scaled_data)
data['GMM_Cluster'] = gmm_labels
# Load dataset (update the file path if necessary)
# Calculate Silhouette Score for GMM
data =
gmm_silhouette = silhouette_score(scaled_data,
pd.read_csv('customer_segmentation_data.csv')
gmm_labels)
silhouette_scores['GMM'] = gmm_silhouette
# Selecting only numeric columns for clustering
numeric_features = print(f"GMM Silhouette Score: {gmm_silhouette}")
data.select_dtypes(include=['float64', 'int64'])

# Scale the numeric data


scaler = StandardScaler()
scaled_data =
scaler.fit_transform(numeric_features)
31/08/2024
Output

31/08/2024
Best Model Based on Silhouette Scores
K-Means has the highest silhouette score of 0.1512, which suggests it performs the best in terms
of cluster cohesion and separation compared to the other models. Higher silhouette scores
indicate better-defined clusters with more distinct boundaries between them.

Why Other Models Are Not as Good:

Hierarchical Clustering (Silhouette Score: 0.1493):

Explanation: Though close to K-Means in performance, the slightly lower silhouette score
suggests that the clusters are not as well-defined. This could be due to the hierarchical nature of
the algorithm, which merges clusters based on distance but may create suboptimal boundaries.

Limitation: The predefined linkage criteria might not be as suitable for the given dataset, leading
to clusters that overlap or don't separate as cleanly.

31/08/2024
DBSCAN (Silhouette Score: 0.0708):

Explanation: DBSCAN has the lowest silhouette score, indicating that it struggles to form well-separated

clusters. This could be due to the algorithm treating many points as outliers or forming dense clusters that do

not align well with the data’s natural distribution.

Limitation: DBSCAN is sensitive to the choice of eps and min_samples. In this case, these parameters might

not have been optimal, leading to a poor clustering structure.

Gaussian Mixture Model (GMM) (Silhouette Score: 0.1435):

Explanation: GMM assumes that the data follows a Gaussian distribution, which might not match the

structure of the data well enough, leading to a slightly lower score than K-Means. It performs well but falls

short of creating clusters with strong separation.

Limitation: The probabilistic nature of GMM can sometimes lead to overlapping clusters if the data doesn't

fit the Gaussian distribution well.

31/08/2024
1. Why does K-Means perform better in this case?

K-Means tends to work well when clusters are spherical and of similar size, which might align
with the structure of your scaled data. The simplicity and direct minimization of intra-cluster
variance may have led to more distinct, well-separated clusters compared to other models.

2. Why is DBSCAN's score lower despite its strength in detecting outliers?

DBSCAN is highly sensitive to parameter settings (eps and min_samples). If these are not chosen
carefully, it may either label too many points as outliers or fail to detect clusters properly. In this
case, the lower score indicates that DBSCAN likely failed to capture the true structure of the
data.

31/08/2024
3. What does the difference in silhouette scores tell us about the data?

The relatively low scores across all models suggest that the data may not have clear, well-
separated clusters, or the scaling/preprocessing may not be optimal. The slight differences
between the models highlight how sensitive each algorithm is to the underlying data structure.

4. What is a good silhouette score threshold?

In general, a silhouette score close to 1 indicates excellent clustering, while a score around 0
indicates overlapping clusters or poorly defined boundaries. Scores near 0.15 suggest that
clusters are somewhat defined, but there may be significant overlap or suboptimal separation.

31/08/2024

You might also like