Report
Report
BELGAUM 590014
FRESHERSLABS
INTERNSHIP REPORT ON
Submitted in partial fulfilment for the requirements of the VII Semester degree of
BACHELOR OF ENGINEERING
IN
COMPUTER SCIENCE & ENGINEERING
During the Academic year 2023-2024
Submitted By
MOURYA H M (1DB20CS072)
CERTIFICATE
This is to certify that the Automata Research Laboratory Internship entitled “Customer
Segmentation: Use clustering algorithms to segment customers for a retail
business” is a bonafide report carried out by MOURYA H M (1DB20CS072), student of DON
BOSCO INSTITUTE OF TECHNOLOGY in partial fulfillment for the award of the degree of
Bachelor of Engineering in Computer science and Engineering of the Visvesvaraya Technological
University, Belgaum during the academic year 2023-24. It is certified that all corrections /
suggestions indicated for Internal Assessment have been incorporated in the report deposited in the
departmental library. The technical seminar has been approved as it satisfies the academic
requirements in respect of the technical seminar prescribed for the Bachelor of Engineering Degree.
The satisfaction and euphoria that successful completion of any internship is incomplete without the
mention of people who made it possible, whose constant support and encouragement made my effort
fruitful.
First and foremost, I ought to pay my due regards to this institute, which provided me a platform and
gave an opportunity to display my skills through the medium of project work. I express heartfelt
thanks to beloved principal Dr. B S Nagabhushana, Don Bosco Institute of Technology, Bangalore
for his encouragement all through my graduation life and providing me with the infrastructure.
I express my deep sense of gratitude and thanks to Dr. K B Shivakumar & Head of the Department,
computer Science and Engineering for extending his valuable insight and suggestions offered during
the course of this technical seminar.
It is my utmost pleasure to acknowledge the kind help extended by my guide Prof. Ranjeet Kumar,
Assistant Professor, Department of computer Science, and also my technical seminar coordinator Dr.
Thippeswamy G R, Prof., Dept of CSE for excellent guidance and cooperation which consequently
resulted in getting the technical seminar completed successfully.
Last but not the least I would like to thank all my friends and family for their help and support in
completing this technical seminar.
MOURYA H M (1DB20CS072)
VISVESVARAYA TECHNOLOGICAL UNIVERSITY
DON BOSCO INSTITUTE OF TECHNOLOGY
BANGALORE-560074
DECLARATION
I, MOURYA H M student of seventh semester B.E, Computer Science and Engineering, Don
Bosco Institute of Technology, Bengaluru declare that the internship entitled “Customer
In today's competitive retail landscape, understanding and effectively catering to the unique
needs and preferences of customers is paramount for business success. This project explores
the application of clustering algorithms to segment customers for a retail business. Customer
segmentation aims to categorize a diverse customer base into distinct groups based on shared
characteristics, allowing businesses to tailor their marketing strategies, product offerings, and
customer service to better meet individual customer needs.
This project leverages the power of data science and machine learning to analyze customer
data, including demographics, purchase history, and browsing behavior. Various clustering
algorithms, such as K-means, hierarchical clustering, and DBSCAN, are employed to partition
customers into meaningful groups. By identifying common patterns and behaviors within each
segment, retailers can gain insights into the specific preferences.
CONTENTS
S.L NO. CHAPTERS PG.NO
1. INTRODUCTION 1
2. PROBLEM STATEMENT 2
3. LITERATURE SURVEY 3
4. OBJECTIVES 4
5. SYSTEM REQUIREMENT SPECIFICTION 5
6. SYSTEM ARCHITECTURE 6
7. METHODOLOGY 7
8. TESTING 8
9. RESULTS 12
10. CONCLUSION 15
BIBLIOGRAPHY 16
Customer Segmentation: Use clustering algorithms to segment customers for a retail business 1
CHAPTER 1
INTRODUCTION
Customer segmentation is a crucial aspect of any retail business strategy. It involves dividing
a customer base into distinct groups based on certain characteristics or behaviors. This
segmentation helps businesses tailor their marketing efforts, product offerings, and customer
service to better meet the specific needs and preferences of each segment. Using clustering
Clustering algorithms automatically group similar data points together based on the features or
attributes provided.
CHAPTER 2
PROBLEM STATEMENT
In order to optimize marketing strategies, product offerings, and customer experiences, our retail
business aims to effectively segment our diverse customer base. By leveraging clustering algorithms,
we seek to group customers with similar characteristics, behaviors, and preference into distinct
segments. The goal is to enhance our ability to target and engage customers with tailored approaches
that meets their specific needs.
CHAPTER 3
LITERATURE SURVEY
Over the years, as there is very strong competition in the business world, the organizations have to enhance
their profits and business by satisfying the demands of their customers and attract new customers according to
their needs. The identification of customers and satisfying the demands of each customer is a very complex
and tedious task. This is because customers may be different according to their demands, tastes, preferences
and so on. Instead of “one-size-fits-all” approach ,customer segmentation clusters the customers into groups
sharing the same properties or behavioural characteristics. According to, customer segmentation is a strategy
of dividing the market into homogenous groups.
The data used in customer segmentation technique that divides the customers into groups depends on various
factors like, data geographical conditions, economic conditions, demographical conditions as well as
behavioural patterns. The customer segmentation technique allows the business to make better use of their
marketing. budgets, gain a competitive edge over their rival companies, demonstrating the better knowledge
of the needs of the customer. It also helps an organization in, increasing their marketing efficiency, determining
new market opportunities, making better brand strategy, identifying customers retention.
Clustering algorithms generates clusters such that within the clusters are similar based on some characteristics.
Similarity is defined in terms of how close the objects are in space. K-means algorithm in one of the most
popular centroid based algorithm. Suppose data set, D, contains n objects in space. Partitioning methods
distribute the objects in D into k clusters, C1,...,Ck , that is, Ci ⊂ D and Ci ∩Cj = ∅ for (1 ≤ i, j ≤ k). A centroid-
based partitioning technique uses the centroid of a cluster, Ci , to represent that cluster. Conceptually, the
centroid of a cluster is its center point. The difference between an object p ∈ Ci and ci , the representative of
the cluster, is measured by dist(p,ci), where dist(x,y) is the Euclidean distance between two points x and y.
Algorithm: The k-means algorithm for partitioning, where each cluster’s center is represented by the mean
value of the objects in the cluster. Input: k: the number of clusters, D: a data set containing n objects. Output:
A set of k clusters. Method: (1) arbitrarily choose k objects from D as the initial cluster centers; (2) repeat (3)
(re)assign each object to the cluster to which the object is the most similar, based on the mean value of the
objects in the cluster; (4) update the cluster means, that is, calculate the mean value of the objects for each ;
(5) until no change.
CHAPTER 4
OBJECTIVES
The project objectives for customer segmentation using clustering algorithms in a retail business typically
revolve around gaining insights into customer behavior, improving marketing strategies, and enhancing
overall business performance. Here are some specific objectives for such a project:
CHAPTER 5
➢ HARDWARE
Processor:- Intel(R) Celeron® CPU
[email protected] Installed memory (RAM) :-
4.00GB
System type :- 64 bit operating system .X64 –based processor
➢ Software
OS:- windows 10
Version :-10.0.17134.829
➢ Python installation
Anaconda installers
Jupyter Notebook
Windows
Python 3.8
CHAPTER 6
SYSTEM ARCHITECTURE
• Data Collection and Storage: Collect customer data from various sources.
• Monitoring and Feedback: Continuously monitor model performance and customer segments.
CHAPTER 7
METHODOLOGY
The data set used to implement clustering and K-means algorithm was collected from a store of shopping mall.
The data set contains 5 attributes and has 200 tuples, representing the data of 200 customers. The attributes in
the data set has CustomerId, gender, age, annual income(k$), spending score on the scale of (1-100).
CHAPTER 8
TESTING
1. Importing the libraries and the data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
import seaborn as sns
data = pd.read_csv('data\Mall_Customers.csv')
fig, ax = plt.subplots(figsize=(10,8))
sns.set(font_scale=1.5)
ax = sns.heatmap(corr, cmap = 'Reds', annot = True, linewidths=0.5, linecolor='black')
plt.title('Heatmap for the Data', fontsize = 20)
plt.show()
data['Gender'].head()
# data['Gender'].unique()
Counts of each type in the Gender Column using value_counts().
# data['Gender'].value_counts()
Plotting Gender Distribution on Bar graph and the ratio of distribution using Pie Chart.
labels=data['Gender'].unique()
values=data['Gender'].value_counts(ascending=True)
ax0.set_ylim(0,130)
ax0.axhline(y=data['Gender'].value_counts()[0], color='#d400ad', linestyle='--', label=f'Female ({data.Gender.value_counts()[0]})')
ax0.axhline(y=data['Gender'].value_counts()[1], color='#42a7f5', linestyle='--', label=f'Male ({data.Gender.value_counts()[1]})')
ax0.legend()
ax1.pie(values,labels=labels,colors=['#42a7f5','#d400ad'],autopct='%1.1f%%')
ax1.set(title='Ratio of Gender Distribution')
fig.suptitle('Gender Distribution', fontsize=30);
plt.show()
maxi = data[data['Gender']=='Female'].Age.value_counts().max()
mean = data[data['Gender']=='Female'].Age.value_counts().mean()
mini = data[data['Gender']=='Female'].Age.value_counts().min()
fig, ax = plt.subplots(figsize=(20,8))
sns.set(font_scale=1.5)
ax = sns.countplot(x=data[data['Gender']=='Female'].Age, palette='spring')
ax.axhline(y=maxi, linestyle='--',color='#c90404', label=f'Max Age Count ({maxi})')
ax.axhline(y=mean, linestyle='--',color='#eb50db', label=f'Average Age Count ({mean:.1f})')
ax.axhline(y=mini, linestyle='--',color='#046ebf', label=f'Min Age Count ({mini})')
ax.set_ylabel('No. of Customers')
ax.legend(loc ='right')
plt.title('Age Distribution in Female Customers', fontsize = 20)
plt.show()
fig, ax = plt.subplots(figsize=(5,8))
sns.set(font_scale=1.5)
ax = sns.boxplot(y=data["Annual_Income"], color="#f73434")
ax.axhline(y=data["Annual_Income"].max(), linestyle='--',color='#c90404', label=f'Max Income ({data.Annual_Income.max()})')
ax.axhline(y=data["Annual_Income"].describe()[6], linestyle='--',color='#f74343', label=f'75% Income
({data.Annual_Income.describe()[6]:.2f})')
ax.axhline(y=data["Annual_Income"].median(), linestyle='--',color='#eb50db', label=f'Median Income
({data.Annual_Income.median():.2f})')
ax.axhline(y=data["Annual_Income"].describe()[4], linestyle='--',color='#eb50db', label=f'25% Income
({data.Annual_Income.describe()[4]:.2f})')
ax.axhline(y=data["Annual_Income"].min(), linestyle='--',color='#046ebf', label=f'Min Income ({data.Annual_Income.min()})')
ax.legend(fontsize='xx-small', loc='upper right')
ax.set_ylabel('No. of Customers')
plt.title('Annual Income (in Thousand USD)', fontsize = 20)plt.show()
data['Annual_Income'].value_counts().head()
Visualizing Annual Income count value distribution on a histogram.
fig, ax = plt.subplots(figsize=(15,7))
sns.set(font_scale=1.5)
ax = sns.histplot(data['Annual_Income'], bins=15, ax=ax, color=['orange'])
ax.set_xlabel('Annual Income (in Thousand USD)')
plt.title('Annual Income count Distribution of Customers', fontsize = 20)
plt.show()
fig, ax = plt.subplots(figsize=(15,7))
sns.set(font_scale=1.5)
fig, ax = plt.subplots(figsize=(15,7))
sns.set(font_scale=1.5)
ax = sns.scatterplot(y=data['Spending_Score'], x=data['Age'], hue=data['Gender'], palette='seismic', s=70,edgecolor='black',
linewidth=0.3)
ax.set_ylabel('Spending Scores')
ax.legend(loc ='upper right')
plt.title('Spending Score per Age by Gender', fontsize = 20)
plt.show()
fig, ax = plt.subplots(figsize=(15,7))
plt.scatter(x=clusters[clusters['Cluster_Prediction'] == 4]['Annual_Income'],
y=clusters[clusters['Cluster_Prediction'] == 4]['Spending_Score'],
s=70,edgecolor='black', linewidth=0.3, c='orange', label='Cluster 1')
plt.scatter(x=clusters[clusters['Cluster_Prediction'] == 0]['Annual_Income'],
y=clusters[clusters['Cluster_Prediction'] == 0]['Spending_Score'],
s=70,edgecolor='black', linewidth=0.3, c='deepskyblue', label='Cluster 2')
plt.scatter(x=clusters[clusters['Cluster_Prediction'] == 2]['Annual_Income'],
y=clusters[clusters['Cluster_Prediction'] == 2]['Spending_Score'],
s=70,edgecolor='black', linewidth=0.2, c='Magenta', label='Cluster 3')
plt.scatter(x=clusters[clusters['Cluster_Prediction'] == 1]['Annual_Income'],
y=clusters[clusters['Cluster_Prediction'] == 1]['Spending_Score'],
s=70,edgecolor='black', linewidth=0.3, c='red', label='Cluster 4')
plt.scatter(x=clusters[clusters['Cluster_Prediction'] == 3]['Annual_Income'],
y=clusters[clusters['Cluster_Prediction'] == 3]['Spending_Score'],
s=70,edgecolor='black', linewidth=0.3, c='lime', label='Cluster 5')
plt.scatter(x=kms.cluster_centers_[:, 0], y=kms.cluster_centers_[:, 1], s = 120, c = 'yellow', label = 'Centroids',edgecolor='black',
linewidth=0.3)
plt.legend(loc='right')
plt.xlim(0,140)
plt.ylim(0,100)
plt.xlabel('Annual Income (in Thousand USD)')
plt.ylabel('Spending Score')
plt.title('Clusters', fontsize = 20)
plt.show()
fig, ax = plt.subplots(nrows=3, ncols=2, figsize=(15,20))
scatter(x=clusters[clusters['Cluster_Prediction'] == 4]['Annual_Income'],
y=clusters[clusters['Cluster_Prediction'] == 4]['Spending_Score'],
s=40,edgecolor='black', linewidth=0.3, c='orange', label='Cluster 1')
scatter(x=kms.cluster_centers_[4,0], y=kms.cluster_centers_[4,1],
s = 120, c = 'yellow',edgecolor='black', linewidth=0.3)
set(xlim=(0,140), ylim=(0,100), xlabel='Annual Income', ylabel='Spending Score', title='Cluster 2')
scatter(x=clusters[clusters['Cluster_Prediction'] == 2]['Annual_Income'],
y=clusters[clusters['Cluster_Prediction'] == 2]['Spending_Score'],
s=40,edgecolor='black', linewidth=0.2, c='Magenta', label='Cluster 3')
scatter(x=kms.cluster_centers_[2,0], y=kms.cluster_centers_[2,1],
s = 120, c = 'yellow',edgecolor='black', linewidth=0.3)
set(xlim=(0,140), ylim=(0,100), xlabel='Annual Income', ylabel='Spending Score', title='Cluster 3')
s = 120, c = 'yellow',edgecolor='black', linewidth=0.3)
s=40,edgecolor='black', linewidth=0.3, c='lime', label='Cluster 5')
scatter(x=kms.cluster_centers_[3,0], y=kms.cluster_centers_[3,1],
s = 120, c = 'yellow',edgecolor='black', linewidth=0.3, label='Centroids')
set(xlim=(0,140), ylim=(0,100), xlabel='Annual Income', ylabel='Spending Score', title='Cluster 5')
fig.delaxes(ax[2,1])
fig.legend(loc='right')
fig.suptitle('Individual Clusters')
plt.show()
CHAPTER 9
RESULTS
CHAPTER 10
CONCLUSION
From the above visualization it can be observed that Cluster 1 denotes the customer who has high annual
income as well as high yearly spend. Cluster 2 represents the cluster having high annual income and low
annual spend. Cluster 3 represents customer with low annual income and low annual spend. Cluster 5
denotes the low annual income but high yearly spend. Cluster 4 and cluster 6 denotes the customer with
medium income and medium spending score.
BIBLIOGRAPHY
[1] I. S. Dhillon and D. M. Modha, “Concept decompositions for large sparse text data using clustering,”
Machine Learning, vol. 42, issue 1, pp. 143-175, 2001.
[2] T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu, “An efficient K-
means clustering algorithm,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 24, pp. 881-
892, 2002.
[3] MacKay and David, “An Example Inference Task: Clustering,” Information Theory, Inference and
Learning Algorithms, Cambridge University Press, pp. 284-292, 2003.
[4] Jiawei Han, Micheline Kamber, Jian Pei “Data Mining Concepts and Techniques”, Third Edition.
[5] D. Aloise, A. Deshpande, P. Hansen, and P. Popat, “The Basis Of Market Segmentation” Euclidean sum-
of-squares clustering,” Machine Learning, vol. 75, pp. 245-249, 2009.
[6] S. Dasgupta and Y. Freund, “Random Trees for Vector Quantization,” IEEE Trans. on Information Theory,
vol. 55, pp. 3229-3242, 2009.
[7] Puwanenthiren Premkanth, ―Market Segmentation and Its Impact on Customer Satisfaction with Especial
Reference to Commercial Bank of Ceylon PLC.‖ Global Journal of Management and Business.