Final
Final
Final
The goal of our study is to segment the customer based on the demographics.
Where many competitors are trying to be better than others. Nowadays we have
many options of one kind so sometimes customers can get confused about what
to buy or not buy because every person has different choices. But in the time of
technology, we can sort out this problem by using machine learning algorithms.
We can apply many algorithms to the dataset and find the target group.
Without machine learning, it would be time-consuming to find a group with
similar choices. To segment the customers, we are using K-Means unsupervised
learning algorithm. Here K-Means algorithm helps us to make a group of data
with the same attributes that help businesses to grow perfectly. Unsupervised
learning algorithm K-Means Clustering divides the unlabeled dataset intovarious
clusters. Here, K specifies how many pre-defined clusters must be produced as
part ofthe process. It is an iterative approach that separates the unlabeled dataset
into kdistinct clusters, each of which contains just one dataset and shares a set of
characteristics.
iv
TABLE OF CONTENTS
CERTIFICATE ii
ACKNOWLEDGEMENTS iii
ABSTRACT iv
LIST OF FIGURES v
CHAPTER 1 INTRODUCTION 1-6
1.1 INTRODUCTION 1-2
1.2 SCOPE 2-3
1.3 SOFTWARE DEVELOPMENT METHODOLOGY 3-5
1.4 LITERATURE REVIEW 5-6
REFERENCES 42
APPENDIX 43-44
LIST OF FIGURES
5.1 ScreenShots 38
v
CHAPTER 1 INTRODUCTION
1.1 INTRODUCTION
Work from home (WFH) and study from home are two new phrases that have
emerged as a result of the COVID-19 global pandemic. [1] Which are meant to
individuals should restrict their outdoor activities and remain inside. In order to
preserve revenues throughout the epidemic, hypermarkets have also developed
online shopping platforms. Online shopping platforms have become
increasingly popular among consumers for making purchases of necessities. In
the circumstances at hand, this is helpful. [2] Customer segmentation refers to
the segmentation of customers based on demographics and behaviour.
Demographics do not emphasize a customer's individuality, as people of the
equal age group might have dissimilar interest. So, the behavioural side is a
better perspective to segmenting your customers, and with their help you can do
the right segmentation. The data tuples are seen as objects by the clustering
technique. Group or cluster data objects so that they are like one another inside
each group and different from one other within other groups. This document's
goal is to find consumer subgroups utilising a data mining strategy and the K-
Means clustering technique, a splitting algorithm. The ability of a business to
tailor a marketing strategy for each customer category is a key factor in the
value of customer segmentation. Identification of products associated with
individual components and methods for managing supply and demand
performance. Being able to estimate customer attrition, identify the customers
who are most likely to experience problems, and consider further market
research issues and advice on finding solutions are just a few of the tasks that
need to be completed. Over the years, the increasing competition between
businesses and the availability of large-scale historical data has resulted in the
extensive use of data mining techniques to discover important and strategic
information that is hidden in the information of organizations. Data mining is
the process of extracting logical information from a dataset and presenting it in
a human-accessible way for decision support. Data mining techniques
distinguish areas such as statistics, artificial intelligence, machine learning .
1
Bio informatics, weather forecasting, fraud detection, financial analysis and
customer segmentation. The key to this paper is to identify customer segments
in the commercial business using a data mining method. Customer division is the
division of the customer base of the business into groups called customer
segments such that each customer segment consists of customers who share
similar market characteristics. These distinctions are based on factors that can
directly or indirectly influence the market or business such as product
preferences or expectations, locations, behavior and so on. The importance of
customer segmentation includes, inter alia, the ability of a business to customize
market plans that will be appropriate for each segment of its customers; support
for business decisions based on a risky environment such as debt relations with
their customers; Identification of products related to individual components and
how to manage demand and supply power; reveals the interdependence and
interaction between consumers, between products, or between customers and
products that the business may not be aware of; the ability to predict customer
decline, and which customers are most likely to have problems and raise other
market research questions and provide clues to finding solutions. Integrated
proved effective for detecting subtle but subtle patterns or relationships buried
in a database of unencrypted data. This mode of learning is classified under
supervised learning. Integration algorithms include the
k-Means algorithm, k-nearest algorithm, Sorting Map (SOM) and more. These
algorithms, without prior knowledge of the data, are able to identify clusters in
them by repeated comparisons of input patterns until stable qualifications in the
training examples are obtained depending on the subject matter or the process.
Each set contains data points that have very close similarities but vary greatly
from the data points of other clusters.
1.2 SCOPE
In general, the methods used to gather the data for this project can easily be
extended into other relevant contexts/analyses. While there is clear value in
using the same data to investigate purchasing patterns or to build an item based
collaborative filtering recommender system, neither of these is the focus for this
paper. The scope of the paper is limited to the following four intertwined goals:
2
1. To cluster customers based on common purchasing behaviors for future
operations/marketing projects.
2. To incorporate best mathematical, visual, programming, and business practices
into a thoughtful analysis that is understood across a variety of contexts
and disciplines
3. To investigate how similar data and algorithms could be used in future data
mining projects.
4. To create an understanding and inspiration of how data science can be used
to solve real-world
Before delving into the details of the project and its implications, the next
chapter discusses what customer segmentation analysis is and the reasons for its
importance.
1. Planning: In this stage, the project team defines the goals and objectives of
the project, identifies the data sources and algorithms to be used for clustering,
and establishes the project scope and timeline.
2. Data Collection: In this stage, the project team gathers and cleanses the customer
data to be used for clustering. This data may include demographic information
such as age, gender, income, education level, and location.
3. Data Exploration and Preparation: In this stage, the project team explores the
customer data to identify patterns and trends that may be useful for clustering.
They may also preprocess the data to remove outliers, normalize data, and impute
missing values.
4. Algorithm Selection: In this stage, the project team selects the appropriate
clustering algorithm to be used based on the project goals and data characteristics.
3
and data characteristics. Popular clustering algorithms include K- means, hierarchical
clustering, and DBSCAN.
algorithm using programming languages such as Python or R. They also validate the
accuracy of the clustering results.
6. Testing: In this stage, the project team tests the clustering results to ensure they are
accurate and reliable. They may use performance metrics such as silhouette score,
clustering stability, or accuracy rate to evaluate the clustering model.
for use by stakeholders. The project team may also provide documentation and training
materials to facilitate user adoption.
8. Maintenance: In this stage, the project team provides ongoingsupport and maintenance
for the clustering model. This may include updating the model with new data, fixing bugs
or issues, and providing user support.
4
Fig 1.3 SOFTWARE DEVELOPMENT METHODOLOGY
Elbow Method:
The elbow method is based on the observation that increasing the number of
clusters can help to reduce the sum of within-cluster variance of each cluster.
This is because having more clusters allows one to capture finer groups of
data objects that are more similar to each other. To define the optimal clusters,
Firstly, we use the clustering algorithm for various values of k. This is done by
ranging k from 1 to 10 clusters. Then we calculate the total intra-cluster sum of
squares. Then, we proceed to plot intra-cluster sum of square based on the
number of clusters. The plot denotes the approximate number of clusters
required in our model. The optimum clusters can be found from the graph
where there is a bend in the graph.
5
not the user has mental illness. These surveys are illness-specific, with one for
depression, another for stress, and so on.
The model aims to identify, analyze and characterize the current state of person
by mood tracker, Chatbot, test were provided. Python and machine learning
technology was used for this model. The model develops various systems for
mental health monitoring virtual counselling, precision therapy and diagnostic
systems by reviewing of Chatbot and virtual counselling. The technology used
was AI, Machine Learning and Neural Processing Language for text analysis.
The smartphone will access and monitor sleep, depression and anxiety. Show
early associations between behaviors and sleep parameters and agreement
between clinic based assessments, active smartphone data capture and passively
collected data. The technology used in this model was AI, Machine Learning
and java. User input was taken in the form of MCQ or speak. Then the text were
passed to personality insights API which generates a JSON file. Then a chart
were prepared accordingto the user input and a critical value was set by doctor
and if the critical value falls below the range the doctor were notified via SMS.
The OS used for this model was Linux/Windows. The programming language
used was python 3.6. Framework was Flask 0.12.2, Pygal 2.4.0. The database
used was sqllite 3.8.2 andmangoDB 3.6.0. Situ Man logic uses LTA (Location,
Time, and Activity) logic. The location, time and activity were directly obtained
from the device and a notification were sent by the mood Buster. This
notification typically request patients to rate their levels of mood, anxiety, and
sleep quality. From these situation aware notifications, the mood buster may be
able to correlate the patient’s status with their situations.
The technology used for this model was Machine learning. The application was
created based on interaction between patient and the smart device to connect
with psychologist. Heart rate were calculated by using camera sensor.By
answering some question’s user can measure their anxiety level. The
technology used for this model was machine learning and signal processing
6
CHAPTER 2 EFFORT AND COST ESTIMATION
Estimating the effort and cost of a project like Clustering of Customers Based
on Demographics would depend on several factors, such as the scope of the
project, the complexity of the algorithms involved, the hardware and software
requirements, and the team size and expertise.
Here is a high-level breakdown of the effort and cost estimation for a project
of this nature:-
1. Project Scope: The first step is to define the scope of the project, which
involves determining the specific features and functionalities required, such as
the ability to detect and recognize customer’s demand.
4. Team Size and Expertise: The size and expertise of the development team
will also impact the effort and cost estimation. A larger team with more
experienced developers may be able to complete the project more quickly, but
may also increase the overall cost of the project. Based on these factors, here
is a rough estimate of the effort and cost involved in a clustering of customers
based on demographics.
7
CHAPTER 3 SOFTWARE REQUIREMENT
SPECIFICATION
3.1 INTRODUCTION
The SRS typically starts with an introduction section that provides an overview of the
software project. In this section, the purpose, goals, and objectives of the software
system are described concisely. It also includes information about the intended
audience and stakeholders who will be involved in the project. The introduction sets
the context for the entire document, giving readers a clear understanding of the
software's purpose and the problems it aims to solve.
Furthermore, the introduction section may briefly discuss the background and
motivation behind the development of the software. It can provide insights into the
existing challenges or inefficiencies that the software seeks to address. This helps
stakeholders understand the rationale behind the project and its potential benefits.
8
3.2 INTENDED AUDIENCE AND READING SUGGESTIONS
Intended Audience:
1. Development team: The document gives the team a clear grasp of the specs
and requirements for the system. It provides the framework for the system's
design and development.
b. Use cases: From the viewpoint of the user, the use cases section describes
9
challenging for stakeholders to understand. As a result, the following reading
suggestions are offered: The aims of the project and the system are covered in
detail in the summary section. To understand the objectives and scope of the
project, it is advised that all stakeholders read this section.
a. Use cases: The use cases section describes the system’s functioning from
the perspective of the user. It clearly illustrates how the technology will be
used inreal-world scenarios.
b. Functional requirements: The system's unique features and functionalities
are covered in this section. It is recommended that the development team and
stakeholders read this.
3. Feature Selection: After preprocessing, the next step is to select the most
relevant features or variables that will be used to cluster customers. This may
involve using statistical methods to identify the most significant features or
using domain knowledge to select the most important variables.
10
the selected features. There are many clustering algorithms available, such as
k-means, hierarchical clustering, and DBSCAN, each with its own strengths
and weaknesses.
5. Visualization: Once the clustering has been completed, the results need to
be visualized and presented in a way that is easy to understand and interpret.
This may involve using graphs, charts, or other visualizations to display the
clusters and their characteristics.
2. Data Preprocessing: The software should be able to clean and preprocess the
customer data to ensure that it is accurate, consistent, and ready for analysis.
This may involve removing duplicates, handling missing values, and
transforming the data into a suitable format.
3. Feature Selection: The software should allow for the selection of relevant
features or variables that will be used to cluster customers. This may involve
11
4. Statistical Method: using statistical methods to identify the most significant
features or using domain knowledge to select the most important variables.
9. Export: The software should allow for the export of clustering results in a
suitable format, such as a CSV or Excel file.
10. Integration: The software should be able to integrate with other systems, such
as customer relationship management (CRM) or marketing automation tools, to
allow for the application of the clustering results in real-world scenarios.
11. Security: The software should ensure the security of customer data and protect
against unauthorized access or data breaches.
12
3.4.2 NON-FUNCTIONAL REQUIREMENTS
e. Security: The software should ensure the security of customer data and
protect against unauthorized access or data breaches.
13
3.5 FEASIBILITY STUDY
14
3.5.2 TECHNICAL FEASIBILITY
8. Security: The software should ensure the security of customer data and
protect against unauthorized access or data breaches.
10. Maintenance: The software should be easy to maintain and update, with
a clear and well- documented codebase that allows for future modifications
and improvements.
16
segmentation, more targeted marketing campaigns, increased customer
retention, and higher revenues.
6. Risk Assessment: The business should assess the potential risks associated
with implementing and operating the clustering software, and take steps to
mitigate these risks.
17
including cleaning and transforming it, to ensure that it is suitable for
clustering.
6. User Interface: The software should have a user-friendly interface that allows
users to easily interact with the software and perform clustering tasks without
requiring technical expertise
8. Security: The software should ensure the security of customer data and
protect against unauthorized access or data breaches.
10. Support and Maintenance: The software vendor should provide ongoing
support and maintenance, including software upgrades and bug fixes, to ensure that
the software continues to meet the business's needs over time.
18
3.6.2 HARDWARE REQUIREMENTS
1. Processor: The processor should be able to handle the computational load of the
clustering algorithm. A multi-core processor or a processor with a high clock speed
is recommended for faster processing.
2. Memory (RAM): The amount of RAM required depends on the size of the data
set. The larger the data set, the more RAM is required for efficient processing. At
least 8 GB of RAM is recommended for most clustering applications.
3. Storage: Adequate storage is required to store the data set, intermediate results,
and output files. A solid-state drive (SSD) is recommended for faster read and write
speeds.
4. Graphics Card (GPU): A graphics card can speed up the processing of certain
clustering algorithms, such as those that use distance calculations or matrix
computations. A high-end GPU with a large number of cores and a high memory
bandwidth is recommended.
19
3.7 USER REQUIREMENTS DOCUMENT (URD)
3. Customer: This represents the customer entity, which contains information about
each customer . It includes attributes like Customer ID and Demographics.
20
4. Clustering: This represents the clustering process, which takes customer data as
input and applies clustering algorithms to group customers based on their
demographics. It maintains a list of clustered customer.
6. Data Output: This represents the component responsible for receiving the
clustered data as output from the clustering system. It includes the data that has
been segmented or grouped based on demographics.
21
Fig 3.7.2 ACTIVITY DIAGRAM OF CLUSTERING OF CUSTOMERS
EXPLANATION
The diagram starts with the “Start” node. The first activity is to gather the data.
If the data processing and analysis is done, the diagram moves to the next
activity, which the uses the k-means algorithm.
If the above steps proceed correctly, the diagram moves to the next activity,
which is to display the result of the customer through clustering.
22
3.8 SYSTEM DESIGN
3.8.1 INTRODUCTION
The following are some steps that can be taken to design a system for clustering
customers based on demographics:
3. Feature Extraction: The next step is to extract relevant features from the
preprocessed data. This involves selecting the most important demographic
variables that can be used to create customer segments.
4. Clustering Algorithm Selection: Once the features are extracted, the next
step is to select a clustering algorithm that best suits your data and objectives.
Some popular clustering algorithms include k-means, hierarchical clustering,
and DBSCAN.
23
6. Cluster Visualization: Finally, the clusters can be visualized to help understand the
characteristics of each cluster and how theydiffer from each other.
1. Customer Demographics Data: This is the source of customer data that will be
24
used to perform clustering based on demographic attributes such as age, gender,
income, education, location, etc.
2. Extract and preprocess data: In this step, the relevant customer data
is extracted from the source and preprocessed to make it suitable for
clustering. This may involve removing missing or irrelevant data,
normalizing the data, and transforming it into a suitable format.
4. Apply clustering algorithm: The preprocessed data is fed into the clustering
algorithm, which groups similar customers together based on their demographic
attributes. The resulting clustered data is returned, along with the labels for each
cluster.
5. Cluster Analysis: This is the final step in the process, where the clustered data is
analyzed to gain insights into customer behavior and preferences. This analysis can
help businesses tailor their marketing and sales strategies to specific customer
segments, improving customer engagement and loyalty.
25
Here is a brief explanation of each step:-
1. The Customer System sends a request to the Clustering System for customer data.
5. The Clustering System returns the cluster information to the Customer System.
26
The Demographic class represents a set of demographic attributes that can be used to
cluster customers. In this example, we have included attributes such as age, gender,
and income, Depending on the specific use case, other attributes may be included as
well.
27
CHAPTER 4 IMPLEMENTATION
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
print(X)
wcss = []
for i in range(1,11):
kmeans = KMeans(n_clusters=i, init='k-means++',
random_state=42)
28
kmeans.fit(X)
wcss.append(kmeans.inertia_)
sns.set()
plt.plot(range(1,11), wcss)
plt.title('The Elbow Point Graph')
plt.xlabel('Number of Clusters')
plt.ylabel('WCSS')
plt.show
print(Y)
29
# plotting all the clusters and their Centroids
plt.figure(figsize=(8,8))
plt.scatter(X[Y==0,0], X[Y==0,1], s=50, c='green', label='Cluster
1')
plt.scatter(X[Y==1,0], X[Y==1,1], s=50, c='red', label='Cluster
2')
plt.scatter(X[Y==2,0], X[Y==2,1], s=50, c='yellow', label='Cluster
3')
plt.scatter(X[Y==3,0], X[Y==3,1], s=50, c='violet', label='Cluster
4')
plt.scatter(X[Y==4,0], X[Y==4,1], s=50, c='blue', label='Cluster
5')
plt.title('Customer Groups')
plt.xlabel('Annual Income')
plt.ylabel('Spending Score')
plt.show()
30
CHAPTER 5 SCREENSHOTS
31
CHAPTER 6 TECHNOLOGY USED
6.1 PYTHON
Python is a popular programming language for machine learning due to its
simplicity, ease of use, and the availability of a vast number of libraries and
frameworks specifically designed for machine learning. Python machine
learninginvolves the use of various machine learning algorithms and techniques
to buildmodels that can make predictions or take actions based on data.
6.2 K-MEANS
32
Step-1: Select the number K to decide the number of
clusters.
Step-2: Select random K points or centroids. (It can be other from the input
dataset).
Step-3: Assign each data point to their closest centroid, which will form the
predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each data point to the new
closest centroid of each cluster.
Step-6: If any reassignment occurs, then go to step-4
else go to FINISH.
Step-7: The model is ready.
33
CHAPTER 7 TESTING AND INTEGRATION
Testing and integration are crucial phases in the software development lifecycle that
ensure the quality, reliability, and functionality of a software system. Test case
descriptions play a vital role in guiding the testing and integration processes, outlining
the steps to be taken, expected results, and ensuring comprehensive test coverage.
Test case descriptions provide detailed instructions for executing specific tests on the
software system. Each test case focuses on a particular aspect or functionality of the
system to validate its behavior against expected outcomes. The descriptions typically
include the following components:
1. Test case identifier: A unique identifier that helps in tracking and referencing
the test case.
2. Test case name: A brief but descriptive name that reflects the purpose or
objective of the test.
5. Test data: The specific input data or conditions required for executing the test case.
6. Expected results: The anticipated outcome or behavior of the software when the
test case is executed successfully.
7. Actual results: The observed results during test execution, which are
compared against the expected results.
34
8. Pass/fail criteria: The criteria that determine whether the test case has passed
or failed based on the comparison of actual and expected results.
11. Test priority: The priority level assigned to the test case, indicating its
relative importance in the testing process.
By documenting test cases in detail, testers can ensure that all aspects of the
software system are thoroughly tested. Test case descriptions serve as a reference for
testers to execute tests consistently and aid in identifying and resolving any issues or
defects found during testing. Moreover, integration test case descriptions facilitate
the seamless integration of individual components or modules into a cohesive
software system, ensuring their compatibility and proper functioning as a whole.
The testing types for clustering of customers based on demographics can include:
35
2. Integration Testing: Integration testing for clustering of customers based on
demographics involves verifying how different modules or components of the
clustering algorithm work together to ensure they integrate correctly. This type
of testing may include testing how different distance metrics or clustering
algorithms work together to generate accurate customer segments.
36
7.3 TEST CASES
Here are some test cases for clustering of customers based on demographics:
37
Test Case id - 3 Cluster Purity
Chirag Bisht
Executed By
38
Test Case id - 5 Scalability
Riddhi Kaushik
Developed By
Chirag Bisht
Executed By
Here are some potential future enhancements for clustering ofcustomers based on
demographics:
39
CONCLUSION
As our dataset was unlabeled, in this paper we have opted for internal clustering
validation rather than external clustering validation, which depends on some
external data like labels. Internal cluster validation can be used for choosing
clustering algorithm which best suits the dataset and can correctly cluster data
into its opposite cluster.
Based on the above information, we now know that the Jumbo Bag Red Retro
spot is the best- selling item by our most expensive team. With that information
available, we can make recommendations for other potential customers in this
section.
40
may not always provide a complete picture of customer behavior and
preferences. Other factors such as psychographics, buying behavior, and
individual preferences may also need to be taken into account. Additionally, it
is important to ensure that any clustering analysis is done ethically and with
respect for customers' privacy and data protection rights.
Overall, clustering based on demographics can be a useful tool for businesses to gain
insights into customer behavior and preferences, but it should be used in
combination with other data analysis techniques and with consideration for ethical
and privacy concerns.
41
REFERENCES
[1]. Sayyida, S.; Hartini, S.; Gunawan, S.; Husin, S.N. The Impact ofthe Covid-19
Pandemic on Retail Consumer Behaviour. Aptisi Trans. Manag. (ATM) 2021,
5, 79–88.
[2]. Prof. Nikhil Patankar, Soham Dixit, Akshay Bhamare, Ashutosh Darpel and
Ritik Raina. Customer segmentation refers to the segmentation of customers
based on demographics “Dept. Of Information Technology Sanjivani College
of Engineering, Kopargaon”423601 (MH), India.
[3]. Aman Banduni, Prof Ilavendhan A. Identifying and meeting the needs and
requirements.School of Computing Science & Engineering, Galgotias University,
Greater Noida, U.P.
[4]. A.K. Jain, M.N. Murty and P.J. Flynn.ǁ Data Integration: A Reviewǁ. ACM
Computer Research. 1999. Vol. 31, No. 3.
[6]. Omar Kettani, Faycal Ramdani, Benaissa Tadili, “An Agglomerative Clustering
Method for Large Data Sets”, IJCA, Year: 2014.
42
APPENDIX
Demographic segmentation is one of the most commonly used methods for customer
segmentation. It divides customers into groups based on demographic characteristics
such as age, gender, income, education, marital status, and occupation. Clustering is a
useful technique for grouping customers based on their demographic attributes, as it
allows marketers to identify patterns and similarities in customer behavior and
preferences.
Clustering algorithms can be divided into two main types: hierarchical clustering and
partitioning clustering. Hierarchical clustering is a bottom-up approach that starts with
each customer as a separate cluster and gradually merges clusters based on their
similarity. Partitioning clustering, on the other hand, is a top-down approach that starts
with a set of clusters and assigns customers to the nearest cluster based on their
similarity.
There are several clustering algorithms that can be used for customer segmentation,
including k-means, hierarchical agglomerative clustering, and DBSCAN. K-means
clustering is a partitioning algorithm that separates customers into k clusters based on
their distance from the center of each cluster. Hierarchical agglomerative clustering,
on the other hand, is a hierarchical algorithm that creates a dendrogram to visualize
the clustering process. DBSCAN is a density-based clustering algorithm that groups
customers based on their density in the data space.
Once customers are clustered based on their demographics, marketers can use these
clusters to develop targeted marketing campaigns and personalized communication
strategies. For example, customers in the same cluster may have similar preferences
43
and behaviors, making it easier to tailor products and services to their needs.
44