0% found this document useful (0 votes)
10 views34 pages

Bai Tap

This document is a group assignment analyzing student data using data science techniques. It first prepares the data by cleaning, deriving features, and descriptive analysis. Then various clustering and modeling techniques are applied and evaluated. Finally, the results and recommendations are communicated.

Uploaded by

duydongk6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views34 pages

Bai Tap

This document is a group assignment analyzing student data using data science techniques. It first prepares the data by cleaning, deriving features, and descriptive analysis. Then various clustering and modeling techniques are applied and evaluated. Finally, the results and recommendations are communicated.

Uploaded by

duydongk6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 34

Course title: Introduction to Data Science

FINAL EXAM ASSIGNMENT


Topic: Student Analyst

Lecturer : Do Trung Tuan


Group member: Pham Phuong Anh - 22070410
Dong Huu Khanh Duy - 22070571
Le Thi Bich Ngoc - 22070439
Phan Thi Thanh Tam - 22070983
Tran Thi Trang - 22070501
Class : DS9-23

Hanoi, November 2023

i
Acknowledgement
We would like to express our appreciation to all those who help us to complete this report
successfully, especially Associate Professor Ph.D Mr. Do Trung Tuan for supporting and
encourage us all the time, taught us throughout the study process.
Many thanks to VNU-IS for including this subject in the curriculum to help students like us
to know what is Data Science and have a deeper look in how it is used in life and its
importance in 21st century.
A sincere thank you to all the lecturers who put in their best efforts in guiding our team to
achieve the goal. I would like to express my deep gratitude to all my classmates, especially
those who took the time to help and support me whenever I needed it during my project.

ii
Table of content
Acknowledgement...............................................................................................ii
Table of content..................................................................................................iii
List of figures.....................................................................................................iv
Participants.........................................................................................................vi
Chapter 1. INTRODUCTION.............................................................................1
1.1. About the research of problem.................................................................1
1.2. About Dataset...........................................................................................1
1.3. Motivation and purpose of the research...................................................2
Chapter 2. Analyze Dataset.................................................................................3
2.1. Data preparation.......................................................................................3
2.2. Data cleaning............................................................................................4
2.3. Deriving New Features.............................................................................6
2.4. Data descriptive........................................................................................9
Chapter 3. Modeling..........................................................................................13
3.1. Data Processing......................................................................................13
3.2. Clustering................................................................................................17
3.3. Evaluating Models..................................................................................20
Chapter 4. Result communication and recommendation...................................25
4.1. Result......................................................................................................25
4.2. Recommendation....................................................................................26
Chapter 5. Conclusion.......................................................................................27
Reference...........................................................................................................28

iii
List of figures
Figure 2.1. Importing Libraries...........................................................................3
Figure 2.2. Read CSV..........................................................................................3
Figure 2.3. Data set..............................................................................................4
Figure 2.4. Remove the NA values......................................................................5
Figure 2.5. Create a feature from Dt. Customer..................................................5
Figure 2.6. Create a feature Customer_For.........................................................6
Figure 2.7. Visualize............................................................................................6
Figure 2.8. Result.................................................................................................7
Figure 2.9. Create new features...........................................................................8
Figure 2.10. Data describe...................................................................................9
Figure 2.10. Result...............................................................................................9
Figure 2.11. Create Graph.................................................................................10
Figure 2.11. Visualize Graph.............................................................................11
Figure 2.12. Remove the outliers data...............................................................11
Figure 2.13. Correlation Matrix.........................................................................11
Figure 2.14. Create Heatmap.............................................................................12
Figure 2.15. List of categorical..........................................................................13
Figure 2.16. Encode the object dtypes...............................................................13
Figure 2.17. Create a copy of data.....................................................................14
Figure 2.18.........................................................................................................14
Figure 2.19.........................................................................................................15
Figure 2.20.........................................................................................................15
Figure 2.21.........................................................................................................15
Figure 2.22.........................................................................................................15
Figure 2.23.........................................................................................................15

iv
Figure 2.24. 3D Projection of data in the reduced dimension...........................16
Figure 2.25. Result.............................................................................................17
Figure 2.26.........................................................................................................17
Figure 2.27.........................................................................................................18
Figure 2.28.........................................................................................................18
Figure 2.29.........................................................................................................19
Figure 2.30.........................................................................................................20
Figure 2.31.........................................................................................................20
Figure 2.32.........................................................................................................21
Figure 2.33.........................................................................................................21
Figure 2.34.........................................................................................................22
Figure 2.35.........................................................................................................22
Figure 2.36.........................................................................................................22
Figure 2.37.........................................................................................................23
Figure 2.38.........................................................................................................23
Figure 2.39.........................................................................................................24

v
Participants
No Name Student ID Contribution Comments

1 Pham Phuong Anh 22070410 20% Leader, Analyze


Dataset, Introduction

2 Dong Huu Khanh Duy 22070571 20% Analyze Dataset

3 Le Thi Bich Ngoc 22070439 20% Modeling and


conclusion

4 Phan Thi Thanh Tam 22070983 20% Abstract, Introduction

5 Tran Thi Trang 22070501 20% Modeling and


conclusion

vi
Chapter 1. INTRODUCTION
1.1. About the research of problem
In the dynamic landscape of today's business world, understanding and connecting
with customers on a personal level is pivotal for success. A strategic approach to this is
Customer Personality Analysis, a comprehensive examination of a company's ideal
customers. By delving into the intricacies of their needs, behaviors, and concerns, businesses
can tailor their products to resonate with different customer segments, fostering a more
personalized and targeted marketing strategy.
In this project, the focus is on performing an unsupervised clustering of data extracted
from a groceries firm's database. Through customer segmentation, the aim is to identify and
group customers who exhibit similarities in their preferences, behaviors, and purchasing
patterns. This segmentation enables businesses to assign greater significance to each
customer group, allowing for the customization of products that align with the unique needs
and behaviors of those segments.

1.2. About Dataset


The ” Customer Segmentation” dataset that we analyze is taken from the Kaggle dataset
repository. About attributes:
- Customer’s Information:
1. ID
2. Year_Birth
3. Education
4. Marital_Status
5. Income
6. Kidhome
7. Teenhome
8. Dt_Customer
9. Recency
10. Complain
- Products: Amount spent on different products in last 2 years
1. Mnt Wines
2. Mnt Fruits
3. MntMeatProducts
4. MntFishProducts
5. MntSweetProducts
6. MntGoldProds
1
- Place:
1. NumWebPurchases
2. NumCatalogPurchases
3. NumStorePurchases
4. NumWebVisistMonth
- Promotion:
1. NumDealsPurchases
2. AcceptedCmp1
3. AcceptedCmp2
4. AcceptedCmp3
5. AcceptedCmp4
6. AcceptedCmp5
7. Response

1.3. Motivation and purpose of the research


Customer Personality Analysis is a detailed analysis of a company’s ideal customers.
It helps a business to better understand its customers and makes it easier for them to modify
products according to the specific needs, behaviors and concerns of different types of
customers.

Customer personality analysis helps a business to modify its product based on its
target customers from different types of customer segments. For example, instead of spending
money to market a new product to every customer in the company’s database, a company can
analyze which customer segment is most likely to buy the product and then market the
product only on that particular segment.

Several benefits and improvements that the firm can create from the results of this
research include:

- Understanding customers' behaviors.

- Developing different customer types and personas.

- Promoting targeted, effective marketing campaigns.

- Developing more relevant policies for sales and pricing.

- Improving customer service.

2
Chapter 2. Analyze Dataset

2.1. Data preparation


First, importing libraries

Figure 2.1. Importing Libraries


Next, to read csv we load the dataset

Figure 2.2. Read CSV


In order to KMget a full grasp of what steps should we be taking to clean the dataset. Let us
have a look at the information in the data.

3
Figure 2.3. Data set
From the above output, we can conclude and note that:
1. There are missing values in income
2. Dt_Customer that indicates the date a customer joined the database is not parsed as
DateTime
3. There are some categorical features in our data frame; as there are some features in
dtype: object. So we will need to encode them into numeric forms later.

2.2. Data cleaning


First of all, for the missing values, I am simply going to drop the rows that have missing
income values.

4
Figure 2.4. Remove the NA values

This command cleans data by removing rows that contain at least one missing value, then
provides an indication of the amount of data remaining.
- data.dropna(): It is used to drop rows containing at least one NA value.
Create a feature from "Dt_Customer" that displays the number of days the customer is
registered in the company database. However, for simplicity's sake, I'm taking this value
relative to the most recent customer on file.

Figure 2.5. Create a feature from Dt. Customer


pd.to_datetime(data["Dt_Customer"]): Use the pd.to_datetime function to convert the
"Dt_Customer" column from object data format to datetime data format. This makes it
possible to perform operations and calculations with temporal data.
An empty list dates is created to store date values.
The loop goes through each value in the "Dt_Customer" column and retrieves the date from
each datetime object.
The date values are added to the dates list.

5
Thus, this code converts the column "Dt_Customer" to datetime format and then extracts the
date from each datetime object, finally printing out the newest and oldest registration dates in
the dataset.

2.3. Deriving New Features


Creating a feature ("Customer_For") of the number of days the customers started to shop in
the store relative to the last recorded date.

Figure 2.6. Create a feature Customer_For


 days: This list stores the number of days each customer has been associated with the
business.
 d1 = max(dates): Suppose d1 is the latest date in the customer date list, representing
the newest customer.
 for i in day: delta = d1 - i; Days.append(delta): Use a loop to calculate the number
of days from the latest date (newest customer) to each customer's date and add the
delta value to the list of dates.
 data["Customer_For"] = days: Adds a new column "Customer_For" to the data set
with values representing the number of days each customer has been associated with
the business.
 data["Customer_For"]: Convert column "Customer_For" to integer format, with
error="coerce" to handle null cases values can be converted into numbers.
In short, this code creates a new feature "Customer_For" to represent the number of days
between the customer's registration date and the latest registration date in the data set.
Find out the unique values in the categorical features to get a clear idea of the data.

6
Figure 2.7. Visualize
- Data["Education"].value_counts(): Similar to above, this function counts the number
of unique values in the "Education" column and returns a Series with the count of
each value.
Print Result:
1. Total categories in the feature Marital_Status:: Prints the number of unique values in
the "Marital_Status" column and the number of occurrences of each value.
2. Total categories in the feature Education:: Prints the number of unique values in the
"Education" column and the number of occurrences of each value.

Figure 2.8. Result


In the next bit, we will be performing the following steps to engineer some new features:
1. Extract the "Age" of a customer by the "Year_Birth" indicating the birth year of the
respective person.
2. Create another feature "Spent" indicating the total amount spent by the customer in
various categories over the span of two years.
3. Create another feature "Living_With" out of "Marital_Status" to extract the living
situation of couples.
4. Create a feature "Children" to indicate total children in a household that is, kids and
teenagers.
5. To get further clarity of household, Creating feature indicating "Family_Size"
6. Create a feature "Is_Parent" to indicate parenthood status
7. Lastly, I will create three categories in the "Education" by simplifying its value
counts.
8. Dropping some of the redundant features

7
Figure 2.9. Create new features
In the above code snippet, there are a series of feature processing (feature engineering)
decisions made on the DataFrame 'data':
1. Age of current customers: Create a new "Age" icon, representing the customer's age
in 2021 based on their year of birth.
2. Total cost for different items: Create a new feature "Spent" by adding up the total
amount of money spent on different product types.
3. Determine living situation based on marital status: Create a new characteristic
"Living_With" to specify a living situation based on marital status. For example,
“Alone” is assigned to marital statuses such as “Absurd,” “Widow,” “YOLO,”
“Divorced,” and “Single.”
4. Characteristics of the index of children living in the household: Create a new feature
"Children" by adding the columns "Kidhome" and "Teenhome."
5. Characteristics for the total number of family members: Create a new characteristic
"Family_Size" based on living situation. For "Alone," the family size is 1; for
"Partner," family size is 2; and the number of children added to these values.
6. Characteristics related to being a parent: Create a new characteristic "Is_Parent" using
a binary index (1 or 0) based on whether the customer has children.
7. Characteristics related to being a parent: Create a new characteristic "Is_Parent" using
a binary index (1 or 0) based on whether the customer has children.
8. For clarity: Some column names are renamed to make them easier to understand.
9. Eliminate redundant features: Some columns are removed because they are
considered redundant.

8
These engineering steps aim to generate meaningful and informative features from the
existing data set, which can improve the performance of machine learning models trained on
this data.

Figure 2.10. Data describe


The describe() function in Pandas is used to create a statistical summary of the quantitative
characteristics of a DataFrame. Here is a detailed explanation of the results you can see from
data.describe().

Figure 2.10. Result


The above stats show some discrepancies in mean Income and Age and max Income and age.
Do note that max-age is 128 years, As I calculated the age that would be today (i.e. 2021) and
the data is old.

2.4. Data descriptive


I must take a look at the broader view of the data. I will plot some of the selected features.

9
Figure 2.11. Create Graph
These functions are used to configure colors for graph objects. rc (resource configuration) is
used to set axis and image properties (axes and figures).
Pallet and cmap are lists of colors used to create the palette for the graph. cmap is a Colormap
object for use in graphs.
Draw Scatter Graph:
1. To_Plot is the list of features you want to plot.
2. sns.pairplot(data[To_Plot],hue="Is_Parent",palette=(["#682F2F","#F3AB60"])): Use
Seaborn's pairplot function to plot the scatter plot for pairs of features in To_Plot list.
The hue column is set to "Is_Parent", which means the color of the points will be
based on the value of the "Is_Parent" column. Colors are selected from a preset
palette.
3. plt.show(): Display the graph.

10
Figure 2.11. Visualize Graph
There are a few outliers in the Income and Age features. we will be deleting the outliers in
the data.

Figure 2.12. Remove the outliers data


The above code is removing outliers from the DataFrame data by setting thresholds for the
"Age" and "Income" columns.
1. Remove Outliers for "Age" Column:
2. This line retains rows in the DataFrame data only if the value of the "Age" column is
less than 90. Rows with an "Age" value greater than or equal to 90 are discarded.
3. Remove Outliers for Column "Income": Similarly, this line retains rows in the
DataFrame data only if the value of the "Income" column is less than 600,000. Lines
with "Income" value greater than or equal to 600,000 will be removed.
4. Print Quantity Data After Removing Outliers:
5. Finally, print out the amount of data remaining after removing outliers.
6. The result of this code is that the DataFrame data contains only rows that do not
contain outliers according to the thresholds set for the "Age" and "Income" columns.

Figure 2.13. Correlation Matrix


- plt.figure(figsize=(20, 20)): This command sets the size of the figure to create the
heatmap. Adjusting image size provides better visualization, especially when there are
many variables.
- sns.heatmap(corrmat, annot=True, cmap=cmap, center=0): This command creates a
heatmap using the Seaborn library. corrmat is the correlation matrix, annot=True adds
a numeric value to the heatmap, cmap specifies the colormap, and center=0 sets the
center of the colormap to zero.
The resulting heatmap provides a visual representation of the correlation between different
pairs of variables in the data set. Positive correlations are indicated by light colors, negative
correlations by dark colors, and correlations close to zero are indicated by neutral colors.

11
Figure 2.14. Create Heatmap
We can see that there is a high correlation between spent and wines (0.89), income (0.79),
Meat (0.85), NumCatalog Purchase (0.78), NumStorePurchase (0.68), We can use variables
to predict the customer segmentation. However, low correlation such as children and income
(-0.34), Spend and NumDealPurcharse (-0,066), these are strongly negative, which are not
good enough for predicting.

12
Chapter 3. Modeling
3.1. Data Processing
Before performing groups of algorithmic analyzes (clustering), data preprocessing is
important to ensure that the data is normalized and suitable for modeling.
The following steps are applied to preprocess the data:
1. Label encoding the categorical features
2. Scaling the features using the standard scaler
3. Creating a subset dataframe for dimensionality reduction

Figure 2.15. List of categorical


Categorical variables in the dataset: ['Education', 'Living_With']

Figure 2.16. Encode the object dtypes


Features = StandardScaler().fit_transform(features)
pca_data=pca.fit_transform(scaled_data)
pComponents = pca.fit_transform(features)
All features are now numerical

13
Figure 2.17. Create a copy of data
1. It creates a copy of the original dataset data and assigns it to the variable ds.
2. It creates a subset of the DataFrame by removing the specified features related to
accepted campaigns (AcceptedCmp3, AcceptedCmp4, AcceptedCmp5,
AcceptedCmp1, AcceptedCmp2), customer complaints (Complain), and customer
responses (Response).
3. It uses the StandardScaler from scikit-learn to standardize the remaining features in
the dataset. The scaled data is stored in the DataFrame scaled_ds.
4. A print statement indicating that all features have been successfully scaled.
All features are now scaled

Figure 2.18
Dataframe to be used for further modeling:

14
Figure 2.19

Figure 2.20

Figure 2.21
Dimensionality Reduction

Figure 2.22

Figure 2.23
1. pca = PCA(n_components=3): Initializes PCA with the desired number of
components set to 3.
2. pca.fit(scaled_ds): Fits the PCA model to the scaled data.

15
3. PCA_ds = pd.DataFrame(pca.transform(scaled_ds), columns=["col1", "col2",
"col3"]): Transforms the scaled data into a new DataFrame named PCA_ds,
containing the first three principal components as columns labeled "col1", "col2", and
"col3".
4. PCA_ds.describe().T: Provides a summary statistics table for the transformed data,
transposing the table for better readability.
The resulting PCA_ds DataFrame contains the reduced-dimensional representation of
the original scaled data.
The 3D scatter plot serves as a powerful visualization tool, offering a comprehensive
representation of the dataset in its reduced dimensionality. In this plot, each data point is
depicted as a marker within the three-dimensional space, with the x, y, and z axes
corresponding to the first three principal components obtained through Principal Component
Analysis (PCA).

Figure 2.24. 3D Projection of data in the reduced dimension


1. x = PCA_ds["col1"], y = PCA_ds["col2"], z = PCA_ds["col3"]: Extracts the values of
the first three principal components from the PCA_ds DataFrame for plotting.
2. fig = plt.figure(figsize=(10, 8)): Creates a new figure with a specified size.
3. ax = fig.add_subplot(111, projection="3d"): Adds a 3D subplot to the figure.
4. ax.scatter(x, y, z, c="maroon", marker="o"): Plots the 3D scatter plot using the
extracted components, with maroon-colored markers.
5. ax.set_title("A 3D Projection Of Data In The Reduced Dimension"): Sets the title of
the plot.
6. plt.show(): Displays the 3D projection plot.

The resulting plot visually represents the data in the reduced dimension using a 3D
scatter plot:

16
Figure 2.25. Result

3.2. Clustering
First, we will use the elbow criterion method to find the optimum number of clusters. Let’s
set a limit for the number of clusters at 10 and plot the distortion score value of each number.
The elbow method is useful for determining the optimal number of clusters by identifying the
point where further partitioning of data provides diminishing returns. In the visualization, the
"elbow" is typically the point where the distortion starts to decrease at a slower rate.

Figure 2.26
- print('Elbow Method to determine the number of clusters to be formed:'): A print
statement indicating the purpose of the following code.
- Elbow_M = KElbowVisualizer(KMeans(), k=10): Initializes the KElbowVisualizer
with the KMeans clustering model and sets the maximum number of clusters (k) to
10.
17
- Elbow_M.fit(PCA_ds): Fits the visualizer to the reduced-dimension dataset obtained
through PCA.
- Elbow_M.show(): Displays the elbow method plot, where the x-axis represents the
number of clusters, and the y-axis indicates the distortion (inertia) of the clusters.

Figure 2.27
Based on the distortion score result on the plot, we will fit k-means with 4 clusters. Next, we
will be fitting the Agglomerative Clustering Model to get the final clusters.

Figure 2.28
- AC = AgglomerativeClustering(n_clusters=4): Initializes the Agglomerative
Clustering model with the specified number of clusters, in this case, 4.

18
- yhat_AC = AC.fit_predict(PCA_ds): Fits the Agglomerative Clustering model to the
reduced-dimension dataset obtained through PCA and predicts the cluster assignments
for each data point.
- PCA_ds["Clusters"] = yhat_AC: Adds a new column "Clusters" to the reduced-
dimension dataset (PCA_ds) containing the cluster assignments for each data point.
- data["Clusters"] = yhat_AC: Adds the "Clusters" feature to the original dataset (data)
to associate each data point with its corresponding cluster.
In the last step, we will examine the clusters formed. Let's have a look at the 3-D distribution
of the clusters.

Figure 2.29

19
3.3. Evaluating Models
Since this is unsupervised clustering. We do not have a tagged feature to evaluate or score
our model. The purpose of this section is to study the patterns in the clusters formed and
determine the nature of the clusters' patterns.
For that, we will be having a look at the data in light of clusters via exploratory data analysis
and drawing conclusions.
Firstly, let us have a look at the group distribution of clustering.

Figure 2.30

Figure 2.31
The clusters seem to be fairly distributed. Cluster's Profile Based On Income And Spending
x=data=Income, Y=data=Spending
data["Spent"] = data["MntWines"]+ data["MntFruits"]+ data["MntMeatProducts"]+
data["MntFishProducts"]+ data["MntSweetProducts"]+ data["MntGoldProds"]

20
Figure 2.32
The color differentiation allows for easy identification of clusters, and the legend aids in
interpreting the cluster assignments.

Figure 2.33
Income vs spending plot shows the clusters pattern
- group 0: high spending & average income
- group 1: high spending & high income
- group 2: low spending & low income
- group 3: high spending & low income
Next, I will be looking at the detailed distribution of clusters as per the various products in
the data. Namely: Wines, Fruits, Meat, Fish, Sweets and Gold.

21
Figure 2.34

Figure 2.35
From the above plot, it can be clearly seen that cluster 1 is our biggest set of customers
closely followed by cluster 0. We can explore what each cluster is spending on for the
targeted marketing strategies.
data["Total_Promos"]=data["AcceptedCmp1"]+data["AcceptedCmp2"]
+data["AcceptedCmp3"]+ data["AcceptedCmp4"]+ data["AcceptedCmp5"]
Let us next explore how our campaigns did in the past.

Figure 2.36
Created a new feature called "Total_Promos" by aggregating the number of advertising
campaigns that the customer has accepted. This feature is created by summing the values
from the "AcceptedCmp1", "AcceptedCmp2", "AcceptedCmp3", "AcceptedCmp4", and
"AcceptedCmp5" columns in the dataframe data. Specifically, each value in the
"Total_Promos" column is the sum of the corresponding values from five accepted
advertising campaigns.

22
Draw a countplot to show the number of advertising campaigns accepted by the total number
of that campaign ("Total_Promos"). This chart is separated by previously classified clusters
and uses a different color for each cluster (clusters are identified by the "Clusters" column).

Figure 2.37
There has not been an overwhelming response to the campaigns so far. Very few participants
overall. Moreover, no one part takes in all 5 of them. Perhaps better-targeted and well-
planned campaigns are required to boost sales.

Figure 2.38

23
Figure 2.39
While campaigns didn't perform exceptionally well, the deals offered yielded the best
outcomes for clusters 0 and 3. Surprisingly, our star customers in cluster 1 showed less
interest in the deals. Cluster 2, on the other hand, did not exhibit a strong inclination towards
any particular offer.

24
Chapter 4. Result communication and
recommendation
4.1. Result
About cluster number 0:
1. All members in the group are confirmed parents.
2. The number of members in each family ranges from 2 to 4, with a maximum of 4
and a minimum of 2.
3. A small subset of the group consists of single parents.
4. Most families in the group have at least one teenager at home.
5. The age of family members is relatively young.
About cluster number 1:
1. Non-Parental Status: All individuals within this cluster are definitively identified
as non-parents.
2. Limited Family Size: Families within this cluster have a maximum of only 2
members, portraying a small household structure.
3. Couples Dominance: A slight majority of the individuals in this cluster are
identified as couples, suggesting a prevalent household structure of pairs over
single individuals.
4. Age Diversity: This cluster encompasses individuals spanning across all age
groups, indicating a wide distribution of ages.
5. High Socioeconomic Status: Notably, this cluster is associated with a high-income
level, reflecting a socioeconomically affluent group.
About cluster number 2:
1. Primarily Parents: The majority of individuals in this group clearly identify as
parents.
2. Family size: Families in this cluster are characterized by limited size, with a
maximum of only 3 members.
3. The majority have 1 child: individuals in this cluster have only one child.
Importantly, these children are often not teenagers.
4. Members of this cluster are relatively younger in age.
About cluster number 3:

25
1. All individuals within this cluster are definitively identified as parents.
2. Family Size Range: Families in this cluster exhibit a family size ranging from a
minimum of 2 to a maximum of 5 members, suggesting a moderate-sized
household structure.
3. A majority of families within this cluster have a teenager residing at home
4. Lower-Income Status: this cluster is associated with a lower-income group

4.2. Recommendation
Create an In-Depth Marketing Strategy:
1. Building Strategy: Based on analysis of K Means results, develop in-depth follow-
up strategies for each group. Focus on the team's preferred communication
channels, languages and approaches.
2. Organize Special Advertising: Create special advertising campaigns for each
customer group. Use images, messages and compatibility levels appropriate to
each group.
Personalization of Services and Products:
1. Integrate Customer Information: Link information from K-means with your CRM
system or customer database to leverage detailed information about each
customer.
2. Personalize Promotions: Create promotions or special offers based on each
group's priorities and shopping patterns. This could include discounts, freebies or
exclusive offers.
3. Personal Interactive Interfaces: If possible, build personalized online interactive
interfaces or mobile applications to provide independent shopping or service
experiences for each group.
Organizing Promotion Program Functions:
1. Determine Priorities and Needs: Based on information from K-means, determine
the common priorities and needs of each customer group. This can be related to
different types of products, services or promotions.
2. Organize Special Promotions: Create special promotions or events for each group.
Use a combination of discounts, giveaways and incentives to optimize appeal.
3. Track and Evaluate Feedback: Continue to monitor the performance of programs
and gather feedback from customers to evaluate prices and adjust them over time.

26
Chapter 5. Conclusion
In conclusion, the research presented here demonstrates the utility of customer
segmentation in leveraging purchasing data from an E-commerce platform spanning
one year. By employing both descriptive and predictive analytics, particularly utilizing
K-means clustering and RFM segmentation models, the study effectively addresses
the intricacies of customer segmentation in the context of sales analytics. Customer
segmentation, as illuminated by the findings, allows the company to categorize its
diverse customer base into distinct groups with shared characteristics. This tailored
approach facilitates more precise marketing, personalized product offerings, and
improved customer engagement. The integration of K-means clustering and RFM
segmentation models enhances the precision of this process, providing the company
with a sophisticated understanding of its customer segments.

As a result, the study's findings and conclusions equip the company with actionable
insights to optimize its operational approaches. Armed with this knowledge, the
company can strategically tailor its marketing efforts, refine product offerings, and
enhance customer experiences, ultimately fostering sustained growth, operational
efficiency, and increased profitability in the competitive E-commerce landscape.

27
Reference

28

You might also like