0% found this document useful (0 votes)

10 views34 pages

Bai Tap

This document is a group assignment analyzing student data using data science techniques. It first prepares the data by cleaning, deriving features, and descriptive analysis. Then various clustering and modeling techniques are applied and evaluated. Finally, the results and recommendations are communicated.

Uploaded by

duydongk6

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views34 pages

Bai Tap

Uploaded by

duydongk6

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 34

Course title: Introduction to Data Science

FINAL EXAM ASSIGNMENT

Topic: Student Analyst

Lecturer : Do Trung Tuan

Group member: Pham Phuong Anh - 22070410
Dong Huu Khanh Duy - 22070571
Le Thi Bich Ngoc - 22070439
Phan Thi Thanh Tam - 22070983
Tran Thi Trang - 22070501
Class : DS9-23

Hanoi, November 2023

i
Acknowledgement
We would like to express our appreciation to all those who help us to complete this report
successfully, especially Associate Professor Ph.D Mr. Do Trung Tuan for supporting and
encourage us all the time, taught us throughout the study process.
Many thanks to VNU-IS for including this subject in the curriculum to help students like us
to know what is Data Science and have a deeper look in how it is used in life and its
importance in 21st century.
A sincere thank you to all the lecturers who put in their best efforts in guiding our team to
achieve the goal. I would like to express my deep gratitude to all my classmates, especially
those who took the time to help and support me whenever I needed it during my project.

ii
Table of content
Acknowledgement...............................................................................................ii
Table of content..................................................................................................iii
List of figures.....................................................................................................iv
Participants.........................................................................................................vi
Chapter 1. INTRODUCTION.............................................................................1
1.1. About the research of problem.................................................................1
1.2. About Dataset...........................................................................................1
1.3. Motivation and purpose of the research...................................................2
Chapter 2. Analyze Dataset.................................................................................3
2.1. Data preparation.......................................................................................3
2.2. Data cleaning............................................................................................4
2.3. Deriving New Features.............................................................................6
2.4. Data descriptive........................................................................................9
Chapter 3. Modeling..........................................................................................13
3.1. Data Processing......................................................................................13
3.2. Clustering................................................................................................17
3.3. Evaluating Models..................................................................................20
Chapter 4. Result communication and recommendation...................................25
4.1. Result......................................................................................................25
4.2. Recommendation....................................................................................26
Chapter 5. Conclusion.......................................................................................27
Reference...........................................................................................................28

iii
List of figures
Figure 2.1. Importing Libraries...........................................................................3
Figure 2.2. Read CSV..........................................................................................3
Figure 2.3. Data set..............................................................................................4
Figure 2.4. Remove the NA values......................................................................5
Figure 2.5. Create a feature from Dt. Customer..................................................5
Figure 2.6. Create a feature Customer_For.........................................................6
Figure 2.7. Visualize............................................................................................6
Figure 2.8. Result.................................................................................................7
Figure 2.9. Create new features...........................................................................8
Figure 2.10. Data describe...................................................................................9
Figure 2.10. Result...............................................................................................9
Figure 2.11. Create Graph.................................................................................10
Figure 2.11. Visualize Graph.............................................................................11
Figure 2.12. Remove the outliers data...............................................................11
Figure 2.13. Correlation Matrix.........................................................................11
Figure 2.14. Create Heatmap.............................................................................12
Figure 2.15. List of categorical..........................................................................13
Figure 2.16. Encode the object dtypes...............................................................13
Figure 2.17. Create a copy of data.....................................................................14
Figure 2.18.........................................................................................................14
Figure 2.19.........................................................................................................15
Figure 2.20.........................................................................................................15
Figure 2.21.........................................................................................................15
Figure 2.22.........................................................................................................15
Figure 2.23.........................................................................................................15

iv
Figure 2.24. 3D Projection of data in the reduced dimension...........................16
Figure 2.25. Result.............................................................................................17
Figure 2.26.........................................................................................................17
Figure 2.27.........................................................................................................18
Figure 2.28.........................................................................................................18
Figure 2.29.........................................................................................................19
Figure 2.30.........................................................................................................20
Figure 2.31.........................................................................................................20
Figure 2.32.........................................................................................................21
Figure 2.33.........................................................................................................21
Figure 2.34.........................................................................................................22
Figure 2.35.........................................................................................................22
Figure 2.36.........................................................................................................22
Figure 2.37.........................................................................................................23
Figure 2.38.........................................................................................................23
Figure 2.39.........................................................................................................24

v
Participants
No Name Student ID Contribution Comments

1 Pham Phuong Anh 22070410 20% Leader, Analyze

Dataset, Introduction

2 Dong Huu Khanh Duy 22070571 20% Analyze Dataset

3 Le Thi Bich Ngoc 22070439 20% Modeling and

conclusion

4 Phan Thi Thanh Tam 22070983 20% Abstract, Introduction

5 Tran Thi Trang 22070501 20% Modeling and

conclusion

vi
Chapter 1. INTRODUCTION
1.1. About the research of problem
In the dynamic landscape of today's business world, understanding and connecting
with customers on a personal level is pivotal for success. A strategic approach to this is
Customer Personality Analysis, a comprehensive examination of a company's ideal
customers. By delving into the intricacies of their needs, behaviors, and concerns, businesses
can tailor their products to resonate with different customer segments, fostering a more
personalized and targeted marketing strategy.
In this project, the focus is on performing an unsupervised clustering of data extracted
from a groceries firm's database. Through customer segmentation, the aim is to identify and
group customers who exhibit similarities in their preferences, behaviors, and purchasing
patterns. This segmentation enables businesses to assign greater significance to each
customer group, allowing for the customization of products that align with the unique needs
and behaviors of those segments.

1.2. About Dataset

The ” Customer Segmentation” dataset that we analyze is taken from the Kaggle dataset
repository. About attributes:
- Customer’s Information:
1. ID
2. Year_Birth
3. Education
4. Marital_Status
5. Income
6. Kidhome
7. Teenhome
8. Dt_Customer
9. Recency
10. Complain
- Products: Amount spent on different products in last 2 years
1. Mnt Wines
2. Mnt Fruits
3. MntMeatProducts
4. MntFishProducts
5. MntSweetProducts
6. MntGoldProds
1
- Place:
1. NumWebPurchases
2. NumCatalogPurchases
3. NumStorePurchases
4. NumWebVisistMonth
- Promotion:
1. NumDealsPurchases
2. AcceptedCmp1
3. AcceptedCmp2
4. AcceptedCmp3
5. AcceptedCmp4
6. AcceptedCmp5
7. Response

1.3. Motivation and purpose of the research

Customer Personality Analysis is a detailed analysis of a company’s ideal customers.
It helps a business to better understand its customers and makes it easier for them to modify
products according to the specific needs, behaviors and concerns of different types of
customers.

Customer personality analysis helps a business to modify its product based on its
target customers from different types of customer segments. For example, instead of spending
money to market a new product to every customer in the company’s database, a company can
analyze which customer segment is most likely to buy the product and then market the
product only on that particular segment.

Several benefits and improvements that the firm can create from the results of this
research include:

- Understanding customers' behaviors.

- Developing different customer types and personas.

- Promoting targeted, effective marketing campaigns.

- Developing more relevant policies for sales and pricing.

- Improving customer service.

2
Chapter 2. Analyze Dataset

2.1. Data preparation

First, importing libraries

Figure 2.1. Importing Libraries

Next, to read csv we load the dataset

Figure 2.2. Read CSV

In order to KMget a full grasp of what steps should we be taking to clean the dataset. Let us
have a look at the information in the data.

3
Figure 2.3. Data set
From the above output, we can conclude and note that:
1. There are missing values in income
2. Dt_Customer that indicates the date a customer joined the database is not parsed as
DateTime
3. There are some categorical features in our data frame; as there are some features in
dtype: object. So we will need to encode them into numeric forms later.

2.2. Data cleaning

First of all, for the missing values, I am simply going to drop the rows that have missing
income values.

4
Figure 2.4. Remove the NA values

This command cleans data by removing rows that contain at least one missing value, then
provides an indication of the amount of data remaining.
- data.dropna(): It is used to drop rows containing at least one NA value.
Create a feature from "Dt_Customer" that displays the number of days the customer is
registered in the company database. However, for simplicity's sake, I'm taking this value
relative to the most recent customer on file.

Figure 2.5. Create a feature from Dt. Customer

pd.to_datetime(data["Dt_Customer"]): Use the pd.to_datetime function to convert the
"Dt_Customer" column from object data format to datetime data format. This makes it
possible to perform operations and calculations with temporal data.
An empty list dates is created to store date values.
The loop goes through each value in the "Dt_Customer" column and retrieves the date from
each datetime object.
The date values are added to the dates list.

5
Thus, this code converts the column "Dt_Customer" to datetime format and then extracts the
date from each datetime object, finally printing out the newest and oldest registration dates in
the dataset.

2.3. Deriving New Features

Creating a feature ("Customer_For") of the number of days the customers started to shop in
the store relative to the last recorded date.

Figure 2.6. Create a feature Customer_For

 days: This list stores the number of days each customer has been associated with the
business.
 d1 = max(dates): Suppose d1 is the latest date in the customer date list, representing
the newest customer.
 for i in day: delta = d1 - i; Days.append(delta): Use a loop to calculate the number
of days from the latest date (newest customer) to each customer's date and add the
delta value to the list of dates.
 data["Customer_For"] = days: Adds a new column "Customer_For" to the data set
with values representing the number of days each customer has been associated with
the business.
 data["Customer_For"]: Convert column "Customer_For" to integer format, with
error="coerce" to handle null cases values can be converted into numbers.
In short, this code creates a new feature "Customer_For" to represent the number of days
between the customer's registration date and the latest registration date in the data set.
Find out the unique values in the categorical features to get a clear idea of the data.

6
Figure 2.7. Visualize
- Data["Education"].value_counts(): Similar to above, this function counts the number
of unique values in the "Education" column and returns a Series with the count of
each value.
Print Result:
1. Total categories in the feature Marital_Status:: Prints the number of unique values in
the "Marital_Status" column and the number of occurrences of each value.
2. Total categories in the feature Education:: Prints the number of unique values in the
"Education" column and the number of occurrences of each value.

Figure 2.8. Result

In the next bit, we will be performing the following steps to engineer some new features:
1. Extract the "Age" of a customer by the "Year_Birth" indicating the birth year of the
respective person.
2. Create another feature "Spent" indicating the total amount spent by the customer in
various categories over the span of two years.
3. Create another feature "Living_With" out of "Marital_Status" to extract the living
situation of couples.
4. Create a feature "Children" to indicate total children in a household that is, kids and
teenagers.
5. To get further clarity of household, Creating feature indicating "Family_Size"
6. Create a feature "Is_Parent" to indicate parenthood status
7. Lastly, I will create three categories in the "Education" by simplifying its value
counts.
8. Dropping some of the redundant features

7
Figure 2.9. Create new features
In the above code snippet, there are a series of feature processing (feature engineering)
decisions made on the DataFrame 'data':
1. Age of current customers: Create a new "Age" icon, representing the customer's age
in 2021 based on their year of birth.
2. Total cost for different items: Create a new feature "Spent" by adding up the total
amount of money spent on different product types.
3. Determine living situation based on marital status: Create a new characteristic
"Living_With" to specify a living situation based on marital status. For example,
“Alone” is assigned to marital statuses such as “Absurd,” “Widow,” “YOLO,”
“Divorced,” and “Single.”
4. Characteristics of the index of children living in the household: Create a new feature
"Children" by adding the columns "Kidhome" and "Teenhome."
5. Characteristics for the total number of family members: Create a new characteristic
"Family_Size" based on living situation. For "Alone," the family size is 1; for
"Partner," family size is 2; and the number of children added to these values.
6. Characteristics related to being a parent: Create a new characteristic "Is_Parent" using
a binary index (1 or 0) based on whether the customer has children.
7. Characteristics related to being a parent: Create a new characteristic "Is_Parent" using
a binary index (1 or 0) based on whether the customer has children.
8. For clarity: Some column names are renamed to make them easier to understand.
9. Eliminate redundant features: Some columns are removed because they are
considered redundant.

8
These engineering steps aim to generate meaningful and informative features from the
existing data set, which can improve the performance of machine learning models trained on
this data.

Figure 2.10. Data describe

The describe() function in Pandas is used to create a statistical summary of the quantitative
characteristics of a DataFrame. Here is a detailed explanation of the results you can see from
data.describe().

Figure 2.10. Result

The above stats show some discrepancies in mean Income and Age and max Income and age.
Do note that max-age is 128 years, As I calculated the age that would be today (i.e. 2021) and
the data is old.

2.4. Data descriptive

I must take a look at the broader view of the data. I will plot some of the selected features.

9
Figure 2.11. Create Graph
These functions are used to configure colors for graph objects. rc (resource configuration) is
used to set axis and image properties (axes and figures).
Pallet and cmap are lists of colors used to create the palette for the graph. cmap is a Colormap
object for use in graphs.
Draw Scatter Graph:
1. To_Plot is the list of features you want to plot.
2. sns.pairplot(data[To_Plot],hue="Is_Parent",palette=(["#682F2F","#F3AB60"])): Use
Seaborn's pairplot function to plot the scatter plot for pairs of features in To_Plot list.
The hue column is set to "Is_Parent", which means the color of the points will be
based on the value of the "Is_Parent" column. Colors are selected from a preset
palette.
3. plt.show(): Display the graph.

10
Figure 2.11. Visualize Graph
There are a few outliers in the Income and Age features. we will be deleting the outliers in
the data.

Figure 2.12. Remove the outliers data

The above code is removing outliers from the DataFrame data by setting thresholds for the
"Age" and "Income" columns.
1. Remove Outliers for "Age" Column:
2. This line retains rows in the DataFrame data only if the value of the "Age" column is
less than 90. Rows with an "Age" value greater than or equal to 90 are discarded.
3. Remove Outliers for Column "Income": Similarly, this line retains rows in the
DataFrame data only if the value of the "Income" column is less than 600,000. Lines
with "Income" value greater than or equal to 600,000 will be removed.
4. Print Quantity Data After Removing Outliers:
5. Finally, print out the amount of data remaining after removing outliers.
6. The result of this code is that the DataFrame data contains only rows that do not
contain outliers according to the thresholds set for the "Age" and "Income" columns.

Figure 2.13. Correlation Matrix

- plt.figure(figsize=(20, 20)): This command sets the size of the figure to create the
heatmap. Adjusting image size provides better visualization, especially when there are
many variables.
- sns.heatmap(corrmat, annot=True, cmap=cmap, center=0): This command creates a
heatmap using the Seaborn library. corrmat is the correlation matrix, annot=True adds
a numeric value to the heatmap, cmap specifies the colormap, and center=0 sets the
center of the colormap to zero.
The resulting heatmap provides a visual representation of the correlation between different
pairs of variables in the data set. Positive correlations are indicated by light colors, negative
correlations by dark colors, and correlations close to zero are indicated by neutral colors.

11
Figure 2.14. Create Heatmap
We can see that there is a high correlation between spent and wines (0.89), income (0.79),
Meat (0.85), NumCatalog Purchase (0.78), NumStorePurchase (0.68), We can use variables
to predict the customer segmentation. However, low correlation such as children and income
(-0.34), Spend and NumDealPurcharse (-0,066), these are strongly negative, which are not
good enough for predicting.

12
Chapter 3. Modeling
3.1. Data Processing
Before performing groups of algorithmic analyzes (clustering), data preprocessing is
important to ensure that the data is normalized and suitable for modeling.
The following steps are applied to preprocess the data:
1. Label encoding the categorical features
2. Scaling the features using the standard scaler
3. Creating a subset dataframe for dimensionality reduction

Figure 2.15. List of categorical

Categorical variables in the dataset: ['Education', 'Living_With']

Figure 2.16. Encode the object dtypes

Features = StandardScaler().fit_transform(features)
pca_data=pca.fit_transform(scaled_data)
pComponents = pca.fit_transform(features)
All features are now numerical

13
Figure 2.17. Create a copy of data
1. It creates a copy of the original dataset data and assigns it to the variable ds.
2. It creates a subset of the DataFrame by removing the specified features related to
accepted campaigns (AcceptedCmp3, AcceptedCmp4, AcceptedCmp5,
AcceptedCmp1, AcceptedCmp2), customer complaints (Complain), and customer
responses (Response).
3. It uses the StandardScaler from scikit-learn to standardize the remaining features in
the dataset. The scaled data is stored in the DataFrame scaled_ds.
4. A print statement indicating that all features have been successfully scaled.
All features are now scaled

Figure 2.18
Dataframe to be used for further modeling:

14
Figure 2.19

Figure 2.20

Figure 2.21
Dimensionality Reduction

Figure 2.22

Figure 2.23
1. pca = PCA(n_components=3): Initializes PCA with the desired number of
components set to 3.
2. pca.fit(scaled_ds): Fits the PCA model to the scaled data.

15
3. PCA_ds = pd.DataFrame(pca.transform(scaled_ds), columns=["col1", "col2",
"col3"]): Transforms the scaled data into a new DataFrame named PCA_ds,
containing the first three principal components as columns labeled "col1", "col2", and
"col3".
4. PCA_ds.describe().T: Provides a summary statistics table for the transformed data,
transposing the table for better readability.
The resulting PCA_ds DataFrame contains the reduced-dimensional representation of
the original scaled data.
The 3D scatter plot serves as a powerful visualization tool, offering a comprehensive
representation of the dataset in its reduced dimensionality. In this plot, each data point is
depicted as a marker within the three-dimensional space, with the x, y, and z axes
corresponding to the first three principal components obtained through Principal Component
Analysis (PCA).

Figure 2.24. 3D Projection of data in the reduced dimension

1. x = PCA_ds["col1"], y = PCA_ds["col2"], z = PCA_ds["col3"]: Extracts the values of
the first three principal components from the PCA_ds DataFrame for plotting.
2. fig = plt.figure(figsize=(10, 8)): Creates a new figure with a specified size.
3. ax = fig.add_subplot(111, projection="3d"): Adds a 3D subplot to the figure.
4. ax.scatter(x, y, z, c="maroon", marker="o"): Plots the 3D scatter plot using the
extracted components, with maroon-colored markers.
5. ax.set_title("A 3D Projection Of Data In The Reduced Dimension"): Sets the title of
the plot.
6. plt.show(): Displays the 3D projection plot.

The resulting plot visually represents the data in the reduced dimension using a 3D
scatter plot:

16
Figure 2.25. Result

3.2. Clustering
First, we will use the elbow criterion method to find the optimum number of clusters. Let’s
set a limit for the number of clusters at 10 and plot the distortion score value of each number.
The elbow method is useful for determining the optimal number of clusters by identifying the
point where further partitioning of data provides diminishing returns. In the visualization, the
"elbow" is typically the point where the distortion starts to decrease at a slower rate.

Figure 2.26
- print('Elbow Method to determine the number of clusters to be formed:'): A print
statement indicating the purpose of the following code.
- Elbow_M = KElbowVisualizer(KMeans(), k=10): Initializes the KElbowVisualizer
with the KMeans clustering model and sets the maximum number of clusters (k) to
10.
17
- Elbow_M.fit(PCA_ds): Fits the visualizer to the reduced-dimension dataset obtained
through PCA.
- Elbow_M.show(): Displays the elbow method plot, where the x-axis represents the
number of clusters, and the y-axis indicates the distortion (inertia) of the clusters.

Figure 2.27
Based on the distortion score result on the plot, we will fit k-means with 4 clusters. Next, we
will be fitting the Agglomerative Clustering Model to get the final clusters.

Figure 2.28
- AC = AgglomerativeClustering(n_clusters=4): Initializes the Agglomerative
Clustering model with the specified number of clusters, in this case, 4.

18
- yhat_AC = AC.fit_predict(PCA_ds): Fits the Agglomerative Clustering model to the
reduced-dimension dataset obtained through PCA and predicts the cluster assignments
for each data point.
- PCA_ds["Clusters"] = yhat_AC: Adds a new column "Clusters" to the reduced-
dimension dataset (PCA_ds) containing the cluster assignments for each data point.
- data["Clusters"] = yhat_AC: Adds the "Clusters" feature to the original dataset (data)
to associate each data point with its corresponding cluster.
In the last step, we will examine the clusters formed. Let's have a look at the 3-D distribution
of the clusters.

Figure 2.29

19
3.3. Evaluating Models
Since this is unsupervised clustering. We do not have a tagged feature to evaluate or score
our model. The purpose of this section is to study the patterns in the clusters formed and
determine the nature of the clusters' patterns.
For that, we will be having a look at the data in light of clusters via exploratory data analysis
and drawing conclusions.
Firstly, let us have a look at the group distribution of clustering.

Figure 2.30

Figure 2.31
The clusters seem to be fairly distributed. Cluster's Profile Based On Income And Spending
x=data=Income, Y=data=Spending
data["Spent"] = data["MntWines"]+ data["MntFruits"]+ data["MntMeatProducts"]+
data["MntFishProducts"]+ data["MntSweetProducts"]+ data["MntGoldProds"]

20
Figure 2.32
The color differentiation allows for easy identification of clusters, and the legend aids in
interpreting the cluster assignments.

Figure 2.33
Income vs spending plot shows the clusters pattern
- group 0: high spending & average income
- group 1: high spending & high income
- group 2: low spending & low income
- group 3: high spending & low income
Next, I will be looking at the detailed distribution of clusters as per the various products in
the data. Namely: Wines, Fruits, Meat, Fish, Sweets and Gold.

21
Figure 2.34

Figure 2.35
From the above plot, it can be clearly seen that cluster 1 is our biggest set of customers
closely followed by cluster 0. We can explore what each cluster is spending on for the
targeted marketing strategies.
data["Total_Promos"]=data["AcceptedCmp1"]+data["AcceptedCmp2"]
+data["AcceptedCmp3"]+ data["AcceptedCmp4"]+ data["AcceptedCmp5"]
Let us next explore how our campaigns did in the past.

Figure 2.36
Created a new feature called "Total_Promos" by aggregating the number of advertising
campaigns that the customer has accepted. This feature is created by summing the values
from the "AcceptedCmp1", "AcceptedCmp2", "AcceptedCmp3", "AcceptedCmp4", and
"AcceptedCmp5" columns in the dataframe data. Specifically, each value in the
"Total_Promos" column is the sum of the corresponding values from five accepted
advertising campaigns.

22
Draw a countplot to show the number of advertising campaigns accepted by the total number
of that campaign ("Total_Promos"). This chart is separated by previously classified clusters
and uses a different color for each cluster (clusters are identified by the "Clusters" column).

Figure 2.37
There has not been an overwhelming response to the campaigns so far. Very few participants
overall. Moreover, no one part takes in all 5 of them. Perhaps better-targeted and well-
planned campaigns are required to boost sales.

Figure 2.38

23
Figure 2.39
While campaigns didn't perform exceptionally well, the deals offered yielded the best
outcomes for clusters 0 and 3. Surprisingly, our star customers in cluster 1 showed less
interest in the deals. Cluster 2, on the other hand, did not exhibit a strong inclination towards
any particular offer.

24
Chapter 4. Result communication and
recommendation
4.1. Result
About cluster number 0:
1. All members in the group are confirmed parents.
2. The number of members in each family ranges from 2 to 4, with a maximum of 4
and a minimum of 2.
3. A small subset of the group consists of single parents.
4. Most families in the group have at least one teenager at home.
5. The age of family members is relatively young.
About cluster number 1:
1. Non-Parental Status: All individuals within this cluster are definitively identified
as non-parents.
2. Limited Family Size: Families within this cluster have a maximum of only 2
members, portraying a small household structure.
3. Couples Dominance: A slight majority of the individuals in this cluster are
identified as couples, suggesting a prevalent household structure of pairs over
single individuals.
4. Age Diversity: This cluster encompasses individuals spanning across all age
groups, indicating a wide distribution of ages.
5. High Socioeconomic Status: Notably, this cluster is associated with a high-income
level, reflecting a socioeconomically affluent group.
About cluster number 2:
1. Primarily Parents: The majority of individuals in this group clearly identify as
parents.
2. Family size: Families in this cluster are characterized by limited size, with a
maximum of only 3 members.
3. The majority have 1 child: individuals in this cluster have only one child.
Importantly, these children are often not teenagers.
4. Members of this cluster are relatively younger in age.
About cluster number 3:

25
1. All individuals within this cluster are definitively identified as parents.
2. Family Size Range: Families in this cluster exhibit a family size ranging from a
minimum of 2 to a maximum of 5 members, suggesting a moderate-sized
household structure.
3. A majority of families within this cluster have a teenager residing at home
4. Lower-Income Status: this cluster is associated with a lower-income group

4.2. Recommendation
Create an In-Depth Marketing Strategy:
1. Building Strategy: Based on analysis of K Means results, develop in-depth follow-
up strategies for each group. Focus on the team's preferred communication
channels, languages and approaches.
2. Organize Special Advertising: Create special advertising campaigns for each
customer group. Use images, messages and compatibility levels appropriate to
each group.
Personalization of Services and Products:
1. Integrate Customer Information: Link information from K-means with your CRM
system or customer database to leverage detailed information about each
customer.
2. Personalize Promotions: Create promotions or special offers based on each
group's priorities and shopping patterns. This could include discounts, freebies or
exclusive offers.
3. Personal Interactive Interfaces: If possible, build personalized online interactive
interfaces or mobile applications to provide independent shopping or service
experiences for each group.
Organizing Promotion Program Functions:
1. Determine Priorities and Needs: Based on information from K-means, determine
the common priorities and needs of each customer group. This can be related to
different types of products, services or promotions.
2. Organize Special Promotions: Create special promotions or events for each group.
Use a combination of discounts, giveaways and incentives to optimize appeal.
3. Track and Evaluate Feedback: Continue to monitor the performance of programs
and gather feedback from customers to evaluate prices and adjust them over time.

26
Chapter 5. Conclusion
In conclusion, the research presented here demonstrates the utility of customer
segmentation in leveraging purchasing data from an E-commerce platform spanning
one year. By employing both descriptive and predictive analytics, particularly utilizing
K-means clustering and RFM segmentation models, the study effectively addresses
the intricacies of customer segmentation in the context of sales analytics. Customer
segmentation, as illuminated by the findings, allows the company to categorize its
diverse customer base into distinct groups with shared characteristics. This tailored
approach facilitates more precise marketing, personalized product offerings, and
improved customer engagement. The integration of K-means clustering and RFM
segmentation models enhances the precision of this process, providing the company
with a sophisticated understanding of its customer segments.

As a result, the study's findings and conclusions equip the company with actionable
insights to optimize its operational approaches. Armed with this knowledge, the
company can strategically tailor its marketing efforts, refine product offerings, and
enhance customer experiences, ultimately fostering sustained growth, operational
efficiency, and increased profitability in the competitive E-commerce landscape.

27
Reference

ML Lab Manual 2025-2
No ratings yet
ML Lab Manual 2025-2
35 pages
21CS64 Data Science and Visualization (PE)
No ratings yet
21CS64 Data Science and Visualization (PE)
37 pages
OUTLINE DL - T7 FINAL
No ratings yet
OUTLINE DL - T7 FINAL
60 pages
Nhóm 4 KHDL
No ratings yet
Nhóm 4 KHDL
59 pages
Report 4
No ratings yet
Report 4
50 pages
Predictive Modelling
No ratings yet
Predictive Modelling
44 pages
Monika Sree 08-06-2024
No ratings yet
Monika Sree 08-06-2024
36 pages
Data Management Assignment
No ratings yet
Data Management Assignment
36 pages
Report FinalProject
No ratings yet
Report FinalProject
89 pages
Data Science Unit 1 Notes
No ratings yet
Data Science Unit 1 Notes
30 pages
Rahulsharma - 03 12 23
No ratings yet
Rahulsharma - 03 12 23
26 pages
Prob Ass
No ratings yet
Prob Ass
33 pages
Guide To Intelligent Data Analysis
No ratings yet
Guide To Intelligent Data Analysis
398 pages
Data Science-1
No ratings yet
Data Science-1
17 pages
Zhang Haoze 202112 MSC
No ratings yet
Zhang Haoze 202112 MSC
114 pages
Intern VuHoangSon
No ratings yet
Intern VuHoangSon
1 page
Steps in Data Science & Analysis
No ratings yet
Steps in Data Science & Analysis
2 pages
Kushal Kadayat
No ratings yet
Kushal Kadayat
33 pages
Arul Final PPP
No ratings yet
Arul Final PPP
45 pages
Turover Prediction
No ratings yet
Turover Prediction
52 pages
Versana Premier - Versana Premier Lotus R3 Basic Service Manual - SM - 5928214-1EN - 7
No ratings yet
Versana Premier - Versana Premier Lotus R3 Basic Service Manual - SM - 5928214-1EN - 7
397 pages
Data Science and Big Data Analytics
No ratings yet
Data Science and Big Data Analytics
264 pages
Data - Science - Manaul (Te)
No ratings yet
Data - Science - Manaul (Te)
78 pages
Introduction To Data Science - Lin and Li
No ratings yet
Introduction To Data Science - Lin and Li
403 pages
Employee Performance Analysis
No ratings yet
Employee Performance Analysis
3 pages
Great Learning DATA MINING PROJECT
No ratings yet
Great Learning DATA MINING PROJECT
15 pages
Data Science and Big Data Analytics-1-82
No ratings yet
Data Science and Big Data Analytics-1-82
82 pages
50 Interview Questions & Answers!
No ratings yet
50 Interview Questions & Answers!
52 pages
How A Perfect Machine Model Should Be Done
No ratings yet
How A Perfect Machine Model Should Be Done
5 pages
DS Tools&Techniques
No ratings yet
DS Tools&Techniques
36 pages
Abhishek Data Scientist Resume
0% (1)
Abhishek Data Scientist Resume
5 pages
Introduction To Data Science: Hui Lin and Ming Li
No ratings yet
Introduction To Data Science: Hui Lin and Ming Li
403 pages
Unit I and Unit II Dev
No ratings yet
Unit I and Unit II Dev
36 pages
Unit 4
No ratings yet
Unit 4
42 pages
Course Project Report: Indian Institute of Technology, Kanpur
No ratings yet
Course Project Report: Indian Institute of Technology, Kanpur
15 pages
Final 1
No ratings yet
Final 1
6 pages
A1991370857 65680 10 2025 Csm355ca1
No ratings yet
A1991370857 65680 10 2025 Csm355ca1
6 pages
DS Prasad
No ratings yet
DS Prasad
3 pages
CS202 Assignment - 4 - GIKI
No ratings yet
CS202 Assignment - 4 - GIKI
3 pages
Assignment03 DataScience Report
No ratings yet
Assignment03 DataScience Report
4 pages
LLM2
No ratings yet
LLM2
6 pages
Articles Xgboost Classification With Smote-Enn Algorithm
No ratings yet
Articles Xgboost Classification With Smote-Enn Algorithm
11 pages
Data Science Project Report
No ratings yet
Data Science Project Report
45 pages
REVIEWER
No ratings yet
REVIEWER
9 pages
Ads Phase 5
No ratings yet
Ads Phase 5
23 pages
ADS IA 1 Syllabus Prep
No ratings yet
ADS IA 1 Syllabus Prep
5 pages
Eda Indepth
No ratings yet
Eda Indepth
19 pages
Ids PDF
No ratings yet
Ids PDF
397 pages
1152CS239-Intro. To Data Science-Syllabus
No ratings yet
1152CS239-Intro. To Data Science-Syllabus
6 pages
Test 1: (Units 1-2)
100% (2)
Test 1: (Units 1-2)
59 pages
Project Report-Micro Credit Loan
No ratings yet
Project Report-Micro Credit Loan
8 pages
INS3063 - Final Project Description - Rubik
No ratings yet
INS3063 - Final Project Description - Rubik
6 pages
Data Preparation Basics#
No ratings yet
Data Preparation Basics#
2 pages
Predictive Modelling Project
No ratings yet
Predictive Modelling Project
29 pages
DAV Practical 2
No ratings yet
DAV Practical 2
6 pages
Unit - Iii - Eda
No ratings yet
Unit - Iii - Eda
25 pages
Fds Csheet and Read The Rule
No ratings yet
Fds Csheet and Read The Rule
4 pages
Practitioner's Guide To Data Science
No ratings yet
Practitioner's Guide To Data Science
403 pages
Technical Report Writing For Ca2 Examination: Topic: Introduction To Data Science
No ratings yet
Technical Report Writing For Ca2 Examination: Topic: Introduction To Data Science
7 pages
ITECH2302 MainAssessment Report
No ratings yet
ITECH2302 MainAssessment Report
8 pages
06 369 CheetahXi 50 Install Manual Rev 6
No ratings yet
06 369 CheetahXi 50 Install Manual Rev 6
76 pages
NaviSuite Nardoa Advanced 3D Pipeline and Cable Route Inspections
No ratings yet
NaviSuite Nardoa Advanced 3D Pipeline and Cable Route Inspections
8 pages
Manual: Nero Controlcenter
No ratings yet
Manual: Nero Controlcenter
17 pages
ERD Lecture
No ratings yet
ERD Lecture
14 pages
Chapter 23 Final Ais
No ratings yet
Chapter 23 Final Ais
10 pages
300+ TOP Neural Networks Multiple Choice Questions and Answers
No ratings yet
300+ TOP Neural Networks Multiple Choice Questions and Answers
29 pages
BOOK AGILE 60-Questions-New-Scrum-Master-v1
No ratings yet
BOOK AGILE 60-Questions-New-Scrum-Master-v1
20 pages
ADS5200 Installation Manual - INGLES
No ratings yet
ADS5200 Installation Manual - INGLES
20 pages
Number Systems Questions (Final Edited)
No ratings yet
Number Systems Questions (Final Edited)
5 pages
ZV2 6-Manual
No ratings yet
ZV2 6-Manual
124 pages
Introduction To Agricultural Information Systems
No ratings yet
Introduction To Agricultural Information Systems
13 pages
What Is Project Charter
No ratings yet
What Is Project Charter
5 pages
Cromar, Scott - U.S. Smartphone Market Report
100% (1)
Cromar, Scott - U.S. Smartphone Market Report
61 pages
Manual en 250 0816 E BM2v3
No ratings yet
Manual en 250 0816 E BM2v3
84 pages
Metropolis Metropolis: AM/ AMS
No ratings yet
Metropolis Metropolis: AM/ AMS
224 pages
qTOWER 2.0 /qTOWER 2.2: Real-Time PCR Thermal Cycler
No ratings yet
qTOWER 2.0 /qTOWER 2.2: Real-Time PCR Thermal Cycler
34 pages
Isc Cissp
No ratings yet
Isc Cissp
23 pages
GP Project Asm 510-2
No ratings yet
GP Project Asm 510-2
26 pages
Laptop Enablement Guide
No ratings yet
Laptop Enablement Guide
6 pages
Mih55t MV2
No ratings yet
Mih55t MV2
3 pages
MB Manual H310m-A-20 e
No ratings yet
MB Manual H310m-A-20 e
41 pages
CrowdStrike - Jamf Pro Instructions
No ratings yet
CrowdStrike - Jamf Pro Instructions
3 pages
Chapter 5 Learning Agent
No ratings yet
Chapter 5 Learning Agent
20 pages
Logging Cookbook: Table Des Matières
No ratings yet
Logging Cookbook: Table Des Matières
48 pages
SQL W4
No ratings yet
SQL W4
13 pages
Cambridge International Examinations: 0417/13 Information and Communication Technology
No ratings yet
Cambridge International Examinations: 0417/13 Information and Communication Technology
16 pages
VAT Quick Guide 2018-3-17
No ratings yet
VAT Quick Guide 2018-3-17
11 pages
Create All Time Zone Tables in HANA Schema SYSTEM
No ratings yet
Create All Time Zone Tables in HANA Schema SYSTEM
4 pages
Two-Stage Framework For Corner Case Stimuli - Generation Using Transformer and - Reinforcement Learning
No ratings yet
Two-Stage Framework For Corner Case Stimuli - Generation Using Transformer and - Reinforcement Learning
7 pages
Narrative - Examaple
No ratings yet
Narrative - Examaple
6 pages
Free Fall Aesthetic App Icons Guitar & Lace
No ratings yet
Free Fall Aesthetic App Icons Guitar & Lace
1 page
Task EDR
No ratings yet
Task EDR
2 pages
Dell Precision 690 Technicke Specifikace en
No ratings yet
Dell Precision 690 Technicke Specifikace en
2 pages
Lifelong Education: Continuous Learning in the Digital Age
From Everand
Lifelong Education: Continuous Learning in the Digital Age
Maia Tobares
No ratings yet

Bai Tap

Uploaded by

Bai Tap

Uploaded by

Course title: Introduction to Data Science

FINAL EXAM ASSIGNMENT

Lecturer : Do Trung Tuan

Hanoi, November 2023

1 Pham Phuong Anh 22070410 20% Leader, Analyze

2 Dong Huu Khanh Duy 22070571 20% Analyze Dataset

3 Le Thi Bich Ngoc 22070439 20% Modeling and

4 Phan Thi Thanh Tam 22070983 20% Abstract, Introduction

5 Tran Thi Trang 22070501 20% Modeling and

1.2. About Dataset

1.3. Motivation and purpose of the research

- Understanding customers' behaviors.

- Developing different customer types and personas.

- Promoting targeted, effective marketing campaigns.

- Developing more relevant policies for sales and pricing.

- Improving customer service.

2.1. Data preparation

Figure 2.1. Importing Libraries

Figure 2.2. Read CSV

2.2. Data cleaning

Figure 2.5. Create a feature from Dt. Customer

2.3. Deriving New Features

Figure 2.6. Create a feature Customer_For

Figure 2.8. Result

Figure 2.10. Data describe

Figure 2.10. Result

2.4. Data descriptive

Figure 2.12. Remove the outliers data

Figure 2.13. Correlation Matrix

Figure 2.15. List of categorical

Figure 2.16. Encode the object dtypes

Figure 2.24. 3D Projection of data in the reduced dimension

You might also like