0% found this document useful (0 votes)
49 views14 pages

A Beginner's Guide To Customer Segmentation With Python - by Sigli Mumuni - Medium

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views14 pages

A Beginner's Guide To Customer Segmentation With Python - by Sigli Mumuni - Medium

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

08.03.

2022, 16:42 A Beginner’s Guide to Customer Segmentation with Python | by Sigli Mumuni | Medium

Upgrade Open in app

Sigli Mumuni Follow

Jan 6 · 7 min read · Listen

A Beginner’s Guide to Customer Segmentation


with Python
A step-by-step introduction to clustering analysis

Photo by Olivier Le Moal on Shutterstock

Customer segmentation is the process of splitting your customer base into different
groups based on common characteristics. These characteristics are usually
demographic, like age, sex, and income, but psychographic or behavioral
characteristics like personality, interests, and habits are often considered as well.
Customer segmentation allows a business to deliver more targeted and effective
marketing that appeals to the different segments identified.

While customer segmentation has been around for as long as marketing itself, recent
advances in machine learning have made the process easier and more accurate. We can
il i l t t t ti i l t i
https://medium.com/@siglimumuni/a-beginners-guide-to-customer-segmentation-with-python-fc8c219d6fa3l i t f 1/14
08.03.2022, 16:42 A Beginner’s Guide to Customer Segmentation with Python | by Sigli Mumuni | Medium
easily implement customer segmentation using clustering analysis, a type of
Upgrade Open in app

unsupervised machine learning technique that places subjects in different groups (or
clusters) based on how closely associated they are with each other.

In this tutorial, we will implement customer segmentation using the K-means


clustering algorithm from the Scikit Learn library in Python. We will be using the mall
customers dataset. This dataset contains information about customers in an
undisclosed mall. It consists of 200 rows and five attributes:

Customer ID

Gender

Age

Annual Income (in thousands of dollars)

Spending Score (based on customer behavior and spending patterns).

You can download the dataset from the Kaggle website or my GitHub repository if you
want to follow along. All the relevant code used in this tutorial is also available in my
GitHub repository.

Importing the libraries


For this tutorial, we will be using NumPy, Pandas, Matplotlib, Seaborn, mpl_toolkits,
and Scikit-learn. If you don’t have these installed already, you can do so by using the
!pip install command.

1 #Import the relevant libraries


2 import numpy as np
3 import pandas as pd
4 import matplotlib.pyplot as plt
5 import seaborn as sns
6 from mpl_toolkits.mplot3d import Axes3D
7 from sklearn.cluster import KMeans

import_libraries.py hosted with ❤ by GitHub view raw

Loading the data


Next, we’ll need to load the dataset to pandas using the read_csv() method. Here, I
have provided the URL to the location of the dataset on my github repository as the
https://medium.com/@siglimumuni/a-beginners-guide-to-customer-segmentation-with-python-fc8c219d6fa3 2/14
08.03.2022, 16:42 A Beginner’s Guide to Customer Segmentation with Python | by Sigli Mumuni | Medium

argument. If you have downloaded the file to your computer, then be sure to enter the
Upgrade Open in app
file path instead.
1 #Load the dataset
2 df = pd.read_csv("https://raw.githubusercontent.com/siglimumuni/Datasets/master/Mall_Customers.c
3
4 #View the first 5 rows
5 df.head()

load_data.py hosted with ❤ by GitHub view raw

Preview of the mall customers dataset

We can get a glimpse of the dataset by using the head() method to display the first 5
rows of data. We can also use the info() method to get a quick breakdown of the
structure of the dataset including the number of rows and columns and data types of
all the columns as well as information on missing values.

1 #Check the structure of the dataset


2 df.info()

structure.py hosted with ❤ by GitHub view raw

https://medium.com/@siglimumuni/a-beginners-guide-to-customer-segmentation-with-python-fc8c219d6fa3 3/14
08.03.2022, 16:42 A Beginner’s Guide to Customer Segmentation with Python | by Sigli Mumuni | Medium

Upgrade Open in app

Structure of the mall customers dataset

We have a total of 200 rows of data and 5 columns, 4 of which are integers and 1 string
object. The dataset contains no null values.

Before we perform our cluster analysis, we will conduct an exploratory data analysis to
better understand the characteristics of the dataset and familiarize ourselves with the
relationships between the different variables.

Exploring the data


We can start exploring the data by using the describe() method. This gives us quick
summary statistics on all numeric variables, like the mean, median, max, and min
values. We can pass it into the round() function to specify two decimal places for the
outputs. Before proceeding, we can drop the CustomerID column as it is irrelevant for
this analysis. The Gender column is dropped automatically since it is a categorical
column.

1 #Check the summary statistics for the numeric columns


2 round(df.drop(columns="CustomerID",axis=1).describe(),2)

describe.py hosted with ❤ by GitHub view raw

https://medium.com/@siglimumuni/a-beginners-guide-to-customer-segmentation-with-python-fc8c219d6fa3 4/14
08.03.2022, 16:42 A Beginner’s Guide to Customer Segmentation with Python | by Sigli Mumuni | Medium

Upgrade Open in app

The mean age is 38.85 and mean annual income is around $60,000 dollars. We can
explore these variables in more depth by visualizing their distributions with a
histogram. We can create multiple plots, side by side by using the plt.subplots() method
and then iterating through them with the histplot() method in seaborn.

1 #Create a subplot object with one row and three columns


2 fig,axes = plt.subplots(nrows=1,ncols=3,figsize=[16,4],sharey=True)
3
4 #Plot three histograms, Age, Annual Income and Spending Score
5 for i,col in enumerate(["Age","Annual Income (k$)","Spending Score (1-100)"]):
6 sns.histplot(df[col],bins=20,ax=axes[i]).set(ylabel=" ")

histograms.py hosted with ❤ by GitHub view raw

There’s a wide range of different ages represented with most customers belonging to
the 20–40 year range. Also, the majority of customers are in the 60 to 80 thousand
dollars annual income bracket while most customers’ spending score is between 40 and
60.

Next, let’s check the proportion of males and females in the dataset. We can use the
countplot() method in seaborn to create a bar chart.

1 #Check the proportion of males and females in the dataset


2 ax = sns.countplot(df["Gender"])
3 total = len(df)
4
5 #Annotate bars with percentage values
6 for p in ax.patches:
7 percentage = f'{100 * p.get_height() / total:.1f}%\n'
8 x = p.get_x() + p.get_width() / 2
9 y = p.get_height()
10 t t ( t ( ) h ' t ' ' t ')
https://medium.com/@siglimumuni/a-beginners-guide-to-customer-segmentation-with-python-fc8c219d6fa3 5/14
08.03.2022, 16:42 A Beginner’s Guide to Customer Segmentation with Python | by Sigli Mumuni | Medium
10 ax.annotate(percentage, (x, y), ha='center', va='center')
11 Upgrade Open in app
12 ax.set(title="Proportion of Males and Females",xlabel="")

countplot.py hosted with ❤ by GitHub view raw

We have more female representation in the dataset than male. Finally, we can explore
the relationships between the different variables in the dataset. One great way to do
this is to use the corr() method to show the correlation between the different variables.

1 #Check the correlation between the different variables


2 corr_matrix = df[["Age","Annual Income (k$)","Spending Score (1-100)"]].corr()
3
4 #Visualize the correlation using seaborn
5 sns.heatmap(corr_matrix, cmap="coolwarm",annot=True)

correlation.py hosted with ❤ by GitHub view raw

https://medium.com/@siglimumuni/a-beginners-guide-to-customer-segmentation-with-python-fc8c219d6fa3 6/14
08.03.2022, 16:42 A Beginner’s Guide to Customer Segmentation with Python | by Sigli Mumuni | Medium

Upgrade Open in app

There doesn’t seem to be any correlation between the different variables except for Age
and Spending Score, which share a weak negative correlation.

This concludes our exploratory data analysis. We can now move on to our main task.

Segmenting the data


In this section, we will focus on building a K-means model to segment our customers
based on two variables, Annual Income, and Spending Score as a start.

One of the key arguments we need to specify in a K-means clustering model is the
number of clusters. The optimal number of clusters will always vary from dataset to
dataset. Fortunately, there is tried and tested method to arrive at this number, through
a process known as the elbow method.

We begin by plotting the explained variation in the data as a function of the number of
clusters (called the Within Cluster Sum of Squared Errors or WCSS), and then pick out
the value at the elbow of the curve as the number of clusters to use.

1 #Create a subset of the dataframe with only Annual Income and Spending Score
2 X = df[["Annual Income (k$)","Spending Score (1-100)"]]
3
4 #Determine the variation in the data
5 wcss=[]
6 for i in range(1,11):
7 km=KMeans(n_clusters=i)
8 km.fit(X)
9 wcss.append(km.inertia_)
10
11 #Plot the elbow curve
12 plt.figure(figsize=(12,6))
13 plt.plot(range(1,11),wcss, linewidth=2, color="blue", marker ="8")
14 plt.xlabel("Number of Clusters (K)")
15 plt.xticks(np.arange(1,11,1))
16 plt.title("The Elbow Method")
17 plt.ylabel("WCSS")
18 plt.show()

wcss.py hosted with ❤ by GitHub view raw

https://medium.com/@siglimumuni/a-beginners-guide-to-customer-segmentation-with-python-fc8c219d6fa3 7/14
08.03.2022, 16:42 A Beginner’s Guide to Customer Segmentation with Python | by Sigli Mumuni | Medium

Upgrade Open in app

Our challenge now is to determine the optimal value of K from the elbow diagram. The
trick is to identify the value at which the WCSS suddenly stops decreasing significantly
compared to previous decreases. In our case, we notice that the drop after 5 is
relatively minimal so we choose 5 as our optimal value. With this information, we can
now build our model.

1 #Build the model with 5 clusters specified


2 kmeans_model=KMeans(n_clusters=5)
3
4 #Fit the input data to the model
5 kmeans_model.fit(X)
6
7 #Segement the input data by assigning labels
8 y = kmeans_model.predict(X)
9
10 #Create a new column in the original dataset for the labels
11 df["label"] = y
12
13 #The dataframe with clustering complete
14 df.head()

model.py hosted with ❤ by GitHub view raw

https://medium.com/@siglimumuni/a-beginners-guide-to-customer-segmentation-with-python-fc8c219d6fa3 8/14
08.03.2022, 16:42 A Beginner’s Guide to Customer Segmentation with Python | by Sigli Mumuni | Medium

Upgrade Open in app

And there is our updated dataframe with a label column specifying which segment a
given client belongs to. We can visualize the different segments using a scatterplot.

1 #Create a scatterplot to show the different clusters


2 plt.figure(figsize=(12,7))
3 sns.scatterplot(data = df, x = 'Annual Income (k$)',y = 'Spending Score (1-100)',hue="label",pal
4 plt.xlabel('Annual Income (k$)')
5 plt.ylabel('Spending Score (1-100)')
6 plt.title('Spending Score (1-100) vs Annual Income (k$)')
7 plt.show()

scatterplot.py hosted with ❤ by GitHub view raw

https://medium.com/@siglimumuni/a-beginners-guide-to-customer-segmentation-with-python-fc8c219d6fa3 9/14
08.03.2022, 16:42 A Beginner’s Guide to Customer Segmentation with Python | by Sigli Mumuni | Medium

Upgrade Open in app

Now we’re able to see the clusters more clearly. Clients in Cluster 0 (Blue) have the
least income and least spending scores while clients in Cluster 2 (Green) have the most
income and highest spending scores.

Adding another variable


So far, we have segmented our client base on two criteria, annual income, and
spending score. In this section, we will perform another segmentation using three
variables. In addition to annual income and spending scores, we shall be including the
age of the customers.

As we did previously, we will begin by calculating the values of WCSS, but this time
with the Age column included.

1 #Create a subset of the dataframe with only Age, Annual Income and Spending Score
2 X2 = df[["Age","Annual Income (k$)","Spending Score (1-100)"]]
3
4 #Determine the variation in the data
5 wcss=[]
6 for i in range(1,11):
7 km=KMeans(n_clusters=i)
8 km.fit(X2)
9 wcss.append(km.inertia_)
10
11 #Plot the elbow curve
12 plt.figure(figsize=(12,6))
13 plt.plot(range(1,11),wcss, linewidth=2, color="blue", marker ="8")
14 plt.xlabel("Number of Clusters (K)")
15 plt.xticks(np.arange(1,11,1))
16 plt.title("The Elbow Method")
17 plt.ylabel("WCSS")
18 plt.show()

wcss2.py hosted with ❤ by GitHub view raw

https://medium.com/@siglimumuni/a-beginners-guide-to-customer-segmentation-with-python-fc8c219d6fa3 10/14
08.03.2022, 16:42 A Beginner’s Guide to Customer Segmentation with Python | by Sigli Mumuni | Medium

Upgrade Open in app

Again, we can select 5 as our optimal value of K. Let’s go ahead and build our second
model.

1 #Build the model with 5 clusters specified


2 kmeans_model3D = KMeans(n_clusters=5)
3
4 #Fit the input data to the model
5 kmeans_model3D.fit(X2)
6
7 #Segement the input data by assigning labels
8 y2 = kmeans_model3D.predict(X2)
9
10 #Update the "label" column in the original dataset with the new values
11 df["label"] = y2
12
13 #The dataframe with clustering complete
14 df.head()

model3D.py hosted with ❤ by GitHub view raw

https://medium.com/@siglimumuni/a-beginners-guide-to-customer-segmentation-with-python-fc8c219d6fa3 11/14
08.03.2022, 16:42 A Beginner’s Guide to Customer Segmentation with Python | by Sigli Mumuni | Medium

Upgrade Open in app

To see the individual clusters more clearly, we will need to visualize them using a
scatterplot, except this time we will need to create a 3D plot since we are dealing with 3
dimensions or variables.

1 #Create a 3D scatter plot


2 fig = plt.figure(figsize=(20,10))
3 ax = fig.add_subplot(111, projection='3d')
4
5 ax.scatter(df["Age"][df["label"] == 0], df["Annual Income (k$)"][df["label"] == 0], df["Spendin
6 ax.scatter(df["Age"][df["label"] == 1], df["Annual Income (k$)"][df["label"] == 1], df["Spendin
7 ax.scatter(df["Age"][df["label"] == 2], df["Annual Income (k$)"][df["label"] == 2], df["Spendin
8 ax.scatter(df["Age"][df["label"] == 3], df["Annual Income (k$)"][df["label"] == 3], df["Spendin
9 ax.scatter(df["Age"][df["label"] == 4], df["Annual Income (k$)"][df["label"] == 4], df["Spendin
10 ax.view_init(35, 185)
11
12 plt.xlabel("Age")
13 plt.ylabel("Annual Income (k$)")
14 ax.set_zlabel('Spending Score (1-100)')
15 plt.show()

3Dscatter.py hosted with ❤ by GitHub view raw

https://medium.com/@siglimumuni/a-beginners-guide-to-customer-segmentation-with-python-fc8c219d6fa3 12/14
08.03.2022, 16:42 A Beginner’s Guide to Customer Segmentation with Python | by Sigli Mumuni | Medium

Upgrade Open in app

We can also get a good idea of how the different segments differ by calculating the
average values of the three variables for each segment as well as a count of the number
of clients in each segment. This can be done with the groupby() method.

1 #Check the count and mean values of all three variables for the different segments
2 round(df.groupby(by="label")\
3 .agg({"CustomerID":"count","Age":"mean","Annual Income (k$)":"mean","Spending Score (1-1
4 .reset_index()\
5 .rename(columns={"label":"Segment","CustomerID":"No.of Clients"}))

groupby.py hosted with ❤ by GitHub view raw

The results provide a lot of interesting insights. For example, clients in Segment 0 are
the youngest, with a low income but high spending score while clients in Segment 4
are the oldest, with a low income and low spending score. Segment 2 has the largest
https://medium.com/@siglimumuni/a-beginners-guide-to-customer-segmentation-with-python-fc8c219d6fa3 13/14
08.03.2022, 16:42 A Beginner’s Guide to Customer Segmentation with Python | by Sigli Mumuni | Medium
are the oldest, with a low income and low spending score. Segment 2 has the largest
Upgrade Open in app

number of clients with moderate incomes and moderate spending scores. We can
summarise the different segments as follows:

Segment 0: young, with low incomes and high spending

Segment 1: middle-aged, with high incomes and high spending

Segment 2: older, with moderate incomes and moderate spending

Segment 3: middle-aged, with high incomes and low spending

Segment 4: older, with low incomes and low spending

Using this information, we can go a step further by creating different personas for the
different segments. Then based on their unique characteristics, we can apply the
appropriate growth strategies which may include loyalty, referral, upselling, and
incentive programs among several others.

And with that, we come to the end of this tutorial. I hope that you learned something
new. If you have any questions or comments, please be sure to leave a note in the
comments section. Thank you very much for reading and all the best in your data
journey.

https://medium.com/@siglimumuni/a-beginners-guide-to-customer-segmentation-with-python-fc8c219d6fa3 14/14

You might also like