0% found this document useful (0 votes)
57 views1 page

MARKET Segmentation

The document discusses loading and exploring a marketing dataset containing 8950 rows and 18 columns. Some key variables include customer ID, balance, purchases, payments made, and tenure. Descriptive statistics are provided on the numeric variables, showing their counts, means, standard deviations, minimums, and other values. The data is explored through visualizing its distribution and detecting any outliers.

Uploaded by

Abysz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views1 page

MARKET Segmentation

The document discusses loading and exploring a marketing dataset containing 8950 rows and 18 columns. Some key variables include customer ID, balance, purchases, payments made, and tenure. Descriptive statistics are provided on the numeric variables, showing their counts, means, standard deviations, minimums, and other values. The data is explored through visualizing its distribution and detecting any outliers.

Uploaded by

Abysz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

Import Libraries

In [1]:
import numpy as np

import pandas as pd

import seaborn as sns

import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler, normalize

from sklearn.cluster import KMeans

from sklearn.decomposition import PCA

import warnings

warnings.filterwarnings("ignore")

Load the Data


In [2]:
df = pd.read_csv('Marketing_data.csv')

CUST_ID : Identification of Credit Card holder (Categorical)

BALANCE : Total amount of money that you owe to your credit card company

BALANCE_FREQUENCY : How frequently the Balance is updated, score between 0 and 1 (1 = frequently updated, 0 = not frequently updated)

PURCHASES : Amount of purchases made from account

ONEOFF_PURCHASES : Maximum purchase amount done in one-go

INSTALLMENTS_PURCHASES : Amount of purchase done in installment

CASH_ADVANCE : Cash in advance given by the user

PURCHASES_FREQUENCY : How frequently the Purchases are being made, score between 0 and 1 (1 = frequently purchased, 0 = not frequently purchased)

ONEOFF_PURCHASES_FREQUENCY : How frequently Purchases are happening in one-go (1 = frequently purchased, 0 = not frequently purchased)

PURCHASES_INSTALLMENTS_FREQUENCY : How frequently purchases in installments are being done (1 = frequently done, 0 = not frequently done)

CASH_ADVANCE_FREQUENCY : How frequently the cash in advance being paid

CASH_ADVANCE_TRX : Number of Transactions made with "Cash in Advanced"

PURCHASES_TRX : Numbe of purchase transactions made

CREDIT_LIMIT : Limit of Credit Card for user

PAYMENTS : Amount of Payment done by user

MINIMUM_PAYMENTS : Minimum amount of payments made by user

PRC_FULL_PAYMENT : Percent of full payment paid by user

TENURE: Tenure of credit card service for user

In [3]:
# Display the dataset
df

Out[3]: CUST_ID BALANCE BALANCE_FREQUENCY PURCHASES ONEOFF_PURCHASES INSTALLMENTS_PURCHASES CASH_ADVANCE PURCHASES_FREQUENCY ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLM

0 C10001 40.900749 0.818182 95.40 0.00 95.40 0.000000 0.166667 0.000000

1 C10002 3202.467416 0.909091 0.00 0.00 0.00 6442.945483 0.000000 0.000000

2 C10003 2495.148862 1.000000 773.17 773.17 0.00 0.000000 1.000000 1.000000

3 C10004 1666.670542 0.636364 1499.00 1499.00 0.00 205.788017 0.083333 0.083333

4 C10005 817.714335 1.000000 16.00 16.00 0.00 0.000000 0.083333 0.083333

... ... ... ... ... ... ... ... ... ...

8945 C19186 28.493517 1.000000 291.12 0.00 291.12 0.000000 1.000000 0.000000

8946 C19187 19.183215 1.000000 300.00 0.00 300.00 0.000000 1.000000 0.000000

8947 C19188 23.398673 0.833333 144.40 0.00 144.40 0.000000 0.833333 0.000000

8948 C19189 13.457564 0.833333 0.00 0.00 0.00 36.558778 0.000000 0.000000

8949 C19190 372.708075 0.666667 1093.25 1093.25 0.00 127.040008 0.666667 0.666667

8950 rows × 18 columns

In [4]:
# Get some information about the data

df.info()

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 8950 entries, 0 to 8949

Data columns (total 18 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 CUST_ID 8950 non-null object

1 BALANCE 8950 non-null float64

2 BALANCE_FREQUENCY 8950 non-null float64

3 PURCHASES 8950 non-null float64

4 ONEOFF_PURCHASES 8950 non-null float64

5 INSTALLMENTS_PURCHASES 8950 non-null float64

6 CASH_ADVANCE 8950 non-null float64

7 PURCHASES_FREQUENCY 8950 non-null float64

8 ONEOFF_PURCHASES_FREQUENCY 8950 non-null float64

9 PURCHASES_INSTALLMENTS_FREQUENCY 8950 non-null float64

10 CASH_ADVANCE_FREQUENCY 8950 non-null float64

11 CASH_ADVANCE_TRX 8950 non-null int64

12 PURCHASES_TRX 8950 non-null int64

13 CREDIT_LIMIT 8949 non-null float64

14 PAYMENTS 8950 non-null float64

15 MINIMUM_PAYMENTS 8637 non-null float64

16 PRC_FULL_PAYMENT 8950 non-null float64

17 TENURE 8950 non-null int64

dtypes: float64(14), int64(3), object(1)

memory usage: 1.2+ MB

In [5]:
# Describe the data

df.describe()

Out[5]: BALANCE BALANCE_FREQUENCY PURCHASES ONEOFF_PURCHASES INSTALLMENTS_PURCHASES CASH_ADVANCE PURCHASES_FREQUENCY ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FR

count 8950.000000 8950.000000 8950.000000 8950.000000 8950.000000 8950.000000 8950.000000 8950.000000 8

mean 1564.474828 0.877271 1003.204834 592.437371 411.067645 978.871112 0.490351 0.202458

std 2081.531879 0.236904 2136.634782 1659.887917 904.338115 2097.163877 0.401371 0.298336

min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000

25% 128.281915 0.888889 39.635000 0.000000 0.000000 0.000000 0.083333 0.000000

50% 873.385231 1.000000 361.280000 38.000000 89.000000 0.000000 0.500000 0.083333

75% 2054.140036 1.000000 1110.130000 577.405000 468.637500 1113.821139 0.916667 0.300000

max 19043.138560 1.000000 49039.570000 40761.250000 22500.000000 47137.211760 1.000000 1.000000

Mean BALANCE is $1,564

BALANCE_FREQUENCY is frequently updated on average ~0.9

PURCHASES average is $1,000

ONEOFF_PURCHASES average is ~$600

Average PURCHASES_FREQUENCY is around 0.5

Average ONEOFF_PURCHASES_FREQUENCY, PURCHASES_INSTALLMENTS_FREQUENCY, and CASH_ADVANCE_FREQUENCY

Average CREDIT_LIMIT is ~$4,500

Average PRC_FULL_PAYMENT is 15%

Average TENURE is 11.5 years

Data Cleaning
Visualize and Explore Dataset
In [6]:
# Check the missing data

print(df.isnull().sum())

print('----------------------------------------\n Percentage of missing data: \n')

print((df[['MINIMUM_PAYMENTS', 'CREDIT_LIMIT']].isnull().sum()/df['CUST_ID'].count())*100)

CUST_ID 0

BALANCE 0

BALANCE_FREQUENCY 0

PURCHASES 0

ONEOFF_PURCHASES 0

INSTALLMENTS_PURCHASES 0

CASH_ADVANCE 0

PURCHASES_FREQUENCY 0

ONEOFF_PURCHASES_FREQUENCY 0

PURCHASES_INSTALLMENTS_FREQUENCY 0

CASH_ADVANCE_FREQUENCY 0

CASH_ADVANCE_TRX 0

PURCHASES_TRX 0

CREDIT_LIMIT 1

PAYMENTS 0

MINIMUM_PAYMENTS 313

PRC_FULL_PAYMENT 0

TENURE 0

dtype: int64

----------------------------------------

Percentage of missing data:

MINIMUM_PAYMENTS 3.497207

CREDIT_LIMIT 0.011173

dtype: float64

Here we have calculated the percentage of data that are missing

There are two variables with missing data, namely CREDIT_LIMIT and MINIMUM_PAYMENTS. The missing values in these columns make up a insignificant percentage of the data set and can be
safely deleted without risking a loss in data. The missing data in CREDIT_LIMIT make up less than 1% of the data and in MINIMUM_PAYMENTS only around 3%.

In [7]:
# Fill up the missing elements with mean of the MINIMUM_PAYMENTS

df.loc[(df['MINIMUM_PAYMENTS'].isnull() == True),

'MINIMUM_PAYMENTS'] = df['MINIMUM_PAYMENTS'].mean()

# Fill up the missing elements with mean of the CREDIT_LIMIT

df.loc[(df['CREDIT_LIMIT'].isnull() == True),

'CREDIT_LIMIT'] = df['CREDIT_LIMIT'].mean()

df.isnull().sum()

CUST_ID 0

Out[7]:
BALANCE 0

BALANCE_FREQUENCY 0

PURCHASES 0

ONEOFF_PURCHASES 0

INSTALLMENTS_PURCHASES 0

CASH_ADVANCE 0

PURCHASES_FREQUENCY 0

ONEOFF_PURCHASES_FREQUENCY 0

PURCHASES_INSTALLMENTS_FREQUENCY 0

CASH_ADVANCE_FREQUENCY 0

CASH_ADVANCE_TRX 0

PURCHASES_TRX 0

CREDIT_LIMIT 0

PAYMENTS 0

MINIMUM_PAYMENTS 0

PRC_FULL_PAYMENT 0

TENURE 0

dtype: int64

In [8]:
# Chek duplicated entries in the data

df.duplicated().sum()

0
Out[8]:

In [9]:
# Drop Customer ID column 'CUST_ID'

df.drop('CUST_ID', axis = 1, inplace = True)

df.head()

Out[9]: BALANCE BALANCE_FREQUENCY PURCHASES ONEOFF_PURCHASES INSTALLMENTS_PURCHASES CASH_ADVANCE PURCHASES_FREQUENCY ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQU

0 40.900749 0.818182 95.40 0.00 95.4 0.000000 0.166667 0.000000 0.0

1 3202.467416 0.909091 0.00 0.00 0.0 6442.945483 0.000000 0.000000 0.0

2 2495.148862 1.000000 773.17 773.17 0.0 0.000000 1.000000 1.000000 0.0

3 1666.670542 0.636364 1499.00 1499.00 0.0 205.788017 0.083333 0.083333 0.0

4 817.714335 1.000000 16.00 16.00 0.0 0.000000 0.083333 0.083333 0.0

EDA
In [10]:
df.columns

Index(['BALANCE', 'BALANCE_FREQUENCY', 'PURCHASES', 'ONEOFF_PURCHASES',

Out[10]:
'INSTALLMENTS_PURCHASES', 'CASH_ADVANCE', 'PURCHASES_FREQUENCY',

'ONEOFF_PURCHASES_FREQUENCY', 'PURCHASES_INSTALLMENTS_FREQUENCY',

'CASH_ADVANCE_FREQUENCY', 'CASH_ADVANCE_TRX', 'PURCHASES_TRX',

'CREDIT_LIMIT', 'PAYMENTS', 'MINIMUM_PAYMENTS', 'PRC_FULL_PAYMENT',

'TENURE'],

dtype='object')

Distplot combines the matplotlib.hist function with seaborn kdeplot().

KDE Plot represents the Kernel Density Estimate.

KDE is used for visualizing the Probability Density of a continuous variable.

KDE demonstrates the probability density at different values in a continuous variable

In [11]:
plt.rcParams['figure.figsize'] =(20,40)

for num in range(0,17):

ax = plt.subplot(9,2,num+1)

col = df.columns[num]

sns.distplot(df[col], ax=ax,kde_kws={'color':'b', 'lw':3, 'label':'KDE'})

plt.show()

Above are the distribution plots of every variables in the dataframe

Here we have the overview of the whole distribution of the dataframe. We can see right away that these distributions are very left skewed and there are a lot of zero values.

In [12]:
plt.rcParams['figure.figsize'] = (13,4)

sns.distplot(df['BALANCE'],bins=150,color = 'r')

plt.title('Distribution of Balance', size=20)

plt.xlabel('Balance')

Text(0.5, 0, 'Balance')
Out[12]:

By keeping the balance low (in this case zero) but the credit limit high,would increase the credit utilization ratio and in turn increases overall credit score.

In [13]:
plt.rcParams['figure.figsize'] = (13,10)

sns.countplot(y=df['BALANCE_FREQUENCY'],order = df['BALANCE_FREQUENCY'].value_counts().index)

plt.ylabel('Balance Frequncy Score (0-1)')

plt.title('Counts of Balance Frequency Score', fontsize=20)

Text(0.5, 1.0, 'Counts of Balance Frequency Score')


Out[13]:

We can see here most of the accounts have the score of one, the best score,most people do use credit card frequently and only a small number of people keep their cards relatively inactive.

In [14]:
plt.rcParams['figure.figsize'] = (13,4)

sns.distplot(df['PURCHASES'], color='orange', bins=150)

plt.title('Distribution of Purchases', size=20)

plt.xlabel('Purchases')

Text(0.5, 0, 'Purchases')
Out[14]:

Many people have the purchase amounts of 0 since earlier alot of people are holding zero balance cards.

In [15]:
plt.subplot(1,2,1)

sns.distplot(df['ONEOFF_PURCHASES'],color='green')

plt.title('Distribution of One Off Purchase', fontsize = 20)

plt.xlabel('Amount')

plt.subplot(1,2,2)

sns.distplot(df['INSTALLMENTS_PURCHASES'], color='red')

plt.title('Distribution of Installment Purchase', fontsize = 20)

plt.xlabel('Amount')

Text(0.5, 0, 'Amount')
Out[15]:

This still follows that same trend of zeros balance account. One off purchases go up as high as more than 40,000 dollars while the highest installment purchases go up to around 25,000 dollars.

In [16]:
plt.rcParams['figure.figsize'] = (16,15)

plt.subplot(2,2,1)

sns.scatterplot(df['PURCHASES'],df['CREDIT_LIMIT'])

plt.title('Credit Limit And Purchases', fontsize =20)

plt.xlabel('Purchases')

plt.ylabel('Credit limit')

plt.subplot(2,2,2)

sns.scatterplot(df['BALANCE'],df['CREDIT_LIMIT'])

plt.title('Credit Limit And Balance', fontsize =20)

plt.xlabel('Balance')

plt.ylabel('Credit limit')

plt.subplot(2,2,3)

sns.scatterplot(df['ONEOFF_PURCHASES'],df['CREDIT_LIMIT'])

plt.title('Credit Limit And One Off Purchases', fontsize =20)


plt.xlabel('One off purchases')

plt.ylabel('Credit limit')

plt.subplot(2,2,4)

sns.scatterplot(df['INSTALLMENTS_PURCHASES'],df['CREDIT_LIMIT'])

plt.title('Credit Limit And Installments Purchases', fontsize =20)


plt.xlabel('Installment Purchases')

plt.ylabel('Credit limit')

Text(0, 0.5, 'Credit limit')


Out[16]:

There seems to be no strong correlation between the credit limit and these variables.For most people, credit cards are tools for credit utilization rather than spending device.

As for balance, there seems to be a better correlation that as credit limit goes up balance also goes up but it is also clear to see that there are also points where balance stays at zeros but credit
limits do go up.

In [17]:
# Correlation matrix between features

correlations = df.corr()

f, ax = plt.subplots(figsize=(20,10))

sns.heatmap(correlations, annot = True);

Above is the heatmap of the dataset

Here we can take a closer look at the correlation with in the dataset. Purchases and one off purchase have very high correlation as we would expect at 0.92. This is the same for varaibles and their
frequency score counter parts such as cash advance trx and cash advance frequency at 0.8. Not surprisingly things like balance and payment have poor correlation. This tell us that the data do
make sense.

In [18]:
plt.rcParams['figure.figsize'] = (6,4)

sns.countplot(df['TENURE'], palette='rainbow')

plt.title('Counts of Tenures', fontsize = 20)

plt.xlabel('Months')

Text(0.5, 0, 'Months')
Out[18]:

Tenure is the repayment period of the cards, ranging from 6-12 months.Most of the cards are 12 months cards.

Model Building
Find the Optimal Number of Clusters Using Elbow Method

The elbow method is a heuristic method of interpretation and validation of consistency within cluster analysis designed to help find the appropriate number of clusters in a dataset.
If the line chart looks like an arm, then the "elbow" on the arm is the value of k that is the best.
Source:
https://en.wikipedia.org/wiki/Elbow_method_(clustering)
https://www.geeksforgeeks.org/elbow-method-for-optimal-value-of-k-in-kmeans/

In [19]:
# Scale the data first

scaler = StandardScaler()

df_scaled = scaler.fit_transform(df)

df_scaled.shape

(8950, 17)
Out[19]:

In [20]:
scores_1 = []

range_values = range(1,11)

for i in range_values:

kmeans = KMeans(n_clusters = i,init='k-means++', n_init=10, max_iter=300, random_state=0)

kmeans.fit(df_scaled)

scores_1.append(kmeans.inertia_)

plt.figure(figsize = (10,10))

plt.plot(scores_1,marker = 'o')

plt.xlabel('Values of K', fontsize = 10)

plt.title('The Elbow Method', fontsize = 15);

Here is the graph deplicting the elbow method used to find the optimum number clusters using kmean analysis

We tried different number of clusters from 1-10 and then we graph inertia or wcss (within clusters sum square) against the cluster number. Inertia is basically how close the datapoints in the clusters
are to the centers, which means the lower it is the more fitting the points are to their respective clusters. Here, we are trying to find the place where the wcss is as low as possible while still keeping
the number of clusters as low as possible.

Here the optimum number of clusters is 4 cluster since it is the place where the graph starts to flatten out meaning that having higher number of clusters will not yield a much more
fitting machine.

Apply K-Means Method


In [21]:
kmeans = KMeans(n_clusters=4, init='k-means++', n_init=10, max_iter=300, random_state=0)

kmeans.fit(df_scaled)

labels = kmeans.labels_ # Labels (cluster) associated to each data point

kmeans.cluster_centers_.shape

(4, 17)
Out[21]:

In [22]:
cluster_centers = pd.DataFrame(data=kmeans.cluster_centers_,

columns = [df.columns])

cluster_centers

Out[22]: BALANCE BALANCE_FREQUENCY PURCHASES ONEOFF_PURCHASES INSTALLMENTS_PURCHASES CASH_ADVANCE PURCHASES_FREQUENCY ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQUEN

0 -0.321688 0.242574 0.109044 0.000926 0.255904 -0.366373 0.983721 0.317153 0.874

1 1.459578 0.384753 -0.234638 -0.163914 -0.253747 1.688972 -0.504848 -0.212939 -0.450

2 0.954485 0.462694 3.125845 2.713251 2.406470 -0.155091 1.136338 1.798653 1.065

3 -0.265552 -0.368944 -0.343190 -0.230500 -0.387798 -0.182691 -0.797823 -0.389437 -0.714

First customers cluster (Transactors): Those are customers who pay least amount of intrerest charges and careful with their money. Cluster with lowest balance, lowest cash advance, and percentage of full payment =
23%.
Second customers cluster (Revolvers): who use credit card as a loan (most lucrative sector): highest balance and cash advance, low purchase frequency, high cash advance frequency (0.5), high cash advance
transactions (16) and low percentage of full payment (3%).
Third customers cluster (VIP/Prime): high credit limit $16,000, and highest percentage of full payment, target for increase credit limit and increase spending habits.
Fourth customers cluster (Low tenure): these are customers with low tenure (7 years), low balance.

In [23]:
# Transformation
cluster_centers = scaler.inverse_transform(cluster_centers)

cluster_centers = pd.DataFrame(data = cluster_centers,

columns = [df.columns])

cluster_centers

Out[23]: BALANCE BALANCE_FREQUENCY PURCHASES ONEOFF_PURCHASES INSTALLMENTS_PURCHASES CASH_ADVANCE PURCHASES_FREQUENCY ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQU

0 894.907458 0.934734 1236.178934 593.974874 642.478274 210.570626 0.885165 0.297070 0.7

1 4602.462714 0.968415 501.896219 320.373681 181.607404 4520.724309 0.287731 0.138934 0.1

2 3551.153761 0.986879 7681.620098 5095.878826 2587.208264 653.638891 0.946418 0.739031 0.7

3 1011.751528 0.789871 269.973466 209.853863 60.386625 595.759339 0.170146 0.086281 0.0

In [24]:
y_kmeans = kmeans.fit_predict(df_scaled)

y_kmeans

array([0, 1, 2, ..., 2, 0, 0])


Out[24]:

In [25]:
# Concatenate the clusters labels to our original dataframe

df_cluster = pd.concat([df, pd.DataFrame({'cluster':labels})], axis = 1)

df_cluster.head()

Out[25]: BALANCE BALANCE_FREQUENCY PURCHASES ONEOFF_PURCHASES INSTALLMENTS_PURCHASES CASH_ADVANCE PURCHASES_FREQUENCY ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQU

0 40.900749 0.818182 95.40 0.00 95.4 0.000000 0.166667 0.000000 0.0

1 3202.467416 0.909091 0.00 0.00 0.0 6442.945483 0.000000 0.000000 0.0

2 2495.148862 1.000000 773.17 773.17 0.0 0.000000 1.000000 1.000000 0.0

3 1666.670542 0.636364 1499.00 1499.00 0.0 205.788017 0.083333 0.083333 0.0

4 817.714335 1.000000 16.00 16.00 0.0 0.000000 0.083333 0.083333 0.0

In [26]:
# Plot the histogram of various clusters

for i in df.columns:

plt.figure(figsize=(35,5))

for j in range(7):

plt.subplot(1,7,j+1)

cluster = df_cluster[df_cluster['cluster'] == j]

cluster[i].hist(bins = 20)

plt.title('{} \nCluster {}'.format(i,j))

plt.show()

Apply PCA
Principal Component Analysis (PCA)
PCA is an unsupervised machine learning algorithm.
PCA performs dimensionality reductions while attemting at keeping the original information unchanged.
PCA works by trying to find a new set of features called components.
Components are composites of the uncorrelated given input features.

In [27]:
# Obtain the principal components

pca = PCA(n_components = 2)

principal_comp = pca.fit_transform(df_scaled)

principal_comp

array([[-1.68222054, -1.07645217],

Out[27]:
[-1.13830193, 2.50644962],

[ 0.96968128, -0.38351775],

...,

[-0.9262004 , -1.81077471],

[-2.33654769, -0.65795594],

[-0.55642242, -0.40046607]])

In [28]:
# Create a dataframe with the two components

pca_df = pd.DataFrame(data = principal_comp, columns = ['pca1', 'pca2'])

pca_df.head()

Out[28]: pca1 pca2

0 -1.682221 -1.076452

1 -1.138302 2.506450

2 0.969681 -0.383518

3 -0.873628 0.043159

4 -1.599433 -0.688578

In [29]:
# Concatenate the clusters labels to the dataframe

pca_df = pd.concat([pca_df, pd.DataFrame({'cluster':labels})], axis=1)

pca_df.head()

Out[29]: pca1 pca2 cluster

0 -1.682221 -1.076452 3

1 -1.138302 2.506450 1

2 0.969681 -0.383518 0

3 -0.873628 0.043159 3

4 -1.599433 -0.688578 3

In [30]:
label = kmeans.fit_predict(df_scaled)

df['label'] = label

plt.rcParams['figure.figsize'] = (12,8)

sns.scatterplot(df['BALANCE'],df['PURCHASES'], hue=df['label'], palette=['red','green','blue','yellow'])

plt.title('Clusters of Balance vs Purchases')

plt.xlabel('Balance')

plt.ylabel('Purchases')

Text(0, 0.5, 'Purchases')


Out[30]:

Here is the scatter plot of balance and purchases seperated by clusters

Here we can see that cluster 0 are high spenders with the highest balance while cluster 1 are people with higher balance but not as big of spenders. Cluster 2,3 are people who do not spend as
much and have relatively lower balance (down to zero).

In [31]:
plt.figure(figsize = (12,8))

ax = sns.scatterplot(x='pca1', y='pca2', hue='cluster',

data=pca_df,

palette = ['red','green','blue','yellow'])

plt.show()

CONCLUSION
In conclusion, with this information from our cluster analysis, as a credit card company, we could spend more time on marketing campaigns on the right people. People in cluster 0 and 1 clearly have the capacity to spend and
since they are already spending we could use their spending habits to optimize the strategies to get them to spend even more. The analysis also tells us that there is untapped potential in people from cluster 2 and 3. These
people already some balance but are not purchasing as much, with the right push we might be able to get them to use the card for spending and become important sources of revenue.

You might also like