MARKET Segmentation
MARKET Segmentation
In [1]:
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings("ignore")
BALANCE : Total amount of money that you owe to your credit card company
BALANCE_FREQUENCY : How frequently the Balance is updated, score between 0 and 1 (1 = frequently updated, 0 = not frequently updated)
PURCHASES_FREQUENCY : How frequently the Purchases are being made, score between 0 and 1 (1 = frequently purchased, 0 = not frequently purchased)
ONEOFF_PURCHASES_FREQUENCY : How frequently Purchases are happening in one-go (1 = frequently purchased, 0 = not frequently purchased)
PURCHASES_INSTALLMENTS_FREQUENCY : How frequently purchases in installments are being done (1 = frequently done, 0 = not frequently done)
In [3]:
# Display the dataset
df
Out[3]: CUST_ID BALANCE BALANCE_FREQUENCY PURCHASES ONEOFF_PURCHASES INSTALLMENTS_PURCHASES CASH_ADVANCE PURCHASES_FREQUENCY ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLM
... ... ... ... ... ... ... ... ... ...
8945 C19186 28.493517 1.000000 291.12 0.00 291.12 0.000000 1.000000 0.000000
8946 C19187 19.183215 1.000000 300.00 0.00 300.00 0.000000 1.000000 0.000000
8947 C19188 23.398673 0.833333 144.40 0.00 144.40 0.000000 0.833333 0.000000
8948 C19189 13.457564 0.833333 0.00 0.00 0.00 36.558778 0.000000 0.000000
8949 C19190 372.708075 0.666667 1093.25 1093.25 0.00 127.040008 0.666667 0.666667
In [4]:
# Get some information about the data
df.info()
<class 'pandas.core.frame.DataFrame'>
In [5]:
# Describe the data
df.describe()
Out[5]: BALANCE BALANCE_FREQUENCY PURCHASES ONEOFF_PURCHASES INSTALLMENTS_PURCHASES CASH_ADVANCE PURCHASES_FREQUENCY ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FR
Data Cleaning
Visualize and Explore Dataset
In [6]:
# Check the missing data
print(df.isnull().sum())
print((df[['MINIMUM_PAYMENTS', 'CREDIT_LIMIT']].isnull().sum()/df['CUST_ID'].count())*100)
CUST_ID 0
BALANCE 0
BALANCE_FREQUENCY 0
PURCHASES 0
ONEOFF_PURCHASES 0
INSTALLMENTS_PURCHASES 0
CASH_ADVANCE 0
PURCHASES_FREQUENCY 0
ONEOFF_PURCHASES_FREQUENCY 0
PURCHASES_INSTALLMENTS_FREQUENCY 0
CASH_ADVANCE_FREQUENCY 0
CASH_ADVANCE_TRX 0
PURCHASES_TRX 0
CREDIT_LIMIT 1
PAYMENTS 0
MINIMUM_PAYMENTS 313
PRC_FULL_PAYMENT 0
TENURE 0
dtype: int64
----------------------------------------
MINIMUM_PAYMENTS 3.497207
CREDIT_LIMIT 0.011173
dtype: float64
There are two variables with missing data, namely CREDIT_LIMIT and MINIMUM_PAYMENTS. The missing values in these columns make up a insignificant percentage of the data set and can be
safely deleted without risking a loss in data. The missing data in CREDIT_LIMIT make up less than 1% of the data and in MINIMUM_PAYMENTS only around 3%.
In [7]:
# Fill up the missing elements with mean of the MINIMUM_PAYMENTS
df.loc[(df['MINIMUM_PAYMENTS'].isnull() == True),
'MINIMUM_PAYMENTS'] = df['MINIMUM_PAYMENTS'].mean()
df.loc[(df['CREDIT_LIMIT'].isnull() == True),
'CREDIT_LIMIT'] = df['CREDIT_LIMIT'].mean()
df.isnull().sum()
CUST_ID 0
Out[7]:
BALANCE 0
BALANCE_FREQUENCY 0
PURCHASES 0
ONEOFF_PURCHASES 0
INSTALLMENTS_PURCHASES 0
CASH_ADVANCE 0
PURCHASES_FREQUENCY 0
ONEOFF_PURCHASES_FREQUENCY 0
PURCHASES_INSTALLMENTS_FREQUENCY 0
CASH_ADVANCE_FREQUENCY 0
CASH_ADVANCE_TRX 0
PURCHASES_TRX 0
CREDIT_LIMIT 0
PAYMENTS 0
MINIMUM_PAYMENTS 0
PRC_FULL_PAYMENT 0
TENURE 0
dtype: int64
In [8]:
# Chek duplicated entries in the data
df.duplicated().sum()
0
Out[8]:
In [9]:
# Drop Customer ID column 'CUST_ID'
df.head()
Out[9]: BALANCE BALANCE_FREQUENCY PURCHASES ONEOFF_PURCHASES INSTALLMENTS_PURCHASES CASH_ADVANCE PURCHASES_FREQUENCY ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQU
EDA
In [10]:
df.columns
Out[10]:
'INSTALLMENTS_PURCHASES', 'CASH_ADVANCE', 'PURCHASES_FREQUENCY',
'ONEOFF_PURCHASES_FREQUENCY', 'PURCHASES_INSTALLMENTS_FREQUENCY',
'TENURE'],
dtype='object')
In [11]:
plt.rcParams['figure.figsize'] =(20,40)
ax = plt.subplot(9,2,num+1)
col = df.columns[num]
plt.show()
Here we have the overview of the whole distribution of the dataframe. We can see right away that these distributions are very left skewed and there are a lot of zero values.
In [12]:
plt.rcParams['figure.figsize'] = (13,4)
sns.distplot(df['BALANCE'],bins=150,color = 'r')
plt.xlabel('Balance')
Text(0.5, 0, 'Balance')
Out[12]:
By keeping the balance low (in this case zero) but the credit limit high,would increase the credit utilization ratio and in turn increases overall credit score.
In [13]:
plt.rcParams['figure.figsize'] = (13,10)
sns.countplot(y=df['BALANCE_FREQUENCY'],order = df['BALANCE_FREQUENCY'].value_counts().index)
We can see here most of the accounts have the score of one, the best score,most people do use credit card frequently and only a small number of people keep their cards relatively inactive.
In [14]:
plt.rcParams['figure.figsize'] = (13,4)
plt.xlabel('Purchases')
Text(0.5, 0, 'Purchases')
Out[14]:
Many people have the purchase amounts of 0 since earlier alot of people are holding zero balance cards.
In [15]:
plt.subplot(1,2,1)
sns.distplot(df['ONEOFF_PURCHASES'],color='green')
plt.xlabel('Amount')
plt.subplot(1,2,2)
sns.distplot(df['INSTALLMENTS_PURCHASES'], color='red')
plt.xlabel('Amount')
Text(0.5, 0, 'Amount')
Out[15]:
This still follows that same trend of zeros balance account. One off purchases go up as high as more than 40,000 dollars while the highest installment purchases go up to around 25,000 dollars.
In [16]:
plt.rcParams['figure.figsize'] = (16,15)
plt.subplot(2,2,1)
sns.scatterplot(df['PURCHASES'],df['CREDIT_LIMIT'])
plt.xlabel('Purchases')
plt.ylabel('Credit limit')
plt.subplot(2,2,2)
sns.scatterplot(df['BALANCE'],df['CREDIT_LIMIT'])
plt.xlabel('Balance')
plt.ylabel('Credit limit')
plt.subplot(2,2,3)
sns.scatterplot(df['ONEOFF_PURCHASES'],df['CREDIT_LIMIT'])
plt.ylabel('Credit limit')
plt.subplot(2,2,4)
sns.scatterplot(df['INSTALLMENTS_PURCHASES'],df['CREDIT_LIMIT'])
plt.ylabel('Credit limit')
There seems to be no strong correlation between the credit limit and these variables.For most people, credit cards are tools for credit utilization rather than spending device.
As for balance, there seems to be a better correlation that as credit limit goes up balance also goes up but it is also clear to see that there are also points where balance stays at zeros but credit
limits do go up.
In [17]:
# Correlation matrix between features
correlations = df.corr()
f, ax = plt.subplots(figsize=(20,10))
Here we can take a closer look at the correlation with in the dataset. Purchases and one off purchase have very high correlation as we would expect at 0.92. This is the same for varaibles and their
frequency score counter parts such as cash advance trx and cash advance frequency at 0.8. Not surprisingly things like balance and payment have poor correlation. This tell us that the data do
make sense.
In [18]:
plt.rcParams['figure.figsize'] = (6,4)
sns.countplot(df['TENURE'], palette='rainbow')
plt.xlabel('Months')
Text(0.5, 0, 'Months')
Out[18]:
Tenure is the repayment period of the cards, ranging from 6-12 months.Most of the cards are 12 months cards.
Model Building
Find the Optimal Number of Clusters Using Elbow Method
The elbow method is a heuristic method of interpretation and validation of consistency within cluster analysis designed to help find the appropriate number of clusters in a dataset.
If the line chart looks like an arm, then the "elbow" on the arm is the value of k that is the best.
Source:
https://en.wikipedia.org/wiki/Elbow_method_(clustering)
https://www.geeksforgeeks.org/elbow-method-for-optimal-value-of-k-in-kmeans/
In [19]:
# Scale the data first
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df)
df_scaled.shape
(8950, 17)
Out[19]:
In [20]:
scores_1 = []
range_values = range(1,11)
for i in range_values:
kmeans.fit(df_scaled)
scores_1.append(kmeans.inertia_)
plt.figure(figsize = (10,10))
plt.plot(scores_1,marker = 'o')
Here is the graph deplicting the elbow method used to find the optimum number clusters using kmean analysis
We tried different number of clusters from 1-10 and then we graph inertia or wcss (within clusters sum square) against the cluster number. Inertia is basically how close the datapoints in the clusters
are to the centers, which means the lower it is the more fitting the points are to their respective clusters. Here, we are trying to find the place where the wcss is as low as possible while still keeping
the number of clusters as low as possible.
Here the optimum number of clusters is 4 cluster since it is the place where the graph starts to flatten out meaning that having higher number of clusters will not yield a much more
fitting machine.
kmeans.fit(df_scaled)
kmeans.cluster_centers_.shape
(4, 17)
Out[21]:
In [22]:
cluster_centers = pd.DataFrame(data=kmeans.cluster_centers_,
columns = [df.columns])
cluster_centers
Out[22]: BALANCE BALANCE_FREQUENCY PURCHASES ONEOFF_PURCHASES INSTALLMENTS_PURCHASES CASH_ADVANCE PURCHASES_FREQUENCY ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQUEN
First customers cluster (Transactors): Those are customers who pay least amount of intrerest charges and careful with their money. Cluster with lowest balance, lowest cash advance, and percentage of full payment =
23%.
Second customers cluster (Revolvers): who use credit card as a loan (most lucrative sector): highest balance and cash advance, low purchase frequency, high cash advance frequency (0.5), high cash advance
transactions (16) and low percentage of full payment (3%).
Third customers cluster (VIP/Prime): high credit limit $16,000, and highest percentage of full payment, target for increase credit limit and increase spending habits.
Fourth customers cluster (Low tenure): these are customers with low tenure (7 years), low balance.
In [23]:
# Transformation
cluster_centers = scaler.inverse_transform(cluster_centers)
columns = [df.columns])
cluster_centers
Out[23]: BALANCE BALANCE_FREQUENCY PURCHASES ONEOFF_PURCHASES INSTALLMENTS_PURCHASES CASH_ADVANCE PURCHASES_FREQUENCY ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQU
In [24]:
y_kmeans = kmeans.fit_predict(df_scaled)
y_kmeans
In [25]:
# Concatenate the clusters labels to our original dataframe
df_cluster.head()
Out[25]: BALANCE BALANCE_FREQUENCY PURCHASES ONEOFF_PURCHASES INSTALLMENTS_PURCHASES CASH_ADVANCE PURCHASES_FREQUENCY ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQU
In [26]:
# Plot the histogram of various clusters
for i in df.columns:
plt.figure(figsize=(35,5))
for j in range(7):
plt.subplot(1,7,j+1)
cluster = df_cluster[df_cluster['cluster'] == j]
cluster[i].hist(bins = 20)
plt.show()
Apply PCA
Principal Component Analysis (PCA)
PCA is an unsupervised machine learning algorithm.
PCA performs dimensionality reductions while attemting at keeping the original information unchanged.
PCA works by trying to find a new set of features called components.
Components are composites of the uncorrelated given input features.
In [27]:
# Obtain the principal components
pca = PCA(n_components = 2)
principal_comp = pca.fit_transform(df_scaled)
principal_comp
array([[-1.68222054, -1.07645217],
Out[27]:
[-1.13830193, 2.50644962],
[ 0.96968128, -0.38351775],
...,
[-0.9262004 , -1.81077471],
[-2.33654769, -0.65795594],
[-0.55642242, -0.40046607]])
In [28]:
# Create a dataframe with the two components
pca_df.head()
0 -1.682221 -1.076452
1 -1.138302 2.506450
2 0.969681 -0.383518
3 -0.873628 0.043159
4 -1.599433 -0.688578
In [29]:
# Concatenate the clusters labels to the dataframe
pca_df.head()
0 -1.682221 -1.076452 3
1 -1.138302 2.506450 1
2 0.969681 -0.383518 0
3 -0.873628 0.043159 3
4 -1.599433 -0.688578 3
In [30]:
label = kmeans.fit_predict(df_scaled)
df['label'] = label
plt.rcParams['figure.figsize'] = (12,8)
plt.xlabel('Balance')
plt.ylabel('Purchases')
Here we can see that cluster 0 are high spenders with the highest balance while cluster 1 are people with higher balance but not as big of spenders. Cluster 2,3 are people who do not spend as
much and have relatively lower balance (down to zero).
In [31]:
plt.figure(figsize = (12,8))
data=pca_df,
palette = ['red','green','blue','yellow'])
plt.show()
CONCLUSION
In conclusion, with this information from our cluster analysis, as a credit card company, we could spend more time on marketing campaigns on the right people. People in cluster 0 and 1 clearly have the capacity to spend and
since they are already spending we could use their spending habits to optimize the strategies to get them to spend even more. The analysis also tells us that there is untapped potential in people from cluster 2 and 3. These
people already some balance but are not purchasing as much, with the right push we might be able to get them to use the card for spending and become important sources of revenue.