Data Description
Data Description
A manager at the bank is disturbed with more and more customers leaving their credit card
services. They would really appreciate if one could predict for them who is gonna get churned so
they can proactively go to the customer to provide them better services and turn customers'
decisions in the opposite direction
Data Description
This dataset consists of 10,000 customers, with 18 features, but in this analysis we will avoid
using categorical features and use a subset of 11 numerical features to make the pre-processing
simpler
We will use Numpy for numerical operations, Pandas for dataframes, Matplotlib and Plotly for
plots and Sklearn for building machine learning models
In [2]:
!pip install plotly
In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as ex
import plotly.graph_objs as go
import plotly.figure_factory as ff
from plotly.subplots import make_subplots
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
Because the last two columns of the original dataset contains prediction output from a Naive
Bayes algorithm, we will drop these columns for the purpose of our analysis
In [4]:
df = pd.read_csv('C:/Users/bisha/OneDrive/Bishal/OneDrive/Python learning/Principal
In [5]:
df=df[df.columns[:-2]] # Drop the last two columns
Existing
0 768805383 45 M 3 High School Ma
Customer
Existing
1 818770008 49 F 5 Graduate S
Customer
Existing
2 713982108 51 M 3 Graduate Ma
Customer
Existing
3 769911858 40 F 4 High School Unkn
Customer
Existing
4 709106358 40 M 3 Uneducated Ma
Customer
5 rows × 21 columns
We will look at the customer Age distribution using plotly. ( which allows for interactive plots)
In [6]:
fig = make_subplots(rows=2,cols=1)
tr1 = go.Box(x=df['Customer_Age'],name='Age Box Plot',boxmean='sd')
tr2 = go.Histogram(x=df['Customer_Age'], name='Age Histogram')
fig.add_trace(tr1,row=1,col=1)
fig.add_trace(tr2,row=2,col=1)
fig.update_layout(height=500, width=600, title_text="Distribution of Customer Ages")
fig.show()
30 40 50 60 70
400
200
0
30 40 50 60 70
income = pd.DataFrame(df['Income_Category'].value_counts())
labelincome=df['Income_Category'].unique()
In [8]:
#explore education level and income level
fig = make_subplots(rows=1,cols=2,specs=[[{'type':'domain'}, {'type':'domain'}]])
fig.add_trace(tr3,row=1,col=1)
fig.add_trace(tr4,row=1,col=2)
fig.update_layout(height=500, width=600, title_text="Distribution of Income and Educ
fig.show()
High School
Graduate
Uneducated
Unknown
College
17.7% Post-Graduate
19.9%
30.9% Doctorate
35.2%
80K
15.2%
15% Less than $40K
120K
7.
14.7%
8%
For prediction purpose, we will assign existing customers to be predicted as “1” and attrited
customers to be predicted as “0”
In [9]:
x = df.iloc[:,9:21] # assign column 9 to 21 as x variable - the features
x = StandardScaler().fit_transform(x) # standarize the variables
In [10]:
df['Attrition_Flag'].replace('Existing Customer','1',inplace=True)
df['Attrition_Flag'].replace('Attrited Customer','0',inplace=True)
We will start by using only the first 2 leading principal components, and then explore 3 principal
components and 4 principal components.
In [11]:
pca = PCA(n_components=2)
PC=pca.fit_transform(x)
principalDF=pd.DataFrame(data=PC,columns=['pc1','pc2'])
0 0.276048 -0.617639 1
1 -0.612402 1.430502 1
2 -0.613733 1.098632 1
3 -2.499317 1.781346 1
4 -0.560120 0.924119 1
To assess how much weightings each feature will have in later predictions, we could construct a
loadings table. The loadings shows how much each of our original features have contributed to
each of the “new features” — the principal components.
In [12]:
PCloadings = pca.components_.T * np.sqrt(pca.explained_variance_)
components = df.columns.tolist()
components = components[9:21]
loadingdf = pd.DataFrame(PCloadings,columns=('PC1','PC2'))
loadingdf["variable"]=components
loadingdf
Now we can plot the loadings and see which of them have high weightings in both principal
component 1 and 2:
In [13]:
fig = ex.scatter(x=loadingdf['PC1'],y=loadingdf['PC2'],text=loadingdf['variable'],)
fig.update_traces(textposition='bottom center')
fig.add_shape(type="line", x0=-0, y0=-0.5,x1=-0,y1=2.5,line=dict(color="RoyalBlue",w
)
fig.show()
loadings plot
2.5
1.5
1
y
Total_Trans_Ct
Total_Trans_Amt
0.5
g_Utilization_Ratio
Total_Revolving_Bal
Total_Ct_Chng_Q4_Q1
Total_Amt_Chng_Q4_Q1
0
Months_on_book
Months_Inactive_12_mon
Credit_Lim
Avg_Open_To
Contacts_Count_12_mon
Total_Relationship_Count
−0.5
−1 −0.5 0 0.5 1
It is clear that "total transaction count" and "total transaction amount" are two heavily weighted
features.
This means that they will play a big role in our next step of prediction (using logistic regression).
localhost:8888/nbconvert/html/OneDrive/Bishal/OneDrive/Python learning/Principal Component Analysis/PCA.ipynb?download=false 5/8
4/26/22, 9:59 PM PCA
If the predictions came out to have a reasonably high accuracy, we can infer that these two
features are indeed two important factors that determines customer churn.
But before going into the prediction, as dimension reduction have allowed us to visualize the
data in 2 dimensions, we can make a scatter plot with respect to the principal components to
see how the data are distributed.
In [14]:
def myplot(score,coeff,labels=None):
xs = score[:,0]
ys = score[:,1]
n = coeff.shape[0]
scalex = 1.0/(xs.max() - xs.min())
scaley = 1.0/(ys.max() - ys.min())
colors = {'1':'pink', '0':'blue'}
plt.scatter(xs * scalex,ys * scaley, c= y.apply(lambda x: colors[x]))
for i in range(n):
plt.arrow(0, 0, coeff[i,0], coeff[i,1],color = 'r',alpha = 0.5)
if labels is None:
plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, "Var"+str(i+1), color = 'g
else:
plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, labels[i], color = 'g', ha
plt.xlim(-1,1)
plt.ylim(-1,1)
plt.xlabel("PC{}".format(1))
plt.ylabel("PC{}".format(2))
myplot(PC[:,0:2],np.transpose(pca.components_[0:2, :]))
plt.show()
1. Split the data set into training set and test set
2. Apply logistic regression to the training set
3. Make prediction on test set
4. Output the prediction score on the test set
In [15]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
Xfinal=finalDf[['pc1','pc2']]
yfinal=finalDf['Attrition_Flag']
logistic=LogisticRegression()
logistic.fit(X=X_train,y=y_train)
logistic.predict(X_test)
score_2=logistic.score(X_test,y_test)
We also want to try using 3 principal components and 4 principal components to compare their
accuracy:
In [21]:
pca=PCA(n_components=3)
PC=pca.fit_transform(x)
principalDF=pd.DataFrame(data=PC,columns=['pc1','pc2','pc3'])
finalDf = pd.concat([principalDF, df[['Attrition_Flag']]], axis = 1)
Xfinal=finalDf[['pc1','pc2','pc3']]
yfinal=finalDf['Attrition_Flag']
logistic=LogisticRegression()
logistic.fit(X=X_train,y=y_train)
logistic.predict(X_test)
score_3=logistic.score(X_test,y_test)
pca=PCA(n_components=4)
PC=pca.fit_transform(x)
principalDF=pd.DataFrame(data=PC,columns=['pc1','pc2','pc3','pc4'])
Xfinal=finalDf[['pc1','pc2','pc3','pc4']]
yfinal=finalDf['Attrition_Flag']
logistic=LogisticRegression()
logistic.fit(X=X_train,y=y_train)
logistic.predict(X_test)
score_4=logistic.score(X_test,y_test)
Finally, we can assess how accurate are the predictions made by our model:
In [23]:
scores=[score_2,score_3,score_4]
scores
In [24]:
ex.bar(y=scores,x=('pc2','pc3','pc4'),range_y=(0.7,0.9),title='PC prediction accurac
PC prediction accuracy
09
localhost:8888/nbconvert/html/OneDrive/Bishal/OneDrive/Python learning/Principal Component Analysis/PCA.ipynb?download=false 7/8
4/26/22, 9:59 PM PCA
0.9
0.85
It turns out that 3 principal components gave the highest score, nevertheless, 84% accuracy is
already achieved with 2 principal components, which is a quite descent result.
Therefore we can infer that total transaction count and total transaction amount are two of the
good predictors of customer churning, and this is also very reasonable if we think about what
factors might be able to predict banks’ customer churning intuitively.