0% found this document useful (0 votes)
2 views8 pages

Data Description

A bank manager is concerned about customer churn in credit card services and seeks to predict which customers are likely to leave. The analysis utilizes a dataset of 10,000 customers with 11 numerical features to predict 'Attrition_Flag' using PCA and logistic regression. The model's accuracy improves with the inclusion of more principal components, achieving a score of approximately 88% with four components.

Uploaded by

Thiresh Sidda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views8 pages

Data Description

A bank manager is concerned about customer churn in credit card services and seeks to predict which customers are likely to leave. The analysis utilizes a dataset of 10,000 customers with 11 numerical features to predict 'Attrition_Flag' using PCA and logistic regression. The model's accuracy improves with the inclusion of more principal components, achieving a score of approximately 88% with four components.

Uploaded by

Thiresh Sidda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

4/26/22, 9:59 PM PCA

A manager at the bank is disturbed with more and more customers leaving their credit card
services. They would really appreciate if one could predict for them who is gonna get churned so
they can proactively go to the customer to provide them better services and turn customers'
decisions in the opposite direction

Data Description
This dataset consists of 10,000 customers, with 18 features, but in this analysis we will avoid
using categorical features and use a subset of 11 numerical features to make the pre-processing
simpler

Target Variable description


The target variable that we’re trying to predict is ‘Attrition_Flag’, with 84% of customers being
Existing Customer and 16% being ‘Attrited Customer’

We will use Numpy for numerical operations, Pandas for dataframes, Matplotlib and Plotly for
plots and Sklearn for building machine learning models

In [2]:
!pip install plotly

Requirement already satisfied: plotly in c:\users\bisha\anaconda3\lib\site-packages


(5.7.0)
Requirement already satisfied: six in c:\users\bisha\anaconda3\lib\site-packages (fr
om plotly) (1.15.0)
Requirement already satisfied: tenacity>=6.2.0 in c:\users\bisha\anaconda3\lib\site-
packages (from plotly) (8.0.1)

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as ex
import plotly.graph_objs as go
import plotly.figure_factory as ff
from plotly.subplots import make_subplots
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

Because the last two columns of the original dataset contains prediction output from a Naive
Bayes algorithm, we will drop these columns for the purpose of our analysis

In [4]:
df = pd.read_csv('C:/Users/bisha/OneDrive/Bishal/OneDrive/Python learning/Principal

In [5]:
df=df[df.columns[:-2]] # Drop the last two columns

df.head() # Inspect the first 5 rows

Out[5]: CLIENTNUM Attrition_Flag Customer_Age Gender Dependent_count Education_Level Marital_St

Existing
0 768805383 45 M 3 High School Ma
Customer

localhost:8888/nbconvert/html/OneDrive/Bishal/OneDrive/Python learning/Principal Component Analysis/PCA.ipynb?download=false 1/8


4/26/22, 9:59 PM PCA

CLIENTNUM Attrition_Flag Customer_Age Gender Dependent_count Education_Level Marital_St

Existing
1 818770008 49 F 5 Graduate S
Customer

Existing
2 713982108 51 M 3 Graduate Ma
Customer

Existing
3 769911858 40 F 4 High School Unkn
Customer

Existing
4 709106358 40 M 3 Uneducated Ma
Customer

5 rows × 21 columns

We will look at the customer Age distribution using plotly. ( which allows for interactive plots)

In [6]:
fig = make_subplots(rows=2,cols=1)
tr1 = go.Box(x=df['Customer_Age'],name='Age Box Plot',boxmean='sd')
tr2 = go.Histogram(x=df['Customer_Age'], name='Age Histogram')

fig.add_trace(tr1,row=1,col=1)
fig.add_trace(tr2,row=2,col=1)
fig.update_layout(height=500, width=600, title_text="Distribution of Customer Ages")
fig.show()

Distribution of Customer Ages

Age Box Plot


Age Histogram

Age Box Plot

30 40 50 60 70

400

200

0
30 40 50 60 70

Let's look at level of education and income:

localhost:8888/nbconvert/html/OneDrive/Bishal/OneDrive/Python learning/Principal Component Analysis/PCA.ipynb?download=false 2/8


4/26/22, 9:59 PM PCA

In [7]: education = pd.DataFrame(df['Education_Level'].value_counts())


labelsedu = df['Education_Level'].unique()

income = pd.DataFrame(df['Income_Category'].value_counts())
labelincome=df['Income_Category'].unique()

In [8]:
#explore education level and income level
fig = make_subplots(rows=1,cols=2,specs=[[{'type':'domain'}, {'type':'domain'}]])

tr3 = go.Pie(labels=labelsedu,values=education.iloc[:,0],name='proportion of Educati


tr4 = go.Pie(labels=labelincome,values=income.iloc[:,0], name='Propotion Of Differen

fig.add_trace(tr3,row=1,col=1)
fig.add_trace(tr4,row=1,col=2)
fig.update_layout(height=500, width=600, title_text="Distribution of Income and Educ
fig.show()

Distribution of Income and Education level

High School
Graduate
Uneducated
Unknown
College
17.7% Post-Graduate
19.9%
30.9% Doctorate
35.2%
80K
15.2%
15% Less than $40K
120K
7.

4.45% 13.8% 60K


1

14.7%
8%

10% 5.1% 11% $120K +

For prediction purpose, we will assign existing customers to be predicted as “1” and attrited
customers to be predicted as “0”

In [9]:
x = df.iloc[:,9:21] # assign column 9 to 21 as x variable - the features
x = StandardScaler().fit_transform(x) # standarize the variables

In [10]:
df['Attrition_Flag'].replace('Existing Customer','1',inplace=True)

df['Attrition_Flag'].replace('Attrited Customer','0',inplace=True)

localhost:8888/nbconvert/html/OneDrive/Bishal/OneDrive/Python learning/Principal Component Analysis/PCA.ipynb?download=false 3/8


4/26/22, 9:59 PM PCA

# assign y variable - the target


y = df['Attrition_Flag']

We will start by using only the first 2 leading principal components, and then explore 3 principal
components and 4 principal components.

In [11]:
pca = PCA(n_components=2)

PC=pca.fit_transform(x)

principalDF=pd.DataFrame(data=PC,columns=['pc1','pc2'])

finalDf = pd.concat([principalDF, df[['Attrition_Flag']]], axis = 1)


finalDf.head()

Out[11]: pc1 pc2 Attrition_Flag

0 0.276048 -0.617639 1

1 -0.612402 1.430502 1

2 -0.613733 1.098632 1

3 -2.499317 1.781346 1

4 -0.560120 0.924119 1

To assess how much weightings each feature will have in later predictions, we could construct a
loadings table. The loadings shows how much each of our original features have contributed to
each of the “new features” — the principal components.

In [12]:
PCloadings = pca.components_.T * np.sqrt(pca.explained_variance_)
components = df.columns.tolist()
components = components[9:21]

loadingdf = pd.DataFrame(PCloadings,columns=('PC1','PC2'))
loadingdf["variable"]=components
loadingdf

Out[12]: PC1 PC2 variable

0 -0.012248 -0.084536 Months_on_book

1 -0.276207 -0.384630 Total_Relationship_Count

2 -0.030992 -0.105797 Months_Inactive_12_mon

3 -0.017396 -0.314187 Contacts_Count_12_mon

4 0.867614 -0.180299 Credit_Limit

5 -0.261374 0.402668 Total_Revolving_Bal

6 0.890865 -0.216361 Avg_Open_To_Buy

7 -0.012135 0.181603 Total_Amt_Chng_Q4_Q1

8 0.467479 0.763757 Total_Trans_Amt

9 0.359458 0.788716 Total_Trans_Ct

10 -0.012862 0.309368 Total_Ct_Chng_Q4_Q1

localhost:8888/nbconvert/html/OneDrive/Bishal/OneDrive/Python learning/Principal Component Analysis/PCA.ipynb?download=false 4/8


4/26/22, 9:59 PM PCA

PC1 PC2 variable

11 -0.718652 0.411826 Avg_Utilization_Ratio

Now we can plot the loadings and see which of them have high weightings in both principal
component 1 and 2:

In [13]:
fig = ex.scatter(x=loadingdf['PC1'],y=loadingdf['PC2'],text=loadingdf['variable'],)

fig.update_layout(height=600,width=500, title_text='loadings plot')

fig.update_traces(textposition='bottom center')
fig.add_shape(type="line", x0=-0, y0=-0.5,x1=-0,y1=2.5,line=dict(color="RoyalBlue",w
)

fig.add_shape(type="line", x0=-1, y0=0,x1=1,y1=0, line=dict(color="RoyalBlue",width=


)

fig.show()

loadings plot

2.5

1.5

1
y

Total_Trans_Ct
Total_Trans_Amt

0.5
g_Utilization_Ratio
Total_Revolving_Bal
Total_Ct_Chng_Q4_Q1
Total_Amt_Chng_Q4_Q1
0
Months_on_book
Months_Inactive_12_mon
Credit_Lim
Avg_Open_To
Contacts_Count_12_mon
Total_Relationship_Count
−0.5
−1 −0.5 0 0.5 1

It is clear that "total transaction count" and "total transaction amount" are two heavily weighted
features.

This means that they will play a big role in our next step of prediction (using logistic regression).
localhost:8888/nbconvert/html/OneDrive/Bishal/OneDrive/Python learning/Principal Component Analysis/PCA.ipynb?download=false 5/8
4/26/22, 9:59 PM PCA

If the predictions came out to have a reasonably high accuracy, we can infer that these two
features are indeed two important factors that determines customer churn.

But before going into the prediction, as dimension reduction have allowed us to visualize the
data in 2 dimensions, we can make a scatter plot with respect to the principal components to
see how the data are distributed.

In [14]:
def myplot(score,coeff,labels=None):
xs = score[:,0]
ys = score[:,1]
n = coeff.shape[0]
scalex = 1.0/(xs.max() - xs.min())
scaley = 1.0/(ys.max() - ys.min())
colors = {'1':'pink', '0':'blue'}
plt.scatter(xs * scalex,ys * scaley, c= y.apply(lambda x: colors[x]))
for i in range(n):
plt.arrow(0, 0, coeff[i,0], coeff[i,1],color = 'r',alpha = 0.5)
if labels is None:
plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, "Var"+str(i+1), color = 'g
else:
plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, labels[i], color = 'g', ha
plt.xlim(-1,1)
plt.ylim(-1,1)
plt.xlabel("PC{}".format(1))
plt.ylabel("PC{}".format(2))

myplot(PC[:,0:2],np.transpose(pca.components_[0:2, :]))
plt.show()

We can proceed with prediction using logistic regression:

1. Split the data set into training set and test set
2. Apply logistic regression to the training set
3. Make prediction on test set
4. Output the prediction score on the test set

In [15]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

Xfinal=finalDf[['pc1','pc2']]
yfinal=finalDf['Attrition_Flag']

localhost:8888/nbconvert/html/OneDrive/Bishal/OneDrive/Python learning/Principal Component Analysis/PCA.ipynb?download=false 6/8


4/26/22, 9:59 PM PCA

X_train, X_test, y_train, y_test = train_test_split(Xfinal,yfinal,test_size=0.3)

logistic=LogisticRegression()
logistic.fit(X=X_train,y=y_train)
logistic.predict(X_test)
score_2=logistic.score(X_test,y_test)

We also want to try using 3 principal components and 4 principal components to compare their
accuracy:

In [21]:
pca=PCA(n_components=3)
PC=pca.fit_transform(x)

principalDF=pd.DataFrame(data=PC,columns=['pc1','pc2','pc3'])
finalDf = pd.concat([principalDF, df[['Attrition_Flag']]], axis = 1)

Xfinal=finalDf[['pc1','pc2','pc3']]
yfinal=finalDf['Attrition_Flag']

X_train, X_test, y_train, y_test = train_test_split(Xfinal,yfinal,test_size=0.3)

logistic=LogisticRegression()
logistic.fit(X=X_train,y=y_train)
logistic.predict(X_test)

score_3=logistic.score(X_test,y_test)

pca=PCA(n_components=4)
PC=pca.fit_transform(x)
principalDF=pd.DataFrame(data=PC,columns=['pc1','pc2','pc3','pc4'])

finalDf = pd.concat([principalDF, df[['Attrition_Flag']]], axis = 1)

Xfinal=finalDf[['pc1','pc2','pc3','pc4']]
yfinal=finalDf['Attrition_Flag']

X_train, X_test, y_train, y_test = train_test_split(Xfinal,yfinal,test_size=0.3)

logistic=LogisticRegression()
logistic.fit(X=X_train,y=y_train)
logistic.predict(X_test)

score_4=logistic.score(X_test,y_test)

Finally, we can assess how accurate are the predictions made by our model:

In [23]:
scores=[score_2,score_3,score_4]
scores

Out[23]: [0.8538993089832182, 0.872326423165515, 0.8848305363606449]

In [24]:
ex.bar(y=scores,x=('pc2','pc3','pc4'),range_y=(0.7,0.9),title='PC prediction accurac

PC prediction accuracy

09
localhost:8888/nbconvert/html/OneDrive/Bishal/OneDrive/Python learning/Principal Component Analysis/PCA.ipynb?download=false 7/8
4/26/22, 9:59 PM PCA
0.9

0.85

It turns out that 3 principal components gave the highest score, nevertheless, 84% accuracy is
already achieved with 2 principal components, which is a quite descent result.

Therefore we can infer that total transaction count and total transaction amount are two of the
good predictors of customer churning, and this is also very reasonable if we think about what
factors might be able to predict banks’ customer churning intuitively.

localhost:8888/nbconvert/html/OneDrive/Bishal/OneDrive/Python learning/Principal Component Analysis/PCA.ipynb?download=false 8/8

You might also like