Credit Card Analysis
Credit Card Analysis
November 6, 2023
2 Introduction
The sample Dataset summarizes the usage behavior of about 9000 active credit card holders during
the last 6 months. The file is at a customer level with 18 behavioral variables. Following is the
Data Dictionary for Credit Card dataset :- CUST_ID : Identification of Credit Card holder
(Categorical)
BALANCE : Balance amount left in their account to make purchases ( BAL-
ANCE_FREQUENCY : How frequently the Balance is updated, score between 0 and 1 (1 =
frequently updated, 0 = not frequently updated)
PURCHASES : Amount of purchases made from account
ONEOFF_PURCHASES : Maximum purchase amount done in one-go
INSTALLMENTS_PURCHASES : Amount of purchase done in installment
CASH_ADVANCE : Cash in advance given by the user
PURCHASES_FREQUENCY : How frequently the Purchases are being made, score between
0 and 1 (1 = frequently purchased, 0 = not frequently purchased)
ONEOFFPURCHASESFREQUENCY : How frequently Purchases are happening in one-go
(1 = frequently purchased, 0 = not frequently purchased)
PURCHASESINSTALLMENTSFREQUENCY : How frequently purchases in installments
are being done (1 = frequently done, 0 = not frequently done)
CASHADVANCEFREQUENCY : How frequently the cash in advance being paid
CASHADVANCETRX : Number of Transactions made with “Cash in Advanced”
PURCHASES_TRX : Numbe of purchase transactions made
CREDIT_LIMIT : Limit of Credit Card for user
PAYMENTS : Amount of Payment done by user
MINIMUM_PAYMENTS : Minimum amount of payments made by user
PRCFULLPAYMENT : Percent of full payment paid by user
1
TENURE : Tenure of credit card service for user
[3]: from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
3 Import Libraries
[4]: import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
import pylab
4 Load Dataset
[5]: df = pd.read_csv("/content/drive/MyDrive/Datasets/CC General/CC GENERAL.csv")
df.head()
ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQUENCY \
0 0.000000 0.083333
1 0.000000 0.000000
2 1.000000 0.000000
2
3 0.083333 0.000000
4 0.083333 0.000000
[6]: df.shape
[7]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8950 entries, 0 to 8949
Data columns (total 18 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 CUST_ID 8950 non-null object
1 BALANCE 8950 non-null float64
2 BALANCE_FREQUENCY 8950 non-null float64
3 PURCHASES 8950 non-null float64
4 ONEOFF_PURCHASES 8950 non-null float64
5 INSTALLMENTS_PURCHASES 8950 non-null float64
6 CASH_ADVANCE 8950 non-null float64
7 PURCHASES_FREQUENCY 8950 non-null float64
8 ONEOFF_PURCHASES_FREQUENCY 8950 non-null float64
9 PURCHASES_INSTALLMENTS_FREQUENCY 8950 non-null float64
10 CASH_ADVANCE_FREQUENCY 8950 non-null float64
11 CASH_ADVANCE_TRX 8950 non-null int64
12 PURCHASES_TRX 8950 non-null int64
13 CREDIT_LIMIT 8949 non-null float64
14 PAYMENTS 8950 non-null float64
15 MINIMUM_PAYMENTS 8637 non-null float64
16 PRC_FULL_PAYMENT 8950 non-null float64
17 TENURE 8950 non-null int64
dtypes: float64(14), int64(3), object(1)
3
memory usage: 1.2+ MB
[8]: df.columns
[9]: df.describe()
ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQUENCY \
count 8950.000000 8950.000000
mean 0.202458 0.364437
std 0.298336 0.397448
min 0.000000 0.000000
25% 0.000000 0.000000
50% 0.083333 0.166667
75% 0.300000 0.750000
max 1.000000 1.000000
4
std 0.200121 6.824647 24.857649 3638.815725
min 0.000000 0.000000 0.000000 50.000000
25% 0.000000 0.000000 1.000000 1600.000000
50% 0.000000 0.000000 7.000000 3000.000000
75% 0.222222 4.000000 17.000000 6500.000000
max 1.500000 123.000000 358.000000 30000.000000
5 Univariate Analysis
[10]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8950 entries, 0 to 8949
Data columns (total 18 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 CUST_ID 8950 non-null object
1 BALANCE 8950 non-null float64
2 BALANCE_FREQUENCY 8950 non-null float64
3 PURCHASES 8950 non-null float64
4 ONEOFF_PURCHASES 8950 non-null float64
5 INSTALLMENTS_PURCHASES 8950 non-null float64
6 CASH_ADVANCE 8950 non-null float64
7 PURCHASES_FREQUENCY 8950 non-null float64
8 ONEOFF_PURCHASES_FREQUENCY 8950 non-null float64
9 PURCHASES_INSTALLMENTS_FREQUENCY 8950 non-null float64
10 CASH_ADVANCE_FREQUENCY 8950 non-null float64
11 CASH_ADVANCE_TRX 8950 non-null int64
12 PURCHASES_TRX 8950 non-null int64
13 CREDIT_LIMIT 8949 non-null float64
14 PAYMENTS 8950 non-null float64
15 MINIMUM_PAYMENTS 8637 non-null float64
16 PRC_FULL_PAYMENT 8950 non-null float64
17 TENURE 8950 non-null int64
dtypes: float64(14), int64(3), object(1)
memory usage: 1.2+ MB
5
[11]: sns.set(rc={"figure.figsize":(6,4)})
sns.displot(df["BALANCE"], kde=True, color="orange", bins=10)
[12]: plt.figure(figsize=(6,4))
plt.scatter(x=df['BALANCE'], y=df.index)
plt.xlabel("BALANCE")
plt.show()
6
[13]: sns.set(rc={"figure.figsize":(6,4)})
sns.displot(df["BALANCE_FREQUENCY"], kde=True, color="orange", bins=10)
7
[14]: plt.figure(figsize=(6,4))
plt.scatter(x=df['BALANCE_FREQUENCY'], y=df.index)
plt.xlabel("BALANCE_FREQUENCY")
plt.show()
8
[15]: sns.set(rc={"figure.figsize":(6,4)})
sns.displot(df["PURCHASES"], kde=True, color="orange", bins=10)
9
[16]: plt.figure(figsize=(6,4))
plt.scatter(x=df['PURCHASES'], y=df.index)
plt.xlabel("PURCHASES")
plt.show()
10
[17]: df["PURCHASES"].value_counts()
[18]: sns.set(rc={"figure.figsize":(6,4)})
sns.displot(df["ONEOFF_PURCHASES"], kde=True, color="orange", bins=10)
11
[19]: plt.figure(figsize=(6,4))
plt.scatter(x=df['ONEOFF_PURCHASES'], y=df.index)
plt.xlabel("ONEOFF_PURCHASES")
plt.show()
12
[20]: df["ONEOFF_PURCHASES"].value_counts()
[21]: sns.set(rc={"figure.figsize":(6,4)})
sns.displot(df["INSTALLMENTS_PURCHASES"], kde=True, color="orange", bins=10)
13
[22]: plt.figure(figsize=(6,4))
plt.scatter(x=df['INSTALLMENTS_PURCHASES'], y=df.index)
plt.xlabel("INSTALLMENTS_PURCHASES")
plt.show()
14
[23]: df["INSTALLMENTS_PURCHASES"].value_counts()
[24]: sns.set(rc={"figure.figsize":(6,4)})
sns.displot(df["CASH_ADVANCE"], kde=True, color="orange", bins=10)
15
[25]: plt.figure(figsize=(6,4))
plt.scatter(x=df['CASH_ADVANCE'], y=df.index)
plt.xlabel("CASH_ADVANCE")
plt.show()
16
[26]: df["CASH_ADVANCE"].value_counts()
[27]: sns.set(rc={"figure.figsize":(6,4)})
sns.displot(df["PURCHASES_FREQUENCY"], kde=True, color="orange", bins=10)
17
[28]: plt.figure(figsize=(6,4))
plt.scatter(x=df['PURCHASES_FREQUENCY'], y=df.index)
plt.xlabel("PURCHASES_FREQUENCY")
plt.show()
18
[29]: sns.set(rc={"figure.figsize":(6,4)})
sns.displot(df["ONEOFF_PURCHASES_FREQUENCY"], kde=True, color="orange", bins=10)
19
[30]: plt.figure(figsize=(6,4))
plt.scatter(x=df['ONEOFF_PURCHASES_FREQUENCY'], y=df.index)
plt.xlabel("ONEOFF_PURCHASES_FREQUENCY")
plt.show()
20
[31]: sns.set(rc={"figure.figsize":(6,4)})
sns.displot(df["PURCHASES_INSTALLMENTS_FREQUENCY"], kde=True, color="orange",␣
↪bins=10)
21
[32]: plt.figure(figsize=(6,4))
plt.scatter(x=df['PURCHASES_INSTALLMENTS_FREQUENCY'], y=df.index)
plt.xlabel("PURCHASES_INSTALLMENTS_FREQUENCY")
plt.show()
22
[33]: sns.set(rc={"figure.figsize":(6,4)})
sns.displot(df["CASH_ADVANCE_FREQUENCY"], kde=True, color="orange", bins=10)
23
[34]: plt.figure(figsize=(6,4))
plt.scatter(x=df['CASH_ADVANCE_FREQUENCY'], y=df.index)
plt.xlabel("CASH_ADVANCE_FREQUENCY")
plt.show()
24
[35]: sns.set(rc={"figure.figsize":(6,4)})
sns.displot(df["CASH_ADVANCE_TRX"], kde=True, color="orange", bins=10)
25
[36]: plt.figure(figsize=(6,4))
plt.scatter(x=df['CASH_ADVANCE_TRX'], y=df.index)
plt.xlabel("CASH_ADVANCE_TRX")
plt.show()
26
[37]: df["CASH_ADVANCE_TRX"].value_counts()
[37]: 0 4628
1 887
2 620
3 436
4 384
…
39 1
56 1
107 1
53 1
41 1
Name: CASH_ADVANCE_TRX, Length: 65, dtype: int64
[38]: sns.set(rc={"figure.figsize":(6,4)})
sns.displot(df["PURCHASES_TRX"], kde=True, color="orange", bins=10)
27
[39]: plt.figure(figsize=(6,4))
plt.scatter(x=df['PURCHASES_TRX'], y=df.index)
plt.xlabel("PURCHASES_TRX")
plt.show()
28
[40]: sns.set(rc={"figure.figsize":(6,4)})
sns.displot(df["CREDIT_LIMIT"], kde=True, color="orange", bins=10)
29
[41]: plt.figure(figsize=(6,4))
plt.scatter(x=df['CREDIT_LIMIT'], y=df.index)
plt.xlabel("CREDIT_LIMIT")
plt.show()
30
[42]: sns.set(rc={"figure.figsize":(6,4)})
sns.displot(df["PAYMENTS"], kde=True, color="orange", bins=10)
31
[43]: plt.figure(figsize=(6,4))
plt.scatter(x=df['PAYMENTS'], y=df.index)
plt.xlabel("PAYMENTS")
plt.show()
32
[44]: df["PAYMENTS"].value_counts()
[45]: sns.set(rc={"figure.figsize":(6,4)})
sns.displot(df["MINIMUM_PAYMENTS"], kde=True, color="orange", bins=10)
33
[46]: plt.figure(figsize=(6,4))
plt.scatter(x=df['MINIMUM_PAYMENTS'], y=df.index)
plt.xlabel("MINIMUM_PAYMENTS")
plt.show()
34
[47]: df["MINIMUM_PAYMENTS"].value_counts()
[47]: 299.351881 2
342.286490 1
184.464721 1
276.486072 1
309.140865 1
..
181.773223 1
711.894455 1
256.522546 1
127.799107 1
88.288956 1
Name: MINIMUM_PAYMENTS, Length: 8636, dtype: int64
[48]: sns.set(rc={"figure.figsize":(6,4)})
sns.displot(df["PRC_FULL_PAYMENT"], kde=True, color="orange", bins=10)
35
[49]: plt.figure(figsize=(6,4))
plt.scatter(x=df['PRC_FULL_PAYMENT'], y=df.index)
plt.xlabel("PRC_FULL_PAYMENT")
plt.show()
36
[50]: sns.set(rc={"figure.figsize":(6,4)})
sns.displot(df["TENURE"], kde=True, color="orange", bins=10)
37
[51]: plt.figure(figsize=(6,4))
plt.scatter(x=df['TENURE'], y=df.index)
plt.xlabel("TENURE")
plt.show()
38
6 EDA (Exploratory Data Analysis)
Remove Duplicate
[52]: duplicate = df.duplicated()
print(duplicate.sum())
[53]: CUST_ID 0
BALANCE 0
BALANCE_FREQUENCY 0
PURCHASES 0
ONEOFF_PURCHASES 0
INSTALLMENTS_PURCHASES 0
CASH_ADVANCE 0
PURCHASES_FREQUENCY 0
ONEOFF_PURCHASES_FREQUENCY 0
PURCHASES_INSTALLMENTS_FREQUENCY 0
CASH_ADVANCE_FREQUENCY 0
CASH_ADVANCE_TRX 0
39
PURCHASES_TRX 0
CREDIT_LIMIT 1
PAYMENTS 0
MINIMUM_PAYMENTS 313
PRC_FULL_PAYMENT 0
TENURE 0
dtype: int64
[54]: df["CREDIT_LIMIT"].value_counts()
[56]: df.isnull().sum()
[56]: CUST_ID 0
BALANCE 0
BALANCE_FREQUENCY 0
PURCHASES 0
ONEOFF_PURCHASES 0
INSTALLMENTS_PURCHASES 0
CASH_ADVANCE 0
PURCHASES_FREQUENCY 0
ONEOFF_PURCHASES_FREQUENCY 0
PURCHASES_INSTALLMENTS_FREQUENCY 0
CASH_ADVANCE_FREQUENCY 0
CASH_ADVANCE_TRX 0
PURCHASES_TRX 0
CREDIT_LIMIT 0
PAYMENTS 0
MINIMUM_PAYMENTS 313
PRC_FULL_PAYMENT 0
TENURE 0
dtype: int64
40
[57]: df["MINIMUM_PAYMENTS"].value_counts()
[57]: 299.351881 2
342.286490 1
184.464721 1
276.486072 1
309.140865 1
..
181.773223 1
711.894455 1
256.522546 1
127.799107 1
88.288956 1
Name: MINIMUM_PAYMENTS, Length: 8636, dtype: int64
[59]: df.isnull().sum()
[59]: CUST_ID 0
BALANCE 0
BALANCE_FREQUENCY 0
PURCHASES 0
ONEOFF_PURCHASES 0
INSTALLMENTS_PURCHASES 0
CASH_ADVANCE 0
PURCHASES_FREQUENCY 0
ONEOFF_PURCHASES_FREQUENCY 0
PURCHASES_INSTALLMENTS_FREQUENCY 0
CASH_ADVANCE_FREQUENCY 0
CASH_ADVANCE_TRX 0
PURCHASES_TRX 0
CREDIT_LIMIT 0
PAYMENTS 0
MINIMUM_PAYMENTS 0
PRC_FULL_PAYMENT 0
TENURE 0
dtype: int64
Remove Outlier
[60]: plt.figure(figsize=(30,8))
sns.boxplot(data=df)
plt.show()
41
[61]: num_col = df.select_dtypes(["float64","int64"])
for i in num_col.columns:
plt.boxplot(df[i])
plt.xlabel(i)
plt.show()
42
43
44
45
46
47
48
49
[62]: df["TENURE"].value_counts()
50
[62]: 12 7584
11 365
10 236
6 204
8 196
7 190
9 175
Name: TENURE, dtype: int64
[63]: print(df.columns.get_loc('TENURE'))
17
[66]: plt.figure(figsize=(30,8))
sns.boxplot(data=df)
plt.show()
51
[67]: num_col = df.select_dtypes(["float64","int64"])
for i in num_col.columns:
plt.boxplot(df[i])
plt.xlabel(i)
plt.show()
52
53
54
55
56
57
58
59
60
Bivariate Analysis
[68]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8950 entries, 0 to 8949
Data columns (total 18 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 CUST_ID 8950 non-null object
1 BALANCE 8950 non-null float64
2 BALANCE_FREQUENCY 8950 non-null float64
3 PURCHASES 8950 non-null float64
4 ONEOFF_PURCHASES 8950 non-null float64
5 INSTALLMENTS_PURCHASES 8950 non-null float64
6 CASH_ADVANCE 8950 non-null float64
7 PURCHASES_FREQUENCY 8950 non-null float64
8 ONEOFF_PURCHASES_FREQUENCY 8950 non-null float64
9 PURCHASES_INSTALLMENTS_FREQUENCY 8950 non-null float64
10 CASH_ADVANCE_FREQUENCY 8950 non-null float64
11 CASH_ADVANCE_TRX 8950 non-null float64
12 PURCHASES_TRX 8950 non-null float64
13 CREDIT_LIMIT 8950 non-null float64
14 PAYMENTS 8950 non-null float64
15 MINIMUM_PAYMENTS 8950 non-null float64
16 PRC_FULL_PAYMENT 8950 non-null float64
17 TENURE 8950 non-null int64
dtypes: float64(16), int64(1), object(1)
memory usage: 1.2+ MB
61
[70]: sns.violinplot(x='TENURE', y='BALANCE', data=df)
plt.title('Balance by Tenure')
plt.show()
62
[71]: sns.barplot(x='TENURE', y='BALANCE_FREQUENCY', data=df)
plt.title('Balance Frequency by Tenure')
plt.show()
63
[72]: plt.scatter(df['PURCHASES'], df['ONEOFF_PURCHASES'], alpha=0.5, color='green')
plt.xlabel('Purchases')
plt.ylabel('One-off Purchases')
plt.title('Purchases vs. One-off Purchases')
plt.show()
64
[73]: plt.scatter(df['ONEOFF_PURCHASES'], df['INSTALLMENTS_PURCHASES'], alpha=0.5,␣
↪color='orange')
plt.xlabel('One-off Purchases')
plt.ylabel('Installments Purchases')
plt.title('One-off Purchases vs. Installments Purchases')
plt.show()
65
[74]: plt.scatter(df['INSTALLMENTS_PURCHASES'], df['PURCHASES_FREQUENCY'], alpha=0.5,␣
↪color='purple')
plt.xlabel('Installments Purchases')
plt.ylabel('Purchases Frequency')
plt.title('Installments Purchases vs. Purchases Frequency')
plt.show()
66
[75]: sns.violinplot(x='TENURE', y='CASH_ADVANCE', data=df)
plt.title('Cash Advance by Tenure')
plt.show()
67
[76]: sns.barplot(x='TENURE', y='CASH_ADVANCE', data=df)
plt.title('Cash Advance by Tenure')
plt.show()
68
[77]: sns.jointplot(x='BALANCE', y='PURCHASES', data=df, kind='reg',␣
↪scatter_kws={'alpha':0.3})
plt.show()
69
[78]: sns.jointplot(x='CASH_ADVANCE', y='PAYMENTS', data=df, kind='hex')
plt.show()
70
[79]: sns.relplot(x='PURCHASES', y='PAYMENTS', hue='TENURE', data=df)
plt.title('Faceted Scatter Plot')
plt.show()
71
[80]: sns.pairplot(df)
plt.show()
72
[81]: fig, ax = plt.subplots(figsize=(18, 8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
73
[82]: num_cols = df.select_dtypes(include=["int64","float64"])
def plots(num_cols, variable):
plt.figure(figsize=(15,6))
plt.subplot(1, 2, 1)
#num_cols[variable].hist()
sns.distplot(num_cols[variable], kde=True, bins=10)
plt.title(variable)
plt.subplot(1, 2, 2)
stats.probplot(num_cols[variable], dist="norm", plot=pylab)
plt.title(variable)
plt.show()
for i in num_cols.columns:
plots(num_cols, i)
<ipython-input-82-7af58d2ef5aa>:6: UserWarning:
Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).
For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751
74
<ipython-input-82-7af58d2ef5aa>:6: UserWarning:
Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).
For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751
<ipython-input-82-7af58d2ef5aa>:6: UserWarning:
75
`distplot` is a deprecated function and will be removed in seaborn v0.14.0.
Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).
For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751
<ipython-input-82-7af58d2ef5aa>:6: UserWarning:
Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).
For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751
76
<ipython-input-82-7af58d2ef5aa>:6: UserWarning:
Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).
For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751
<ipython-input-82-7af58d2ef5aa>:6: UserWarning:
77
`distplot` is a deprecated function and will be removed in seaborn v0.14.0.
Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).
For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751
<ipython-input-82-7af58d2ef5aa>:6: UserWarning:
Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).
For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751
78
<ipython-input-82-7af58d2ef5aa>:6: UserWarning:
Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).
For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751
<ipython-input-82-7af58d2ef5aa>:6: UserWarning:
79
`distplot` is a deprecated function and will be removed in seaborn v0.14.0.
Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).
For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751
<ipython-input-82-7af58d2ef5aa>:6: UserWarning:
Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).
For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751
80
<ipython-input-82-7af58d2ef5aa>:6: UserWarning:
Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).
For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751
<ipython-input-82-7af58d2ef5aa>:6: UserWarning:
81
`distplot` is a deprecated function and will be removed in seaborn v0.14.0.
Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).
For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751
<ipython-input-82-7af58d2ef5aa>:6: UserWarning:
Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).
For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751
82
<ipython-input-82-7af58d2ef5aa>:6: UserWarning:
Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).
For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751
<ipython-input-82-7af58d2ef5aa>:6: UserWarning:
83
Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).
For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751
<ipython-input-82-7af58d2ef5aa>:6: UserWarning:
Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).
For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751
84
<ipython-input-82-7af58d2ef5aa>:6: UserWarning:
Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).
For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751
85
7 Feature Engineering
[83]: X = df.iloc[:,1:]
[84]: X.head()
ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQUENCY \
0 0.000000 0.083333
1 0.000000 0.000000
2 0.750000 0.000000
3 0.083333 0.000000
4 0.083333 0.000000
Normalizing Data
[85]: sc = StandardScaler()
X = sc.fit_transform(X)
[86]: X
86
[86]: array([[-0.87782104, -1.02187519, -0.72968709, …, -0.79404745,
-0.62927738, 0.36067954],
[ 1.1785459 , -0.2027079 , -0.83815959, …, 0.83675109,
0.9739614 , 0.36067954],
[ 0.71848713, 0.61645939, 0.04095652, …, 0.05869355,
-0.62927738, 0.36067954],
…,
[-0.88920486, -0.88535181, -0.67397271, …, -0.89385616,
1.17436805, -4.12276757],
[-0.89567082, -0.88535181, -0.83815959, …, -0.94046866,
1.17436805, -4.12276757],
[-0.66200474, -1.88655177, 0.40489651, …, -0.88359305,
-0.62927738, -4.12276757]])
PCA
[87]: pca = PCA(n_components=2)
X_principle = pca.fit_transform(X)
X_principle = pd.DataFrame(X_principle,columns=["P1","P2"])
X_principle.head()
[87]: P1 P2
0 -1.624789 -2.381615
1 -2.158390 2.289496
2 1.198983 0.280710
3 -0.495849 -0.185932
4 -1.632622 -1.597279
8 Model
KMean Model
[89]: wscc = []
/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:870:
FutureWarning: The default value of `n_init` will change from 10 to 'auto' in
1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:870:
87
FutureWarning: The default value of `n_init` will change from 10 to 'auto' in
1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:870:
FutureWarning: The default value of `n_init` will change from 10 to 'auto' in
1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:870:
FutureWarning: The default value of `n_init` will change from 10 to 'auto' in
1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:870:
FutureWarning: The default value of `n_init` will change from 10 to 'auto' in
1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:870:
FutureWarning: The default value of `n_init` will change from 10 to 'auto' in
1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:870:
FutureWarning: The default value of `n_init` will change from 10 to 'auto' in
1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:870:
FutureWarning: The default value of `n_init` will change from 10 to 'auto' in
1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:870:
FutureWarning: The default value of `n_init` will change from 10 to 'auto' in
1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:870:
FutureWarning: The default value of `n_init` will change from 10 to 'auto' in
1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
88
[92]: model_kmean = KMeans(n_clusters=3,random_state=0).fit(X_principle)
/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:870:
FutureWarning: The default value of `n_init` will change from 10 to 'auto' in
1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
[94]: y_cluster
[95]: 0.43085099690525414
[96]: X_principle
[96]: P1 P2
0 -1.624789 -2.381615
1 -2.158390 2.289496
2 1.198983 0.280710
89
3 -0.495849 -0.185932
4 -1.632622 -1.597279
… … …
8945 0.320141 -2.739866
8946 -0.131128 -1.803644
8947 -0.463207 -2.990444
8948 -2.418519 -2.466634
8949 0.180087 -0.908287
[98]: plt.figure(figsize=(10,10))
ax = sns.scatterplot(x="P1", y="P2", hue = y_cluster, data = X_principle,␣
↪palette =['red','green','blue'])
plt.show()
90
[99]: # Another Way To Used Linkage Metrics
from scipy.cluster.hierarchy import dendrogram,linkage
import matplotlib.pyplot as plt
plt.figure(figsize=(10,7))
plt.title("Mall Customer")
dend = dendrogram(linkage(X,method="ward"))
91
[100]: X_principle
[100]: P1 P2
0 -1.624789 -2.381615
1 -2.158390 2.289496
2 1.198983 0.280710
3 -0.495849 -0.185932
4 -1.632622 -1.597279
… … …
8945 0.320141 -2.739866
8946 -0.131128 -1.803644
8947 -0.463207 -2.990444
8948 -2.418519 -2.466634
8949 0.180087 -0.908287
92
Agglomerative Clustering
[102]: model_agg = AgglomerativeClustering(n_clusters=3).fit(X_principle)
[104]: 0.40260454393255907
[105]: df_y_cluster
[105]: P1 P2 cluster
0 -1.624789 -2.381615 0
1 -2.158390 2.289496 2
2 1.198983 0.280710 1
3 -0.495849 -0.185932 0
4 -1.632622 -1.597279 0
… … … …
8945 0.320141 -2.739866 0
8946 -0.131128 -1.803644 0
8947 -0.463207 -2.990444 0
8948 -2.418519 -2.466634 0
8949 0.180087 -0.908287 0
93
[8950 rows x 3 columns]
[107]: df_y_cluster
94
[109]: plt.figure(figsize=(10,10))
ax = sns.scatterplot(x="P1", y="P2", hue = y_pred, data = X_principle, palette␣
↪=['red','green','blue'])
plt.show()
[116]:
95