0% found this document useful (0 votes)
165 views6 pages

Feature Engineering On Banks' Private Credit Data - Ipynb - Colab

Uploaded by

Moonlight
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
165 views6 pages

Feature Engineering On Banks' Private Credit Data - Ipynb - Colab

Uploaded by

Moonlight
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

9/24/24, 10:26 AM Feature Engineering on Banks' Private Credit Data.

ipynb - Colab

keyboard_arrow_down Feature Engineering on Banks' Private Credit Data

import pandas as pd

# upload the dataset


from google.colab import files
files.upload()

Choose Files credit.csv


credit.csv(text/csv) - 551806 bytes, last modified: 9/17/2024 - 100% done
Saving credit.csv to credit.csv
{'credit.csv': b'Cust_No,Target,Nation,Birth_Place,Gender,Age,Marriage_State,Highest
Education,House_State,Work_Years,Title,Duty,Industry,Year_Income,Couple_Year_Income,L12_Month_Pay_Amount,Couple_L12_Month_Pay_Amount,Ast

df=pd.read_csv('credit.csv', index_col=0)
df.head()

Highest
Target Nation Birth_Place Gender Age Marriage_State House_State Work_Years Title ... ZX_Max_Overdue_Account
Education

Cust_No

2 0 1.0 330621 1 55 40.0 71.0 1.0 0 9.0 ... 1

4 0 1.0 330621 0 40 99.0 90.0 1.0 0 NaN ... 0

6 0 1.0 330621 1 45 20.0 71.0 1.0 0 NaN ... 1

7 0 NaN 330421 0 32 20.0 21.0 1.0 0 NaN ... 1

8 0 1.0 330621 0 46 20.0 71.0 NaN 0 NaN ... 0

5 rows × 31 columns

#data collection --> data cleansing


#view the missing values
import missingno
missingno.matrix(df)

<Axes: >

#isnull().sum()
df_missing = pd.DataFrame(df.isnull().sum()/df.shape[0],columns = ['missing_rate']).reset_index()
df_missing.sort_values(by='missing_rate', ascending=False)[:10]

https://colab.research.google.com/drive/1NKCYHsHNEgWK9rQhpOYasj9JNCijPk_I?authuser=3#scrollTo=1bLaK0vL2bRm&printMode=true 1/6
9/24/24, 10:26 AM Feature Engineering on Banks' Private Credit Data.ipynb - Colab

index missing_rate

9 Title 0.603075

11 Industry 0.594120

7 House_State 0.424806

1 Nation 0.356877

5 Marriage_State 0.334404

6 Highest Education 0.325786

10 Duty 0.126563

25 ZX_Max_Credits 0.000000

21 ZX_Max_Overdue_Account 0.000000

22 ZX_Link_Max_Overdue_Amount 0.000000

#fill the missing values


#fillna() - mean, mode, median

missing_col = ['Title','Industry','House_State', 'Nation','Marriage_State', 'Highest Education', 'Duty']

missing_col = ["Title", "Industry", "House_State", "Nation", "Marriage_State", "Highest Education", "Duty"]


for col in missing_col:
df[col] = df[col].fillna(int(df[col].mode()))

<ipython-input-9-ea6953e5cbbd>:3: FutureWarning: Calling int on a single element Series is deprecated and will raise a TypeError in the
df[col] = df[col].fillna(int(df[col].mode()))

#isnull().sum()
df_missing_2 = pd.DataFrame(df.isnull().sum()/df.shape[0],columns = ['missing_rate']).reset_index()
df_missing_2.sort_values(by='missing_rate', ascending=False)[:10]

index missing_rate

0 Target 0.0

16 Ast_Curr_Bal 0.0

29 ZX_Credit_Total_Overdue_Months 0.0

28 ZX_Credit_Max_Overdu_Amount 0.0

27 ZX_Max_Overdue_Credits 0.0

26 ZX_Max_Credit_Banks 0.0

25 ZX_Max_Credits 0.0

24 ZX_Max_Overdue_Duration 0.0

23 ZX_Total_Overdu_Months 0.0

22 ZX_Link_Max_Overdue_Amount 0.0

missingno.matrix(df)

https://colab.research.google.com/drive/1NKCYHsHNEgWK9rQhpOYasj9JNCijPk_I?authuser=3#scrollTo=1bLaK0vL2bRm&printMode=true 2/6
9/24/24, 10:26 AM Feature Engineering on Banks' Private Credit Data.ipynb - Colab

<Axes: >

#feature_selection
#filter
#crosstab()
cross_table = pd.crosstab(df.House_State, columns=df.Target, margins=True)
cross_table_rowpct = cross_table.div(cross_table["All"], axis=0)
cross_table_rowpct

Target 0 1 All

House_State

1.0 0.980996 0.019004 1.0

2.0 0.954545 0.045455 1.0

3.0 0.941176 0.058824 1.0

4.0 1.000000 0.000000 1.0

5.0 0.980392 0.019608 1.0

6.0 1.000000 0.000000 1.0

7.0 0.857143 0.142857 1.0

8.0 1.000000 0.000000 1.0

All 0.980399 0.019601 1.0

Next steps: Generate code with cross_table_rowpct


toggle_off View recommended plots New interactive sheet

# perform chi-square test


# seperate the ind vs dependent variables
# Target --> y
# X - independent
# C_category
X = df.drop('Target', axis=1)
y = df['Target']
X_category = df[['Nation', 'Birth_Place','Gender','Marriage_State','Highest Education', 'House_State', 'Work_Years', 'Title', 'Duty', 'Indus

from sklearn.feature_selection import chi2


(chi2,pval)=chi2(X_category,y)
dict_feature={}
for i,j in zip(X_category.columns.values,chi2):
dict_feature[i]=j
kai = sorted(dict_feature.items(), key=lambda item:item[1],reverse=True)
kai

[('Work_Years', 30037.98992988671),
('Birth_Place', 2337.714562647344),

https://colab.research.google.com/drive/1NKCYHsHNEgWK9rQhpOYasj9JNCijPk_I?authuser=3#scrollTo=1bLaK0vL2bRm&printMode=true 3/6
9/24/24, 10:26 AM Feature Engineering on Banks' Private Credit Data.ipynb - Colab
('Marriage_State', 42.7575821276435),
('Duty', 30.50877073893663),
('Industry', 25.452013582101742),
('Nation', 5.256723621174332),
('Gender', 2.309012664555949),
('Highest Education', 1.2297808819626934),
('Title', 0.8774406202190365),
('House_State', 0.3184384017473372)]

# test the continuous variable correlation


# check the variables that are highly correlated
nominal_features = ['Nation', 'Birth_Place','Gender','Marriage_State','Highest Education', 'House_State', 'Work_Years', 'Title', 'Duty', 'In
numerical_features = [col_ for col_ in df.columns if col_ not in nominal_features] # use col_ instead of col_not
numerical_features.pop(0)
X_num = df[numerical_features]

import matplotlib.pyplot as plt


import seaborn as sns
plt.figure(figsize=(25,15))
corr_matrix = X_num.corr(method='pearson')
sns.heatmap(corr_matrix, annot=True)

<Axes: >

cols_pair=[]
for index_ in corr_matrix.index:
for col_ in corr_matrix.columns:
if corr_matrix.loc[index_,col_]>=0.8 and index_!=col_ and (col_,index_) not in cols_pair:
cols_pair.append((index_,col_))
cols_pair

[('ZX_Max_Account_Number', 'ZX_Max_Link_Banks'),
('ZX_Max_Credits', 'ZX_Max_Credit_Banks')]

https://colab.research.google.com/drive/1NKCYHsHNEgWK9rQhpOYasj9JNCijPk_I?authuser=3#scrollTo=1bLaK0vL2bRm&printMode=true 4/6
9/24/24, 10:26 AM Feature Engineering on Banks' Private Credit Data.ipynb - Colab

# try wrapper method


from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
x_rfe=RFE(estimator=LogisticRegression(),n_features_to_select=20).fit(X,y)
print(x_rfe.ranking_)
print(x_rfe.support_)

Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
[ 5 1 9 1 1 1 10 1 1 1 1 1 1 1 11 1 1 4 1 1 1 1 1 1
3 8 7 2 1 6]
[False True False True True True False True True True True True
True True False True True False True True True True True True
False False False False True False]
/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n iter i = check optimize result(

# Embedded Method
from sklearn.ensemble import RandomForestClassifier
emb = RandomForestClassifier()
emb.fit(X,y)

▾ RandomForestClassifier i ?
RandomForestClassifier()

colss = [i for i in X.columns]


emb1 = sorted(zip(map(lambda x: round(x, 4), emb.feature_importances_), colss), reverse=True) # Removed colss from map function arguments
emb1

[(0.1313, 'Ast_Curr_Bal'),
(0.1137, 'Age'),
(0.0921, 'Year_Income'),
(0.0669, 'Std_Cred_Limit'),
(0.0421, 'ZX_Link_Max_Overdue_Amount'),
(0.0412, 'ZX_Max_Account_Number'),
(0.0373, 'Highest Education'),
(0.0349, 'Duty'),
(0.0347, 'ZX_Max_Link_Banks'),
(0.0341, 'ZX_Total_Overdu_Months'),
(0.0312, 'Industry'),
(0.0307, 'Birth_Place'),
(0.0295, 'ZX_Max_Overdue_Duration'),
(0.0295, 'ZX_Max_Overdue_Account'),
(0.0252, 'Loan_Curr_Bal'),
(0.0248, 'Couple_Year_Income'),
(0.0244, 'Marriage_State'),
(0.0222, 'L12_Month_Pay_Amount'),
(0.0198, 'ZX_Max_Credit_Banks'),
(0.0192, 'ZX_Max_Credits'),
(0.0188, 'ZX_Credit_Max_Overdu_Amount'),
(0.0155, 'Work_Years'),
(0.0142, 'ZX_Credit_Max_Overdue_Duration'),
(0.0141, 'Gender'),
(0.0124, 'Title'),

https://colab.research.google.com/drive/1NKCYHsHNEgWK9rQhpOYasj9JNCijPk_I?authuser=3#scrollTo=1bLaK0vL2bRm&printMode=true 5/6
9/24/24, 10:26 AM Feature Engineering on Banks' Private Credit Data.ipynb - Colab
(0.012, 'ZX_Credit_Total_Overdue_Months'),
(0.0102, 'ZX_Max_Overdue_Credits'),
(0.0097, 'Nation'),
(0.0084, 'House_State'),
(0.0, 'Couple_L12_Month_Pay_Amount')]

# variable removal
deleted_col = ['ZX_Max_Overdue_Duration','Couple_L12_Month_Pay_Amount','House_State', 'ZX_Max_Overdue_Credits', 'Couple_Year_Income','Marriage
df_selected = df.drop(deleted_col, axis=1)
df_selected.head()

Highest
Target Nation Birth_Place Gender Age Duty Industry Year_Income L12_Month_Pay_Amount ... Loan_Curr_Bal ZX_M
Education

Cust_No

2 0 1.0 330621 1 55 71.0 9.0 52.0 100000.0 0.0 ... 560000.0

4 0 1.0 330621 0 40 90.0 2.0 51.0 300000.0 0.0 ... 0.0

6 0 1.0 330621 1 45 71.0 0.0 17.0 150000.0 0.0 ... 1350000.0

7 0 1.0 330421 0 32 21.0 0.0 83.0 80000.0 0.0 ... 120000.0

8 0 1.0 330621 0 46 71.0 0.0 51.0 50000.0 0.0 ... 0.0

5 rows × 22 columns

https://colab.research.google.com/drive/1NKCYHsHNEgWK9rQhpOYasj9JNCijPk_I?authuser=3#scrollTo=1bLaK0vL2bRm&printMode=true 6/6

You might also like