0% found this document useful (0 votes)
21 views

Supervised Learning Project - Ipynb - Colab

Uploaded by

liuyinj
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

Supervised Learning Project - Ipynb - Colab

Uploaded by

liuyinj
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

2024/6/5 15:27 Supervised Learning Project.

ipynb - Colab

keyboard_arrow_down Bank Customer Churn Prediction


In this project, we use supervised learning models to identify customers who are likely to churn in the future. Furthermore, we will analyze top
factors that influence user retention. Dataset information.

Things need to know

对于0基础的同学,以了解为主
不需要可以从0把code写出来
这部分code与算法课程code的思维不一样
需要下课自己去跑一下code,尽量去理解
工具终归是辅助我们的,先尽力学会一个
逐步掌握python code debug能力

keyboard_arrow_down Contents
Part 1: Data Exploration
Part 2: Feature Preprocessing
Part 3: Model Training and Results Evaluation

keyboard_arrow_down Part 0: Setup Google Drive Environment / Data Collection


check this link for more info

# install pydrive to load data


!pip install -U -q PyDrive

from pydrive.auth import GoogleAuth


from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

WARNING:root:pydrive is deprecated and no longer maintained. We recommend that you migrate your projects to pydrive2, the ma

# the same way we get id from last class


# https://drive.google.com/file/d/1szdCZ98EK59cfJ4jG03g1HOv_OhC1oyN/view?usp=sharing

id = "1szdCZ98EK59cfJ4jG03g1HOv_OhC1oyN"

file = drive.CreateFile({'id':id})
file.GetContentFile('bank.data.csv')

import numpy as np
import pandas as pd

churn_df = pd.read_csv('bank.data.csv')
churn_df.head()

https://colab.research.google.com/drive/12WP7JHDfPnibQawRvSL_JVGW-quq1plj?usp=sharing#scrollTo=SIvRSRqAi0Md&printMode=true 1/14
2024/6/5 15:27 Supervised Learning Project.ipynb - Colab

RowNumber CustomerId Surname CreditScore Geography Gender Age Tenure Balance NumOfProducts HasCrCard IsActiveMemb

0 1 15634602 Hargrave 619 France Female 42 2 0.00 1 1

1 2 15647311 Hill 608 Spain Female 41 1 83807.86 1 0

2 3 15619304 Onio 502 France Female 42 8 159660.80 3 1

3 4 15701354 Boni 699 France Female 39 1 0.00 2 0

4 5 15737888 Mit h ll 850 S i F l 43 2 125510 82 1 1

keyboard_arrow_down Part 1: Data Exploration


keyboard_arrow_down Part 1.1: Understand the Raw Dataset
# check data info
churn_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 RowNumber 10000 non-null int64
1 CustomerId 10000 non-null int64
2 Surname 10000 non-null object
3 CreditScore 10000 non-null int64
4 Geography 10000 non-null object
5 Gender 10000 non-null object
6 Age 10000 non-null int64
7 Tenure 10000 non-null int64
8 Balance 10000 non-null float64
9 NumOfProducts 10000 non-null int64
10 HasCrCard 10000 non-null int64
11 IsActiveMember 10000 non-null int64
12 EstimatedSalary 10000 non-null float64
13 Exited 10000 non-null int64
dtypes: float64(2), int64(9), object(3)
memory usage: 1.1+ MB

# check the unique values for each column


churn_df.nunique()

RowNumber 10000
CustomerId 10000
Surname 2932
CreditScore 460
Geography 3
Gender 2
Age 70
Tenure 11
Balance 6382
NumOfProducts 4
HasCrCard 2
IsActiveMember 2
EstimatedSalary 9999
Exited 2
dtype: int64

# Get target variable


y = churn_df['Exited']

keyboard_arrow_down Part 1.2: Understand the features


# check missing values
churn_df.isnull().sum()

RowNumber 0
CustomerId 0
Surname 0
CreditScore 0
Geography 0

https://colab.research.google.com/drive/12WP7JHDfPnibQawRvSL_JVGW-quq1plj?usp=sharing#scrollTo=SIvRSRqAi0Md&printMode=true 2/14
2024/6/5 15:27 Supervised Learning Project.ipynb - Colab
Gender 0
Age 0
Tenure 0
Balance 0
NumOfProducts 0
HasCrCard 0
IsActiveMember 0
EstimatedSalary 0
Exited 0
dtype: int64

# understand Numerical feature


# discrete/continuous
# 'CreditScore', 'Age', 'Tenure', 'NumberOfProducts'
# 'Balance', 'EstimatedSalary'
churn_df[['CreditScore', 'Age', 'Tenure', 'NumOfProducts','Balance', 'EstimatedSalary']].describe()

CreditScore Age Tenure NumOfProducts Balance EstimatedSalary

count 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000

mean 650.528800 38.921800 5.012800 1.530200 76485.889288 100090.239881

std 96.653299 10.487806 2.892174 0.581654 62397.405202 57510.492818

min 350.000000 18.000000 0.000000 1.000000 0.000000 11.580000

25% 584.000000 32.000000 3.000000 1.000000 0.000000 51002.110000

50% 652.000000 37.000000 5.000000 1.000000 97198.540000 100193.915000

75% 718.000000 44.000000 7.000000 2.000000 127644.240000 149388.247500

max 850.000000 92.000000 10.000000 4.000000 250898.090000 199992.480000

# check the feature distribution


# pandas.DataFrame.describe()
# boxplot, distplot, countplot
import matplotlib.pyplot as plt
import seaborn as sns

# boxplot for numerical feature


_,axss = plt.subplots(2,3, figsize=[20,10])
sns.boxplot(x='Exited', y ='CreditScore', data=churn_df, ax=axss[0][0])
sns.boxplot(x='Exited', y ='Age', data=churn_df, ax=axss[0][1])
sns.boxplot(x='Exited', y ='Tenure', data=churn_df, ax=axss[0][2])
sns.boxplot(x='Exited', y ='NumOfProducts', data=churn_df, ax=axss[1][0])
sns.boxplot(x='Exited', y ='Balance', data=churn_df, ax=axss[1][1])
sns.boxplot(x='Exited', y ='EstimatedSalary', data=churn_df, ax=axss[1][2])

https://colab.research.google.com/drive/12WP7JHDfPnibQawRvSL_JVGW-quq1plj?usp=sharing#scrollTo=SIvRSRqAi0Md&printMode=true 3/14
2024/6/5 15:27 Supervised Learning Project.ipynb - Colab

<Axes: xlabel='Exited', ylabel='EstimatedSalary'>

# understand categorical feature


# 'Geography', 'Gender'
# 'HasCrCard', 'IsActiveMember'
_,axss = plt.subplots(2,2, figsize=[20,10])
sns.countplot(x='Exited', hue='Geography', data=churn_df, ax=axss[0][0])
sns.countplot(x='Exited', hue='Gender', data=churn_df, ax=axss[0][1])
sns.countplot(x='Exited', hue='HasCrCard', data=churn_df, ax=axss[1][0])
sns.countplot(x='Exited', hue='IsActiveMember', data=churn_df, ax=axss[1][1])

https://colab.research.google.com/drive/12WP7JHDfPnibQawRvSL_JVGW-quq1plj?usp=sharing#scrollTo=SIvRSRqAi0Md&printMode=true 4/14
2024/6/5 15:27 Supervised Learning Project.ipynb - Colab

<Axes: xlabel='Exited', ylabel='count'>

keyboard_arrow_down Part 2: Feature Preprocessing


# Get feature space by dropping useless feature
to_drop = ['RowNumber','CustomerId','Surname','Exited']
X = churn_df.drop(to_drop, axis = 1)

X.head()

CreditScore Geography Gender Age Tenure Balance NumOfProducts HasCrCard IsActiveMember EstimatedSalary

0 619 France Female 42 2 0.00 1 1 1 101348.88

1 608 Spain Female 41 1 83807.86 1 0 1 112542.58

2 502 France Female 42 8 159660.80 3 1 0 113931.57

3 699 France Female 39 1 0.00 2 0 0 93826.63

4 850 Spain Female 43 2 125510.82 1 1 1 79084.10

X.dtypes

CreditScore int64
Geography object
Gender object
Age int64
Tenure int64
Balance float64

https://colab.research.google.com/drive/12WP7JHDfPnibQawRvSL_JVGW-quq1plj?usp=sharing#scrollTo=SIvRSRqAi0Md&printMode=true 5/14
2024/6/5 15:27 Supervised Learning Project.ipynb - Colab
NumOfProducts int64
HasCrCard int64
IsActiveMember int64
EstimatedSalary float64
dtype: object

cat_cols = X.columns[X.dtypes == 'object']


num_cols = X.columns[(X.dtypes == 'float64') | (X.dtypes == 'int64')]

num_cols

Index(['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard',


'IsActiveMember', 'EstimatedSalary'],
dtype='object')

cat_cols

Index(['Geography', 'Gender'], dtype='object')

Split dataset

# Splite data into training and testing


# 100 -> 75:y=1, 25:y=0
# training(80): 60 y=1; 20 y=0
# testing(20): 15 y=1; 5 y=0

from sklearn import model_selection

# Reserve 25% for testing


# stratify example:
# 100 -> y: 80 '0', 20 '1' -> 4:1
# 80% training 64: '0', 16:'1' -> 4:1
# 20% testing 16:'0', 4: '1' -> 4:1
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.25, stratify = y, random_state = 1) #strat

print('training data has ' + str(X_train.shape[0]) + ' observation with ' + str(X_train.shape[1]) + ' features')
print('test data has ' + str(X_test.shape[0]) + ' observation with ' + str(X_test.shape[1]) + ' features')

training data has 7500 observation with 10 features


test data has 2500 observation with 10 features

10000 -> 8000 '0' + 2000 '1'

25% test 75% training

without stratified sampling:

keyboard_arrow_down • extreme case:


1. testing: 2000 '1' + 500 '0'
2. training: 7500 '0'

with stratified sampling:

1. testing: 2000 '0' + 500 '1'


2. training: 6000 '0' + 1500 '1'

Read more for handling categorical feature, and there is an awesome package for encoding.

X_train.head()

https://colab.research.google.com/drive/12WP7JHDfPnibQawRvSL_JVGW-quq1plj?usp=sharing#scrollTo=SIvRSRqAi0Md&printMode=true 6/14
2024/6/5 15:27 Supervised Learning Project.ipynb - Colab

CreditScore Geography Gender Age Tenure Balance NumOfProducts HasCrCard IsActiveMember EstimatedSalary

7971 633 Spain Male 42 10 0.00 1 0 1 79408.17

9152 708 Germany Female 23 4 71433.08 1 1 0 103697.57

6732 548 France Female 37 9 0.00 2 0 0 98029.58

902 645 France Female 48 7 90612.34 1 1 1 149139.13

2996 729 Spain Female 45 7 91091.06 2 1 0 71133.12

# One hot encoding


# another way: get_dummies
from sklearn.preprocessing import OneHotEncoder

def OneHotEncoding(df, enc, categories):


transformed = pd.DataFrame(enc.transform(df[categories]).toarray(), columns = enc.get_feature_names_out(categories))
return pd.concat([df.reset_index(drop=True), transformed], axis=1).drop(categories, axis=1)

categories = ['Geography']
enc_ohe = OneHotEncoder()
enc_ohe.fit(X_train[categories])

X_train = OneHotEncoding(X_train, enc_ohe, categories)


X_test = OneHotEncoding(X_test, enc_ohe, categories)

X_train.head()

CreditScore Gender Age Tenure Balance NumOfProducts HasCrCard IsActiveMember EstimatedSalary Geography_France Geog

0 633 Male 42 10 0.00 1 0 1 79408.17 0.0

1 708 Female 23 4 71433.08 1 1 0 103697.57 0.0

2 548 Female 37 9 0.00 2 0 0 98029.58 1.0

3 645 Female 48 7 90612.34 1 1 1 149139.13 1.0

4 729 F l 45 7 91091 06 2 1 0 71133 12 00

# Ordinal encoding
from sklearn.preprocessing import OrdinalEncoder

categories = ['Gender']
enc_oe = OrdinalEncoder()
enc_oe.fit(X_train[categories])

X_train[categories] = enc_oe.transform(X_train[categories])
X_test[categories] = enc_oe.transform(X_test[categories])

X_train.head()

CreditScore Gender Age Tenure Balance NumOfProducts HasCrCard IsActiveMember EstimatedSalary Geography_France Geog

0 633 1.0 42 10 0.00 1 0 1 79408.17 0.0

1 708 0.0 23 4 71433.08 1 1 0 103697.57 0.0

2 548 0.0 37 9 0.00 2 0 0 98029.58 1.0

3 645 0.0 48 7 90612.34 1 1 1 149139.13 1.0

4 729 00 45 7 91091 06 2 1 0 71133 12 00

Standardize/Normalize Data

https://colab.research.google.com/drive/12WP7JHDfPnibQawRvSL_JVGW-quq1plj?usp=sharing#scrollTo=SIvRSRqAi0Md&printMode=true 7/14
2024/6/5 15:27 Supervised Learning Project.ipynb - Colab
# Scale the data, using standardization
# standardization (x-mean)/std
# normalization (x-x_min)/(x_max-x_min) ->[0,1]

# 1. speed up gradient descent


# 2. same scale
# 3. algorithm requirments

# for example, use training data to train the standardscaler to get mean and std
# apply mean and std to both training and testing data.
# fit_transform does the training and applying, transform only does applying.
# Because we can't use any info from test, and we need to do the same modification
# to testing data as well as training data

# https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html#sphx-glr-auto-examples-preprocessing-plot-al
# https://scikit-learn.org/stable/modules/preprocessing.html

# min-max example: (x-x_min)/(x_max-x_min)


# [1,2,3,4,5,6,100] -> fit(min:1, max:6) (scalar.min = 1, scalar.max = 6) -> transform [(1-1)/(6-1),(2-1)/(6-1)..]
# scalar.fit(train) -> min:1, max:100
# scalar.transform(apply to x) -> apply min:1, max:100 to X_train
# scalar.transform -> apply min:1, max:100 to X_test

# scalar.fit -> mean:1, std:100


# scalar.transform -> apply mean:1, std:100 to X_train
# scalar.transform -> apply mean:1, std:100 to X_test

from sklearn.preprocessing import StandardScaler


scaler = StandardScaler()
scaler.fit(X_train[num_cols])

X_train[num_cols] = scaler.transform(X_train[num_cols])
X_test[num_cols] = scaler.transform(X_test[num_cols])

X_train.head()

CreditScore Gender Age Tenure Balance NumOfProducts HasCrCard IsActiveMember EstimatedSalary Geography_Fran

0 -0.172985 1.0 0.289202 1.731199 -1.218916 -0.912769 -1.542199 0.968496 -0.352044

1 0.602407 0.0 -1.509319 -0.341156 -0.076977 -0.912769 0.648425 -1.032529 0.072315

2 -1.051762 0.0 -0.184093 1.385806 -1.218916 0.796109 -1.542199 -1.032529 -0.026711

3 -0.048922 0.0 0.857156 0.695022 0.229625 -0.912769 0.648425 0.968496 0.866221

4 0 819517 00 0 573179 0 695022 0 237278 0 796109 0 648425 1 032529 0 496617

keyboard_arrow_down Part 3: Model Training and Result Evaluation


keyboard_arrow_down Part 3.1: Model Training
keyboard_arrow_down build models
#@title build models
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

# Logistic Regression
classifier_logistic = LogisticRegression()

# K Nearest Neighbors
classifier_KNN = KNeighborsClassifier()

# Random Forest
classifier_RF = RandomForestClassifier()

# Train the model


classifier_logistic.fit(X_train, y_train)

https://colab.research.google.com/drive/12WP7JHDfPnibQawRvSL_JVGW-quq1plj?usp=sharing#scrollTo=SIvRSRqAi0Md&printMode=true 8/14
2024/6/5 15:27 Supervised Learning Project.ipynb - Colab

▾ LogisticRegression
LogisticRegression()

# Prediction of test data


classifier_logistic.predict(X_test)

array([0, 0, 0, ..., 0, 0, 0])

# Accuracy of test data


classifier_logistic.score(X_test, y_test)

0.8088

keyboard_arrow_down (Optional) Part 3.2: Use Grid Search to Find Optimal Hyperparameters
alternative: random search

#Loss/cost function --> (wx + b - y) ^2 + ƛ * |w| --> ƛ is a hyperparameter

from sklearn.model_selection import GridSearchCV

# helper function for printing out grid search results


def print_grid_search_metrics(gs):
print ("Best score: " + str(gs.best_score_))
print ("Best parameters set:")
best_parameters = gs.best_params_
for param_name in sorted(best_parameters.keys()):
print(param_name + ':' + str(best_parameters[param_name]))

keyboard_arrow_down Part 3.2.1: Find Optimal Hyperparameters - LogisticRegression

# Possible hyperparamter options for Logistic Regression Regularization


# Penalty is choosed from L1 or L2
# C is the 1/lambda value(weight) for L1 and L2
# solver: algorithm to find the weights that minimize the cost function

# ('l1', 0.01)('l1', 0.05) ('l1', 0.1) ('l1', 0.2)('l1', 1)


# ('12', 0.01)('l2', 0.05) ('l2', 0.1) ('l2', 0.2)('l2', 1)
parameters = {
'penalty':('l2','l1'),
'C':(0.01, 0.05, 0.1, 0.2, 1)
}

Grid_LR = GridSearchCV(LogisticRegression(solver='liblinear'),parameters, cv = 5)
Grid_LR.fit(X_train, y_train)

▸ GridSearchCV
▸ estimator: LogisticRegression
▸ LogisticRegression

# the best hyperparameter combination


# C = 1/lambda
print_grid_search_metrics(Grid_LR)

Best score: 0.8125333333333333


Best parameters set:
C:1
penalty:l1

# best model
best_LR_model = Grid_LR.best_estimator_

best_LR_model.predict(X_test)

array([0, 0, 0, ..., 0, 0, 0])

https://colab.research.google.com/drive/12WP7JHDfPnibQawRvSL_JVGW-quq1plj?usp=sharing#scrollTo=SIvRSRqAi0Md&printMode=true 9/14
2024/6/5 15:27 Supervised Learning Project.ipynb - Colab

best_LR_model.score(X_test, y_test)

0.8092

LR_models = pd.DataFrame(Grid_LR.cv_results_)
res = (LR_models.pivot(index='param_penalty', columns='param_C', values='mean_test_score'))
_ = sns.heatmap(res, cmap='viridis')

<ipython-input-37-990a4e8b9ba5>:2: FutureWarning: In a future version, the Index constructor will not infer numeric dtypes w
res = (LR_models.pivot(index='param_penalty', columns='param_C', values='mean_test_score')

keyboard_arrow_down Part 3.2.2: Find Optimal Hyperparameters: KNN

# Possible hyperparamter options for KNN


# Choose k
parameters = {
'n_neighbors':[1,3,5,7,9]
}
Grid_KNN = GridSearchCV(KNeighborsClassifier(),parameters, cv=5)
Grid_KNN.fit(X_train, y_train)

▸ GridSearchCV
▸ estimator: KNeighborsClassifier
▸ KNeighborsClassifier

# best k
print_grid_search_metrics(Grid_KNN)

Best score: 0.8433333333333334


Best parameters set:
n_neighbors:9

best_KNN_model = Grid_KNN.best_estimator_

best_KNN_model.predict(X_test)

array([0, 0, 0, ..., 0, 0, 0])

best_KNN_model.score(X_test, y_test)

0.8428

keyboard_arrow_down Part 3.2.3: Find Optimal Hyperparameters: Random Forest

https://colab.research.google.com/drive/12WP7JHDfPnibQawRvSL_JVGW-quq1plj?usp=sharing#scrollTo=SIvRSRqAi0Md&printMode=true 10/14
2024/6/5 15:27 Supervised Learning Project.ipynb - Colab
# Possible hyperparamter options for Random Forest
# Choose the number of trees
parameters = {
'n_estimators' : [60,80,100],
'max_depth': [1,5,10]
}
Grid_RF = GridSearchCV(RandomForestClassifier(),parameters, cv=5)
Grid_RF.fit(X_train, y_train)

▸ GridSearchCV
▸ estimator: RandomForestClassifier
▸ RandomForestClassifier

# best number of tress


print_grid_search_metrics(Grid_RF)

Best score: 0.8650666666666667


Best parameters set:
max_depth:10
n_estimators:100

# best random forest


best_RF_model = Grid_RF.best_estimator_

best_RF_model

▾ RandomForestClassifier
RandomForestClassifier(max_depth=10)

best_RF_model.score(X_test, y_test)

0.8596

keyboard_arrow_down Part 3.3: Model Evaluation - Confusion Matrix (Precision, Recall, Accuracy)

class of interest as positive

TP: correctly labeled real churn

Precision(PPV, positive predictive value): tp / (tp + fp); Total number of true predictive churn divided by the total number of predictive churn;
High Precision means low fp, not many return users were predicted as churn users.

Recall(sensitivity, hit rate, true positive rate): tp / (tp + fn) Predict most postive or churn user correctly. High recall means low fn, not many churn
users were predicted as return users.

https://colab.research.google.com/drive/12WP7JHDfPnibQawRvSL_JVGW-quq1plj?usp=sharing#scrollTo=SIvRSRqAi0Md&printMode=true 11/14
2024/6/5 15:27 Supervised Learning Project.ipynb - Colab
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score

# calculate accuracy, precision and recall, [[tn, fp],[]]


def cal_evaluation(classifier, cm):
tn = cm[0][0]
fp = cm[0][1]
fn = cm[1][0]
tp = cm[1][1]
accuracy = (tp + tn) / (tp + fp + fn + tn + 0.0)
precision = tp / (tp + fp + 0.0)
recall = tp / (tp + fn + 0.0)
print (classifier)
print ("Accuracy is: " + str(accuracy))
print ("precision is: " + str(precision))
print ("recall is: " + str(recall))
print ()

# print out confusion matrices


def draw_confusion_matrices(confusion_matricies):
class_names = ['Not','Churn']
for cm in confusion_matrices:
classifier, cm = cm[0], cm[1]
cal_evaluation(classifier, cm)

# Confusion matrix, accuracy, precison and recall for random forest and logistic regression
confusion_matrices = [
("Random Forest", confusion_matrix(y_test,best_RF_model.predict(X_test))),
("Logistic Regression", confusion_matrix(y_test,best_LR_model.predict(X_test))),
("K nearest neighbor", confusion_matrix(y_test, best_KNN_model.predict(X_test)))
]

draw_confusion_matrices(confusion_matrices)

Random Forest
Accuracy is: 0.8596
precision is: 0.8038461538461539
recall is: 0.4106090373280943

Logistic Regression
Accuracy is: 0.8092
precision is: 0.5963855421686747
recall is: 0.1944990176817289

K nearest neighbor
Accuracy is: 0.8428
precision is: 0.7283464566929134
recall is: 0.36345776031434185

keyboard_arrow_down Part 3.4: Model Evaluation - ROC & AUC


RandomForestClassifier, KNeighborsClassifier and LogisticRegression have predict_prob() function

keyboard_arrow_down Part 3.4.1: ROC of RF Model

from sklearn.metrics import roc_curve


from sklearn import metrics

# Use predict_proba to get the probability results of Random Forest


y_pred_rf = best_RF_model.predict_proba(X_test)[:, 1]
fpr_rf, tpr_rf, _ = roc_curve(y_test, y_pred_rf)

best_RF_model.predict_proba(X_test)

array([[0.71741715, 0.28258285],
[0.93310478, 0.06689522],
[0.73605511, 0.26394489],
...,
[0.8564215 , 0.1435785 ],

https://colab.research.google.com/drive/12WP7JHDfPnibQawRvSL_JVGW-quq1plj?usp=sharing#scrollTo=SIvRSRqAi0Md&printMode=true 12/14
2024/6/5 15:27 Supervised Learning Project.ipynb - Colab
[0.91990008, 0.08009992],
[0.90302627, 0.09697373]])

# ROC curve of Random Forest result


import matplotlib.pyplot as plt
plt.figure(1)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr_rf, tpr_rf, label='RF')
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC curve - RF model')
plt.legend(loc='best')
plt.show()

from sklearn import metrics

# AUC score
metrics.auc(fpr_rf,tpr_rf)

0.8501547730997742

keyboard_arrow_down Part 3.4.1: ROC of LR Model

# Use predict_proba to get the probability results of Logistic Regression


y_pred_lr = best_LR_model.predict_proba(X_test)[:, 1]
fpr_lr, tpr_lr, thresh = roc_curve(y_test, y_pred_lr)

best_LR_model.predict_proba(X_test)

array([[0.82435654, 0.17564346],
[0.93171599, 0.06828401],
[0.85520711, 0.14479289],
...,
[0.71449354, 0.28550646],
[0.89278295, 0.10721705],
[0.8556093 , 0.1443907 ]])

# ROC Curve
plt.figure(1)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr_lr, tpr_lr, label='LR')
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC curve - LR Model')
plt.legend(loc='best')
plt.show()

https://colab.research.google.com/drive/12WP7JHDfPnibQawRvSL_JVGW-quq1plj?usp=sharing#scrollTo=SIvRSRqAi0Md&printMode=true 13/14
2024/6/5 15:27 Supervised Learning Project.ipynb - Colab

# AUC score
metrics.auc(fpr_lr,tpr_lr)

0.7722018237274021

keyboard_arrow_down Part 4: Model Extra Functionality


keyboard_arrow_down Part 4.1: Logistic Regression Model
The corelated features that we are interested in

X_with_corr = X.copy()

X_with_corr = OneHotEncoding(X_with_corr, enc_ohe, ['Geography'])


X_with_corr['Gender'] = enc_oe.transform(X_with_corr[['Gender']])
X_with_corr['SalaryInRMB'] = X_with_corr['EstimatedSalary'] * 6.4
X_with_corr.head()

CreditScore Gender Age Tenure Balance NumOfProducts HasCrCard IsActiveM

0 619 0.0 42 2 0.00 1 1

1 608 0.0 41 1 83807.86 1 0

2 502 0.0 42 8 159660.80 3 1

3 699 0.0 39 1 0.00 2 0

4 850 00 43 2 125510 82 1 1

# add L1 regularization to logistic regression


# check the coef for feature selection
scaler = StandardScaler()
X_l1 = scaler.fit_transform(X_with_corr)
LRmodel_l1 = LogisticRegression(penalty="l1", C = 0.04, solver='liblinear')
LRmodel_l1.fit(X_l1, y)

indices = np.argsort(abs(LRmodel_l1.coef_[0]))[::-1]

print ("Logistic Regression (L1) Coefficients")


for ind in range(X_with_corr.shape[1]):
print ("{0} : {1}".format(X_with_corr.columns[indices[ind]],round(LRmodel_l1.coef_[0][indices[ind]], 4)))

Logistic Regression (L1) Coefficients


Age : 0.7307
IsActiveMember : -0.5046
Geography_Germany : 0.3121
G d 0 2409

https://colab.research.google.com/drive/12WP7JHDfPnibQawRvSL_JVGW-quq1plj?usp=sharing#scrollTo=SIvRSRqAi0Md&printMode=true 14/14

You might also like