0% found this document useful (0 votes)

23 views

Supervised Learning Project - Ipynb - Colab

Uploaded by

liuyinj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views

Supervised Learning Project - Ipynb - Colab

Uploaded by

liuyinj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

2024/6/5 15:27 Supervised Learning Project.

ipynb - Colab

keyboard_arrow_down Bank Customer Churn Prediction

In this project, we use supervised learning models to identify customers who are likely to churn in the future. Furthermore, we will analyze top
factors that influence user retention. Dataset information.

Things need to know

对于0基础的同学，以了解为主
不需要可以从0把code写出来
这部分code与算法课程code的思维不一样
需要下课自己去跑一下code，尽量去理解
工具终归是辅助我们的，先尽力学会一个
逐步掌握python code debug能力

keyboard_arrow_down Contents
Part 1: Data Exploration
Part 2: Feature Preprocessing
Part 3: Model Training and Results Evaluation

keyboard_arrow_down Part 0: Setup Google Drive Environment / Data Collection

check this link for more info

# install pydrive to load data

!pip install -U -q PyDrive

from pydrive.auth import GoogleAuth

from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

WARNING:root:pydrive is deprecated and no longer maintained. We recommend that you migrate your projects to pydrive2, the ma

# the same way we get id from last class

# https://drive.google.com/file/d/1szdCZ98EK59cfJ4jG03g1HOv_OhC1oyN/view?usp=sharing

id = "1szdCZ98EK59cfJ4jG03g1HOv_OhC1oyN"

file = drive.CreateFile({'id':id})
file.GetContentFile('bank.data.csv')

import numpy as np
import pandas as pd

churn_df = pd.read_csv('bank.data.csv')
churn_df.head()

https://colab.research.google.com/drive/12WP7JHDfPnibQawRvSL_JVGW-quq1plj?usp=sharing#scrollTo=SIvRSRqAi0Md&printMode=true 1/14
2024/6/5 15:27 Supervised Learning Project.ipynb - Colab

RowNumber CustomerId Surname CreditScore Geography Gender Age Tenure Balance NumOfProducts HasCrCard IsActiveMemb

0 1 15634602 Hargrave 619 France Female 42 2 0.00 1 1

1 2 15647311 Hill 608 Spain Female 41 1 83807.86 1 0

2 3 15619304 Onio 502 France Female 42 8 159660.80 3 1

3 4 15701354 Boni 699 France Female 39 1 0.00 2 0

4 5 15737888 Mit h ll 850 S i F l 43 2 125510 82 1 1

keyboard_arrow_down Part 1: Data Exploration

keyboard_arrow_down Part 1.1: Understand the Raw Dataset
# check data info
churn_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 RowNumber 10000 non-null int64
1 CustomerId 10000 non-null int64
2 Surname 10000 non-null object
3 CreditScore 10000 non-null int64
4 Geography 10000 non-null object
5 Gender 10000 non-null object
6 Age 10000 non-null int64
7 Tenure 10000 non-null int64
8 Balance 10000 non-null float64
9 NumOfProducts 10000 non-null int64
10 HasCrCard 10000 non-null int64
11 IsActiveMember 10000 non-null int64
12 EstimatedSalary 10000 non-null float64
13 Exited 10000 non-null int64
dtypes: float64(2), int64(9), object(3)
memory usage: 1.1+ MB

# check the unique values for each column

churn_df.nunique()

RowNumber 10000
CustomerId 10000
Surname 2932
CreditScore 460
Geography 3
Gender 2
Age 70
Tenure 11
Balance 6382
NumOfProducts 4
HasCrCard 2
IsActiveMember 2
EstimatedSalary 9999
Exited 2
dtype: int64

# Get target variable

y = churn_df['Exited']

keyboard_arrow_down Part 1.2: Understand the features

# check missing values
churn_df.isnull().sum()

RowNumber 0
CustomerId 0
Surname 0
CreditScore 0
Geography 0

https://colab.research.google.com/drive/12WP7JHDfPnibQawRvSL_JVGW-quq1plj?usp=sharing#scrollTo=SIvRSRqAi0Md&printMode=true 2/14
2024/6/5 15:27 Supervised Learning Project.ipynb - Colab
Gender 0
Age 0
Tenure 0
Balance 0
NumOfProducts 0
HasCrCard 0
IsActiveMember 0
EstimatedSalary 0
Exited 0
dtype: int64

# understand Numerical feature

# discrete/continuous
# 'CreditScore', 'Age', 'Tenure', 'NumberOfProducts'
# 'Balance', 'EstimatedSalary'
churn_df[['CreditScore', 'Age', 'Tenure', 'NumOfProducts','Balance', 'EstimatedSalary']].describe()

CreditScore Age Tenure NumOfProducts Balance EstimatedSalary

count 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000

mean 650.528800 38.921800 5.012800 1.530200 76485.889288 100090.239881

std 96.653299 10.487806 2.892174 0.581654 62397.405202 57510.492818

min 350.000000 18.000000 0.000000 1.000000 0.000000 11.580000

25% 584.000000 32.000000 3.000000 1.000000 0.000000 51002.110000

50% 652.000000 37.000000 5.000000 1.000000 97198.540000 100193.915000

75% 718.000000 44.000000 7.000000 2.000000 127644.240000 149388.247500

max 850.000000 92.000000 10.000000 4.000000 250898.090000 199992.480000

# check the feature distribution

# pandas.DataFrame.describe()
# boxplot, distplot, countplot
import matplotlib.pyplot as plt
import seaborn as sns

# boxplot for numerical feature

_,axss = plt.subplots(2,3, figsize=[20,10])
sns.boxplot(x='Exited', y ='CreditScore', data=churn_df, ax=axss[0][0])
sns.boxplot(x='Exited', y ='Age', data=churn_df, ax=axss[0][1])
sns.boxplot(x='Exited', y ='Tenure', data=churn_df, ax=axss[0][2])
sns.boxplot(x='Exited', y ='NumOfProducts', data=churn_df, ax=axss[1][0])
sns.boxplot(x='Exited', y ='Balance', data=churn_df, ax=axss[1][1])
sns.boxplot(x='Exited', y ='EstimatedSalary', data=churn_df, ax=axss[1][2])

https://colab.research.google.com/drive/12WP7JHDfPnibQawRvSL_JVGW-quq1plj?usp=sharing#scrollTo=SIvRSRqAi0Md&printMode=true 3/14
2024/6/5 15:27 Supervised Learning Project.ipynb - Colab

<Axes: xlabel='Exited', ylabel='EstimatedSalary'>

# understand categorical feature

# 'Geography', 'Gender'
# 'HasCrCard', 'IsActiveMember'
_,axss = plt.subplots(2,2, figsize=[20,10])
sns.countplot(x='Exited', hue='Geography', data=churn_df, ax=axss[0][0])
sns.countplot(x='Exited', hue='Gender', data=churn_df, ax=axss[0][1])
sns.countplot(x='Exited', hue='HasCrCard', data=churn_df, ax=axss[1][0])
sns.countplot(x='Exited', hue='IsActiveMember', data=churn_df, ax=axss[1][1])

https://colab.research.google.com/drive/12WP7JHDfPnibQawRvSL_JVGW-quq1plj?usp=sharing#scrollTo=SIvRSRqAi0Md&printMode=true 4/14
2024/6/5 15:27 Supervised Learning Project.ipynb - Colab

<Axes: xlabel='Exited', ylabel='count'>

keyboard_arrow_down Part 2: Feature Preprocessing

# Get feature space by dropping useless feature
to_drop = ['RowNumber','CustomerId','Surname','Exited']
X = churn_df.drop(to_drop, axis = 1)

X.head()

CreditScore Geography Gender Age Tenure Balance NumOfProducts HasCrCard IsActiveMember EstimatedSalary

0 619 France Female 42 2 0.00 1 1 1 101348.88

1 608 Spain Female 41 1 83807.86 1 0 1 112542.58

2 502 France Female 42 8 159660.80 3 1 0 113931.57

3 699 France Female 39 1 0.00 2 0 0 93826.63

4 850 Spain Female 43 2 125510.82 1 1 1 79084.10

X.dtypes

CreditScore int64
Geography object
Gender object
Age int64
Tenure int64
Balance float64

https://colab.research.google.com/drive/12WP7JHDfPnibQawRvSL_JVGW-quq1plj?usp=sharing#scrollTo=SIvRSRqAi0Md&printMode=true 5/14
2024/6/5 15:27 Supervised Learning Project.ipynb - Colab
NumOfProducts int64
HasCrCard int64
IsActiveMember int64
EstimatedSalary float64
dtype: object

cat_cols = X.columns[X.dtypes == 'object']

num_cols = X.columns[(X.dtypes == 'float64') | (X.dtypes == 'int64')]

num_cols

Index(['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard',

'IsActiveMember', 'EstimatedSalary'],
dtype='object')

cat_cols

Index(['Geography', 'Gender'], dtype='object')

Split dataset

# Splite data into training and testing

# 100 -> 75:y=1, 25:y=0
# training(80): 60 y=1; 20 y=0
# testing(20): 15 y=1; 5 y=0

from sklearn import model_selection

# Reserve 25% for testing

# stratify example:
# 100 -> y: 80 '0', 20 '1' -> 4:1
# 80% training 64: '0', 16:'1' -> 4:1
# 20% testing 16:'0', 4: '1' -> 4:1
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.25, stratify = y, random_state = 1) #strat

print('training data has ' + str(X_train.shape[0]) + ' observation with ' + str(X_train.shape[1]) + ' features')
print('test data has ' + str(X_test.shape[0]) + ' observation with ' + str(X_test.shape[1]) + ' features')

training data has 7500 observation with 10 features

test data has 2500 observation with 10 features

10000 -> 8000 '0' + 2000 '1'

25% test 75% training

without stratified sampling:

keyboard_arrow_down • extreme case:

1. testing: 2000 '1' + 500 '0'
2. training: 7500 '0'

with stratified sampling:

1. testing: 2000 '0' + 500 '1'

2. training: 6000 '0' + 1500 '1'

Read more for handling categorical feature, and there is an awesome package for encoding.

X_train.head()

https://colab.research.google.com/drive/12WP7JHDfPnibQawRvSL_JVGW-quq1plj?usp=sharing#scrollTo=SIvRSRqAi0Md&printMode=true 6/14
2024/6/5 15:27 Supervised Learning Project.ipynb - Colab

CreditScore Geography Gender Age Tenure Balance NumOfProducts HasCrCard IsActiveMember EstimatedSalary

7971 633 Spain Male 42 10 0.00 1 0 1 79408.17

9152 708 Germany Female 23 4 71433.08 1 1 0 103697.57

6732 548 France Female 37 9 0.00 2 0 0 98029.58

902 645 France Female 48 7 90612.34 1 1 1 149139.13

2996 729 Spain Female 45 7 91091.06 2 1 0 71133.12

# One hot encoding

# another way: get_dummies
from sklearn.preprocessing import OneHotEncoder

def OneHotEncoding(df, enc, categories):

transformed = pd.DataFrame(enc.transform(df[categories]).toarray(), columns = enc.get_feature_names_out(categories))
return pd.concat([df.reset_index(drop=True), transformed], axis=1).drop(categories, axis=1)

categories = ['Geography']
enc_ohe = OneHotEncoder()
enc_ohe.fit(X_train[categories])

X_train = OneHotEncoding(X_train, enc_ohe, categories)

X_test = OneHotEncoding(X_test, enc_ohe, categories)

X_train.head()

CreditScore Gender Age Tenure Balance NumOfProducts HasCrCard IsActiveMember EstimatedSalary Geography_France Geog

0 633 Male 42 10 0.00 1 0 1 79408.17 0.0

1 708 Female 23 4 71433.08 1 1 0 103697.57 0.0

2 548 Female 37 9 0.00 2 0 0 98029.58 1.0

3 645 Female 48 7 90612.34 1 1 1 149139.13 1.0

4 729 F l 45 7 91091 06 2 1 0 71133 12 00

# Ordinal encoding
from sklearn.preprocessing import OrdinalEncoder

categories = ['Gender']
enc_oe = OrdinalEncoder()
enc_oe.fit(X_train[categories])

X_train[categories] = enc_oe.transform(X_train[categories])
X_test[categories] = enc_oe.transform(X_test[categories])

X_train.head()

CreditScore Gender Age Tenure Balance NumOfProducts HasCrCard IsActiveMember EstimatedSalary Geography_France Geog

0 633 1.0 42 10 0.00 1 0 1 79408.17 0.0

1 708 0.0 23 4 71433.08 1 1 0 103697.57 0.0

2 548 0.0 37 9 0.00 2 0 0 98029.58 1.0

3 645 0.0 48 7 90612.34 1 1 1 149139.13 1.0

4 729 00 45 7 91091 06 2 1 0 71133 12 00

Standardize/Normalize Data

https://colab.research.google.com/drive/12WP7JHDfPnibQawRvSL_JVGW-quq1plj?usp=sharing#scrollTo=SIvRSRqAi0Md&printMode=true 7/14
2024/6/5 15:27 Supervised Learning Project.ipynb - Colab
# Scale the data, using standardization
# standardization (x-mean)/std
# normalization (x-x_min)/(x_max-x_min) ->[0,1]

# 1. speed up gradient descent

# 2. same scale
# 3. algorithm requirments

# for example, use training data to train the standardscaler to get mean and std
# apply mean and std to both training and testing data.
# fit_transform does the training and applying, transform only does applying.
# Because we can't use any info from test, and we need to do the same modification
# to testing data as well as training data

# https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html#sphx-glr-auto-examples-preprocessing-plot-al
# https://scikit-learn.org/stable/modules/preprocessing.html

# min-max example: (x-x_min)/(x_max-x_min)

# [1,2,3,4,5,6,100] -> fit(min:1, max:6) (scalar.min = 1, scalar.max = 6) -> transform [(1-1)/(6-1),(2-1)/(6-1)..]
# scalar.fit(train) -> min:1, max:100
# scalar.transform(apply to x) -> apply min:1, max:100 to X_train
# scalar.transform -> apply min:1, max:100 to X_test

# scalar.fit -> mean:1, std:100

# scalar.transform -> apply mean:1, std:100 to X_train
# scalar.transform -> apply mean:1, std:100 to X_test

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X_train[num_cols])

X_train[num_cols] = scaler.transform(X_train[num_cols])
X_test[num_cols] = scaler.transform(X_test[num_cols])

X_train.head()

CreditScore Gender Age Tenure Balance NumOfProducts HasCrCard IsActiveMember EstimatedSalary Geography_Fran

0 -0.172985 1.0 0.289202 1.731199 -1.218916 -0.912769 -1.542199 0.968496 -0.352044

1 0.602407 0.0 -1.509319 -0.341156 -0.076977 -0.912769 0.648425 -1.032529 0.072315

2 -1.051762 0.0 -0.184093 1.385806 -1.218916 0.796109 -1.542199 -1.032529 -0.026711

3 -0.048922 0.0 0.857156 0.695022 0.229625 -0.912769 0.648425 0.968496 0.866221

4 0 819517 00 0 573179 0 695022 0 237278 0 796109 0 648425 1 032529 0 496617

keyboard_arrow_down Part 3: Model Training and Result Evaluation

keyboard_arrow_down Part 3.1: Model Training
keyboard_arrow_down build models
#@title build models
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

# Logistic Regression
classifier_logistic = LogisticRegression()

# K Nearest Neighbors
classifier_KNN = KNeighborsClassifier()

# Random Forest
classifier_RF = RandomForestClassifier()

# Train the model

classifier_logistic.fit(X_train, y_train)

https://colab.research.google.com/drive/12WP7JHDfPnibQawRvSL_JVGW-quq1plj?usp=sharing#scrollTo=SIvRSRqAi0Md&printMode=true 8/14
2024/6/5 15:27 Supervised Learning Project.ipynb - Colab

▾ LogisticRegression
LogisticRegression()

# Prediction of test data

classifier_logistic.predict(X_test)

array([0, 0, 0, ..., 0, 0, 0])

# Accuracy of test data

classifier_logistic.score(X_test, y_test)

0.8088

keyboard_arrow_down (Optional) Part 3.2: Use Grid Search to Find Optimal Hyperparameters
alternative: random search

#Loss/cost function --> (wx + b - y) ^2 + ƛ * |w| --> ƛ is a hyperparameter

from sklearn.model_selection import GridSearchCV

# helper function for printing out grid search results

def print_grid_search_metrics(gs):
print ("Best score: " + str(gs.best_score_))
print ("Best parameters set:")
best_parameters = gs.best_params_
for param_name in sorted(best_parameters.keys()):
print(param_name + ':' + str(best_parameters[param_name]))

keyboard_arrow_down Part 3.2.1: Find Optimal Hyperparameters - LogisticRegression

# Possible hyperparamter options for Logistic Regression Regularization

# Penalty is choosed from L1 or L2
# C is the 1/lambda value(weight) for L1 and L2
# solver: algorithm to find the weights that minimize the cost function

# ('l1', 0.01)('l1', 0.05) ('l1', 0.1) ('l1', 0.2)('l1', 1)

# ('12', 0.01)('l2', 0.05) ('l2', 0.1) ('l2', 0.2)('l2', 1)
parameters = {
'penalty':('l2','l1'),
'C':(0.01, 0.05, 0.1, 0.2, 1)
}

Grid_LR = GridSearchCV(LogisticRegression(solver='liblinear'),parameters, cv = 5)
Grid_LR.fit(X_train, y_train)

▸ GridSearchCV
▸ estimator: LogisticRegression
▸ LogisticRegression

# the best hyperparameter combination

# C = 1/lambda
print_grid_search_metrics(Grid_LR)

Best score: 0.8125333333333333

Best parameters set:
C:1
penalty:l1

# best model
best_LR_model = Grid_LR.best_estimator_

best_LR_model.predict(X_test)

array([0, 0, 0, ..., 0, 0, 0])

https://colab.research.google.com/drive/12WP7JHDfPnibQawRvSL_JVGW-quq1plj?usp=sharing#scrollTo=SIvRSRqAi0Md&printMode=true 9/14
2024/6/5 15:27 Supervised Learning Project.ipynb - Colab

best_LR_model.score(X_test, y_test)

0.8092

LR_models = pd.DataFrame(Grid_LR.cv_results_)
res = (LR_models.pivot(index='param_penalty', columns='param_C', values='mean_test_score'))
_ = sns.heatmap(res, cmap='viridis')

<ipython-input-37-990a4e8b9ba5>:2: FutureWarning: In a future version, the Index constructor will not infer numeric dtypes w
res = (LR_models.pivot(index='param_penalty', columns='param_C', values='mean_test_score')

keyboard_arrow_down Part 3.2.2: Find Optimal Hyperparameters: KNN

# Possible hyperparamter options for KNN

# Choose k
parameters = {
'n_neighbors':[1,3,5,7,9]
}
Grid_KNN = GridSearchCV(KNeighborsClassifier(),parameters, cv=5)
Grid_KNN.fit(X_train, y_train)

▸ GridSearchCV
▸ estimator: KNeighborsClassifier
▸ KNeighborsClassifier

# best k
print_grid_search_metrics(Grid_KNN)

Best score: 0.8433333333333334

Best parameters set:
n_neighbors:9

best_KNN_model = Grid_KNN.best_estimator_

best_KNN_model.predict(X_test)

array([0, 0, 0, ..., 0, 0, 0])

best_KNN_model.score(X_test, y_test)

0.8428

keyboard_arrow_down Part 3.2.3: Find Optimal Hyperparameters: Random Forest

https://colab.research.google.com/drive/12WP7JHDfPnibQawRvSL_JVGW-quq1plj?usp=sharing#scrollTo=SIvRSRqAi0Md&printMode=true 10/14
2024/6/5 15:27 Supervised Learning Project.ipynb - Colab
# Possible hyperparamter options for Random Forest
# Choose the number of trees
parameters = {
'n_estimators' : [60,80,100],
'max_depth': [1,5,10]
}
Grid_RF = GridSearchCV(RandomForestClassifier(),parameters, cv=5)
Grid_RF.fit(X_train, y_train)

▸ GridSearchCV
▸ estimator: RandomForestClassifier
▸ RandomForestClassifier

# best number of tress

print_grid_search_metrics(Grid_RF)

Best score: 0.8650666666666667

Best parameters set:
max_depth:10
n_estimators:100

# best random forest

best_RF_model = Grid_RF.best_estimator_

best_RF_model

▾ RandomForestClassifier
RandomForestClassifier(max_depth=10)

best_RF_model.score(X_test, y_test)

0.8596

keyboard_arrow_down Part 3.3: Model Evaluation - Confusion Matrix (Precision, Recall, Accuracy)

class of interest as positive

TP: correctly labeled real churn

Precision(PPV, positive predictive value): tp / (tp + fp); Total number of true predictive churn divided by the total number of predictive churn;
High Precision means low fp, not many return users were predicted as churn users.

Recall(sensitivity, hit rate, true positive rate): tp / (tp + fn) Predict most postive or churn user correctly. High recall means low fn, not many churn
users were predicted as return users.

https://colab.research.google.com/drive/12WP7JHDfPnibQawRvSL_JVGW-quq1plj?usp=sharing#scrollTo=SIvRSRqAi0Md&printMode=true 11/14
2024/6/5 15:27 Supervised Learning Project.ipynb - Colab
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score

# calculate accuracy, precision and recall, [[tn, fp],[]]

def cal_evaluation(classifier, cm):
tn = cm[0][0]
fp = cm[0][1]
fn = cm[1][0]
tp = cm[1][1]
accuracy = (tp + tn) / (tp + fp + fn + tn + 0.0)
precision = tp / (tp + fp + 0.0)
recall = tp / (tp + fn + 0.0)
print (classifier)
print ("Accuracy is: " + str(accuracy))
print ("precision is: " + str(precision))
print ("recall is: " + str(recall))
print ()

# print out confusion matrices

def draw_confusion_matrices(confusion_matricies):
class_names = ['Not','Churn']
for cm in confusion_matrices:
classifier, cm = cm[0], cm[1]
cal_evaluation(classifier, cm)

# Confusion matrix, accuracy, precison and recall for random forest and logistic regression
confusion_matrices = [
("Random Forest", confusion_matrix(y_test,best_RF_model.predict(X_test))),
("Logistic Regression", confusion_matrix(y_test,best_LR_model.predict(X_test))),
("K nearest neighbor", confusion_matrix(y_test, best_KNN_model.predict(X_test)))
]

draw_confusion_matrices(confusion_matrices)

Random Forest
Accuracy is: 0.8596
precision is: 0.8038461538461539
recall is: 0.4106090373280943

Logistic Regression
Accuracy is: 0.8092
precision is: 0.5963855421686747
recall is: 0.1944990176817289

K nearest neighbor
Accuracy is: 0.8428
precision is: 0.7283464566929134
recall is: 0.36345776031434185

keyboard_arrow_down Part 3.4: Model Evaluation - ROC & AUC

RandomForestClassifier, KNeighborsClassifier and LogisticRegression have predict_prob() function

keyboard_arrow_down Part 3.4.1: ROC of RF Model

from sklearn.metrics import roc_curve

from sklearn import metrics

# Use predict_proba to get the probability results of Random Forest

y_pred_rf = best_RF_model.predict_proba(X_test)[:, 1]
fpr_rf, tpr_rf, _ = roc_curve(y_test, y_pred_rf)

best_RF_model.predict_proba(X_test)

array([[0.71741715, 0.28258285],
[0.93310478, 0.06689522],
[0.73605511, 0.26394489],
...,
[0.8564215 , 0.1435785 ],

https://colab.research.google.com/drive/12WP7JHDfPnibQawRvSL_JVGW-quq1plj?usp=sharing#scrollTo=SIvRSRqAi0Md&printMode=true 12/14
2024/6/5 15:27 Supervised Learning Project.ipynb - Colab
[0.91990008, 0.08009992],
[0.90302627, 0.09697373]])

# ROC curve of Random Forest result

import matplotlib.pyplot as plt
plt.figure(1)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr_rf, tpr_rf, label='RF')
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC curve - RF model')
plt.legend(loc='best')
plt.show()

from sklearn import metrics

# AUC score
metrics.auc(fpr_rf,tpr_rf)

0.8501547730997742

keyboard_arrow_down Part 3.4.1: ROC of LR Model

# Use predict_proba to get the probability results of Logistic Regression

y_pred_lr = best_LR_model.predict_proba(X_test)[:, 1]
fpr_lr, tpr_lr, thresh = roc_curve(y_test, y_pred_lr)

best_LR_model.predict_proba(X_test)

array([[0.82435654, 0.17564346],
[0.93171599, 0.06828401],
[0.85520711, 0.14479289],
...,
[0.71449354, 0.28550646],
[0.89278295, 0.10721705],
[0.8556093 , 0.1443907 ]])

# ROC Curve
plt.figure(1)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr_lr, tpr_lr, label='LR')
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC curve - LR Model')
plt.legend(loc='best')
plt.show()

https://colab.research.google.com/drive/12WP7JHDfPnibQawRvSL_JVGW-quq1plj?usp=sharing#scrollTo=SIvRSRqAi0Md&printMode=true 13/14
2024/6/5 15:27 Supervised Learning Project.ipynb - Colab

# AUC score
metrics.auc(fpr_lr,tpr_lr)

0.7722018237274021

keyboard_arrow_down Part 4: Model Extra Functionality

keyboard_arrow_down Part 4.1: Logistic Regression Model
The corelated features that we are interested in

X_with_corr = X.copy()

X_with_corr = OneHotEncoding(X_with_corr, enc_ohe, ['Geography'])

X_with_corr['Gender'] = enc_oe.transform(X_with_corr[['Gender']])
X_with_corr['SalaryInRMB'] = X_with_corr['EstimatedSalary'] * 6.4
X_with_corr.head()

CreditScore Gender Age Tenure Balance NumOfProducts HasCrCard IsActiveM

0 619 0.0 42 2 0.00 1 1

1 608 0.0 41 1 83807.86 1 0

2 502 0.0 42 8 159660.80 3 1

3 699 0.0 39 1 0.00 2 0

4 850 00 43 2 125510 82 1 1

# add L1 regularization to logistic regression

# check the coef for feature selection
scaler = StandardScaler()
X_l1 = scaler.fit_transform(X_with_corr)
LRmodel_l1 = LogisticRegression(penalty="l1", C = 0.04, solver='liblinear')
LRmodel_l1.fit(X_l1, y)

indices = np.argsort(abs(LRmodel_l1.coef_[0]))[::-1]

print ("Logistic Regression (L1) Coefficients")

for ind in range(X_with_corr.shape[1]):
print ("{0} : {1}".format(X_with_corr.columns[indices[ind]],round(LRmodel_l1.coef_[0][indices[ind]], 4)))

Logistic Regression (L1) Coefficients

Age : 0.7307
IsActiveMember : -0.5046
Geography_Germany : 0.3121
G d 0 2409

https://colab.research.google.com/drive/12WP7JHDfPnibQawRvSL_JVGW-quq1plj?usp=sharing#scrollTo=SIvRSRqAi0Md&printMode=true 14/14

Level 5 The Management of Quality in Health and Social Care Final
No ratings yet
Level 5 The Management of Quality in Health and Social Care Final
9 pages
Churn Prediction Model
No ratings yet
Churn Prediction Model
36 pages
Customer Churn Prediction
No ratings yet
Customer Churn Prediction
16 pages
Bank Customer Churn Analysis - Jupyter Notebook
No ratings yet
Bank Customer Churn Analysis - Jupyter Notebook
11 pages
SPPUML3
No ratings yet
SPPUML3
12 pages
#Group: B (ML) : Numpy NP Pandas PD
No ratings yet
#Group: B (ML) : Numpy NP Pandas PD
9 pages
Churn For Bank Customers
No ratings yet
Churn For Bank Customers
28 pages
Customer Churn Analysis - Jupyter Notebook
No ratings yet
Customer Churn Analysis - Jupyter Notebook
10 pages
Decision Tree
No ratings yet
Decision Tree
12 pages
Py_ Customer Churn Classification — Actuaries' Analytical Cookbook
No ratings yet
Py_ Customer Churn Classification — Actuaries' Analytical Cookbook
76 pages
Ml3.ipynb - Colab
No ratings yet
Ml3.ipynb - Colab
11 pages
Sunbase Data Assignment
No ratings yet
Sunbase Data Assignment
11 pages
DS Capestone PDF
No ratings yet
DS Capestone PDF
41 pages
Kunal Assignment 3
No ratings yet
Kunal Assignment 3
19 pages
Exploratry Data Analysis of The Telecom Customer Churn
No ratings yet
Exploratry Data Analysis of The Telecom Customer Churn
16 pages
Data Mining - Project
100% (2)
Data Mining - Project
11 pages
Minor_Project
No ratings yet
Minor_Project
21 pages
Predicting Credit Risk 1713295035
No ratings yet
Predicting Credit Risk 1713295035
19 pages
Churn Analysis of Bank Customers
100% (1)
Churn Analysis of Bank Customers
12 pages
Ensemble Techniques Project
100% (2)
Ensemble Techniques Project
28 pages
Naive Bayes vs Logistic Regression
No ratings yet
Naive Bayes vs Logistic Regression
16 pages
Churn Predictions
No ratings yet
Churn Predictions
96 pages
Group 1 5b Report
No ratings yet
Group 1 5b Report
10 pages
Predictive+Modelling+-+Linear+Discriminant+Analysis+-+Mentor+version - Ipynb - Colaboratory
No ratings yet
Predictive+Modelling+-+Linear+Discriminant+Analysis+-+Mentor+version - Ipynb - Colaboratory
13 pages
Credit_Scores_classification
No ratings yet
Credit_Scores_classification
104 pages
Kunal_DA-12_Assignment-4
No ratings yet
Kunal_DA-12_Assignment-4
26 pages
Data Science Lab Manual
No ratings yet
Data Science Lab Manual
30 pages
Group 13 - Analyzing Customer Churn
No ratings yet
Group 13 - Analyzing Customer Churn
6 pages
Data Analysis in The Banking Sector: Pandas Fundamentals
No ratings yet
Data Analysis in The Banking Sector: Pandas Fundamentals
16 pages
Week 4 LAB
No ratings yet
Week 4 LAB
26 pages
deep learning practical file
No ratings yet
deep learning practical file
18 pages
Hanoi - 2021: (Document Title)
No ratings yet
Hanoi - 2021: (Document Title)
19 pages
Germany Credit Analysis
No ratings yet
Germany Credit Analysis
41 pages
Predictionof Customer Churnin Banking Industry
No ratings yet
Predictionof Customer Churnin Banking Industry
16 pages
Task-2 Example code
No ratings yet
Task-2 Example code
8 pages
Group Assignment - Predictive Modelling
No ratings yet
Group Assignment - Predictive Modelling
23 pages
Links for Datasets
No ratings yet
Links for Datasets
3 pages
Program 4+Linear+Discriminant+Analysis+-+Mentor+Version0.2 - New
No ratings yet
Program 4+Linear+Discriminant+Analysis+-+Mentor+Version0.2 - New
16 pages
Report
No ratings yet
Report
17 pages
Predicting Employee Churn in Python
100% (1)
Predicting Employee Churn in Python
19 pages
Lab Assignment 1 Ucs551
No ratings yet
Lab Assignment 1 Ucs551
23 pages
Practical 3
No ratings yet
Practical 3
8 pages
Business Problem
No ratings yet
Business Problem
10 pages
vertopal.com_Mlt_ann_lab_2_
No ratings yet
vertopal.com_Mlt_ann_lab_2_
7 pages
Supermarket Sales Analysis Project
No ratings yet
Supermarket Sales Analysis Project
8 pages
Annotated Follow-Along Guide - Construct A Naive Bayes Model With Python
No ratings yet
Annotated Follow-Along Guide - Construct A Naive Bayes Model With Python
9 pages
Group 5 Dseb64a Report
No ratings yet
Group 5 Dseb64a Report
10 pages
3_Analysis of Default.ipynb - Colab
No ratings yet
3_Analysis of Default.ipynb - Colab
16 pages
Daa-01
No ratings yet
Daa-01
11 pages
Project 3
No ratings yet
Project 3
4 pages
Credit Card Analysis
No ratings yet
Credit Card Analysis
95 pages
DWDM Cep
No ratings yet
DWDM Cep
13 pages
Lab 1 ML
No ratings yet
Lab 1 ML
2 pages
Bank Marketing Ingles
No ratings yet
Bank Marketing Ingles
37 pages
Customer Segmentation 1683225943
No ratings yet
Customer Segmentation 1683225943
34 pages
Naive Bayes Model With Python 1684166563
No ratings yet
Naive Bayes Model With Python 1684166563
9 pages
Customer Segmentation PDF
No ratings yet
Customer Segmentation PDF
18 pages
2324 BigData Lab3
No ratings yet
2324 BigData Lab3
6 pages
Vertopal.com AML Project LearnerNotebook LowCode
No ratings yet
Vertopal.com AML Project LearnerNotebook LowCode
74 pages
Quadexp IDS Project
No ratings yet
Quadexp IDS Project
22 pages
Data Mining Models: Techniques and Applications
From Everand
Data Mining Models: Techniques and Applications
Ravi Deshpande
No ratings yet
APPLIFEMGT
No ratings yet
APPLIFEMGT
86 pages
SHS CREATIVEWRITING Q1 LAS Wk2Day2
No ratings yet
SHS CREATIVEWRITING Q1 LAS Wk2Day2
4 pages
PDF
No ratings yet
PDF
21 pages
Ridgid - 300 Compact Threader A4 Flyer 0323
No ratings yet
Ridgid - 300 Compact Threader A4 Flyer 0323
1 page
Papaya Leaf Cancer Cure
75% (4)
Papaya Leaf Cancer Cure
34 pages
Hypermall Unlimited Violence J D Clement instant download
No ratings yet
Hypermall Unlimited Violence J D Clement instant download
53 pages
Play it, Sam - YouTube
No ratings yet
Play it, Sam - YouTube
1 page
S 85 625 2016WebSum
No ratings yet
S 85 625 2016WebSum
13 pages
Yamaha Earbuds Manual
No ratings yet
Yamaha Earbuds Manual
51 pages
Teaching Reading Skills
No ratings yet
Teaching Reading Skills
34 pages
PPL 28DayActionGuide
No ratings yet
PPL 28DayActionGuide
72 pages
Advanced Legal Skills - Introduction To Business Law, Legal Systems, and The Commercial Courts
No ratings yet
Advanced Legal Skills - Introduction To Business Law, Legal Systems, and The Commercial Courts
6 pages
Data Communications
No ratings yet
Data Communications
30 pages
Posting - Operations Mgr.-Cornerstone Preparatory Academy
No ratings yet
Posting - Operations Mgr.-Cornerstone Preparatory Academy
4 pages
8th NTRCA School Level
No ratings yet
8th NTRCA School Level
49 pages
ALHuxtable-Le Corbusier Pessac
No ratings yet
ALHuxtable-Le Corbusier Pessac
8 pages
Perry s Chemical Engineers Handbook Section 14 8th edition Edition Paul Mathias pdf download
100% (2)
Perry s Chemical Engineers Handbook Section 14 8th edition Edition Paul Mathias pdf download
53 pages
A Two Wheel Self-Balancing Vehicle
No ratings yet
A Two Wheel Self-Balancing Vehicle
16 pages
Book List 2020 Update
No ratings yet
Book List 2020 Update
68 pages
CURRICULUM VITAE KIPKURUI VICTORUUUU
No ratings yet
CURRICULUM VITAE KIPKURUI VICTORUUUU
3 pages
Rig Mechanics Discipline Test - Eng
100% (1)
Rig Mechanics Discipline Test - Eng
6 pages
Insuline Index
No ratings yet
Insuline Index
28 pages
Blackboard Chapter 6
No ratings yet
Blackboard Chapter 6
24 pages
Summative Test Math V
No ratings yet
Summative Test Math V
8 pages
Hiradc SF Merge - Tiling
No ratings yet
Hiradc SF Merge - Tiling
3 pages
We Don't Become Refugees by Choice Meade pdf download
No ratings yet
We Don't Become Refugees by Choice Meade pdf download
44 pages
Broken Sonnet by Hale
No ratings yet
Broken Sonnet by Hale
2 pages
A Novel Lesreffrrfson Plan
No ratings yet
A Novel Lesreffrrfson Plan
2 pages
BELLOW Installation Instruction UK
No ratings yet
BELLOW Installation Instruction UK
21 pages