Supervised Learning Project - Ipynb - Colab
Supervised Learning Project - Ipynb - Colab
ipynb - Colab
对于0基础的同学,以了解为主
不需要可以从0把code写出来
这部分code与算法课程code的思维不一样
需要下课自己去跑一下code,尽量去理解
工具终归是辅助我们的,先尽力学会一个
逐步掌握python code debug能力
keyboard_arrow_down Contents
Part 1: Data Exploration
Part 2: Feature Preprocessing
Part 3: Model Training and Results Evaluation
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
WARNING:root:pydrive is deprecated and no longer maintained. We recommend that you migrate your projects to pydrive2, the ma
id = "1szdCZ98EK59cfJ4jG03g1HOv_OhC1oyN"
file = drive.CreateFile({'id':id})
file.GetContentFile('bank.data.csv')
import numpy as np
import pandas as pd
churn_df = pd.read_csv('bank.data.csv')
churn_df.head()
https://colab.research.google.com/drive/12WP7JHDfPnibQawRvSL_JVGW-quq1plj?usp=sharing#scrollTo=SIvRSRqAi0Md&printMode=true 1/14
2024/6/5 15:27 Supervised Learning Project.ipynb - Colab
RowNumber CustomerId Surname CreditScore Geography Gender Age Tenure Balance NumOfProducts HasCrCard IsActiveMemb
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 RowNumber 10000 non-null int64
1 CustomerId 10000 non-null int64
2 Surname 10000 non-null object
3 CreditScore 10000 non-null int64
4 Geography 10000 non-null object
5 Gender 10000 non-null object
6 Age 10000 non-null int64
7 Tenure 10000 non-null int64
8 Balance 10000 non-null float64
9 NumOfProducts 10000 non-null int64
10 HasCrCard 10000 non-null int64
11 IsActiveMember 10000 non-null int64
12 EstimatedSalary 10000 non-null float64
13 Exited 10000 non-null int64
dtypes: float64(2), int64(9), object(3)
memory usage: 1.1+ MB
RowNumber 10000
CustomerId 10000
Surname 2932
CreditScore 460
Geography 3
Gender 2
Age 70
Tenure 11
Balance 6382
NumOfProducts 4
HasCrCard 2
IsActiveMember 2
EstimatedSalary 9999
Exited 2
dtype: int64
RowNumber 0
CustomerId 0
Surname 0
CreditScore 0
Geography 0
https://colab.research.google.com/drive/12WP7JHDfPnibQawRvSL_JVGW-quq1plj?usp=sharing#scrollTo=SIvRSRqAi0Md&printMode=true 2/14
2024/6/5 15:27 Supervised Learning Project.ipynb - Colab
Gender 0
Age 0
Tenure 0
Balance 0
NumOfProducts 0
HasCrCard 0
IsActiveMember 0
EstimatedSalary 0
Exited 0
dtype: int64
https://colab.research.google.com/drive/12WP7JHDfPnibQawRvSL_JVGW-quq1plj?usp=sharing#scrollTo=SIvRSRqAi0Md&printMode=true 3/14
2024/6/5 15:27 Supervised Learning Project.ipynb - Colab
https://colab.research.google.com/drive/12WP7JHDfPnibQawRvSL_JVGW-quq1plj?usp=sharing#scrollTo=SIvRSRqAi0Md&printMode=true 4/14
2024/6/5 15:27 Supervised Learning Project.ipynb - Colab
X.head()
CreditScore Geography Gender Age Tenure Balance NumOfProducts HasCrCard IsActiveMember EstimatedSalary
X.dtypes
CreditScore int64
Geography object
Gender object
Age int64
Tenure int64
Balance float64
https://colab.research.google.com/drive/12WP7JHDfPnibQawRvSL_JVGW-quq1plj?usp=sharing#scrollTo=SIvRSRqAi0Md&printMode=true 5/14
2024/6/5 15:27 Supervised Learning Project.ipynb - Colab
NumOfProducts int64
HasCrCard int64
IsActiveMember int64
EstimatedSalary float64
dtype: object
num_cols
cat_cols
Split dataset
print('training data has ' + str(X_train.shape[0]) + ' observation with ' + str(X_train.shape[1]) + ' features')
print('test data has ' + str(X_test.shape[0]) + ' observation with ' + str(X_test.shape[1]) + ' features')
Read more for handling categorical feature, and there is an awesome package for encoding.
X_train.head()
https://colab.research.google.com/drive/12WP7JHDfPnibQawRvSL_JVGW-quq1plj?usp=sharing#scrollTo=SIvRSRqAi0Md&printMode=true 6/14
2024/6/5 15:27 Supervised Learning Project.ipynb - Colab
CreditScore Geography Gender Age Tenure Balance NumOfProducts HasCrCard IsActiveMember EstimatedSalary
categories = ['Geography']
enc_ohe = OneHotEncoder()
enc_ohe.fit(X_train[categories])
X_train.head()
CreditScore Gender Age Tenure Balance NumOfProducts HasCrCard IsActiveMember EstimatedSalary Geography_France Geog
# Ordinal encoding
from sklearn.preprocessing import OrdinalEncoder
categories = ['Gender']
enc_oe = OrdinalEncoder()
enc_oe.fit(X_train[categories])
X_train[categories] = enc_oe.transform(X_train[categories])
X_test[categories] = enc_oe.transform(X_test[categories])
X_train.head()
CreditScore Gender Age Tenure Balance NumOfProducts HasCrCard IsActiveMember EstimatedSalary Geography_France Geog
Standardize/Normalize Data
https://colab.research.google.com/drive/12WP7JHDfPnibQawRvSL_JVGW-quq1plj?usp=sharing#scrollTo=SIvRSRqAi0Md&printMode=true 7/14
2024/6/5 15:27 Supervised Learning Project.ipynb - Colab
# Scale the data, using standardization
# standardization (x-mean)/std
# normalization (x-x_min)/(x_max-x_min) ->[0,1]
# for example, use training data to train the standardscaler to get mean and std
# apply mean and std to both training and testing data.
# fit_transform does the training and applying, transform only does applying.
# Because we can't use any info from test, and we need to do the same modification
# to testing data as well as training data
# https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html#sphx-glr-auto-examples-preprocessing-plot-al
# https://scikit-learn.org/stable/modules/preprocessing.html
X_train[num_cols] = scaler.transform(X_train[num_cols])
X_test[num_cols] = scaler.transform(X_test[num_cols])
X_train.head()
CreditScore Gender Age Tenure Balance NumOfProducts HasCrCard IsActiveMember EstimatedSalary Geography_Fran
# Logistic Regression
classifier_logistic = LogisticRegression()
# K Nearest Neighbors
classifier_KNN = KNeighborsClassifier()
# Random Forest
classifier_RF = RandomForestClassifier()
https://colab.research.google.com/drive/12WP7JHDfPnibQawRvSL_JVGW-quq1plj?usp=sharing#scrollTo=SIvRSRqAi0Md&printMode=true 8/14
2024/6/5 15:27 Supervised Learning Project.ipynb - Colab
▾ LogisticRegression
LogisticRegression()
0.8088
keyboard_arrow_down (Optional) Part 3.2: Use Grid Search to Find Optimal Hyperparameters
alternative: random search
Grid_LR = GridSearchCV(LogisticRegression(solver='liblinear'),parameters, cv = 5)
Grid_LR.fit(X_train, y_train)
▸ GridSearchCV
▸ estimator: LogisticRegression
▸ LogisticRegression
# best model
best_LR_model = Grid_LR.best_estimator_
best_LR_model.predict(X_test)
https://colab.research.google.com/drive/12WP7JHDfPnibQawRvSL_JVGW-quq1plj?usp=sharing#scrollTo=SIvRSRqAi0Md&printMode=true 9/14
2024/6/5 15:27 Supervised Learning Project.ipynb - Colab
best_LR_model.score(X_test, y_test)
0.8092
LR_models = pd.DataFrame(Grid_LR.cv_results_)
res = (LR_models.pivot(index='param_penalty', columns='param_C', values='mean_test_score'))
_ = sns.heatmap(res, cmap='viridis')
<ipython-input-37-990a4e8b9ba5>:2: FutureWarning: In a future version, the Index constructor will not infer numeric dtypes w
res = (LR_models.pivot(index='param_penalty', columns='param_C', values='mean_test_score')
▸ GridSearchCV
▸ estimator: KNeighborsClassifier
▸ KNeighborsClassifier
# best k
print_grid_search_metrics(Grid_KNN)
best_KNN_model = Grid_KNN.best_estimator_
best_KNN_model.predict(X_test)
best_KNN_model.score(X_test, y_test)
0.8428
https://colab.research.google.com/drive/12WP7JHDfPnibQawRvSL_JVGW-quq1plj?usp=sharing#scrollTo=SIvRSRqAi0Md&printMode=true 10/14
2024/6/5 15:27 Supervised Learning Project.ipynb - Colab
# Possible hyperparamter options for Random Forest
# Choose the number of trees
parameters = {
'n_estimators' : [60,80,100],
'max_depth': [1,5,10]
}
Grid_RF = GridSearchCV(RandomForestClassifier(),parameters, cv=5)
Grid_RF.fit(X_train, y_train)
▸ GridSearchCV
▸ estimator: RandomForestClassifier
▸ RandomForestClassifier
best_RF_model
▾ RandomForestClassifier
RandomForestClassifier(max_depth=10)
best_RF_model.score(X_test, y_test)
0.8596
keyboard_arrow_down Part 3.3: Model Evaluation - Confusion Matrix (Precision, Recall, Accuracy)
Precision(PPV, positive predictive value): tp / (tp + fp); Total number of true predictive churn divided by the total number of predictive churn;
High Precision means low fp, not many return users were predicted as churn users.
Recall(sensitivity, hit rate, true positive rate): tp / (tp + fn) Predict most postive or churn user correctly. High recall means low fn, not many churn
users were predicted as return users.
https://colab.research.google.com/drive/12WP7JHDfPnibQawRvSL_JVGW-quq1plj?usp=sharing#scrollTo=SIvRSRqAi0Md&printMode=true 11/14
2024/6/5 15:27 Supervised Learning Project.ipynb - Colab
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
# Confusion matrix, accuracy, precison and recall for random forest and logistic regression
confusion_matrices = [
("Random Forest", confusion_matrix(y_test,best_RF_model.predict(X_test))),
("Logistic Regression", confusion_matrix(y_test,best_LR_model.predict(X_test))),
("K nearest neighbor", confusion_matrix(y_test, best_KNN_model.predict(X_test)))
]
draw_confusion_matrices(confusion_matrices)
Random Forest
Accuracy is: 0.8596
precision is: 0.8038461538461539
recall is: 0.4106090373280943
Logistic Regression
Accuracy is: 0.8092
precision is: 0.5963855421686747
recall is: 0.1944990176817289
K nearest neighbor
Accuracy is: 0.8428
precision is: 0.7283464566929134
recall is: 0.36345776031434185
best_RF_model.predict_proba(X_test)
array([[0.71741715, 0.28258285],
[0.93310478, 0.06689522],
[0.73605511, 0.26394489],
...,
[0.8564215 , 0.1435785 ],
https://colab.research.google.com/drive/12WP7JHDfPnibQawRvSL_JVGW-quq1plj?usp=sharing#scrollTo=SIvRSRqAi0Md&printMode=true 12/14
2024/6/5 15:27 Supervised Learning Project.ipynb - Colab
[0.91990008, 0.08009992],
[0.90302627, 0.09697373]])
# AUC score
metrics.auc(fpr_rf,tpr_rf)
0.8501547730997742
best_LR_model.predict_proba(X_test)
array([[0.82435654, 0.17564346],
[0.93171599, 0.06828401],
[0.85520711, 0.14479289],
...,
[0.71449354, 0.28550646],
[0.89278295, 0.10721705],
[0.8556093 , 0.1443907 ]])
# ROC Curve
plt.figure(1)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr_lr, tpr_lr, label='LR')
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC curve - LR Model')
plt.legend(loc='best')
plt.show()
https://colab.research.google.com/drive/12WP7JHDfPnibQawRvSL_JVGW-quq1plj?usp=sharing#scrollTo=SIvRSRqAi0Md&printMode=true 13/14
2024/6/5 15:27 Supervised Learning Project.ipynb - Colab
# AUC score
metrics.auc(fpr_lr,tpr_lr)
0.7722018237274021
X_with_corr = X.copy()
4 850 00 43 2 125510 82 1 1
indices = np.argsort(abs(LRmodel_l1.coef_[0]))[::-1]
https://colab.research.google.com/drive/12WP7JHDfPnibQawRvSL_JVGW-quq1plj?usp=sharing#scrollTo=SIvRSRqAi0Md&printMode=true 14/14