0% found this document useful (0 votes)

9 views25 pages

Group 2 TH

The document outlines a data analysis project using Python, specifically in a Kaggle environment, where datasets related to customer information are loaded and explored. It includes details about the training and testing datasets, their structure, and basic statistical descriptions. The analysis also involves visualizations to understand the distribution of various features and their relationship with customer exit rates.

Uploaded by

Đơn Đặng

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views25 pages

Group 2 TH

Uploaded by

Đơn Đặng

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

ieochc5ud

December 11, 2024

[390]: # This Python 3 environment comes with many helpful analytics libraries␣
↪installed

# It is defined by the kaggle/python Docker image: https://github.com/kaggle/

↪docker-python

# For example, here's several helpful packages to load

import numpy as np # linear algebra

import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory

# For example, running this (by clicking run or pressing Shift+Enter) will list␣
↪all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that␣

↪gets preserved as output when you create a version using "Save & Run All"

# You can also write temporary files to /kaggle/temp/, but they won't be saved␣
↪outside of the current session

/kaggle/input/city-u-10-f-fun-ai-final-project/10F_train.csv
/kaggle/input/city-u-10-f-fun-ai-final-project/10F_sample_submission.csv
/kaggle/input/city-u-10-f-fun-ai-final-project/10F_test.csv

[391]: import pandas as pd

# Đọc dữ liệu
train = pd.read_csv('/kaggle/input/city-u-10-f-fun-ai-final-project/10F_train.
↪csv')

test = pd.read_csv('/kaggle/input/city-u-10-f-fun-ai-final-project/10F_test.
↪csv')

[392]: train.head()

1
[392]: id CustomerId Surname CreditScore Geography Gender Age Tenure \
0 1 15749177 Okwudiliolisa 627 France Male 33.0 1
1 2 15694510 Hsueh 678 France Male 40.0 10
2 3 15741417 Kao 581 France Male 34.0 2
3 4 15766172 Chiemenam 716 Spain Male 33.0 5
4 5 15771669 Genovese 588 Germany Male 36.0 4

Balance NumOfProducts HasCrCard IsActiveMember EstimatedSalary \

0 0.00 2 1 1 49503.50
1 0.00 2 1 0 184866.69
2 148882.54 1 1 1 84560.88
3 0.00 2 1 1 15068.83
4 131778.58 1 1 0 136024.31

Exited
0 0
1 0
2 0
3 0
4 1

[393]: train.describe()

[393]: id CustomerId CreditScore Age \

count 132027.000000 1.320270e+05 132027.000000 132027.000000
mean 82432.091133 1.569183e+07 656.783832 38.120996
std 47705.125906 7.137972e+04 80.043164 8.869802
min 1.000000 1.556570e+07 350.000000 18.000000
25% 41083.500000 1.563290e+07 598.000000 32.000000
50% 82435.000000 1.569013e+07 660.000000 37.000000
75% 123790.500000 1.575662e+07 710.000000 42.000000
max 165033.000000 1.581569e+07 850.000000 92.000000

Tenure Balance NumOfProducts HasCrCard \

count 132027.000000 132027.000000 132027.000000 132027.000000
mean 5.021821 55609.625464 1.554682 0.753838
std 2.808487 62860.390849 0.547018 0.430776
min 0.000000 0.000000 1.000000 0.000000
25% 3.000000 0.000000 1.000000 1.000000
50% 5.000000 0.000000 2.000000 1.000000
75% 7.000000 120107.645000 2.000000 1.000000
max 10.000000 250898.090000 4.000000 1.000000

IsActiveMember EstimatedSalary Exited

count 132027.000000 132027.000000 132027.00000
mean 0.497300 112683.672952 0.21182
std 0.499995 50275.570007 0.40860

2
min 0.000000 11.580000 0.00000
25% 0.000000 74835.650000 0.00000
50% 0.000000 118024.100000 0.00000
75% 1.000000 155616.750000 0.00000
max 1.000000 199992.480000 1.00000

[394]: print(train.shape)

(132027, 14)

[395]: test.head()

[395]: id CustomerId Surname CreditScore Geography Gender Age Tenure \

0 0 15674932 Okwudilichukwu 668 France Male 33 3
1 12 15717962 Rossi 759 Spain Male 71 9
2 20 15781496 Udegbulam 773 Spain Male 35 9
3 22 15759913 Trentini 553 Germany Female 43 9
4 24 15626012 Obidimkpa 714 France Male 26 6

Balance NumOfProducts HasCrCard IsActiveMember EstimatedSalary

0 0.00 2 1 0 181449.97
1 0.00 1 1 1 93081.87
2 0.00 2 0 1 87549.36
3 85200.82 1 1 0 160574.09
4 149879.66 2 1 1 50016.17

[396]: # Hiển thị thông tin tổng quan về dữ liệu

print(train.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 132027 entries, 0 to 132026
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 132027 non-null int64
1 CustomerId 132027 non-null int64
2 Surname 132027 non-null object
3 CreditScore 132027 non-null int64
4 Geography 132027 non-null object
5 Gender 132027 non-null object
6 Age 132027 non-null float64
7 Tenure 132027 non-null int64
8 Balance 132027 non-null float64
9 NumOfProducts 132027 non-null int64
10 HasCrCard 132027 non-null int64
11 IsActiveMember 132027 non-null int64
12 EstimatedSalary 132027 non-null float64
13 Exited 132027 non-null int64

3
dtypes: float64(3), int64(8), object(3)
memory usage: 14.1+ MB
None

[397]: print(test.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33007 entries, 0 to 33006
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 33007 non-null int64
1 CustomerId 33007 non-null int64
2 Surname 33007 non-null object
3 CreditScore 33007 non-null int64
4 Geography 33007 non-null object
5 Gender 33007 non-null object
6 Age 33007 non-null int64
7 Tenure 33007 non-null int64
8 Balance 33007 non-null float64
9 NumOfProducts 33007 non-null int64
10 HasCrCard 33007 non-null int64
11 IsActiveMember 33007 non-null int64
12 EstimatedSalary 33007 non-null float64
dtypes: float64(2), int64(8), object(3)
memory usage: 3.3+ MB
None

[398]: # Kiểm tra dữ liệu thiếu

print(train.isnull().sum())

id 0
CustomerId 0
Surname 0
CreditScore 0
Geography 0
Gender 0
Age 0
Tenure 0
Balance 0
NumOfProducts 0
HasCrCard 0
IsActiveMember 0
EstimatedSalary 0
Exited 0
dtype: int64

[399]: print(test.isnull().sum())

4
id 0
CustomerId 0
Surname 0
CreditScore 0
Geography 0
Gender 0
Age 0
Tenure 0
Balance 0
NumOfProducts 0
HasCrCard 0
IsActiveMember 0
EstimatedSalary 0
dtype: int64

[400]: # Xem phân phối của biến Exited

print(train['Exited'].value_counts())

Exited
0 104061
1 27966
Name: count, dtype: int64

[401]: import seaborn as sns

import matplotlib.pyplot as plt

[402]: # Vẽ biểu đồ phân loại

sns.countplot(x='Exited', data=train)
plt.show()

5
[403]: cols = ['Gender','Geography','HasCrCard','IsActiveMember']

n_rows = 2
n_cols = 3

fig, ax = plt. subplots (n_rows, n_cols, figsize=(n_cols3.5, n_rows3.5))

for r in range(0, n_rows):

for c in range(0, n_cols):
i = r*n_cols + c #index to loop through list "cols"
if i < len(cols):
ax_i = ax[r,c]
sns. countplot (data=train, x=cols[i], hue="Exited",␣
↪palette="Blues",ax=ax_i)

ax_i.set_title(f"Figure {i+1}: Exited Rate vs {cols[i]}")

ax_i.legend(title = '', loc='upper right', labels = ['Not Exited',␣
↪'Exited'])

ax.flat[-2].set_visible(False) #Remove the last subplot

plt.tight_layout()
plt.show()

6
[404]: sns.histplot(data=train, x='Age', hue= 'Exited', bins = 40, kde=True)

/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning:
use_inf_as_na option is deprecated and will be removed in a future version.
Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1075: FutureWarning:
When grouping with a length-1 list-like, you will need to pass a length-1 tuple
to get_group in a future version of pandas. Pass `(name,)` instead of `name` to
silence this warning.
data_subset = grouped_data.get_group(pd_key)
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1075: FutureWarning:
When grouping with a length-1 list-like, you will need to pass a length-1 tuple
to get_group in a future version of pandas. Pass `(name,)` instead of `name` to
silence this warning.
data_subset = grouped_data.get_group(pd_key)

[404]: <Axes: xlabel='Age', ylabel='Count'>

7
[405]: sns.histplot (data=train, x='CreditScore', hue = 'Exited', bins = 40)

8
[405]: <Axes: xlabel='CreditScore', ylabel='Count'>

[406]: sns.histplot (data=train, x='Tenure', hue = 'Exited', bins = 40)

9
data_subset = grouped_data.get_group(pd_key)

[406]: <Axes: xlabel='Tenure', ylabel='Count'>

[407]: sns.histplot (data=train, x='Balance', hue = 'Exited', bins = 40)

[407]: <Axes: xlabel='Balance', ylabel='Count'>

10
[408]: sns.histplot (data=train, x='EstimatedSalary', hue = 'Exited', bins = 40)

[408]: <Axes: xlabel='EstimatedSalary', ylabel='Count'>

11
[409]: sns.histplot (data=train, x='NumOfProducts', hue = 'Exited', bins = 40)

12
[409]: <Axes: xlabel='NumOfProducts', ylabel='Count'>

[410]: # Loại bỏ các cột không cần thiết

X = train.drop(['id', 'CustomerId', 'Surname', 'Exited'], axis=1)
y = train['Exited']
X_test = test.drop(['id', 'CustomerId', 'Surname'], axis=1)

# Xác định các cột phân loại và số

categorical_features = ['Geography','Gender','IsActiveMember']
numerical_features = ['CreditScore', 'Age','Tenure', 'Balance',
'NumOfProducts']

[411]: print (categorical_features)

print (numerical_features)

['Geography', 'Gender', 'IsActiveMember']

['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts']

[412]: from sklearn.pipeline import Pipeline

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

13
# Pipeline cho biến phân loại
categorical_transformer = Pipeline(steps=[
('onehot', OneHotEncoder(handle_unknown='ignore'))])

# Pipeline cho biến số

numerical_transformer = Pipeline(steps=[
('scaler', StandardScaler())])

# Kết hợp xử lý cả hai loại biến

preprocessor = ColumnTransformer(
transformers=[
('num', numerical_transformer, numerical_features),
('cat', categorical_transformer, categorical_features)])

[413]: # Chia tập train và validation

from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2,␣
↪random_state=42, stratify=y)

print("Kích thước tập train:", X_train.shape)

print("Kích thước tập validation:", X_val.shape)

Kích thước tập train: (105621, 10)

Kích thước tập validation: (26406, 10)

[414]: from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import classification_report, accuracy_score,␣
↪confusion_matrix

# Tạo pipeline cho KNN

knn_model = Pipeline(steps=[('preprocessor', preprocessor),
('classifier',␣
↪KNeighborsClassifier(n_neighbors=5))])

# Huấn luyện mô hình

knn_model.fit(X_train, y_train)

# Dự đoán trên tập validation

y_pred_knn = knn_model.predict(X_val)

# Đánh giá mô hình

print("KNN Accuracy:", accuracy_score(y_val, y_pred_knn))
print("Confusion Matrix:\n", confusion_matrix(y_val, y_pred_knn))
print("Classification Report:\n", classification_report(y_val, y_pred_knn))

KNN Accuracy: 0.8485950162841779

14
Confusion Matrix:
[[19366 1447]
[ 2551 3042]]
Classification Report:
precision recall f1-score support

0 0.88 0.93 0.91 20813

1 0.68 0.54 0.60 5593

accuracy 0.85 26406

macro avg 0.78 0.74 0.75 26406
weighted avg 0.84 0.85 0.84 26406

[415]: #KNN test with cross validation

from sklearn.model_selection import cross_val_score
cv_scores = cross_val_score(knn_model, X_train, y_train, cv=5,␣
↪scoring='accuracy')

# In ra các kết quả từ các fold

print(f"Cross-validation scores: {cv_scores}")
print(f"Mean cross-validation accuracy: {cv_scores.mean()}")

# Huấn luyện lại mô hình trên toàn bộ dữ liệu huấn luyện và đánh giá trên tập␣
↪validation

knn_model.fit(X_train, y_train)
y_pred_knn = knn_model.predict(X_val)

# Đánh giá mô hình

print("Random Forest Accuracy on Validation Set:", accuracy_score(y_val,␣
↪y_pred_knn))

print("Confusion Matrix:\n", confusion_matrix(y_val, y_pred_knn))

print("Classification Report:\n", classification_report(y_val, y_pred_knn))

Cross-validation scores: [0.85031953 0.8500284 0.84903427 0.84605188

0.84557849]
Mean cross-validation accuracy: 0.848202515437165
Random Forest Accuracy on Validation Set: 0.8485950162841779
Confusion Matrix:
[[19366 1447]
[ 2551 3042]]
Classification Report:
precision recall f1-score support

0 0.88 0.93 0.91 20813

1 0.68 0.54 0.60 5593

15
accuracy 0.85 26406
macro avg 0.78 0.74 0.75 26406
weighted avg 0.84 0.85 0.84 26406

[416]: from sklearn.naive_bayes import GaussianNB

# Tạo pipeline cho Naive Bayes

nb_model = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', GaussianNB())])

# Huấn luyện mô hình

nb_model.fit(X_train, y_train)

# Dự đoán trên tập validation

y_pred_nb = nb_model.predict(X_val)

# Đánh giá mô hình

print("Naive Bayes Accuracy:", accuracy_score(y_val, y_pred_nb))
print("Confusion Matrix:\n", confusion_matrix(y_val, y_pred_nb))
print("Classification Report:\n", classification_report(y_val, y_pred_nb))

Naive Bayes Accuracy: 0.80807392259335

Confusion Matrix:
[[18501 2312]
[ 2756 2837]]
Classification Report:
precision recall f1-score support

0 0.87 0.89 0.88 20813

1 0.55 0.51 0.53 5593

accuracy 0.81 26406

macro avg 0.71 0.70 0.70 26406
weighted avg 0.80 0.81 0.81 26406

[417]: #Naive Bayes test with cross validation

from sklearn.model_selection import cross_val_score
cv_scores = cross_val_score(nb_model, X_train, y_train, cv=5,␣
↪scoring='accuracy')

# In ra các kết quả từ các fold

print(f"Cross-validation scores: {cv_scores}")
print(f"Mean cross-validation accuracy: {cv_scores.mean()}")

16
# Huấn luyện lại mô hình trên toàn bộ dữ liệu huấn luyện và đánh giá trên tập␣
↪validation

nb_model.fit(X_train, y_train)
y_pred_nb = nb_model.predict(X_val)

# Đánh giá mô hình

print("Random Forest Accuracy on Validation Set:", accuracy_score(y_val,␣
↪y_pred_nb))

print("Confusion Matrix:\n", confusion_matrix(y_val, y_pred_nb))

print("Classification Report:\n", classification_report(y_val, y_pred_nb))

Cross-validation scores: [0.80426036 0.80444045 0.80818027 0.80604999

0.81035789]
Mean cross-validation accuracy: 0.8066577896198159
Random Forest Accuracy on Validation Set: 0.80807392259335
Confusion Matrix:
[[18501 2312]
[ 2756 2837]]
Classification Report:
precision recall f1-score support

0 0.87 0.89 0.88 20813

1 0.55 0.51 0.53 5593

accuracy 0.81 26406

macro avg 0.71 0.70 0.70 26406
weighted avg 0.80 0.81 0.81 26406

[418]: from sklearn.linear_model import LogisticRegression

# Tạo pipeline cho Logistic Regression

lr_model = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', LogisticRegression(random_state=42, max_iter=1000))])

# Huấn luyện mô hình

lr_model.fit(X_train, y_train)

# Dự đoán trên tập validation

y_pred_lr = lr_model.predict(X_val)

# Đánh giá mô hình

print("Logistic Regression Accuracy:", accuracy_score(y_val, y_pred_lr))
print("Confusion Matrix:\n", confusion_matrix(y_val, y_pred_lr))
print("Classification Report:\n", classification_report(y_val, y_pred_lr))

17
Logistic Regression Accuracy: 0.8328410209800803
Confusion Matrix:
[[19887 926]
[ 3488 2105]]
Classification Report:
precision recall f1-score support

0 0.85 0.96 0.90 20813

1 0.69 0.38 0.49 5593

accuracy 0.83 26406

macro avg 0.77 0.67 0.69 26406
weighted avg 0.82 0.83 0.81 26406

[419]: #Linear test with cross validation

from sklearn.model_selection import cross_val_score
cv_scores = cross_val_score(lr_model, X_train, y_train, cv=5,␣
↪scoring='accuracy')

# In ra các kết quả từ các fold

print(f"Cross-validation scores: {cv_scores}")
print(f"Mean cross-validation accuracy: {cv_scores.mean()}")

# Huấn luyện lại mô hình trên toàn bộ dữ liệu huấn luyện và đánh giá trên tập␣
↪validation

lr_model.fit(X_train, y_train)
y_pred_lr = lr_model.predict(X_val)

# Đánh giá mô hình

print("Random Forest Accuracy on Validation Set:", accuracy_score(y_val,␣
↪y_pred_lr))

print("Confusion Matrix:\n", confusion_matrix(y_val, y_pred_lr))

print("Classification Report:\n", classification_report(y_val, y_pred_lr))

Cross-validation scores: [0.83427219 0.83132929 0.83274948 0.834217

0.83724673]
Mean cross-validation accuracy: 0.8339629400474402
Random Forest Accuracy on Validation Set: 0.8328410209800803
Confusion Matrix:
[[19887 926]
[ 3488 2105]]
Classification Report:
precision recall f1-score support

0 0.85 0.96 0.90 20813

1 0.69 0.38 0.49 5593

18
accuracy 0.83 26406
macro avg 0.77 0.67 0.69 26406
weighted avg 0.82 0.83 0.81 26406

[420]: from sklearn.tree import DecisionTreeClassifier

# Tạo pipeline cho Decision Tree

dt_model = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', DecisionTreeClassifier(random_state=42, max_depth=10))])

# Huấn luyện mô hình

dt_model.fit(X_train, y_train)

# Dự đoán trên tập validation

y_pred_dt = dt_model.predict(X_val)

# Đánh giá mô hình

print("Decision Tree Accuracy:", accuracy_score(y_val, y_pred_dt))
print("Confusion Matrix:\n", confusion_matrix(y_val, y_pred_dt))
print("Classification Report:\n", classification_report(y_val, y_pred_dt))

Decision Tree Accuracy: 0.860978565477543

Confusion Matrix:
[[19580 1233]
[ 2438 3155]]
Classification Report:
precision recall f1-score support

0 0.89 0.94 0.91 20813

1 0.72 0.56 0.63 5593

accuracy 0.86 26406

macro avg 0.80 0.75 0.77 26406
weighted avg 0.85 0.86 0.85 26406

[421]: #Decision tree test with cross validation

from sklearn.model_selection import cross_val_score
cv_scores = cross_val_score(dt_model, X_train, y_train, cv=5,␣
↪scoring='accuracy')

# In ra các kết quả từ các fold

print(f"Cross-validation scores: {cv_scores}")
print(f"Mean cross-validation accuracy: {cv_scores.mean()}")

19
# Huấn luyện lại mô hình trên toàn bộ dữ liệu huấn luyện và đánh giá trên tập␣
↪validation

dt_model.fit(X_train, y_train)
y_pred_dt = dt_model.predict(X_val)

# Đánh giá mô hình

print("Random Forest Accuracy on Validation Set:", accuracy_score(y_val,␣
↪y_pred_dt))

print("Confusion Matrix:\n", confusion_matrix(y_val, y_pred_dt))

print("Classification Report:\n", classification_report(y_val, y_pred_dt))

Cross-validation scores: [0.85921893 0.85892823 0.85646658 0.85826548

0.85599318]
Mean cross-validation accuracy: 0.8577744819263879
Random Forest Accuracy on Validation Set: 0.860978565477543
Confusion Matrix:
[[19580 1233]
[ 2438 3155]]
Classification Report:
precision recall f1-score support

0 0.89 0.94 0.91 20813

1 0.72 0.56 0.63 5593

accuracy 0.86 26406

macro avg 0.80 0.75 0.77 26406
weighted avg 0.85 0.86 0.85 26406

[422]: #Randomforest test

from sklearn.ensemble import RandomForestClassifier

# Tạo pipeline cho Random Forest

rf_model = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', RandomForestClassifier(random_state=42, n_estimators=100,␣
↪max_depth=10))])

# Huấn luyện mô hình

rf_model.fit(X_train, y_train)

# Dự đoán trên tập validation

y_pred_rf = rf_model.predict(X_val)

# Đánh giá mô hình

print("Random Forest Accuracy:", accuracy_score(y_val, y_pred_rf))

20
print("Confusion Matrix:\n", confusion_matrix(y_val, y_pred_rf))
print("Classification Report:\n", classification_report(y_val, y_pred_rf))

Random Forest Accuracy: 0.8629856850715747

Confusion Matrix:
[[19879 934]
[ 2684 2909]]
Classification Report:
precision recall f1-score support

0 0.88 0.96 0.92 20813

1 0.76 0.52 0.62 5593

accuracy 0.86 26406

macro avg 0.82 0.74 0.77 26406
weighted avg 0.85 0.86 0.85 26406

[423]: #RandomForesr test with cross validation

from sklearn.model_selection import cross_val_score
cv_scores = cross_val_score(rf_model, X_train, y_train, cv=5,␣
↪scoring='accuracy')

# In ra các kết quả từ các fold

print(f"Cross-validation scores: {cv_scores}")
print(f"Mean cross-validation accuracy: {cv_scores.mean()}")

# Huấn luyện lại mô hình trên toàn bộ dữ liệu huấn luyện và đánh giá trên tập␣
↪validation

rf_model.fit(X_train, y_train)
y_pred_nb = rf_model.predict(X_val)

# Đánh giá mô hình

print("Random Forest Accuracy on Validation Set:", accuracy_score(y_val,␣
↪y_pred_rf))

print("Confusion Matrix:\n", confusion_matrix(y_val, y_pred_rf))

print("Classification Report:\n", classification_report(y_val, y_pred_rf))

Cross-validation scores: [0.86338462 0.86295209 0.86209998 0.86219466

0.86318879]
Mean cross-validation accuracy: 0.8627640277919392
Random Forest Accuracy on Validation Set: 0.8629856850715747
Confusion Matrix:
[[19879 934]
[ 2684 2909]]
Classification Report:
precision recall f1-score support

21
0 0.88 0.96 0.92 20813
1 0.76 0.52 0.62 5593

accuracy 0.86 26406

macro avg 0.82 0.74 0.77 26406
weighted avg 0.85 0.86 0.85 26406

[424]: from xgboost import XGBClassifier

from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, confusion_matrix,␣
↪classification_report

# Tạo pipeline cho XGBoost

xgb_model = Pipeline(steps=[
('preprocessor', preprocessor), # Preprocessing (nếu có)
('classifier', XGBClassifier(random_state=42, use_label_encoder=False,␣
↪eval_metric='mlogloss'))]) # Khởi tạo mô hình XGBoost

# Huấn luyện mô hình

xgb_model.fit(X_train, y_train)

# Dự đoán trên tập validation

y_pred_xgb = xgb_model.predict(X_val)

# Đánh giá mô hình

print("XGBoost Accuracy:", accuracy_score(y_val, y_pred_xgb))
print("Confusion Matrix:\n", confusion_matrix(y_val, y_pred_xgb))
print("Classification Report:\n", classification_report(y_val, y_pred_xgb))

XGBoost Accuracy: 0.865295766113762

Confusion Matrix:
[[19702 1111]
[ 2446 3147]]
Classification Report:
precision recall f1-score support

0 0.89 0.95 0.92 20813

1 0.74 0.56 0.64 5593

accuracy 0.87 26406

macro avg 0.81 0.75 0.78 26406
weighted avg 0.86 0.87 0.86 26406

22
[425]: #XGBClassifier test with cross validation
from sklearn.model_selection import cross_val_score
cv_scores = cross_val_score(nb_model, X_train, y_train, cv=5,␣
↪scoring='accuracy')

# In ra các kết quả từ các fold

print(f"Cross-validation scores: {cv_scores}")
print(f"Mean cross-validation accuracy: {cv_scores.mean()}")

# Huấn luyện lại mô hình trên toàn bộ dữ liệu huấn luyện và đánh giá trên tập␣
↪validation

xgb_model.fit(X_train, y_train)
y_pred_nb = xgb_model.predict(X_val)

# Đánh giá mô hình

print("Random Forest Accuracy on Validation Set:", accuracy_score(y_val,␣
↪y_pred_xgb))

print("Confusion Matrix:\n", confusion_matrix(y_val, y_pred_xgb))

print("Classification Report:\n", classification_report(y_val, y_pred_xgb))

Cross-validation scores: [0.80426036 0.80444045 0.80818027 0.80604999

0.81035789]
Mean cross-validation accuracy: 0.8066577896198159
Random Forest Accuracy on Validation Set: 0.865295766113762
Confusion Matrix:
[[19702 1111]
[ 2446 3147]]
Classification Report:
precision recall f1-score support

0 0.89 0.95 0.92 20813

1 0.74 0.56 0.64 5593

accuracy 0.87 26406

macro avg 0.81 0.75 0.78 26406
weighted avg 0.86 0.87 0.86 26406

[ ]: pip install lightgbm

Requirement already satisfied: lightgbm in /opt/conda/lib/python3.10/site-

packages (4.2.0)
Requirement already satisfied: numpy in /opt/conda/lib/python3.10/site-packages
(from lightgbm) (1.26.4)
Requirement already satisfied: scipy in /opt/conda/lib/python3.10/site-packages
(from lightgbm) (1.14.1)

23
[ ]: #LGBMClassifier test
from lightgbm import LGBMClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, confusion_matrix,␣
↪classification_report

# Tạo pipeline cho LightGBM

lgbm_model = Pipeline(steps=[
('preprocessor', preprocessor), # Preprocessing (nếu có)
('classifier', LGBMClassifier(random_state=42))]) # Khởi tạo mô hình␣
↪LightGBM

# Huấn luyện mô hình

lgbm_model.fit(X_train, y_train)

# Dự đoán trên tập validation

y_pred_lgbm = lgbm_model.predict(X_val)

# Đánh giá mô hình

print("LightGBM Accuracy:", accuracy_score(y_val, y_pred_lgbm))
print("Confusion Matrix:\n", confusion_matrix(y_val, y_pred_lgbm))
print("Classification Report:\n", classification_report(y_val, y_pred_lgbm))

[ ]: #LightGBM test with cross validation

from sklearn.model_selection import cross_val_score
cv_scores = cross_val_score(nb_model, X_train, y_train, cv=5,␣
↪scoring='accuracy')

# In ra các kết quả từ các fold

print(f"Cross-validation scores: {cv_scores}")
print(f"Mean cross-validation accuracy: {cv_scores.mean()}")

# Huấn luyện lại mô hình trên toàn bộ dữ liệu huấn luyện và đánh giá trên tập␣
↪validation

lgbm_model.fit(X_train, y_train)
y_pred_lgbm = lgbm_model.predict(X_val)

# Đánh giá mô hình

print("Random Forest Accuracy on Validation Set:", accuracy_score(y_val,␣
↪y_pred_lgbm))

print("Confusion Matrix:\n", confusion_matrix(y_val, y_pred_lgbm))

print("Classification Report:\n", classification_report(y_val, y_pred_lgbm))

[ ]: models = {
"KNN": y_pred_knn,
"Naive Bayes": y_pred_nb,
"Logistic Regression": y_pred_lr,

24
"Decision Tree": y_pred_dt,
"Random Forest": y_pred_rf,
"XGBClassifier": y_pred_xgb,
"LGBMClassifier": y_pred_lgbm
}

for name, preds in models.items():

print(f"{name} Model")
print("Accuracy:", accuracy_score(y_val, preds))
print(f"Cross-validation scores: {cv_scores}")
print(f"Mean cross-validation accuracy: {cv_scores.mean()}")
print("-" * 40)

[ ]: # Dự đoán trên tập test

y_pred_test = lgbm_model.predict(X_test)

# Tạo DataFrame cho kết quả dự đoán

submission = pd.DataFrame({
'id': test['id'], # Lấy CustomerId từ dữ liệu test, kiểm tra lại tên cột␣
↪'id'

'Exited': y_pred_test # Lấy kết quả dự đoán

})

# Đảm bảo rằng tên cột chính xác

submission.columns = ['id', 'Exited']

# Lưu kết quả vào file CSV (đảm bảo lưu đúng đường dẫn trong môi trường Kaggle)
submission.to_csv('submission.csv', index=False)

# Xem trước file submission

submission.head()

[ ]: # Xem trước nội dung file submission

submission.head()

Delhivery Mani
No ratings yet
Delhivery Mani
79 pages
HA250 EN Col18
100% (1)
HA250 EN Col18
117 pages
Design of A Reconfigurable Li-Ion Battery Management System (BMS)
No ratings yet
Design of A Reconfigurable Li-Ion Battery Management System (BMS)
7 pages
Data Analyzer
No ratings yet
Data Analyzer
10 pages
#Group: B (ML) : Numpy NP Pandas PD
No ratings yet
#Group: B (ML) : Numpy NP Pandas PD
9 pages
Geo Python Doc (1) 7,8 Bavesh
No ratings yet
Geo Python Doc (1) 7,8 Bavesh
9 pages
Exp3 Python
No ratings yet
Exp3 Python
15 pages
Data Preparation Project
No ratings yet
Data Preparation Project
23 pages
Python For Machine Learning
No ratings yet
Python For Machine Learning
66 pages
Complete Case Analysis (CCA) : Advantages
No ratings yet
Complete Case Analysis (CCA) : Advantages
6 pages
Data Visualization EDA-print
No ratings yet
Data Visualization EDA-print
18 pages
Malicious Coding
No ratings yet
Malicious Coding
4 pages
Ip Practical File
No ratings yet
Ip Practical File
20 pages
Data Cleaning
No ratings yet
Data Cleaning
13 pages
Rimjhim
No ratings yet
Rimjhim
21 pages
Ass 1 ML
No ratings yet
Ass 1 ML
21 pages
ML LAB Manual-1
No ratings yet
ML LAB Manual-1
33 pages
06 Seaborn
No ratings yet
06 Seaborn
13 pages
Creation of Series Using List, Dictionary & Ndarray
No ratings yet
Creation of Series Using List, Dictionary & Ndarray
65 pages
Class 12 Practical File Informatics Practices
No ratings yet
Class 12 Practical File Informatics Practices
28 pages
Analysis and Prediction of House Prices by Linear Regression Model
No ratings yet
Analysis and Prediction of House Prices by Linear Regression Model
91 pages
Pandas+With+Python+ +DATAhill+Solutions
No ratings yet
Pandas+With+Python+ +DATAhill+Solutions
24 pages
Practical File Class Xii
No ratings yet
Practical File Class Xii
25 pages
Step-by-Step Explanation of Python Data Preprocessing Script
No ratings yet
Step-by-Step Explanation of Python Data Preprocessing Script
9 pages
ML Projects
No ratings yet
ML Projects
22 pages
Practical 3
No ratings yet
Practical 3
8 pages
Data Science With Python
No ratings yet
Data Science With Python
12 pages
EDS - Python Cheat Sheet
0% (1)
EDS - Python Cheat Sheet
3 pages
Overview of Data Cleaning
No ratings yet
Overview of Data Cleaning
17 pages
EDA Plots Code
No ratings yet
EDA Plots Code
13 pages
Programs of Python Pandas
No ratings yet
Programs of Python Pandas
15 pages
Sales Data Clustering
No ratings yet
Sales Data Clustering
15 pages
Datascience PR 6 Veda
No ratings yet
Datascience PR 6 Veda
6 pages
Credit Card 1679991215
No ratings yet
Credit Card 1679991215
26 pages
Unit 6 Pyspark - MLlib
No ratings yet
Unit 6 Pyspark - MLlib
6 pages
Pyspark MLlib
No ratings yet
Pyspark MLlib
4 pages
Python Pandas II Notes XII
No ratings yet
Python Pandas II Notes XII
20 pages
West Rox
No ratings yet
West Rox
29 pages
List of Practical Ip065 Xii Session 2025 CKC Academy
No ratings yet
List of Practical Ip065 Xii Session 2025 CKC Academy
19 pages
Pandas Cheat Sheet
No ratings yet
Pandas Cheat Sheet
17 pages
4 PythonPandas
No ratings yet
4 PythonPandas
8 pages
Document (4) - 1
No ratings yet
Document (4) - 1
15 pages
Experiment 2 FDL - Jupyter Notebook
No ratings yet
Experiment 2 FDL - Jupyter Notebook
2 pages
Chapter 2 - Python Pandas II
No ratings yet
Chapter 2 - Python Pandas II
71 pages
EDA Diwali Sale Analysis Project
No ratings yet
EDA Diwali Sale Analysis Project
11 pages
List of Practical Ip065 Xii Session 2025 CKC Academy
No ratings yet
List of Practical Ip065 Xii Session 2025 CKC Academy
19 pages
Panda
No ratings yet
Panda
33 pages
5) Randomforest - Ipynb - Colaboratory
No ratings yet
5) Randomforest - Ipynb - Colaboratory
12 pages
188 Code Tugas 1
No ratings yet
188 Code Tugas 1
18 pages
K Means
No ratings yet
K Means
15 pages
Copy of Final Project
No ratings yet
Copy of Final Project
16 pages
Reading Data: #Importing Required Libraries
No ratings yet
Reading Data: #Importing Required Libraries
16 pages
EDA - Session-2 - Data Frame Basics-2
No ratings yet
EDA - Session-2 - Data Frame Basics-2
11 pages
Create A Pandas Series From A Dictionary of Values and An Ndarray
No ratings yet
Create A Pandas Series From A Dictionary of Values and An Ndarray
15 pages
A926534728 - 28953 - 8 - 2025 - Spark Mllib
No ratings yet
A926534728 - 28953 - 8 - 2025 - Spark Mllib
8 pages
Pandas Dataframe All Operations 1735471870
No ratings yet
Pandas Dataframe All Operations 1735471870
4 pages
Data Analysis CheatSheet
No ratings yet
Data Analysis CheatSheet
2 pages
Practical File Questions With Answers
No ratings yet
Practical File Questions With Answers
7 pages
Image To PDF 22-Jan-2025
No ratings yet
Image To PDF 22-Jan-2025
6 pages
Pandas Data Analytics
No ratings yet
Pandas Data Analytics
61 pages
Kunal Assignment 3
No ratings yet
Kunal Assignment 3
19 pages
A List of Factorial Math Constants
From Everand
A List of Factorial Math Constants
Archive Classics
No ratings yet
Natasha Prendergast CV
No ratings yet
Natasha Prendergast CV
3 pages
Sierra PCB - The Leading Provider of PCB Design, Manufacturing and Assembly Services
No ratings yet
Sierra PCB - The Leading Provider of PCB Design, Manufacturing and Assembly Services
4 pages
Stack
No ratings yet
Stack
48 pages
Machine Learning Project - Sapan Parikh
100% (1)
Machine Learning Project - Sapan Parikh
12 pages
Bs. in P&O India. Syllabus 2010
100% (1)
Bs. in P&O India. Syllabus 2010
58 pages
IITH JEE and GATE 2017 Cutoff Ranks Scores
No ratings yet
IITH JEE and GATE 2017 Cutoff Ranks Scores
2 pages
Learning 2
No ratings yet
Learning 2
27 pages
Software For Civil Engineer
No ratings yet
Software For Civil Engineer
1 page
Nexus 5 - Adb Sideload
No ratings yet
Nexus 5 - Adb Sideload
3 pages
Chapter 7: Modeling System Requirement With Use Cases
No ratings yet
Chapter 7: Modeling System Requirement With Use Cases
31 pages
Sccs1624 - Meted Prac A+e
No ratings yet
Sccs1624 - Meted Prac A+e
4 pages
Mop 0905
No ratings yet
Mop 0905
136 pages
Digital Modulation
100% (1)
Digital Modulation
5 pages
IJCRT2406111
No ratings yet
IJCRT2406111
7 pages
Unit 21799p P.L.C. II
No ratings yet
Unit 21799p P.L.C. II
18 pages
4 X 8 Vault
No ratings yet
4 X 8 Vault
1 page
Chapter 2 - Query Optimization
No ratings yet
Chapter 2 - Query Optimization
40 pages
Document Register, Transmittal Sheet, Distribution Matrix and Change Request Format
100% (2)
Document Register, Transmittal Sheet, Distribution Matrix and Change Request Format
4 pages
FlipFlops Chapter 5
No ratings yet
FlipFlops Chapter 5
27 pages
National Textile University
No ratings yet
National Textile University
9 pages
Comprehensive Data Structures Notes For Computer Science Students PDF
No ratings yet
Comprehensive Data Structures Notes For Computer Science Students PDF
29 pages
Scaling Hybrid Cloud Observability - Participant's Guide
No ratings yet
Scaling Hybrid Cloud Observability - Participant's Guide
8 pages
Lutron Homeworks Software
100% (1)
Lutron Homeworks Software
5 pages
Differences Between Earthed and Unearthed Cables - EEP
No ratings yet
Differences Between Earthed and Unearthed Cables - EEP
5 pages
The Four Stages of NTFS File Growth - Part - 2
No ratings yet
The Four Stages of NTFS File Growth - Part - 2
5 pages
Firefox Addon Extension)
No ratings yet
Firefox Addon Extension)
14 pages
Chapter 17 AIS
No ratings yet
Chapter 17 AIS
17 pages
Young Professional Programme 067ebc16a0ecd73 28758048
No ratings yet
Young Professional Programme 067ebc16a0ecd73 28758048
21 pages

Group 2 TH

Uploaded by

Group 2 TH

Uploaded by

ieochc5ud

December 11, 2024

# It is defined by the kaggle/python Docker image: https://github.com/kaggle/

# For example, here's several helpful packages to load

import numpy as np # linear algebra

# Input data files are available in the read-only "../input/" directory

# You can write up to 20GB to the current directory (/kaggle/working/) that␣

[391]: import pandas as pd

Balance NumOfProducts HasCrCard IsActiveMember EstimatedSalary \

[393]: id CustomerId CreditScore Age \

Tenure Balance NumOfProducts HasCrCard \

IsActiveMember EstimatedSalary Exited

[395]: id CustomerId Surname CreditScore Geography Gender Age Tenure \

Balance NumOfProducts HasCrCard IsActiveMember EstimatedSalary

[396]: # Hiển thị thông tin tổng quan về dữ liệu

[398]: # Kiểm tra dữ liệu thiếu

[400]: # Xem phân phối của biến Exited

[401]: import seaborn as sns

[402]: # Vẽ biểu đồ phân loại

fig, ax = plt. subplots (n_rows, n_cols, figsize=(n_cols*3.5, n_rows*3.5))

for r in range(0, n_rows):

ax_i.set_title(f"Figure {i+1}: Exited Rate vs {cols[i]}")

ax.flat[-2].set_visible(False) #Remove the last subplot

[404]: <Axes: xlabel='Age', ylabel='Count'>

[406]: sns.histplot (data=train, x='Tenure', hue = 'Exited', bins = 40)

[406]: <Axes: xlabel='Tenure', ylabel='Count'>

[407]: sns.histplot (data=train, x='Balance', hue = 'Exited', bins = 40)

[407]: <Axes: xlabel='Balance', ylabel='Count'>

[408]: <Axes: xlabel='EstimatedSalary', ylabel='Count'>

[410]: # Loại bỏ các cột không cần thiết

# Xác định các cột phân loại và số

[411]: print (categorical_features)

['Geography', 'Gender', 'IsActiveMember']

[412]: from sklearn.pipeline import Pipeline

# Pipeline cho biến số

# Kết hợp xử lý cả hai loại biến

[413]: # Chia tập train và validation

print("Kích thước tập train:", X_train.shape)

Kích thước tập train: (105621, 10)

[414]: from sklearn.neighbors import KNeighborsClassifier

# Tạo pipeline cho KNN

# Huấn luyện mô hình

# Dự đoán trên tập validation

# Đánh giá mô hình

KNN Accuracy: 0.8485950162841779

0 0.88 0.93 0.91 20813

accuracy 0.85 26406

[415]: #KNN test with cross validation

# In ra các kết quả từ các fold

# Đánh giá mô hình

print("Confusion Matrix:\n", confusion_matrix(y_val, y_pred_knn))

Cross-validation scores: [0.85031953 0.8500284 0.84903427 0.84605188

0 0.88 0.93 0.91 20813

[416]: from sklearn.naive_bayes import GaussianNB

# Tạo pipeline cho Naive Bayes

# Huấn luyện mô hình

# Dự đoán trên tập validation

# Đánh giá mô hình

Naive Bayes Accuracy: 0.80807392259335

0 0.87 0.89 0.88 20813

accuracy 0.81 26406

[417]: #Naive Bayes test with cross validation

# In ra các kết quả từ các fold

# Đánh giá mô hình

print("Confusion Matrix:\n", confusion_matrix(y_val, y_pred_nb))

Cross-validation scores: [0.80426036 0.80444045 0.80818027 0.80604999

0 0.87 0.89 0.88 20813

accuracy 0.81 26406

[418]: from sklearn.linear_model import LogisticRegression

# Tạo pipeline cho Logistic Regression

# Huấn luyện mô hình

# Dự đoán trên tập validation

# Đánh giá mô hình

0 0.85 0.96 0.90 20813

accuracy 0.83 26406

[419]: #Linear test with cross validation

fig, ax = plt. subplots (n_rows, n_cols, figsize=(n_cols3.5, n_rows3.5))