0% found this document useful (0 votes)
9 views25 pages

Group 2 TH

The document outlines a data analysis project using Python, specifically in a Kaggle environment, where datasets related to customer information are loaded and explored. It includes details about the training and testing datasets, their structure, and basic statistical descriptions. The analysis also involves visualizations to understand the distribution of various features and their relationship with customer exit rates.

Uploaded by

Đơn Đặng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views25 pages

Group 2 TH

The document outlines a data analysis project using Python, specifically in a Kaggle environment, where datasets related to customer information are loaded and explored. It includes details about the training and testing datasets, their structure, and basic statistical descriptions. The analysis also involves visualizations to understand the distribution of various features and their relationship with customer exit rates.

Uploaded by

Đơn Đặng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

ieochc5ud

December 11, 2024

[390]: # This Python 3 environment comes with many helpful analytics libraries␣
↪installed

# It is defined by the kaggle/python Docker image: https://github.com/kaggle/


↪docker-python

# For example, here's several helpful packages to load

import numpy as np # linear algebra


import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory


# For example, running this (by clicking run or pressing Shift+Enter) will list␣
↪all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that␣


↪gets preserved as output when you create a version using "Save & Run All"

# You can also write temporary files to /kaggle/temp/, but they won't be saved␣
↪outside of the current session

/kaggle/input/city-u-10-f-fun-ai-final-project/10F_train.csv
/kaggle/input/city-u-10-f-fun-ai-final-project/10F_sample_submission.csv
/kaggle/input/city-u-10-f-fun-ai-final-project/10F_test.csv

[391]: import pandas as pd

# Đọc dữ liệu
train = pd.read_csv('/kaggle/input/city-u-10-f-fun-ai-final-project/10F_train.
↪csv')

test = pd.read_csv('/kaggle/input/city-u-10-f-fun-ai-final-project/10F_test.
↪csv')

[392]: train.head()

1
[392]: id CustomerId Surname CreditScore Geography Gender Age Tenure \
0 1 15749177 Okwudiliolisa 627 France Male 33.0 1
1 2 15694510 Hsueh 678 France Male 40.0 10
2 3 15741417 Kao 581 France Male 34.0 2
3 4 15766172 Chiemenam 716 Spain Male 33.0 5
4 5 15771669 Genovese 588 Germany Male 36.0 4

Balance NumOfProducts HasCrCard IsActiveMember EstimatedSalary \


0 0.00 2 1 1 49503.50
1 0.00 2 1 0 184866.69
2 148882.54 1 1 1 84560.88
3 0.00 2 1 1 15068.83
4 131778.58 1 1 0 136024.31

Exited
0 0
1 0
2 0
3 0
4 1

[393]: train.describe()

[393]: id CustomerId CreditScore Age \


count 132027.000000 1.320270e+05 132027.000000 132027.000000
mean 82432.091133 1.569183e+07 656.783832 38.120996
std 47705.125906 7.137972e+04 80.043164 8.869802
min 1.000000 1.556570e+07 350.000000 18.000000
25% 41083.500000 1.563290e+07 598.000000 32.000000
50% 82435.000000 1.569013e+07 660.000000 37.000000
75% 123790.500000 1.575662e+07 710.000000 42.000000
max 165033.000000 1.581569e+07 850.000000 92.000000

Tenure Balance NumOfProducts HasCrCard \


count 132027.000000 132027.000000 132027.000000 132027.000000
mean 5.021821 55609.625464 1.554682 0.753838
std 2.808487 62860.390849 0.547018 0.430776
min 0.000000 0.000000 1.000000 0.000000
25% 3.000000 0.000000 1.000000 1.000000
50% 5.000000 0.000000 2.000000 1.000000
75% 7.000000 120107.645000 2.000000 1.000000
max 10.000000 250898.090000 4.000000 1.000000

IsActiveMember EstimatedSalary Exited


count 132027.000000 132027.000000 132027.00000
mean 0.497300 112683.672952 0.21182
std 0.499995 50275.570007 0.40860

2
min 0.000000 11.580000 0.00000
25% 0.000000 74835.650000 0.00000
50% 0.000000 118024.100000 0.00000
75% 1.000000 155616.750000 0.00000
max 1.000000 199992.480000 1.00000

[394]: print(train.shape)

(132027, 14)

[395]: test.head()

[395]: id CustomerId Surname CreditScore Geography Gender Age Tenure \


0 0 15674932 Okwudilichukwu 668 France Male 33 3
1 12 15717962 Rossi 759 Spain Male 71 9
2 20 15781496 Udegbulam 773 Spain Male 35 9
3 22 15759913 Trentini 553 Germany Female 43 9
4 24 15626012 Obidimkpa 714 France Male 26 6

Balance NumOfProducts HasCrCard IsActiveMember EstimatedSalary


0 0.00 2 1 0 181449.97
1 0.00 1 1 1 93081.87
2 0.00 2 0 1 87549.36
3 85200.82 1 1 0 160574.09
4 149879.66 2 1 1 50016.17

[396]: # Hiển thị thông tin tổng quan về dữ liệu


print(train.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 132027 entries, 0 to 132026
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 132027 non-null int64
1 CustomerId 132027 non-null int64
2 Surname 132027 non-null object
3 CreditScore 132027 non-null int64
4 Geography 132027 non-null object
5 Gender 132027 non-null object
6 Age 132027 non-null float64
7 Tenure 132027 non-null int64
8 Balance 132027 non-null float64
9 NumOfProducts 132027 non-null int64
10 HasCrCard 132027 non-null int64
11 IsActiveMember 132027 non-null int64
12 EstimatedSalary 132027 non-null float64
13 Exited 132027 non-null int64

3
dtypes: float64(3), int64(8), object(3)
memory usage: 14.1+ MB
None

[397]: print(test.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33007 entries, 0 to 33006
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 33007 non-null int64
1 CustomerId 33007 non-null int64
2 Surname 33007 non-null object
3 CreditScore 33007 non-null int64
4 Geography 33007 non-null object
5 Gender 33007 non-null object
6 Age 33007 non-null int64
7 Tenure 33007 non-null int64
8 Balance 33007 non-null float64
9 NumOfProducts 33007 non-null int64
10 HasCrCard 33007 non-null int64
11 IsActiveMember 33007 non-null int64
12 EstimatedSalary 33007 non-null float64
dtypes: float64(2), int64(8), object(3)
memory usage: 3.3+ MB
None

[398]: # Kiểm tra dữ liệu thiếu


print(train.isnull().sum())

id 0
CustomerId 0
Surname 0
CreditScore 0
Geography 0
Gender 0
Age 0
Tenure 0
Balance 0
NumOfProducts 0
HasCrCard 0
IsActiveMember 0
EstimatedSalary 0
Exited 0
dtype: int64

[399]: print(test.isnull().sum())

4
id 0
CustomerId 0
Surname 0
CreditScore 0
Geography 0
Gender 0
Age 0
Tenure 0
Balance 0
NumOfProducts 0
HasCrCard 0
IsActiveMember 0
EstimatedSalary 0
dtype: int64

[400]: # Xem phân phối của biến Exited


print(train['Exited'].value_counts())

Exited
0 104061
1 27966
Name: count, dtype: int64

[401]: import seaborn as sns


import matplotlib.pyplot as plt

[402]: # Vẽ biểu đồ phân loại


sns.countplot(x='Exited', data=train)
plt.show()

5
[403]: cols = ['Gender','Geography','HasCrCard','IsActiveMember']

n_rows = 2
n_cols = 3

fig, ax = plt. subplots (n_rows, n_cols, figsize=(n_cols*3.5, n_rows*3.5))

for r in range(0, n_rows):


for c in range(0, n_cols):
i = r*n_cols + c #index to loop through list "cols"
if i < len(cols):
ax_i = ax[r,c]
sns. countplot (data=train, x=cols[i], hue="Exited",␣
↪palette="Blues",ax=ax_i)

ax_i.set_title(f"Figure {i+1}: Exited Rate vs {cols[i]}")


ax_i.legend(title = '', loc='upper right', labels = ['Not Exited',␣
↪'Exited'])

ax.flat[-2].set_visible(False) #Remove the last subplot


plt.tight_layout()
plt.show()

6
[404]: sns.histplot(data=train, x='Age', hue= 'Exited', bins = 40, kde=True)

/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning:
use_inf_as_na option is deprecated and will be removed in a future version.
Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1075: FutureWarning:
When grouping with a length-1 list-like, you will need to pass a length-1 tuple
to get_group in a future version of pandas. Pass `(name,)` instead of `name` to
silence this warning.
data_subset = grouped_data.get_group(pd_key)
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1075: FutureWarning:
When grouping with a length-1 list-like, you will need to pass a length-1 tuple
to get_group in a future version of pandas. Pass `(name,)` instead of `name` to
silence this warning.
data_subset = grouped_data.get_group(pd_key)

[404]: <Axes: xlabel='Age', ylabel='Count'>

7
[405]: sns.histplot (data=train, x='CreditScore', hue = 'Exited', bins = 40)

/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning:
use_inf_as_na option is deprecated and will be removed in a future version.
Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1075: FutureWarning:
When grouping with a length-1 list-like, you will need to pass a length-1 tuple
to get_group in a future version of pandas. Pass `(name,)` instead of `name` to
silence this warning.
data_subset = grouped_data.get_group(pd_key)
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1075: FutureWarning:
When grouping with a length-1 list-like, you will need to pass a length-1 tuple
to get_group in a future version of pandas. Pass `(name,)` instead of `name` to
silence this warning.
data_subset = grouped_data.get_group(pd_key)
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1075: FutureWarning:
When grouping with a length-1 list-like, you will need to pass a length-1 tuple
to get_group in a future version of pandas. Pass `(name,)` instead of `name` to
silence this warning.
data_subset = grouped_data.get_group(pd_key)

8
[405]: <Axes: xlabel='CreditScore', ylabel='Count'>

[406]: sns.histplot (data=train, x='Tenure', hue = 'Exited', bins = 40)

/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning:
use_inf_as_na option is deprecated and will be removed in a future version.
Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1075: FutureWarning:
When grouping with a length-1 list-like, you will need to pass a length-1 tuple
to get_group in a future version of pandas. Pass `(name,)` instead of `name` to
silence this warning.
data_subset = grouped_data.get_group(pd_key)
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1075: FutureWarning:
When grouping with a length-1 list-like, you will need to pass a length-1 tuple
to get_group in a future version of pandas. Pass `(name,)` instead of `name` to
silence this warning.
data_subset = grouped_data.get_group(pd_key)
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1075: FutureWarning:
When grouping with a length-1 list-like, you will need to pass a length-1 tuple
to get_group in a future version of pandas. Pass `(name,)` instead of `name` to
silence this warning.

9
data_subset = grouped_data.get_group(pd_key)

[406]: <Axes: xlabel='Tenure', ylabel='Count'>

[407]: sns.histplot (data=train, x='Balance', hue = 'Exited', bins = 40)

/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning:
use_inf_as_na option is deprecated and will be removed in a future version.
Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1075: FutureWarning:
When grouping with a length-1 list-like, you will need to pass a length-1 tuple
to get_group in a future version of pandas. Pass `(name,)` instead of `name` to
silence this warning.
data_subset = grouped_data.get_group(pd_key)
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1075: FutureWarning:
When grouping with a length-1 list-like, you will need to pass a length-1 tuple
to get_group in a future version of pandas. Pass `(name,)` instead of `name` to
silence this warning.
data_subset = grouped_data.get_group(pd_key)

[407]: <Axes: xlabel='Balance', ylabel='Count'>

10
[408]: sns.histplot (data=train, x='EstimatedSalary', hue = 'Exited', bins = 40)

/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning:
use_inf_as_na option is deprecated and will be removed in a future version.
Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1075: FutureWarning:
When grouping with a length-1 list-like, you will need to pass a length-1 tuple
to get_group in a future version of pandas. Pass `(name,)` instead of `name` to
silence this warning.
data_subset = grouped_data.get_group(pd_key)
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1075: FutureWarning:
When grouping with a length-1 list-like, you will need to pass a length-1 tuple
to get_group in a future version of pandas. Pass `(name,)` instead of `name` to
silence this warning.
data_subset = grouped_data.get_group(pd_key)

[408]: <Axes: xlabel='EstimatedSalary', ylabel='Count'>

11
[409]: sns.histplot (data=train, x='NumOfProducts', hue = 'Exited', bins = 40)

/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning:
use_inf_as_na option is deprecated and will be removed in a future version.
Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1075: FutureWarning:
When grouping with a length-1 list-like, you will need to pass a length-1 tuple
to get_group in a future version of pandas. Pass `(name,)` instead of `name` to
silence this warning.
data_subset = grouped_data.get_group(pd_key)
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1075: FutureWarning:
When grouping with a length-1 list-like, you will need to pass a length-1 tuple
to get_group in a future version of pandas. Pass `(name,)` instead of `name` to
silence this warning.
data_subset = grouped_data.get_group(pd_key)
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1075: FutureWarning:
When grouping with a length-1 list-like, you will need to pass a length-1 tuple
to get_group in a future version of pandas. Pass `(name,)` instead of `name` to
silence this warning.
data_subset = grouped_data.get_group(pd_key)

12
[409]: <Axes: xlabel='NumOfProducts', ylabel='Count'>

[410]: # Loại bỏ các cột không cần thiết


X = train.drop(['id', 'CustomerId', 'Surname', 'Exited'], axis=1)
y = train['Exited']
X_test = test.drop(['id', 'CustomerId', 'Surname'], axis=1)

# Xác định các cột phân loại và số


categorical_features = ['Geography','Gender','IsActiveMember']
numerical_features = ['CreditScore', 'Age','Tenure', 'Balance',
'NumOfProducts']

[411]: print (categorical_features)


print (numerical_features)

['Geography', 'Gender', 'IsActiveMember']


['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts']

[412]: from sklearn.pipeline import Pipeline


from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

13
# Pipeline cho biến phân loại
categorical_transformer = Pipeline(steps=[
('onehot', OneHotEncoder(handle_unknown='ignore'))])

# Pipeline cho biến số


numerical_transformer = Pipeline(steps=[
('scaler', StandardScaler())])

# Kết hợp xử lý cả hai loại biến


preprocessor = ColumnTransformer(
transformers=[
('num', numerical_transformer, numerical_features),
('cat', categorical_transformer, categorical_features)])

[413]: # Chia tập train và validation


from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2,␣
↪random_state=42, stratify=y)

print("Kích thước tập train:", X_train.shape)


print("Kích thước tập validation:", X_val.shape)

Kích thước tập train: (105621, 10)


Kích thước tập validation: (26406, 10)

[414]: from sklearn.neighbors import KNeighborsClassifier


from sklearn.metrics import classification_report, accuracy_score,␣
↪confusion_matrix

# Tạo pipeline cho KNN


knn_model = Pipeline(steps=[('preprocessor', preprocessor),
('classifier',␣
↪KNeighborsClassifier(n_neighbors=5))])

# Huấn luyện mô hình


knn_model.fit(X_train, y_train)

# Dự đoán trên tập validation


y_pred_knn = knn_model.predict(X_val)

# Đánh giá mô hình


print("KNN Accuracy:", accuracy_score(y_val, y_pred_knn))
print("Confusion Matrix:\n", confusion_matrix(y_val, y_pred_knn))
print("Classification Report:\n", classification_report(y_val, y_pred_knn))

KNN Accuracy: 0.8485950162841779

14
Confusion Matrix:
[[19366 1447]
[ 2551 3042]]
Classification Report:
precision recall f1-score support

0 0.88 0.93 0.91 20813


1 0.68 0.54 0.60 5593

accuracy 0.85 26406


macro avg 0.78 0.74 0.75 26406
weighted avg 0.84 0.85 0.84 26406

[415]: #KNN test with cross validation


from sklearn.model_selection import cross_val_score
cv_scores = cross_val_score(knn_model, X_train, y_train, cv=5,␣
↪scoring='accuracy')

# In ra các kết quả từ các fold


print(f"Cross-validation scores: {cv_scores}")
print(f"Mean cross-validation accuracy: {cv_scores.mean()}")

# Huấn luyện lại mô hình trên toàn bộ dữ liệu huấn luyện và đánh giá trên tập␣
↪validation

knn_model.fit(X_train, y_train)
y_pred_knn = knn_model.predict(X_val)

# Đánh giá mô hình


print("Random Forest Accuracy on Validation Set:", accuracy_score(y_val,␣
↪y_pred_knn))

print("Confusion Matrix:\n", confusion_matrix(y_val, y_pred_knn))


print("Classification Report:\n", classification_report(y_val, y_pred_knn))

Cross-validation scores: [0.85031953 0.8500284 0.84903427 0.84605188


0.84557849]
Mean cross-validation accuracy: 0.848202515437165
Random Forest Accuracy on Validation Set: 0.8485950162841779
Confusion Matrix:
[[19366 1447]
[ 2551 3042]]
Classification Report:
precision recall f1-score support

0 0.88 0.93 0.91 20813


1 0.68 0.54 0.60 5593

15
accuracy 0.85 26406
macro avg 0.78 0.74 0.75 26406
weighted avg 0.84 0.85 0.84 26406

[416]: from sklearn.naive_bayes import GaussianNB

# Tạo pipeline cho Naive Bayes


nb_model = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', GaussianNB())])

# Huấn luyện mô hình


nb_model.fit(X_train, y_train)

# Dự đoán trên tập validation


y_pred_nb = nb_model.predict(X_val)

# Đánh giá mô hình


print("Naive Bayes Accuracy:", accuracy_score(y_val, y_pred_nb))
print("Confusion Matrix:\n", confusion_matrix(y_val, y_pred_nb))
print("Classification Report:\n", classification_report(y_val, y_pred_nb))

Naive Bayes Accuracy: 0.80807392259335


Confusion Matrix:
[[18501 2312]
[ 2756 2837]]
Classification Report:
precision recall f1-score support

0 0.87 0.89 0.88 20813


1 0.55 0.51 0.53 5593

accuracy 0.81 26406


macro avg 0.71 0.70 0.70 26406
weighted avg 0.80 0.81 0.81 26406

[417]: #Naive Bayes test with cross validation


from sklearn.model_selection import cross_val_score
cv_scores = cross_val_score(nb_model, X_train, y_train, cv=5,␣
↪scoring='accuracy')

# In ra các kết quả từ các fold


print(f"Cross-validation scores: {cv_scores}")
print(f"Mean cross-validation accuracy: {cv_scores.mean()}")

16
# Huấn luyện lại mô hình trên toàn bộ dữ liệu huấn luyện và đánh giá trên tập␣
↪validation

nb_model.fit(X_train, y_train)
y_pred_nb = nb_model.predict(X_val)

# Đánh giá mô hình


print("Random Forest Accuracy on Validation Set:", accuracy_score(y_val,␣
↪y_pred_nb))

print("Confusion Matrix:\n", confusion_matrix(y_val, y_pred_nb))


print("Classification Report:\n", classification_report(y_val, y_pred_nb))

Cross-validation scores: [0.80426036 0.80444045 0.80818027 0.80604999


0.81035789]
Mean cross-validation accuracy: 0.8066577896198159
Random Forest Accuracy on Validation Set: 0.80807392259335
Confusion Matrix:
[[18501 2312]
[ 2756 2837]]
Classification Report:
precision recall f1-score support

0 0.87 0.89 0.88 20813


1 0.55 0.51 0.53 5593

accuracy 0.81 26406


macro avg 0.71 0.70 0.70 26406
weighted avg 0.80 0.81 0.81 26406

[418]: from sklearn.linear_model import LogisticRegression

# Tạo pipeline cho Logistic Regression


lr_model = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', LogisticRegression(random_state=42, max_iter=1000))])

# Huấn luyện mô hình


lr_model.fit(X_train, y_train)

# Dự đoán trên tập validation


y_pred_lr = lr_model.predict(X_val)

# Đánh giá mô hình


print("Logistic Regression Accuracy:", accuracy_score(y_val, y_pred_lr))
print("Confusion Matrix:\n", confusion_matrix(y_val, y_pred_lr))
print("Classification Report:\n", classification_report(y_val, y_pred_lr))

17
Logistic Regression Accuracy: 0.8328410209800803
Confusion Matrix:
[[19887 926]
[ 3488 2105]]
Classification Report:
precision recall f1-score support

0 0.85 0.96 0.90 20813


1 0.69 0.38 0.49 5593

accuracy 0.83 26406


macro avg 0.77 0.67 0.69 26406
weighted avg 0.82 0.83 0.81 26406

[419]: #Linear test with cross validation


from sklearn.model_selection import cross_val_score
cv_scores = cross_val_score(lr_model, X_train, y_train, cv=5,␣
↪scoring='accuracy')

# In ra các kết quả từ các fold


print(f"Cross-validation scores: {cv_scores}")
print(f"Mean cross-validation accuracy: {cv_scores.mean()}")

# Huấn luyện lại mô hình trên toàn bộ dữ liệu huấn luyện và đánh giá trên tập␣
↪validation

lr_model.fit(X_train, y_train)
y_pred_lr = lr_model.predict(X_val)

# Đánh giá mô hình


print("Random Forest Accuracy on Validation Set:", accuracy_score(y_val,␣
↪y_pred_lr))

print("Confusion Matrix:\n", confusion_matrix(y_val, y_pred_lr))


print("Classification Report:\n", classification_report(y_val, y_pred_lr))

Cross-validation scores: [0.83427219 0.83132929 0.83274948 0.834217


0.83724673]
Mean cross-validation accuracy: 0.8339629400474402
Random Forest Accuracy on Validation Set: 0.8328410209800803
Confusion Matrix:
[[19887 926]
[ 3488 2105]]
Classification Report:
precision recall f1-score support

0 0.85 0.96 0.90 20813


1 0.69 0.38 0.49 5593

18
accuracy 0.83 26406
macro avg 0.77 0.67 0.69 26406
weighted avg 0.82 0.83 0.81 26406

[420]: from sklearn.tree import DecisionTreeClassifier

# Tạo pipeline cho Decision Tree


dt_model = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', DecisionTreeClassifier(random_state=42, max_depth=10))])

# Huấn luyện mô hình


dt_model.fit(X_train, y_train)

# Dự đoán trên tập validation


y_pred_dt = dt_model.predict(X_val)

# Đánh giá mô hình


print("Decision Tree Accuracy:", accuracy_score(y_val, y_pred_dt))
print("Confusion Matrix:\n", confusion_matrix(y_val, y_pred_dt))
print("Classification Report:\n", classification_report(y_val, y_pred_dt))

Decision Tree Accuracy: 0.860978565477543


Confusion Matrix:
[[19580 1233]
[ 2438 3155]]
Classification Report:
precision recall f1-score support

0 0.89 0.94 0.91 20813


1 0.72 0.56 0.63 5593

accuracy 0.86 26406


macro avg 0.80 0.75 0.77 26406
weighted avg 0.85 0.86 0.85 26406

[421]: #Decision tree test with cross validation


from sklearn.model_selection import cross_val_score
cv_scores = cross_val_score(dt_model, X_train, y_train, cv=5,␣
↪scoring='accuracy')

# In ra các kết quả từ các fold


print(f"Cross-validation scores: {cv_scores}")
print(f"Mean cross-validation accuracy: {cv_scores.mean()}")

19
# Huấn luyện lại mô hình trên toàn bộ dữ liệu huấn luyện và đánh giá trên tập␣
↪validation

dt_model.fit(X_train, y_train)
y_pred_dt = dt_model.predict(X_val)

# Đánh giá mô hình


print("Random Forest Accuracy on Validation Set:", accuracy_score(y_val,␣
↪y_pred_dt))

print("Confusion Matrix:\n", confusion_matrix(y_val, y_pred_dt))


print("Classification Report:\n", classification_report(y_val, y_pred_dt))

Cross-validation scores: [0.85921893 0.85892823 0.85646658 0.85826548


0.85599318]
Mean cross-validation accuracy: 0.8577744819263879
Random Forest Accuracy on Validation Set: 0.860978565477543
Confusion Matrix:
[[19580 1233]
[ 2438 3155]]
Classification Report:
precision recall f1-score support

0 0.89 0.94 0.91 20813


1 0.72 0.56 0.63 5593

accuracy 0.86 26406


macro avg 0.80 0.75 0.77 26406
weighted avg 0.85 0.86 0.85 26406

[422]: #Randomforest test


from sklearn.ensemble import RandomForestClassifier

# Tạo pipeline cho Random Forest


rf_model = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', RandomForestClassifier(random_state=42, n_estimators=100,␣
↪max_depth=10))])

# Huấn luyện mô hình


rf_model.fit(X_train, y_train)

# Dự đoán trên tập validation


y_pred_rf = rf_model.predict(X_val)

# Đánh giá mô hình


print("Random Forest Accuracy:", accuracy_score(y_val, y_pred_rf))

20
print("Confusion Matrix:\n", confusion_matrix(y_val, y_pred_rf))
print("Classification Report:\n", classification_report(y_val, y_pred_rf))

Random Forest Accuracy: 0.8629856850715747


Confusion Matrix:
[[19879 934]
[ 2684 2909]]
Classification Report:
precision recall f1-score support

0 0.88 0.96 0.92 20813


1 0.76 0.52 0.62 5593

accuracy 0.86 26406


macro avg 0.82 0.74 0.77 26406
weighted avg 0.85 0.86 0.85 26406

[423]: #RandomForesr test with cross validation


from sklearn.model_selection import cross_val_score
cv_scores = cross_val_score(rf_model, X_train, y_train, cv=5,␣
↪scoring='accuracy')

# In ra các kết quả từ các fold


print(f"Cross-validation scores: {cv_scores}")
print(f"Mean cross-validation accuracy: {cv_scores.mean()}")

# Huấn luyện lại mô hình trên toàn bộ dữ liệu huấn luyện và đánh giá trên tập␣
↪validation

rf_model.fit(X_train, y_train)
y_pred_nb = rf_model.predict(X_val)

# Đánh giá mô hình


print("Random Forest Accuracy on Validation Set:", accuracy_score(y_val,␣
↪y_pred_rf))

print("Confusion Matrix:\n", confusion_matrix(y_val, y_pred_rf))


print("Classification Report:\n", classification_report(y_val, y_pred_rf))

Cross-validation scores: [0.86338462 0.86295209 0.86209998 0.86219466


0.86318879]
Mean cross-validation accuracy: 0.8627640277919392
Random Forest Accuracy on Validation Set: 0.8629856850715747
Confusion Matrix:
[[19879 934]
[ 2684 2909]]
Classification Report:
precision recall f1-score support

21
0 0.88 0.96 0.92 20813
1 0.76 0.52 0.62 5593

accuracy 0.86 26406


macro avg 0.82 0.74 0.77 26406
weighted avg 0.85 0.86 0.85 26406

[424]: from xgboost import XGBClassifier


from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, confusion_matrix,␣
↪classification_report

# Tạo pipeline cho XGBoost


xgb_model = Pipeline(steps=[
('preprocessor', preprocessor), # Preprocessing (nếu có)
('classifier', XGBClassifier(random_state=42, use_label_encoder=False,␣
↪eval_metric='mlogloss'))]) # Khởi tạo mô hình XGBoost

# Huấn luyện mô hình


xgb_model.fit(X_train, y_train)

# Dự đoán trên tập validation


y_pred_xgb = xgb_model.predict(X_val)

# Đánh giá mô hình


print("XGBoost Accuracy:", accuracy_score(y_val, y_pred_xgb))
print("Confusion Matrix:\n", confusion_matrix(y_val, y_pred_xgb))
print("Classification Report:\n", classification_report(y_val, y_pred_xgb))

XGBoost Accuracy: 0.865295766113762


Confusion Matrix:
[[19702 1111]
[ 2446 3147]]
Classification Report:
precision recall f1-score support

0 0.89 0.95 0.92 20813


1 0.74 0.56 0.64 5593

accuracy 0.87 26406


macro avg 0.81 0.75 0.78 26406
weighted avg 0.86 0.87 0.86 26406

22
[425]: #XGBClassifier test with cross validation
from sklearn.model_selection import cross_val_score
cv_scores = cross_val_score(nb_model, X_train, y_train, cv=5,␣
↪scoring='accuracy')

# In ra các kết quả từ các fold


print(f"Cross-validation scores: {cv_scores}")
print(f"Mean cross-validation accuracy: {cv_scores.mean()}")

# Huấn luyện lại mô hình trên toàn bộ dữ liệu huấn luyện và đánh giá trên tập␣
↪validation

xgb_model.fit(X_train, y_train)
y_pred_nb = xgb_model.predict(X_val)

# Đánh giá mô hình


print("Random Forest Accuracy on Validation Set:", accuracy_score(y_val,␣
↪y_pred_xgb))

print("Confusion Matrix:\n", confusion_matrix(y_val, y_pred_xgb))


print("Classification Report:\n", classification_report(y_val, y_pred_xgb))

Cross-validation scores: [0.80426036 0.80444045 0.80818027 0.80604999


0.81035789]
Mean cross-validation accuracy: 0.8066577896198159
Random Forest Accuracy on Validation Set: 0.865295766113762
Confusion Matrix:
[[19702 1111]
[ 2446 3147]]
Classification Report:
precision recall f1-score support

0 0.89 0.95 0.92 20813


1 0.74 0.56 0.64 5593

accuracy 0.87 26406


macro avg 0.81 0.75 0.78 26406
weighted avg 0.86 0.87 0.86 26406

[ ]: pip install lightgbm

Requirement already satisfied: lightgbm in /opt/conda/lib/python3.10/site-


packages (4.2.0)
Requirement already satisfied: numpy in /opt/conda/lib/python3.10/site-packages
(from lightgbm) (1.26.4)
Requirement already satisfied: scipy in /opt/conda/lib/python3.10/site-packages
(from lightgbm) (1.14.1)

23
[ ]: #LGBMClassifier test
from lightgbm import LGBMClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, confusion_matrix,␣
↪classification_report

# Tạo pipeline cho LightGBM


lgbm_model = Pipeline(steps=[
('preprocessor', preprocessor), # Preprocessing (nếu có)
('classifier', LGBMClassifier(random_state=42))]) # Khởi tạo mô hình␣
↪LightGBM

# Huấn luyện mô hình


lgbm_model.fit(X_train, y_train)

# Dự đoán trên tập validation


y_pred_lgbm = lgbm_model.predict(X_val)

# Đánh giá mô hình


print("LightGBM Accuracy:", accuracy_score(y_val, y_pred_lgbm))
print("Confusion Matrix:\n", confusion_matrix(y_val, y_pred_lgbm))
print("Classification Report:\n", classification_report(y_val, y_pred_lgbm))

[ ]: #LightGBM test with cross validation


from sklearn.model_selection import cross_val_score
cv_scores = cross_val_score(nb_model, X_train, y_train, cv=5,␣
↪scoring='accuracy')

# In ra các kết quả từ các fold


print(f"Cross-validation scores: {cv_scores}")
print(f"Mean cross-validation accuracy: {cv_scores.mean()}")

# Huấn luyện lại mô hình trên toàn bộ dữ liệu huấn luyện và đánh giá trên tập␣
↪validation

lgbm_model.fit(X_train, y_train)
y_pred_lgbm = lgbm_model.predict(X_val)

# Đánh giá mô hình


print("Random Forest Accuracy on Validation Set:", accuracy_score(y_val,␣
↪y_pred_lgbm))

print("Confusion Matrix:\n", confusion_matrix(y_val, y_pred_lgbm))


print("Classification Report:\n", classification_report(y_val, y_pred_lgbm))

[ ]: models = {
"KNN": y_pred_knn,
"Naive Bayes": y_pred_nb,
"Logistic Regression": y_pred_lr,

24
"Decision Tree": y_pred_dt,
"Random Forest": y_pred_rf,
"XGBClassifier": y_pred_xgb,
"LGBMClassifier": y_pred_lgbm
}

for name, preds in models.items():


print(f"{name} Model")
print("Accuracy:", accuracy_score(y_val, preds))
print(f"Cross-validation scores: {cv_scores}")
print(f"Mean cross-validation accuracy: {cv_scores.mean()}")
print("-" * 40)

[ ]: # Dự đoán trên tập test


y_pred_test = lgbm_model.predict(X_test)

# Tạo DataFrame cho kết quả dự đoán


submission = pd.DataFrame({
'id': test['id'], # Lấy CustomerId từ dữ liệu test, kiểm tra lại tên cột␣
↪'id'

'Exited': y_pred_test # Lấy kết quả dự đoán


})

# Đảm bảo rằng tên cột chính xác


submission.columns = ['id', 'Exited']

# Lưu kết quả vào file CSV (đảm bảo lưu đúng đường dẫn trong môi trường Kaggle)
submission.to_csv('submission.csv', index=False)

# Xem trước file submission


submission.head()

[ ]: # Xem trước nội dung file submission


submission.head()

25

You might also like