0% found this document useful (0 votes)
42 views96 pages

Churn Predictions

The document outlines a churn prediction analysis using an e-commerce dataset from 2024, focusing on data loading, exploration, and descriptive statistics. It includes details on customer demographics, churn distribution, and relationships between various numerical and categorical features with churn rates. The analysis also features visualizations such as bar charts and box plots to illustrate the findings.

Uploaded by

cart454
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views96 pages

Churn Predictions

The document outlines a churn prediction analysis using an e-commerce dataset from 2024, focusing on data loading, exploration, and descriptive statistics. It includes details on customer demographics, churn distribution, and relationships between various numerical and categorical features with churn rates. The analysis also features visualizations such as bar charts and box plots to illustrate the findings.

Uploaded by

cart454
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 96

Churn Predictions

April 22, 2025

0.1 Data loading


[2]: import pandas as pd

try:
df = pd.read_csv('E Commerce Dataset_2024.csv')
display(df.head())
except FileNotFoundError:
print("Error: 'E Commerce Dataset_2024.csv' not found.")
df = None
except Exception as e:
print(f"An error occurred: {e}")
df = None

CustomerID Churn Tenure PreferredLoginDevice CityTier WarehouseToHome \


0 50001 1 4.0 Mobile Phone 3 6.0
1 50002 1 NaN Phone 1 8.0
2 50003 1 NaN Phone 1 30.0
3 50004 1 0.0 Phone 3 15.0
4 50005 1 0.0 Phone 1 12.0

PreferredPaymentMode Gender HourSpendOnApp NumberOfDeviceRegistered \


0 Debit Card Female 3.0 3
1 UPI Male 3.0 4
2 Debit Card Male 2.0 4
3 Debit Card Male 2.0 4
4 CC Male NaN 3

PreferedOrderCat SatisfactionScore MaritalStatus NumberOfAddress \


0 Laptop & Accessory 2 Single 9
1 Mobile 3 Single 7
2 Mobile 3 Single 6
3 Laptop & Accessory 5 Single 8
4 Mobile 5 Single 3

Complain OrderAmountHikeFromlastYear CouponUsed OrderCount \


0 1 11.0 1.0 1.0
1 1 15.0 0.0 1.0

1
2 1 14.0 0.0 1.0
3 0 23.0 0.0 1.0
4 0 11.0 1.0 1.0

DaySinceLastOrder CashbackAmount
0 5.0 160
1 0.0 121
2 3.0 120
3 3.0 134
4 3.0 130

0.2 Data exploration


[3]: # Data Types
print("Data Types:\n", df.dtypes)

# Missing Values
missing_values = df.isnull().sum()
missing_percentage = (missing_values / len(df)) * 100
print("\nMissing Values:\n", missing_percentage)

# Descriptive Statistics
print("\nDescriptive Statistics:\n", df.describe(include='all'))

# Target Variable Analysis


churn_counts = df['Churn'].value_counts()
churn_percentage = (churn_counts / len(df)) * 100
print("\nChurn Distribution:\n", churn_percentage)

import matplotlib.pyplot as plt


plt.figure(figsize=(6, 4))
churn_counts.plot(kind='bar', color=['skyblue', 'salmon'])
plt.title('Distribution of Churn')
plt.xlabel('Churn (0: No, 1: Yes)')
plt.ylabel('Number of Customers')
plt.show()

Data Types:
CustomerID int64
Churn int64
Tenure float64
PreferredLoginDevice object
CityTier int64
WarehouseToHome float64
PreferredPaymentMode object
Gender object
HourSpendOnApp float64
NumberOfDeviceRegistered int64

2
PreferedOrderCat object
SatisfactionScore int64
MaritalStatus object
NumberOfAddress int64
Complain int64
OrderAmountHikeFromlastYear float64
CouponUsed float64
OrderCount float64
DaySinceLastOrder float64
CashbackAmount int64
dtype: object

Missing Values:
CustomerID 0.000000
Churn 0.000000
Tenure 4.689165
PreferredLoginDevice 0.000000
CityTier 0.000000
WarehouseToHome 4.458259
PreferredPaymentMode 0.000000
Gender 0.000000
HourSpendOnApp 4.529307
NumberOfDeviceRegistered 0.000000
PreferedOrderCat 0.000000
SatisfactionScore 0.000000
MaritalStatus 0.000000
NumberOfAddress 0.000000
Complain 0.000000
OrderAmountHikeFromlastYear 4.706927
CouponUsed 4.547069
OrderCount 4.582593
DaySinceLastOrder 5.452931
CashbackAmount 0.000000
dtype: float64

Descriptive Statistics:
CustomerID Churn Tenure PreferredLoginDevice \
count 5630.000000 5630.000000 5366.000000 5630
unique NaN NaN NaN 3
top NaN NaN NaN Mobile Phone
freq NaN NaN NaN 2765
mean 52815.500000 0.168384 10.189899 NaN
std 1625.385339 0.374240 8.557241 NaN
min 50001.000000 0.000000 0.000000 NaN
25% 51408.250000 0.000000 2.000000 NaN
50% 52815.500000 0.000000 9.000000 NaN
75% 54222.750000 0.000000 16.000000 NaN
max 55630.000000 1.000000 61.000000 NaN

3
CityTier WarehouseToHome PreferredPaymentMode Gender \
count 5630.000000 5379.000000 5630 5630
unique NaN NaN 7 2
top NaN NaN Debit Card Male
freq NaN NaN 2314 3384
mean 1.654707 15.639896 NaN NaN
std 0.915389 8.531475 NaN NaN
min 1.000000 5.000000 NaN NaN
25% 1.000000 9.000000 NaN NaN
50% 1.000000 14.000000 NaN NaN
75% 3.000000 20.000000 NaN NaN
max 3.000000 127.000000 NaN NaN

HourSpendOnApp NumberOfDeviceRegistered PreferedOrderCat \


count 5375.000000 5630.000000 5630
unique NaN NaN 6
top NaN NaN Laptop & Accessory
freq NaN NaN 2050
mean 2.931535 3.688988 NaN
std 0.721926 1.023999 NaN
min 0.000000 1.000000 NaN
25% 2.000000 3.000000 NaN
50% 3.000000 4.000000 NaN
75% 3.000000 4.000000 NaN
max 5.000000 6.000000 NaN

SatisfactionScore MaritalStatus NumberOfAddress Complain \


count 5630.000000 5630 5630.000000 5630.000000
unique NaN 3 NaN NaN
top NaN Married NaN NaN
freq NaN 2986 NaN NaN
mean 3.066785 NaN 4.214032 0.284902
std 1.380194 NaN 2.583586 0.451408
min 1.000000 NaN 1.000000 0.000000
25% 2.000000 NaN 2.000000 0.000000
50% 3.000000 NaN 3.000000 0.000000
75% 4.000000 NaN 6.000000 1.000000
max 5.000000 NaN 22.000000 1.000000

OrderAmountHikeFromlastYear CouponUsed OrderCount \


count 5365.000000 5374.000000 5372.000000
unique NaN NaN NaN
top NaN NaN NaN
freq NaN NaN NaN
mean 15.707922 1.751023 3.008004
std 3.675485 1.894621 2.939680
min 11.000000 0.000000 1.000000

4
25% 13.000000 1.000000 1.000000
50% 15.000000 1.000000 2.000000
75% 18.000000 2.000000 3.000000
max 26.000000 16.000000 16.000000

DaySinceLastOrder CashbackAmount
count 5323.000000 5630.000000
unique NaN NaN
top NaN NaN
freq NaN NaN
mean 4.543491 177.221492
std 3.654433 49.193869
min 0.000000 0.000000
25% 2.000000 146.000000
50% 3.000000 163.000000
75% 7.000000 196.000000
max 46.000000 325.000000

Churn Distribution:
Churn
0 83.161634
1 16.838366
Name: count, dtype: float64

5
[4]: # Relationship with Target Variable (Numerical)
for col in ['Tenure', 'WarehouseToHome', 'HourSpendOnApp',␣
↪'OrderAmountHikeFromlastYear', 'CouponUsed', 'OrderCount',␣

↪'DaySinceLastOrder', 'CashbackAmount']:

if df[col].dtype in ['int64', 'float64']:


print(f"\nSummary Statistics for {col} grouped by Churn:\n", df.
↪groupby('Churn')[col].describe())

plt.figure(figsize=(8, 6))
df.boxplot(column=col, by='Churn', patch_artist=True, showfliers=False)␣
↪ # Suppress outliers

plt.title(f'{col} vs Churn')
plt.suptitle('') # remove default boxplot title
plt.ylabel(col)
plt.show()

# Correlation Analysis
numerical_features = df.select_dtypes(include=['number'])
correlation_matrix = numerical_features.corr()

plt.figure(figsize=(12, 10))
import seaborn as sns
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix of Numerical Features')
plt.show()

# Categorical Variable Analysis


for col in ['PreferredLoginDevice', 'PreferredPaymentMode', 'Gender',␣
↪'PreferedOrderCat', 'MaritalStatus', 'Complain']:

print(f"\nValue Counts for {col}:\n", df[col].value_counts())


plt.figure(figsize=(8, 6))
df[col].value_counts().plot(kind='bar', color='skyblue')
plt.title(f'Distribution of {col}')
plt.xlabel(col)
plt.ylabel('Count')
plt.show()
churn_by_cat = df.groupby(col)['Churn'].value_counts(normalize=True).
↪unstack()

churn_by_cat.plot(kind='bar', stacked=True, figsize=(8, 6))


plt.title(f'Churn Rate by {col}')
plt.xlabel(col)
plt.ylabel('Proportion of Churn')
plt.show()

Summary Statistics for Tenure grouped by Churn:


count mean std min 25% 50% 75% max

6
Churn
0 4499.0 11.502334 8.419217 0.0 5.0 10.0 17.0 61.0
1 867.0 3.379469 5.486089 0.0 0.0 1.0 3.0 21.0
<Figure size 800x600 with 0 Axes>

Summary Statistics for WarehouseToHome grouped by Churn:


count mean std min 25% 50% 75% max
Churn
0 4515.0 15.353931 8.483276 5.0 9.0 13.0 19.0 127.0
1 864.0 17.134259 8.631132 5.0 9.0 15.0 24.0 36.0
<Figure size 800x600 with 0 Axes>

7
Summary Statistics for HourSpendOnApp grouped by Churn:
count mean std min 25% 50% 75% max
Churn
0 4485.0 2.925530 0.727184 0.0 2.0 3.0 3.0 5.0
1 890.0 2.961798 0.694427 2.0 2.0 3.0 3.0 4.0
<Figure size 800x600 with 0 Axes>

8
Summary Statistics for OrderAmountHikeFromlastYear grouped by Churn:
count mean std min 25% 50% 75% max
Churn
0 4431.0 15.724893 3.646256 11.0 13.0 15.0 18.0 26.0
1 934.0 15.627409 3.812084 11.0 13.0 14.0 18.0 26.0
<Figure size 800x600 with 0 Axes>

9
Summary Statistics for CouponUsed grouped by Churn:
count mean std min 25% 50% 75% max
Churn
0 4434.0 1.758232 1.893083 0.0 1.0 1.0 2.0 16.0
1 940.0 1.717021 1.902503 0.0 1.0 1.0 2.0 16.0
<Figure size 800x600 with 0 Axes>

10
Summary Statistics for OrderCount grouped by Churn:
count mean std min 25% 50% 75% max
Churn
0 4442.0 3.046601 2.964982 1.0 1.0 2.0 3.0 16.0
1 930.0 2.823656 2.809924 1.0 1.0 2.0 3.0 16.0
<Figure size 800x600 with 0 Axes>

11
Summary Statistics for DaySinceLastOrder grouped by Churn:
count mean std min 25% 50% 75% max
Churn
0 4429.0 4.807406 3.644758 0.0 2.0 4.0 8.0 31.0
1 894.0 3.236018 3.415137 0.0 1.0 2.0 5.0 46.0
<Figure size 800x600 with 0 Axes>

12
Summary Statistics for CashbackAmount grouped by Churn:
count mean std min 25% 50% 75% max
Churn
0 4682.0 180.633704 50.422799 0.0 147.0 166.0 201.0 325.0
1 948.0 160.369198 38.413534 110.0 132.0 150.0 175.0 324.0
<Figure size 800x600 with 0 Axes>

13
14
Value Counts for PreferredLoginDevice:
PreferredLoginDevice
Mobile Phone 2765
Computer 1634
Phone 1231
Name: count, dtype: int64

15
16
Value Counts for PreferredPaymentMode:
PreferredPaymentMode
Debit Card 2314
Credit Card 1501
E wallet 614
UPI 414
COD 365
CC 273
Cash on Delivery 149
Name: count, dtype: int64

17
18
Value Counts for Gender:
Gender
Male 3384
Female 2246
Name: count, dtype: int64

19
20
Value Counts for PreferedOrderCat:
PreferedOrderCat
Laptop & Accessory 2050
Mobile Phone 1271
Fashion 826
Mobile 809
Grocery 410
Others 264
Name: count, dtype: int64

21
22
Value Counts for MaritalStatus:
MaritalStatus
Married 2986
Single 1796
Divorced 848
Name: count, dtype: int64

23
24
Value Counts for Complain:
Complain
0 4026
1 1604
Name: count, dtype: int64

25
26
0.3 Data visualization
Visualize the data distributions, relationships between variables, and the target variable’s relation-
ship with other features.
[5]: import matplotlib.pyplot as plt
import seaborn as sns

# Distributions
numerical_cols = ['Tenure', 'WarehouseToHome', 'HourSpendOnApp',␣
↪'OrderAmountHikeFromlastYear', 'CouponUsed', 'OrderCount',␣

↪'DaySinceLastOrder', 'CashbackAmount']

for col in numerical_cols:


plt.figure(figsize=(8, 6))
sns.histplot(df[col], kde=True)
plt.title(f'Distribution of {col}')
plt.show()

27
categorical_cols = ['PreferredLoginDevice', 'CityTier', 'PreferredPaymentMode',␣
↪'Gender', 'PreferedOrderCat', 'SatisfactionScore', 'MaritalStatus',␣

↪'Complain']

for col in categorical_cols:


plt.figure(figsize=(8, 6))
df[col].value_counts().plot(kind='bar')
plt.title(f'Distribution of {col}')
plt.show()

# Relationships between variables


sns.pairplot(df[numerical_cols], diag_kind='kde')
plt.show()

# Target Variable Relationships


for col in numerical_cols:
plt.figure(figsize=(8, 6))
sns.boxplot(x='Churn', y=col, data=df)
plt.title(f'{col} vs Churn')
plt.show()

for col in categorical_cols:


plt.figure(figsize=(8, 6))
churn_rates = df.groupby(col)['Churn'].mean()
churn_rates.plot(kind='bar')
plt.title(f'Churn Rate by {col}')
plt.show()

# Outlier Detection
for col in numerical_cols:
plt.figure(figsize=(8, 6))
sns.boxplot(x=col, data=df)
plt.title(f'Boxplot of {col} for Outlier Detection')
plt.show()

28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
0.4 Data cleaning
[6]: import pandas as pd
import numpy as np

# Impute missing values


numerical_cols = df.select_dtypes(include=np.number).columns
categorical_cols = df.select_dtypes(exclude=np.number).columns

for col in numerical_cols:


if df[col].isnull().any():
df[col] = df[col].fillna(df[col].median())

for col in categorical_cols:


if df[col].isnull().any():
df[col] = df[col].fillna(df[col].mode()[0])

69
# Outlier handling using IQR method
for col in numerical_cols:
Q1 = df[col].quantile(0.05)
Q3 = df[col].quantile(0.95)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
df[col] = np.clip(df[col], lower_bound, upper_bound)

display(df.head())

CustomerID Churn Tenure PreferredLoginDevice CityTier WarehouseToHome \


0 50001 1 4.0 Mobile Phone 3 6.0
1 50002 1 9.0 Phone 1 8.0
2 50003 1 9.0 Phone 1 30.0
3 50004 1 0.0 Phone 3 15.0
4 50005 1 0.0 Phone 1 12.0

PreferredPaymentMode Gender HourSpendOnApp NumberOfDeviceRegistered \


0 Debit Card Female 3.0 3
1 UPI Male 3.0 4
2 Debit Card Male 2.0 4
3 Debit Card Male 2.0 4
4 CC Male 3.0 3

PreferedOrderCat SatisfactionScore MaritalStatus NumberOfAddress \


0 Laptop & Accessory 2 Single 9
1 Mobile 3 Single 7
2 Mobile 3 Single 6
3 Laptop & Accessory 5 Single 8
4 Mobile 5 Single 3

Complain OrderAmountHikeFromlastYear CouponUsed OrderCount \


0 1 11.0 1.0 1.0
1 1 15.0 0.0 1.0
2 1 14.0 0.0 1.0
3 0 23.0 0.0 1.0
4 0 11.0 1.0 1.0

DaySinceLastOrder CashbackAmount
0 5.0 160
1 0.0 121
2 3.0 120
3 3.0 134
4 3.0 130

70
0.5 Data preparation
[7]: from sklearn.preprocessing import OneHotEncoder

# Identify categorical columns (excluding 'Churn')


categorical_cols = df.select_dtypes(exclude=['number', 'bool']).columns.tolist()

# Create the OneHotEncoder


encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)

# Fit and transform the categorical features


encoded_features = encoder.fit_transform(df[categorical_cols])

# Create a new DataFrame with the encoded features


encoded_df = pd.DataFrame(encoded_features, columns=encoder.
↪get_feature_names_out(categorical_cols))

prepared_df = pd.concat([df.drop(columns=categorical_cols), encoded_df,␣


↪df['Churn']], axis=1)

display(prepared_df.head())

CustomerID Churn Tenure CityTier WarehouseToHome HourSpendOnApp \


0 50001 1 4.0 3 6.0 3.0
1 50002 1 9.0 1 8.0 3.0
2 50003 1 9.0 1 30.0 2.0
3 50004 1 0.0 3 15.0 2.0
4 50005 1 0.0 1 12.0 3.0

NumberOfDeviceRegistered SatisfactionScore NumberOfAddress Complain \


0 3 2 9 1
1 4 3 7 1
2 4 3 6 1
3 4 5 8 0
4 3 5 3 0

… PreferedOrderCat_Fashion PreferedOrderCat_Grocery \
0 … 0.0 0.0
1 … 0.0 0.0
2 … 0.0 0.0
3 … 0.0 0.0
4 … 0.0 0.0

PreferedOrderCat_Laptop & Accessory PreferedOrderCat_Mobile \


0 1.0 0.0
1 0.0 1.0
2 0.0 1.0

71
3 1.0 0.0
4 0.0 1.0

PreferedOrderCat_Mobile Phone PreferedOrderCat_Others \


0 0.0 0.0
1 0.0 0.0
2 0.0 0.0
3 0.0 0.0
4 0.0 0.0

MaritalStatus_Divorced MaritalStatus_Married MaritalStatus_Single Churn


0 0.0 0.0 1.0 1
1 0.0 0.0 1.0 1
2 0.0 0.0 1.0 1
3 0.0 0.0 1.0 1
4 0.0 0.0 1.0 1

[5 rows x 37 columns]

0.6 Feature engineering


[8]: # Interaction Features
prepared_df['Tenure_OrderAmountInteraction'] = prepared_df['Tenure'] *␣
↪prepared_df['OrderAmountHikeFromlastYear']

prepared_df['HourSpend_OrderCountInteraction'] = prepared_df['HourSpendOnApp']␣
↪* prepared_df['OrderCount']

# Ratio Features
prepared_df['Cashback_OrderAmountRatio'] = prepared_df['CashbackAmount'] /␣
↪prepared_df['OrderAmountHikeFromlastYear']

prepared_df['CouponUsed_OrderCountRatio'] = prepared_df['CouponUsed'] /␣
↪prepared_df['OrderCount']

# Combined Features
prepared_df['CustomerExperienceScore'] = prepared_df['SatisfactionScore'] * (1␣
↪- prepared_df['Complain'])

# Polynomial Features
prepared_df['TenureSquared'] = prepared_df['Tenure'] ** 2

# For simplicity, let's just print the correlation matrix without sorting
numerical_features = prepared_df.select_dtypes(include=['number'])
corr_matrix = numerical_features.corr()

# Print the correlation matrix with Churn


print("Correlation with Churn:")
print(corr_matrix.loc[:, 'Churn'])

72
# Further evaluation with visualization can be added if needed.
display(prepared_df.head())

Correlation with Churn:


Churn Churn
CustomerID -0.019083 -0.019083
Churn 1.000000 1.000000
Tenure -0.337831 -0.337831
CityTier 0.084703 0.084703
WarehouseToHome 0.072331 0.072331
HourSpendOnApp 0.018816 0.018816
NumberOfDeviceRegistered 0.107939 0.107939
SatisfactionScore 0.105481 0.105481
NumberOfAddress 0.043931 0.043931
Complain 0.250188 0.250188
OrderAmountHikeFromlastYear -0.007075 -0.007075
CouponUsed -0.001602 -0.001602
OrderCount -0.024038 -0.024038
DaySinceLastOrder -0.159446 -0.159446
CashbackAmount -0.154161 -0.154161
PreferredLoginDevice_Computer 0.051099 0.051099
PreferredLoginDevice_Mobile Phone -0.111639 -0.111639
PreferredLoginDevice_Phone 0.078916 0.078916
PreferredPaymentMode_CC 0.028796 0.028796
PreferredPaymentMode_COD 0.083933 0.083933
PreferredPaymentMode_Cash on Delivery -0.006178 -0.006178
PreferredPaymentMode_Credit Card -0.064131 -0.064131
PreferredPaymentMode_Debit Card -0.032453 -0.032453
PreferredPaymentMode_E wallet 0.055751 0.055751
PreferredPaymentMode_UPI 0.004163 0.004163
Gender_Female -0.029264 -0.029264
Gender_Male 0.029264 0.029264
PreferedOrderCat_Fashion -0.014871 -0.014871
PreferedOrderCat_Grocery -0.089575 -0.089575
PreferedOrderCat_Laptop & Accessory -0.133353 -0.133353
PreferedOrderCat_Mobile 0.113364 0.113364
PreferedOrderCat_Mobile Phone 0.154387 0.154387
PreferedOrderCat_Others -0.054903 -0.054903
MaritalStatus_Divorced -0.024934 -0.024934
MaritalStatus_Married -0.151024 -0.151024
MaritalStatus_Single 0.180847 0.180847
Churn 1.000000 1.000000
Tenure_OrderAmountInteraction -0.321028 -0.321028
HourSpend_OrderCountInteraction -0.024667 -0.024667
Cashback_OrderAmountRatio -0.110923 -0.110923
CouponUsed_OrderCountRatio 0.001892 0.001892

73
CustomerExperienceScore -0.144989 -0.144989
TenureSquared -0.239627 -0.239627
CustomerID Churn Tenure CityTier WarehouseToHome HourSpendOnApp \
0 50001 1 4.0 3 6.0 3.0
1 50002 1 9.0 1 8.0 3.0
2 50003 1 9.0 1 30.0 2.0
3 50004 1 0.0 3 15.0 2.0
4 50005 1 0.0 1 12.0 3.0

NumberOfDeviceRegistered SatisfactionScore NumberOfAddress Complain \


0 3 2 9 1
1 4 3 7 1
2 4 3 6 1
3 4 5 8 0
4 3 5 3 0

… MaritalStatus_Divorced MaritalStatus_Married MaritalStatus_Single \


0 … 0.0 0.0 1.0
1 … 0.0 0.0 1.0
2 … 0.0 0.0 1.0
3 … 0.0 0.0 1.0
4 … 0.0 0.0 1.0

Churn Tenure_OrderAmountInteraction HourSpend_OrderCountInteraction \


0 1 44.0 3.0
1 1 135.0 3.0
2 1 126.0 2.0
3 1 0.0 2.0
4 1 0.0 3.0

Cashback_OrderAmountRatio CouponUsed_OrderCountRatio \
0 14.545455 1.0
1 8.066667 0.0
2 8.571429 0.0
3 5.826087 0.0
4 11.818182 1.0

CustomerExperienceScore TenureSquared
0 0 16.0
1 0 81.0
2 0 81.0
3 5 0.0
4 5 0.0

[5 rows x 43 columns]

74
0.7 Data splitting
[9]: from sklearn.model_selection import train_test_split

# Define features (X) and target (y)


X = prepared_df.drop('Churn', axis=1)
y = prepared_df['Churn']

# Split data into training and temporary sets (validation + testing)


X_train, X_temp, y_train, y_temp = train_test_split(
X, y, test_size=0.3, random_state=42, stratify=y
)

# Split temporary set into validation and testing sets


X_val, X_test, y_val, y_test = train_test_split(
X_temp, y_temp, test_size=0.5, random_state=42, stratify=y_temp
)

0.8 Model training


[19]: from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

# Instantiate the models


logreg_model = LogisticRegression()
rf_model = RandomForestClassifier()
xgb_model = XGBClassifier()

# Train the models, accessing the first 'Churn' column only


logreg_model.fit(X_train, y_train.iloc[:, 0])
rf_model.fit(X_train, y_train.iloc[:, 0])
xgb_model.fit(X_train, y_train.iloc[:, 0])

/usr/local/lib/python3.11/dist-packages/sklearn/linear_model/_logistic.py:465:
ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-
regression
n_iter_i = _check_optimize_result(

75
[19]: XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=None, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=None, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=None, n_jobs=None,
num_parallel_tree=None, random_state=None, …)

0.9 Model Optimization


[23]: from hyperopt import fmin, tpe, hp, STATUS_OK, Trials
from sklearn.metrics import f1_score, precision_recall_curve, auc,␣
↪roc_auc_score, accuracy_score, precision_score, recall_score

from sklearn.linear_model import LogisticRegression


from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix

# Apply SMOTE to handle class imbalance


def apply_smote(X_train, y_train, random_state=42):
print("Before SMOTE - Class distribution:")
print(y_train.iloc[:, 0].value_counts(normalize=True) * 100)

# Apply SMOTE
smote = SMOTE(random_state=random_state)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train.iloc[:,␣
↪0])

# Convert back to DataFrame if needed


if isinstance(X_train, pd.DataFrame):
X_train_smote = pd.DataFrame(X_train_smote, columns=X_train.columns)

y_train_smote = pd.DataFrame(y_train_smote, columns=['Churn'])

print("After SMOTE - Class distribution:")


print(y_train_smote.iloc[:, 0].value_counts(normalize=True) * 100)

76
return X_train_smote, y_train_smote

# Apply SMOTE to the training data


X_train_smote, y_train_smote = apply_smote(X_train, y_train)

# Create a small validation set from the resampled training data for early␣
↪stopping in XGBoost

X_train_es, X_valid_es, y_train_es, y_valid_es = train_test_split(


X_train_smote, y_train_smote.iloc[:, 0], test_size=0.2, random_state=42,␣
↪stratify=y_train_smote.iloc[:, 0]

# Define the hyperparameter search space for each model


space_logreg = {
'C': hp.loguniform('C', -5, 5),
'penalty': 'l2', # Use only 'l2' penalty
'solver': 'liblinear', # explicitly set the solver to liblinear
'class_weight': hp.choice('class_weight', [None, 'balanced'])
}

space_rf = {
'n_estimators': hp.quniform('n_estimators', 50, 200, 10),
'max_depth': hp.quniform('max_depth', 5, 20, 1),
'min_samples_split': hp.quniform('min_samples_split', 2, 10, 1),
'class_weight': hp.choice('class_weight', [None, 'balanced',␣
↪'balanced_subsample'])

space_xgb = {
'n_estimators': hp.quniform('n_estimators', 50, 200, 10),
'learning_rate': hp.loguniform('learning_rate', -5, 0),
'max_depth': hp.quniform('max_depth', 3, 10, 1),
'subsample': hp.uniform('subsample', 0.5, 1),
'scale_pos_weight': hp.choice('scale_pos_weight', [1, 5]) # SMOTE already␣
↪balanced classes

def find_optimal_threshold(model, X_val, y_val):


# Get predicted probabilities
y_prob = model.predict_proba(X_val)[:, 1]

# Try different thresholds


thresholds = np.linspace(0.1, 0.9, 100)
best_f1 = 0
best_threshold = 0.5

for threshold in thresholds:

77
y_pred = (y_prob >= threshold).astype(int)
f1 = f1_score(y_val.iloc[:, 0], y_pred)
if f1 > best_f1:
best_f1 = f1
best_threshold = threshold

return best_threshold, best_f1

def evaluate_model(model, X_val, y_val, find_threshold=True):


y_pred = model.predict(X_val)
y_prob = model.predict_proba(X_val)[:, 1]

# Calculate various metrics


accuracy = accuracy_score(y_val.iloc[:, 0], y_pred)
precision_val = precision_score(y_val.iloc[:, 0], y_pred)
recall_val = recall_score(y_val.iloc[:, 0], y_pred)
f1 = f1_score(y_val.iloc[:, 0], y_pred)
roc_auc = roc_auc_score(y_val.iloc[:, 0], y_prob)

# Calculate PR-AUC
precision_curve, recall_curve, _ = precision_recall_curve(y_val.iloc[:, 0],␣
↪y_prob)

pr_auc = auc(recall_curve, precision_curve)

result = {
'accuracy': accuracy,
'precision': precision_val,
'recall': recall_val,
'f1': f1,
'roc_auc': roc_auc,
'pr_auc': pr_auc,
}

# Find optimal threshold if requested


if find_threshold:
best_threshold, best_f1 = find_optimal_threshold(model, X_val, y_val)
result['best_threshold'] = best_threshold
result['best_f1'] = best_f1

return result

def objective_logreg(params):
model = LogisticRegression(**params, max_iter=1000)
model.fit(X_train_smote, y_train_smote.iloc[:, 0])

y_pred = model.predict(X_val)
y_prob = model.predict_proba(X_val)[:, 1]

78
# Calculate metrics
f1 = f1_score(y_val.iloc[:, 0], y_pred)
precision, recall, _ = precision_recall_curve(y_val.iloc[:, 0], y_prob)
pr_auc = auc(recall, precision)

# Combine metrics with more weight on PR-AUC which is better for imbalanced␣
↪data
combined_score = 0.4 * f1 + 0.6 * pr_auc

return {'loss': -combined_score, 'status': STATUS_OK, 'f1': f1, 'pr_auc':␣


↪pr_auc}

def objective_rf(params):
# Convert float parameters to int
params = {
'n_estimators': int(params['n_estimators']),
'max_depth': int(params['max_depth']),
'min_samples_split': int(params['min_samples_split']),
'class_weight': params['class_weight']
}

model = RandomForestClassifier(**params)
model.fit(X_train_smote, y_train_smote.iloc[:, 0])

y_pred = model.predict(X_val)
y_prob = model.predict_proba(X_val)[:, 1]

# Calculate metrics
f1 = f1_score(y_val.iloc[:, 0], y_pred)
precision, recall, _ = precision_recall_curve(y_val.iloc[:, 0], y_prob)
pr_auc = auc(recall, precision)

# Combine metrics with more weight on PR-AUC which is better for imbalanced␣
↪data
combined_score = 0.4 * f1 + 0.6 * pr_auc

return {'loss': -combined_score, 'status': STATUS_OK, 'f1': f1, 'pr_auc':␣


↪pr_auc}

def objective_xgb(params):
# Convert float parameters to int where needed
params = {
'n_estimators': int(params['n_estimators']),
'learning_rate': params['learning_rate'],
'max_depth': int(params['max_depth']),
'subsample': params['subsample'],

79
'scale_pos_weight': params['scale_pos_weight']
}

model = XGBClassifier(
**params,
eval_metric='logloss',
verbosity=0
)

# Create evaluation set for early stopping


eval_set = [(X_valid_es, y_valid_es)]

# Fit with early stopping


model.fit(
X_train_es,
y_train_es,
eval_set=eval_set,
verbose=False
)

# Evaluate on validation set


y_pred = model.predict(X_val)
y_prob = model.predict_proba(X_val)[:, 1]

# Calculate metrics
f1 = f1_score(y_val.iloc[:, 0], y_pred)
precision, recall, _ = precision_recall_curve(y_val.iloc[:, 0], y_prob)
pr_auc = auc(recall, precision)

# Combine metrics with more weight on PR-AUC which is better for imbalanced␣
↪data
combined_score = 0.4 * f1 + 0.6 * pr_auc

return {'loss': -combined_score, 'status': STATUS_OK, 'f1': f1, 'pr_auc':␣


↪pr_auc}

# Perform the hyperparameter search


print("Starting Logistic Regression optimization...")
trials_logreg = Trials()
best_params_logreg = fmin(fn=objective_logreg, space=space_logreg, algo=tpe.
↪suggest, max_evals=50, trials=trials_logreg)

print("Starting Random Forest optimization...")


trials_rf = Trials()
best_params_rf = fmin(fn=objective_rf, space=space_rf, algo=tpe.suggest,␣
↪max_evals=50, trials=trials_rf)

80
print("Starting XGBoost optimization...")
trials_xgb = Trials()
best_params_xgb = fmin(fn=objective_xgb, space=space_xgb, algo=tpe.suggest,␣
↪max_evals=50, trials=trials_xgb)

# Print the best hyperparameters


print("\nBest hyperparameters for Logistic Regression:")
final_params_logreg = {
'C': np.exp(best_params_logreg['C']),
'penalty': 'l2',
'solver': 'liblinear',
'class_weight': [None, 'balanced'][best_params_logreg['class_weight']]
}
print(final_params_logreg)

print("\nBest hyperparameters for Random Forest:")


final_params_rf = {
'n_estimators': int(best_params_rf['n_estimators']),
'max_depth': int(best_params_rf['max_depth']),
'min_samples_split': int(best_params_rf['min_samples_split']),
'class_weight': [None, 'balanced',␣
↪'balanced_subsample'][best_params_rf['class_weight']]

}
print(final_params_rf)

print("\nBest hyperparameters for XGBoost:")


final_params_xgb = {
'n_estimators': int(best_params_xgb['n_estimators']),
'learning_rate': np.exp(best_params_xgb['learning_rate']),
'max_depth': int(best_params_xgb['max_depth']),
'subsample': best_params_xgb['subsample'],
'scale_pos_weight': [1, 5][best_params_xgb['scale_pos_weight']]
}
print(final_params_xgb)

# Train the final models with the best hyperparameters


print("\nTraining final models with best hyperparameters...")

# Logistic Regression
final_logreg = LogisticRegression(**final_params_logreg, max_iter=1000) #␣
↪Added max_iter to prevent convergence warnings

final_logreg.fit(X_train_smote, y_train_smote.iloc[:, 0])


logreg_metrics = evaluate_model(final_logreg, X_val, y_val)

# Random Forest
final_rf = RandomForestClassifier(**final_params_rf)
final_rf.fit(X_train_smote, y_train_smote.iloc[:, 0])

81
rf_metrics = evaluate_model(final_rf, X_val, y_val)

# XGBoost
final_xgb = XGBClassifier(**final_params_xgb, eval_metric='logloss',␣
↪verbosity=0)

final_xgb.fit(X_train_smote, y_train_smote.iloc[:, 0])


xgb_metrics = evaluate_model(final_xgb, X_val, y_val)

# Print the final evaluation metrics


print("\nFinal Logistic Regression metrics:")
print(f"Accuracy: {logreg_metrics['accuracy']:.4f}")
print(f"Precision: {logreg_metrics['precision']:.4f}")
print(f"Recall: {logreg_metrics['recall']:.4f}")
print(f"F1 Score: {logreg_metrics['f1']:.4f}")
print(f"ROC-AUC: {logreg_metrics['roc_auc']:.4f}")
print(f"PR-AUC: {logreg_metrics['pr_auc']:.4f}")
print(f"Optimal Threshold: {logreg_metrics['best_threshold']:.4f} (F1:␣
↪{logreg_metrics['best_f1']:.4f})")

print("\nFinal Random Forest metrics:")


print(f"Accuracy: {rf_metrics['accuracy']:.4f}")
print(f"Precision: {rf_metrics['precision']:.4f}")
print(f"Recall: {rf_metrics['recall']:.4f}")
print(f"F1 Score: {rf_metrics['f1']:.4f}")
print(f"ROC-AUC: {rf_metrics['roc_auc']:.4f}")
print(f"PR-AUC: {rf_metrics['pr_auc']:.4f}")
print(f"Optimal Threshold: {rf_metrics['best_threshold']:.4f} (F1:␣
↪{rf_metrics['best_f1']:.4f})")

print("\nFinal XGBoost metrics:")


print(f"Accuracy: {xgb_metrics['accuracy']:.4f}")
print(f"Precision: {xgb_metrics['precision']:.4f}")
print(f"Recall: {xgb_metrics['recall']:.4f}")
print(f"F1 Score: {xgb_metrics['f1']:.4f}")
print(f"ROC-AUC: {xgb_metrics['roc_auc']:.4f}")
print(f"PR-AUC: {xgb_metrics['pr_auc']:.4f}")
print(f"Optimal Threshold: {xgb_metrics['best_threshold']:.4f} (F1:␣
↪{xgb_metrics['best_f1']:.4f})")

# Create results dictionary in the required format


results = {}
models = {
'Logistic Regression': final_logreg,
'Random Forest': final_rf,
'XGBoost': final_xgb
}

82
metrics_mapping = {
'Logistic Regression': logreg_metrics,
'Random Forest': rf_metrics,
'XGBoost': xgb_metrics
}

# Store evaluation results


for model_name, model in models.items():
metrics = metrics_mapping[model_name]
results[model_name] = {
"Accuracy": metrics['accuracy'],
"Precision": metrics['precision'],
"Recall": metrics['recall'],
"F1-Score": metrics['f1'],
"AUC": metrics['roc_auc']
}

# Select the best model based on F1-score


best_model_name = max(results, key=lambda k: results[k]["F1-Score"])
best_model = models[best_model_name]
best_metrics = metrics_mapping[best_model_name]

print(f"\nBest model based on F1-Score: {best_model_name}")


print(f"F1-Score: {results[best_model_name]['F1-Score']:.4f}")
print(f"Optimal threshold: {best_metrics['best_threshold']:.4f}")

# Function to make predictions with threshold


def predict_with_threshold(model, X, threshold=0.5):
y_prob = model.predict_proba(X)[:, 1]
return (y_prob >= threshold).astype(int)

# Add code to evaluate on test set if available


try:
# Evaluate the best model on the test set
test_metrics = evaluate_model(best_model, X_test, y_test,␣
↪find_threshold=False)

print(f"\nPerformance of the Best Model ({best_model_name}) on the Test Set:


↪")

print(f"Accuracy: {test_metrics['accuracy']:.4f}")
print(f"Precision: {test_metrics['precision']:.4f}")
print(f"Recall: {test_metrics['recall']:.4f}")
print(f"F1-Score: {test_metrics['f1']:.4f}")
print(f"AUC: {test_metrics['roc_auc']:.4f}")

# Use optimized threshold for predictions if desired

83
y_pred_optimized = predict_with_threshold(best_model, X_test,␣
↪threshold=best_metrics['best_threshold'])
opt_f1 = f1_score(y_test.iloc[:, 0], y_pred_optimized)
print(f"F1-Score with optimized threshold ({best_metrics['best_threshold']:.
↪4f}): {opt_f1:.4f}")

# Plot confusion matrix


plt.figure(figsize=(8, 6))
cm = confusion_matrix(y_test.iloc[:, 0], y_pred_optimized)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=[0, 1],␣
↪yticklabels=[0, 1])

plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix')
plt.show()

except NameError:
print("\nTest set not found. To evaluate on a test set, define X_test and␣
↪y_test variables.")

Before SMOTE - Class distribution:


Churn
0 83.151484
1 16.848516
Name: proportion, dtype: float64
After SMOTE - Class distribution:
Churn
0 50.0
1 50.0
Name: proportion, dtype: float64
Starting Logistic Regression optimization…
100%|����������| 50/50 [00:15<00:00, 3.21trial/s, best loss:
-0.5831375717140068]
Starting Random Forest optimization…
100%|����������| 50/50 [01:41<00:00, 2.03s/trial, best loss:
-0.8692813211066912]
Starting XGBoost optimization…
100%|����������| 50/50 [01:29<00:00, 1.79s/trial, best loss:
-0.8848868863872488]

Best hyperparameters for Logistic Regression:


{'C': np.float64(1.0067670502521642), 'penalty': 'l2', 'solver': 'liblinear',
'class_weight': 'balanced'}

Best hyperparameters for Random Forest:


{'n_estimators': 180, 'max_depth': 18, 'min_samples_split': 2, 'class_weight':
None}

84
Best hyperparameters for XGBoost:
{'n_estimators': 110, 'learning_rate': np.float64(1.3764408103898214),
'max_depth': 8, 'subsample': np.float64(0.9104790831666846), 'scale_pos_weight':
5}

Training final models with best hyperparameters…

Final Logistic Regression metrics:


Accuracy: 0.7725
Precision: 0.4053
Recall: 0.7535
F1 Score: 0.5271
ROC-AUC: 0.8499
PR-AUC: 0.5991
Optimal Threshold: 0.7465 (F1: 0.5776)

Final Random Forest metrics:


Accuracy: 0.9408
Precision: 0.8651
Recall: 0.7676
F1 Score: 0.8134
ROC-AUC: 0.9727
PR-AUC: 0.8824
Optimal Threshold: 0.4960 (F1: 0.8266)

Final XGBoost metrics:


Accuracy: 0.9360
Precision: 0.8014
Recall: 0.8239
F1 Score: 0.8125
ROC-AUC: 0.9565
PR-AUC: 0.8824
Optimal Threshold: 0.4798 (F1: 0.8166)

Best model based on F1-Score: Random Forest


F1-Score: 0.8134
Optimal threshold: 0.4960

Performance of the Best Model (Random Forest) on the Test Set:


Accuracy: 0.9491
Precision: 0.8898
Recall: 0.7958
F1-Score: 0.8401
AUC: 0.9832
F1-Score with optimized threshold (0.4960): 0.8444

85
0.10 Model Analysis
[24]: # Feature Importance
if isinstance(best_model, (RandomForestClassifier, XGBClassifier)):
importances = best_model.feature_importances_
feature_names = X_train.columns
feature_importances = pd.Series(importances, index=feature_names).
↪sort_values(ascending=False)

print("\nFeature Importances:")
print(feature_importances)

# Visualize feature importances


plt.figure(figsize=(10, 6))
sns.barplot(x=feature_importances, y=feature_importances.index)
plt.title("Feature Importances")
plt.xlabel("Importance Score")

86
plt.ylabel("Feature")
plt.show()
else:
print("Feature importance is not directly available for this model type.")

Feature Importances:
Tenure 0.115815
TenureSquared 0.103631
Tenure_OrderAmountInteraction 0.082694
MaritalStatus_Single 0.063758
MaritalStatus_Married 0.044249
PreferedOrderCat_Laptop & Accessory 0.043390
CustomerExperienceScore 0.037941
DaySinceLastOrder 0.030060
CashbackAmount 0.029917
PreferredLoginDevice_Mobile Phone 0.028004
WarehouseToHome 0.026585
Gender_Male 0.023279
PreferredPaymentMode_Debit Card 0.022893
Cashback_OrderAmountRatio 0.022797
PreferredLoginDevice_Computer 0.022608
Gender_Female 0.022308
PreferredPaymentMode_Credit Card 0.022090
CustomerID 0.021090
NumberOfAddress 0.020599
OrderAmountHikeFromlastYear 0.018337
PreferedOrderCat_Mobile Phone 0.017166
SatisfactionScore 0.016677
CouponUsed_OrderCountRatio 0.014937
PreferedOrderCat_Mobile 0.014571
HourSpend_OrderCountInteraction 0.014262
CouponUsed 0.013399
PreferredLoginDevice_Phone 0.013123
OrderCount 0.011329
NumberOfDeviceRegistered 0.010503
PreferredPaymentMode_E wallet 0.010039
HourSpendOnApp 0.009492
CityTier 0.008710
PreferedOrderCat_Fashion 0.008695
Complain 0.008372
MaritalStatus_Divorced 0.008145
PreferredPaymentMode_COD 0.004896
PreferredPaymentMode_CC 0.004189
PreferredPaymentMode_UPI 0.004181
PreferredPaymentMode_Cash on Delivery 0.001825
PreferedOrderCat_Grocery 0.001813

87
PreferedOrderCat_Others 0.001628
dtype: float64

[25]: from sklearn.metrics import roc_curve, auc


import matplotlib.pyplot as plt

# Get predicted probabilities for the validation set


y_pred_prob = best_model.predict_proba(X_val)[:, 1]

# Calculate ROC curve and AUC


fpr, tpr, thresholds = roc_curve(y_val.iloc[:, 0], y_pred_prob)
roc_auc = auc(fpr, tpr)

# Plot ROC curve


plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (area = {roc_auc:
↪.2f})')

plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')


plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()

88
[26]: from sklearn.metrics import confusion_matrix
import seaborn as sns

# Get predictions for the validation set


y_pred = best_model.predict(X_val)

# Create confusion matrix


cm = confusion_matrix(y_val.iloc[:, 0], y_pred)

# Plot confusion matrix


plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues")
plt.title("Confusion Matrix")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()

89
0.11 SHAP Analysis
[27]: import shap

# Create a SHAP explainer for the best model


explainer = shap.Explainer(best_model)

# Calculate SHAP values for a subset of the data


shap_values = explainer(X_test)

# Visualize SHAP values


shap.summary_plot(shap_values, X_test)

<Figure size 640x480 with 0 Axes>

90
[28]: # Individual SHAP plots
shap.plots.waterfall(shap_values[0, :, 1]) # Example for the first instance in␣
↪X_test, for class 1 (Churn)

# Dependence plots
shap.dependence_plot("Tenure", shap_values.values[:, :, 1], X_test) # Access␣
↪shap_values for class 1

91
0.12 Recomendations
[31]: report = f"""
# Customer Churn Prediction Report

## Model Performance Comparison

We evaluated three different models: Logistic Regression, Random Forest, and␣


↪XGBoost. The models were tuned using hyperparameter optimization to maximize␣

↪their performance. The table below summarizes the results on the validation␣

↪dataset.

| Model | Accuracy | Precision | Recall | F1-Score | AUC |


|-----------------------|----------|-----------|---------|----------|--------|
| Logistic Regression | {results['Logistic Regression']['Accuracy']:.4f} |␣
↪{results['Logistic Regression']['Precision']:.4f} | {results['Logistic␣

↪Regression']['Recall']:.4f} | {results['Logistic Regression']['F1-Score']:.

↪4f} | {results['Logistic Regression']['AUC']:.4f} |

92
| Random Forest | {results['Random Forest']['Accuracy']:.4f} |␣
↪{results['Random Forest']['Precision']:.4f} | {results['Random␣

↪Forest']['Recall']:.4f} | {results['Random Forest']['F1-Score']:.4f} |␣

↪{results['Random Forest']['AUC']:.4f} |

| XGBoost | {results['XGBoost']['Accuracy']:.4f} |␣
↪{results['XGBoost']['Precision']:.4f} | {results['XGBoost']['Recall']:.4f} |␣

↪{results['XGBoost']['F1-Score']:.4f} | {results['XGBoost']['AUC']:.4f} |

Based on these results, {best_model_name} was chosen as the best-performing␣


↪model due to its high F1-Score. This score reflects a balance between the␣
↪model's ability to correctly identify customers who will churn and its␣

↪accuracy in avoiding false positives.

## Actionable Recommendations

Based on the model's findings and feature importance, we recommend focusing␣


↪retention efforts on the following:

# Top 5 Recommendations Based on SHAP Analysis

# Top 5 Customer Retention Tips Backed by SHAP Insights

## 1. Give Extra Love to New Customers


**What the Data Says:** Customers who’ve been around for less than 6 months are␣
↪the most likely to leave. SHAP values show a strong connection between short␣

↪tenure and higher churn risk.

**What You Can Do:** Create a warm, welcoming experience for new customers—
think personalized onboarding, proactive support, and loyalty perks during␣
↪their first few months.

---

## 2. Reward Long-Term Loyalty


**What the Data Says:** Customers who stick around and spend more over time are␣
↪less likely to churn. SHAP shows tenure and spending patterns are powerful␣

↪predictors.

**What You Can Do:** Launch a loyalty program that gets better with time—offer␣
↪growing rewards or exclusive perks the longer they stay and the more they␣

↪spend.

---

## 3. Don’t Overlook Single Customers

93
**What the Data Says:** Being single is linked to different churn behavior␣
↪compared to married customers, with SHAP highlighting it as a key factor.

**What You Can Do:** Create campaigns and experiences that speak directly to␣
↪single customers’ preferences. Tailor your messaging and offers to better␣

↪match their lifestyle.

---

## 4. Track and Act on Customer Experience


**What the Data Says:** Customer Experience Score has a big influence on␣
↪whether people stay or go. Low scores are a red flag for churn.

**What You Can Do:** Regularly monitor experience scores and respond quickly␣
↪when they dip. Make it easy for customers to give feedback—and show them␣

↪you’re listening.

---

## 5. Make Mobile Work Smoothly


**What the Data Says:** The device people use to log in matters. SHAP analysis␣
↪shows mobile users may have different churn patterns than desktop users.

**What You Can Do:** Review your mobile experience closely. Fix bugs, speed␣
↪things up, and remove any friction that could push mobile users away.

"""
from IPython.display import Markdown
Markdown(report)
[31]:
1 Customer Churn Prediction Report
1.1 Model Performance Comparison
We evaluated three different models: Logistic Regression, Random Forest, and XGBoost. The
models were tuned using hyperparameter optimization to maximize their performance. The table
below summarizes the results on the validation dataset.

Model Accuracy Precision Recall F1-Score AUC


Logistic Regression 0.7725 0.4053 0.7535 0.5271 0.8499
Random Forest 0.9408 0.8651 0.7676 0.8134 0.9727
XGBoost 0.9360 0.8014 0.8239 0.8125 0.9565

Based on these results, Random Forest was chosen as the best-performing model due to its high
F1-Score. This score reflects a balance between the model’s ability to correctly identify customers
who will churn and its accuracy in avoiding false positives.

94
1.2 Actionable Recommendations
Based on the model’s findings and feature importance, we recommend focusing retention efforts on
the following: # Top 5 Recommendations Based on SHAP Analysis

2 Top 5 Customer Retention Tips Backed by SHAP Insights


2.1 1. Give Extra Love to New Customers
What the Data Says: Customers who’ve been around for less than 6 months are the most likely
to leave. SHAP values show a strong connection between short tenure and higher churn risk.
What You Can Do: Create a warm, welcoming experience for new customers—think personalized
onboarding, proactive support, and loyalty perks during their first few months.

2.2 2. Reward Long-Term Loyalty


What the Data Says: Customers who stick around and spend more over time are less likely to
churn. SHAP shows tenure and spending patterns are powerful predictors.
What You Can Do: Launch a loyalty program that gets better with time—offer growing rewards
or exclusive perks the longer they stay and the more they spend.

2.3 3. Don’t Overlook Single Customers


What the Data Says: Being single is linked to different churn behavior compared to married
customers, with SHAP highlighting it as a key factor.
What You Can Do: Create campaigns and experiences that speak directly to single customers’
preferences. Tailor your messaging and offers to better match their lifestyle.

2.4 4. Track and Act on Customer Experience


What the Data Says: Customer Experience Score has a big influence on whether people stay or
go. Low scores are a red flag for churn.
What You Can Do: Regularly monitor experience scores and respond quickly when they dip.
Make it easy for customers to give feedback—and show them you’re listening.

2.5 5. Make Mobile Work Smoothly


What the Data Says: The device people use to log in matters. SHAP analysis shows mobile
users may have different churn patterns than desktop users.
What You Can Do: Review your mobile experience closely. Fix bugs, speed things up, and
remove any friction that could push mobile users away.

95
[ ]:

96

You might also like