0% found this document useful (0 votes)

42 views96 pages

Churn Predictions

The document outlines a churn prediction analysis using an e-commerce dataset from 2024, focusing on data loading, exploration, and descriptive statistics. It includes details on customer demographics, churn distribution, and relationships between various numerical and categorical features with churn rates. The analysis also features visualizations such as bar charts and box plots to illustrate the findings.

Uploaded by

cart454

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

42 views96 pages

Churn Predictions

Uploaded by

cart454

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 96

Churn Predictions

April 22, 2025

0.1 Data loading

[2]: import pandas as pd

try:
df = pd.read_csv('E Commerce Dataset_2024.csv')
display(df.head())
except FileNotFoundError:
print("Error: 'E Commerce Dataset_2024.csv' not found.")
df = None
except Exception as e:
print(f"An error occurred: {e}")
df = None

CustomerID Churn Tenure PreferredLoginDevice CityTier WarehouseToHome \

0 50001 1 4.0 Mobile Phone 3 6.0
1 50002 1 NaN Phone 1 8.0
2 50003 1 NaN Phone 1 30.0
3 50004 1 0.0 Phone 3 15.0
4 50005 1 0.0 Phone 1 12.0

PreferredPaymentMode Gender HourSpendOnApp NumberOfDeviceRegistered \

0 Debit Card Female 3.0 3
1 UPI Male 3.0 4
2 Debit Card Male 2.0 4
3 Debit Card Male 2.0 4
4 CC Male NaN 3

PreferedOrderCat SatisfactionScore MaritalStatus NumberOfAddress \

0 Laptop & Accessory 2 Single 9
1 Mobile 3 Single 7
2 Mobile 3 Single 6
3 Laptop & Accessory 5 Single 8
4 Mobile 5 Single 3

Complain OrderAmountHikeFromlastYear CouponUsed OrderCount \

0 1 11.0 1.0 1.0
1 1 15.0 0.0 1.0

1
2 1 14.0 0.0 1.0
3 0 23.0 0.0 1.0
4 0 11.0 1.0 1.0

DaySinceLastOrder CashbackAmount
0 5.0 160
1 0.0 121
2 3.0 120
3 3.0 134
4 3.0 130

0.2 Data exploration

[3]: # Data Types
print("Data Types:\n", df.dtypes)

# Missing Values
missing_values = df.isnull().sum()
missing_percentage = (missing_values / len(df)) * 100
print("\nMissing Values:\n", missing_percentage)

# Descriptive Statistics
print("\nDescriptive Statistics:\n", df.describe(include='all'))

# Target Variable Analysis

churn_counts = df['Churn'].value_counts()
churn_percentage = (churn_counts / len(df)) * 100
print("\nChurn Distribution:\n", churn_percentage)

import matplotlib.pyplot as plt

plt.figure(figsize=(6, 4))
churn_counts.plot(kind='bar', color=['skyblue', 'salmon'])
plt.title('Distribution of Churn')
plt.xlabel('Churn (0: No, 1: Yes)')
plt.ylabel('Number of Customers')
plt.show()

Data Types:
CustomerID int64
Churn int64
Tenure float64
PreferredLoginDevice object
CityTier int64
WarehouseToHome float64
PreferredPaymentMode object
Gender object
HourSpendOnApp float64
NumberOfDeviceRegistered int64

2
PreferedOrderCat object
SatisfactionScore int64
MaritalStatus object
NumberOfAddress int64
Complain int64
OrderAmountHikeFromlastYear float64
CouponUsed float64
OrderCount float64
DaySinceLastOrder float64
CashbackAmount int64
dtype: object

Missing Values:
CustomerID 0.000000
Churn 0.000000
Tenure 4.689165
PreferredLoginDevice 0.000000
CityTier 0.000000
WarehouseToHome 4.458259
PreferredPaymentMode 0.000000
Gender 0.000000
HourSpendOnApp 4.529307
NumberOfDeviceRegistered 0.000000
PreferedOrderCat 0.000000
SatisfactionScore 0.000000
MaritalStatus 0.000000
NumberOfAddress 0.000000
Complain 0.000000
OrderAmountHikeFromlastYear 4.706927
CouponUsed 4.547069
OrderCount 4.582593
DaySinceLastOrder 5.452931
CashbackAmount 0.000000
dtype: float64

Descriptive Statistics:
CustomerID Churn Tenure PreferredLoginDevice \
count 5630.000000 5630.000000 5366.000000 5630
unique NaN NaN NaN 3
top NaN NaN NaN Mobile Phone
freq NaN NaN NaN 2765
mean 52815.500000 0.168384 10.189899 NaN
std 1625.385339 0.374240 8.557241 NaN
min 50001.000000 0.000000 0.000000 NaN
25% 51408.250000 0.000000 2.000000 NaN
50% 52815.500000 0.000000 9.000000 NaN
75% 54222.750000 0.000000 16.000000 NaN
max 55630.000000 1.000000 61.000000 NaN

3
CityTier WarehouseToHome PreferredPaymentMode Gender \
count 5630.000000 5379.000000 5630 5630
unique NaN NaN 7 2
top NaN NaN Debit Card Male
freq NaN NaN 2314 3384
mean 1.654707 15.639896 NaN NaN
std 0.915389 8.531475 NaN NaN
min 1.000000 5.000000 NaN NaN
25% 1.000000 9.000000 NaN NaN
50% 1.000000 14.000000 NaN NaN
75% 3.000000 20.000000 NaN NaN
max 3.000000 127.000000 NaN NaN

HourSpendOnApp NumberOfDeviceRegistered PreferedOrderCat \

count 5375.000000 5630.000000 5630
unique NaN NaN 6
top NaN NaN Laptop & Accessory
freq NaN NaN 2050
mean 2.931535 3.688988 NaN
std 0.721926 1.023999 NaN
min 0.000000 1.000000 NaN
25% 2.000000 3.000000 NaN
50% 3.000000 4.000000 NaN
75% 3.000000 4.000000 NaN
max 5.000000 6.000000 NaN

SatisfactionScore MaritalStatus NumberOfAddress Complain \

count 5630.000000 5630 5630.000000 5630.000000
unique NaN 3 NaN NaN
top NaN Married NaN NaN
freq NaN 2986 NaN NaN
mean 3.066785 NaN 4.214032 0.284902
std 1.380194 NaN 2.583586 0.451408
min 1.000000 NaN 1.000000 0.000000
25% 2.000000 NaN 2.000000 0.000000
50% 3.000000 NaN 3.000000 0.000000
75% 4.000000 NaN 6.000000 1.000000
max 5.000000 NaN 22.000000 1.000000

OrderAmountHikeFromlastYear CouponUsed OrderCount \

count 5365.000000 5374.000000 5372.000000
unique NaN NaN NaN
top NaN NaN NaN
freq NaN NaN NaN
mean 15.707922 1.751023 3.008004
std 3.675485 1.894621 2.939680
min 11.000000 0.000000 1.000000

4
25% 13.000000 1.000000 1.000000
50% 15.000000 1.000000 2.000000
75% 18.000000 2.000000 3.000000
max 26.000000 16.000000 16.000000

DaySinceLastOrder CashbackAmount
count 5323.000000 5630.000000
unique NaN NaN
top NaN NaN
freq NaN NaN
mean 4.543491 177.221492
std 3.654433 49.193869
min 0.000000 0.000000
25% 2.000000 146.000000
50% 3.000000 163.000000
75% 7.000000 196.000000
max 46.000000 325.000000

Churn Distribution:
Churn
0 83.161634
1 16.838366
Name: count, dtype: float64

5
[4]: # Relationship with Target Variable (Numerical)
for col in ['Tenure', 'WarehouseToHome', 'HourSpendOnApp',␣
↪'OrderAmountHikeFromlastYear', 'CouponUsed', 'OrderCount',␣

↪'DaySinceLastOrder', 'CashbackAmount']:

if df[col].dtype in ['int64', 'float64']:

print(f"\nSummary Statistics for {col} grouped by Churn:\n", df.
↪groupby('Churn')[col].describe())

plt.figure(figsize=(8, 6))
df.boxplot(column=col, by='Churn', patch_artist=True, showfliers=False)␣
↪ # Suppress outliers

plt.title(f'{col} vs Churn')
plt.suptitle('') # remove default boxplot title
plt.ylabel(col)
plt.show()

# Correlation Analysis
numerical_features = df.select_dtypes(include=['number'])
correlation_matrix = numerical_features.corr()

plt.figure(figsize=(12, 10))
import seaborn as sns
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix of Numerical Features')
plt.show()

# Categorical Variable Analysis

for col in ['PreferredLoginDevice', 'PreferredPaymentMode', 'Gender',␣
↪'PreferedOrderCat', 'MaritalStatus', 'Complain']:

print(f"\nValue Counts for {col}:\n", df[col].value_counts())

plt.figure(figsize=(8, 6))
df[col].value_counts().plot(kind='bar', color='skyblue')
plt.title(f'Distribution of {col}')
plt.xlabel(col)
plt.ylabel('Count')
plt.show()
churn_by_cat = df.groupby(col)['Churn'].value_counts(normalize=True).
↪unstack()

churn_by_cat.plot(kind='bar', stacked=True, figsize=(8, 6))

plt.title(f'Churn Rate by {col}')
plt.xlabel(col)
plt.ylabel('Proportion of Churn')
plt.show()

Summary Statistics for Tenure grouped by Churn:

count mean std min 25% 50% 75% max

6
Churn
0 4499.0 11.502334 8.419217 0.0 5.0 10.0 17.0 61.0
1 867.0 3.379469 5.486089 0.0 0.0 1.0 3.0 21.0
<Figure size 800x600 with 0 Axes>

Summary Statistics for WarehouseToHome grouped by Churn:

count mean std min 25% 50% 75% max
Churn
0 4515.0 15.353931 8.483276 5.0 9.0 13.0 19.0 127.0
1 864.0 17.134259 8.631132 5.0 9.0 15.0 24.0 36.0
<Figure size 800x600 with 0 Axes>

7
Summary Statistics for HourSpendOnApp grouped by Churn:
count mean std min 25% 50% 75% max
Churn
0 4485.0 2.925530 0.727184 0.0 2.0 3.0 3.0 5.0
1 890.0 2.961798 0.694427 2.0 2.0 3.0 3.0 4.0
<Figure size 800x600 with 0 Axes>

8
Summary Statistics for OrderAmountHikeFromlastYear grouped by Churn:
count mean std min 25% 50% 75% max
Churn
0 4431.0 15.724893 3.646256 11.0 13.0 15.0 18.0 26.0
1 934.0 15.627409 3.812084 11.0 13.0 14.0 18.0 26.0
<Figure size 800x600 with 0 Axes>

9
Summary Statistics for CouponUsed grouped by Churn:
count mean std min 25% 50% 75% max
Churn
0 4434.0 1.758232 1.893083 0.0 1.0 1.0 2.0 16.0
1 940.0 1.717021 1.902503 0.0 1.0 1.0 2.0 16.0
<Figure size 800x600 with 0 Axes>

10
Summary Statistics for OrderCount grouped by Churn:
count mean std min 25% 50% 75% max
Churn
0 4442.0 3.046601 2.964982 1.0 1.0 2.0 3.0 16.0
1 930.0 2.823656 2.809924 1.0 1.0 2.0 3.0 16.0
<Figure size 800x600 with 0 Axes>

11
Summary Statistics for DaySinceLastOrder grouped by Churn:
count mean std min 25% 50% 75% max
Churn
0 4429.0 4.807406 3.644758 0.0 2.0 4.0 8.0 31.0
1 894.0 3.236018 3.415137 0.0 1.0 2.0 5.0 46.0
<Figure size 800x600 with 0 Axes>

12
Summary Statistics for CashbackAmount grouped by Churn:
count mean std min 25% 50% 75% max
Churn
0 4682.0 180.633704 50.422799 0.0 147.0 166.0 201.0 325.0
1 948.0 160.369198 38.413534 110.0 132.0 150.0 175.0 324.0
<Figure size 800x600 with 0 Axes>

13
14
Value Counts for PreferredLoginDevice:
PreferredLoginDevice
Mobile Phone 2765
Computer 1634
Phone 1231
Name: count, dtype: int64

15
16
Value Counts for PreferredPaymentMode:
PreferredPaymentMode
Debit Card 2314
Credit Card 1501
E wallet 614
UPI 414
COD 365
CC 273
Cash on Delivery 149
Name: count, dtype: int64

17
18
Value Counts for Gender:
Gender
Male 3384
Female 2246
Name: count, dtype: int64

19
20
Value Counts for PreferedOrderCat:
PreferedOrderCat
Laptop & Accessory 2050
Mobile Phone 1271
Fashion 826
Mobile 809
Grocery 410
Others 264
Name: count, dtype: int64

21
22
Value Counts for MaritalStatus:
MaritalStatus
Married 2986
Single 1796
Divorced 848
Name: count, dtype: int64

23
24
Value Counts for Complain:
Complain
0 4026
1 1604
Name: count, dtype: int64

25
26
0.3 Data visualization
Visualize the data distributions, relationships between variables, and the target variable’s relation-
ship with other features.
[5]: import matplotlib.pyplot as plt
import seaborn as sns

# Distributions
numerical_cols = ['Tenure', 'WarehouseToHome', 'HourSpendOnApp',␣
↪'OrderAmountHikeFromlastYear', 'CouponUsed', 'OrderCount',␣

↪'DaySinceLastOrder', 'CashbackAmount']

for col in numerical_cols:

plt.figure(figsize=(8, 6))
sns.histplot(df[col], kde=True)
plt.title(f'Distribution of {col}')
plt.show()

27
categorical_cols = ['PreferredLoginDevice', 'CityTier', 'PreferredPaymentMode',␣
↪'Gender', 'PreferedOrderCat', 'SatisfactionScore', 'MaritalStatus',␣

↪'Complain']

for col in categorical_cols:

plt.figure(figsize=(8, 6))
df[col].value_counts().plot(kind='bar')
plt.title(f'Distribution of {col}')
plt.show()

# Relationships between variables

sns.pairplot(df[numerical_cols], diag_kind='kde')
plt.show()

# Target Variable Relationships

for col in numerical_cols:
plt.figure(figsize=(8, 6))
sns.boxplot(x='Churn', y=col, data=df)
plt.title(f'{col} vs Churn')
plt.show()

for col in categorical_cols:

plt.figure(figsize=(8, 6))
churn_rates = df.groupby(col)['Churn'].mean()
churn_rates.plot(kind='bar')
plt.title(f'Churn Rate by {col}')
plt.show()

# Outlier Detection
for col in numerical_cols:
plt.figure(figsize=(8, 6))
sns.boxplot(x=col, data=df)
plt.title(f'Boxplot of {col} for Outlier Detection')
plt.show()

28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
0.4 Data cleaning
[6]: import pandas as pd
import numpy as np

# Impute missing values

numerical_cols = df.select_dtypes(include=np.number).columns
categorical_cols = df.select_dtypes(exclude=np.number).columns

for col in numerical_cols:

if df[col].isnull().any():
df[col] = df[col].fillna(df[col].median())

for col in categorical_cols:

if df[col].isnull().any():
df[col] = df[col].fillna(df[col].mode()[0])

69
# Outlier handling using IQR method
for col in numerical_cols:
Q1 = df[col].quantile(0.05)
Q3 = df[col].quantile(0.95)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
df[col] = np.clip(df[col], lower_bound, upper_bound)

display(df.head())

CustomerID Churn Tenure PreferredLoginDevice CityTier WarehouseToHome \

0 50001 1 4.0 Mobile Phone 3 6.0
1 50002 1 9.0 Phone 1 8.0
2 50003 1 9.0 Phone 1 30.0
3 50004 1 0.0 Phone 3 15.0
4 50005 1 0.0 Phone 1 12.0

PreferredPaymentMode Gender HourSpendOnApp NumberOfDeviceRegistered \

0 Debit Card Female 3.0 3
1 UPI Male 3.0 4
2 Debit Card Male 2.0 4
3 Debit Card Male 2.0 4
4 CC Male 3.0 3

PreferedOrderCat SatisfactionScore MaritalStatus NumberOfAddress \

0 Laptop & Accessory 2 Single 9
1 Mobile 3 Single 7
2 Mobile 3 Single 6
3 Laptop & Accessory 5 Single 8
4 Mobile 5 Single 3

Complain OrderAmountHikeFromlastYear CouponUsed OrderCount \

0 1 11.0 1.0 1.0
1 1 15.0 0.0 1.0
2 1 14.0 0.0 1.0
3 0 23.0 0.0 1.0
4 0 11.0 1.0 1.0

DaySinceLastOrder CashbackAmount
0 5.0 160
1 0.0 121
2 3.0 120
3 3.0 134
4 3.0 130

70
0.5 Data preparation
[7]: from sklearn.preprocessing import OneHotEncoder

# Identify categorical columns (excluding 'Churn')

categorical_cols = df.select_dtypes(exclude=['number', 'bool']).columns.tolist()

# Create the OneHotEncoder

encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)

# Fit and transform the categorical features

encoded_features = encoder.fit_transform(df[categorical_cols])

# Create a new DataFrame with the encoded features

encoded_df = pd.DataFrame(encoded_features, columns=encoder.
↪get_feature_names_out(categorical_cols))

prepared_df = pd.concat([df.drop(columns=categorical_cols), encoded_df,␣

↪df['Churn']], axis=1)

display(prepared_df.head())

CustomerID Churn Tenure CityTier WarehouseToHome HourSpendOnApp \

0 50001 1 4.0 3 6.0 3.0
1 50002 1 9.0 1 8.0 3.0
2 50003 1 9.0 1 30.0 2.0
3 50004 1 0.0 3 15.0 2.0
4 50005 1 0.0 1 12.0 3.0

NumberOfDeviceRegistered SatisfactionScore NumberOfAddress Complain \

0 3 2 9 1
1 4 3 7 1
2 4 3 6 1
3 4 5 8 0
4 3 5 3 0

… PreferedOrderCat_Fashion PreferedOrderCat_Grocery \
0 … 0.0 0.0
1 … 0.0 0.0
2 … 0.0 0.0
3 … 0.0 0.0
4 … 0.0 0.0

PreferedOrderCat_Laptop & Accessory PreferedOrderCat_Mobile \

0 1.0 0.0
1 0.0 1.0
2 0.0 1.0

71
3 1.0 0.0
4 0.0 1.0

PreferedOrderCat_Mobile Phone PreferedOrderCat_Others \

0 0.0 0.0
1 0.0 0.0
2 0.0 0.0
3 0.0 0.0
4 0.0 0.0

MaritalStatus_Divorced MaritalStatus_Married MaritalStatus_Single Churn

0 0.0 0.0 1.0 1
1 0.0 0.0 1.0 1
2 0.0 0.0 1.0 1
3 0.0 0.0 1.0 1
4 0.0 0.0 1.0 1

[5 rows x 37 columns]

0.6 Feature engineering

[8]: # Interaction Features
prepared_df['Tenure_OrderAmountInteraction'] = prepared_df['Tenure'] *␣
↪prepared_df['OrderAmountHikeFromlastYear']

prepared_df['HourSpend_OrderCountInteraction'] = prepared_df['HourSpendOnApp']␣
↪* prepared_df['OrderCount']

# Ratio Features
prepared_df['Cashback_OrderAmountRatio'] = prepared_df['CashbackAmount'] /␣
↪prepared_df['OrderAmountHikeFromlastYear']

prepared_df['CouponUsed_OrderCountRatio'] = prepared_df['CouponUsed'] /␣
↪prepared_df['OrderCount']

# Combined Features
prepared_df['CustomerExperienceScore'] = prepared_df['SatisfactionScore'] * (1␣
↪- prepared_df['Complain'])

# Polynomial Features
prepared_df['TenureSquared'] = prepared_df['Tenure'] ** 2

# For simplicity, let's just print the correlation matrix without sorting
numerical_features = prepared_df.select_dtypes(include=['number'])
corr_matrix = numerical_features.corr()

# Print the correlation matrix with Churn

print("Correlation with Churn:")
print(corr_matrix.loc[:, 'Churn'])

72
# Further evaluation with visualization can be added if needed.
display(prepared_df.head())

Correlation with Churn:

Churn Churn
CustomerID -0.019083 -0.019083
Churn 1.000000 1.000000
Tenure -0.337831 -0.337831
CityTier 0.084703 0.084703
WarehouseToHome 0.072331 0.072331
HourSpendOnApp 0.018816 0.018816
NumberOfDeviceRegistered 0.107939 0.107939
SatisfactionScore 0.105481 0.105481
NumberOfAddress 0.043931 0.043931
Complain 0.250188 0.250188
OrderAmountHikeFromlastYear -0.007075 -0.007075
CouponUsed -0.001602 -0.001602
OrderCount -0.024038 -0.024038
DaySinceLastOrder -0.159446 -0.159446
CashbackAmount -0.154161 -0.154161
PreferredLoginDevice_Computer 0.051099 0.051099
PreferredLoginDevice_Mobile Phone -0.111639 -0.111639
PreferredLoginDevice_Phone 0.078916 0.078916
PreferredPaymentMode_CC 0.028796 0.028796
PreferredPaymentMode_COD 0.083933 0.083933
PreferredPaymentMode_Cash on Delivery -0.006178 -0.006178
PreferredPaymentMode_Credit Card -0.064131 -0.064131
PreferredPaymentMode_Debit Card -0.032453 -0.032453
PreferredPaymentMode_E wallet 0.055751 0.055751
PreferredPaymentMode_UPI 0.004163 0.004163
Gender_Female -0.029264 -0.029264
Gender_Male 0.029264 0.029264
PreferedOrderCat_Fashion -0.014871 -0.014871
PreferedOrderCat_Grocery -0.089575 -0.089575
PreferedOrderCat_Laptop & Accessory -0.133353 -0.133353
PreferedOrderCat_Mobile 0.113364 0.113364
PreferedOrderCat_Mobile Phone 0.154387 0.154387
PreferedOrderCat_Others -0.054903 -0.054903
MaritalStatus_Divorced -0.024934 -0.024934
MaritalStatus_Married -0.151024 -0.151024
MaritalStatus_Single 0.180847 0.180847
Churn 1.000000 1.000000
Tenure_OrderAmountInteraction -0.321028 -0.321028
HourSpend_OrderCountInteraction -0.024667 -0.024667
Cashback_OrderAmountRatio -0.110923 -0.110923
CouponUsed_OrderCountRatio 0.001892 0.001892

73
CustomerExperienceScore -0.144989 -0.144989
TenureSquared -0.239627 -0.239627
CustomerID Churn Tenure CityTier WarehouseToHome HourSpendOnApp \
0 50001 1 4.0 3 6.0 3.0
1 50002 1 9.0 1 8.0 3.0
2 50003 1 9.0 1 30.0 2.0
3 50004 1 0.0 3 15.0 2.0
4 50005 1 0.0 1 12.0 3.0

NumberOfDeviceRegistered SatisfactionScore NumberOfAddress Complain \

0 3 2 9 1
1 4 3 7 1
2 4 3 6 1
3 4 5 8 0
4 3 5 3 0

… MaritalStatus_Divorced MaritalStatus_Married MaritalStatus_Single \

0 … 0.0 0.0 1.0
1 … 0.0 0.0 1.0
2 … 0.0 0.0 1.0
3 … 0.0 0.0 1.0
4 … 0.0 0.0 1.0

Churn Tenure_OrderAmountInteraction HourSpend_OrderCountInteraction \

0 1 44.0 3.0
1 1 135.0 3.0
2 1 126.0 2.0
3 1 0.0 2.0
4 1 0.0 3.0

Cashback_OrderAmountRatio CouponUsed_OrderCountRatio \
0 14.545455 1.0
1 8.066667 0.0
2 8.571429 0.0
3 5.826087 0.0
4 11.818182 1.0

CustomerExperienceScore TenureSquared
0 0 16.0
1 0 81.0
2 0 81.0
3 5 0.0
4 5 0.0

[5 rows x 43 columns]

74
0.7 Data splitting
[9]: from sklearn.model_selection import train_test_split

# Define features (X) and target (y)

X = prepared_df.drop('Churn', axis=1)
y = prepared_df['Churn']

# Split data into training and temporary sets (validation + testing)

X_train, X_temp, y_train, y_temp = train_test_split(
X, y, test_size=0.3, random_state=42, stratify=y
)

# Split temporary set into validation and testing sets

X_val, X_test, y_val, y_test = train_test_split(
X_temp, y_temp, test_size=0.5, random_state=42, stratify=y_temp
)

0.8 Model training

[19]: from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

# Instantiate the models

logreg_model = LogisticRegression()
rf_model = RandomForestClassifier()
xgb_model = XGBClassifier()

# Train the models, accessing the first 'Churn' column only

logreg_model.fit(X_train, y_train.iloc[:, 0])
rf_model.fit(X_train, y_train.iloc[:, 0])
xgb_model.fit(X_train, y_train.iloc[:, 0])

/usr/local/lib/python3.11/dist-packages/sklearn/linear_model/_logistic.py:465:
ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-
regression
n_iter_i = _check_optimize_result(

75
[19]: XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=None, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=None, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=None, n_jobs=None,
num_parallel_tree=None, random_state=None, …)

0.9 Model Optimization

[23]: from hyperopt import fmin, tpe, hp, STATUS_OK, Trials
from sklearn.metrics import f1_score, precision_recall_curve, auc,␣
↪roc_auc_score, accuracy_score, precision_score, recall_score

from sklearn.linear_model import LogisticRegression

from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix

# Apply SMOTE to handle class imbalance

def apply_smote(X_train, y_train, random_state=42):
print("Before SMOTE - Class distribution:")
print(y_train.iloc[:, 0].value_counts(normalize=True) * 100)

# Apply SMOTE
smote = SMOTE(random_state=random_state)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train.iloc[:,␣
↪0])

# Convert back to DataFrame if needed

if isinstance(X_train, pd.DataFrame):
X_train_smote = pd.DataFrame(X_train_smote, columns=X_train.columns)

y_train_smote = pd.DataFrame(y_train_smote, columns=['Churn'])

print("After SMOTE - Class distribution:")

print(y_train_smote.iloc[:, 0].value_counts(normalize=True) * 100)

76
return X_train_smote, y_train_smote

# Apply SMOTE to the training data

X_train_smote, y_train_smote = apply_smote(X_train, y_train)

# Create a small validation set from the resampled training data for early␣
↪stopping in XGBoost

X_train_es, X_valid_es, y_train_es, y_valid_es = train_test_split(

X_train_smote, y_train_smote.iloc[:, 0], test_size=0.2, random_state=42,␣
↪stratify=y_train_smote.iloc[:, 0]

# Define the hyperparameter search space for each model

space_logreg = {
'C': hp.loguniform('C', -5, 5),
'penalty': 'l2', # Use only 'l2' penalty
'solver': 'liblinear', # explicitly set the solver to liblinear
'class_weight': hp.choice('class_weight', [None, 'balanced'])
}

space_rf = {
'n_estimators': hp.quniform('n_estimators', 50, 200, 10),
'max_depth': hp.quniform('max_depth', 5, 20, 1),
'min_samples_split': hp.quniform('min_samples_split', 2, 10, 1),
'class_weight': hp.choice('class_weight', [None, 'balanced',␣
↪'balanced_subsample'])

space_xgb = {
'n_estimators': hp.quniform('n_estimators', 50, 200, 10),
'learning_rate': hp.loguniform('learning_rate', -5, 0),
'max_depth': hp.quniform('max_depth', 3, 10, 1),
'subsample': hp.uniform('subsample', 0.5, 1),
'scale_pos_weight': hp.choice('scale_pos_weight', [1, 5]) # SMOTE already␣
↪balanced classes

def find_optimal_threshold(model, X_val, y_val):

# Get predicted probabilities
y_prob = model.predict_proba(X_val)[:, 1]

# Try different thresholds

thresholds = np.linspace(0.1, 0.9, 100)
best_f1 = 0
best_threshold = 0.5

for threshold in thresholds:

77
y_pred = (y_prob >= threshold).astype(int)
f1 = f1_score(y_val.iloc[:, 0], y_pred)
if f1 > best_f1:
best_f1 = f1
best_threshold = threshold

return best_threshold, best_f1

def evaluate_model(model, X_val, y_val, find_threshold=True):

y_pred = model.predict(X_val)
y_prob = model.predict_proba(X_val)[:, 1]

# Calculate various metrics

accuracy = accuracy_score(y_val.iloc[:, 0], y_pred)
precision_val = precision_score(y_val.iloc[:, 0], y_pred)
recall_val = recall_score(y_val.iloc[:, 0], y_pred)
f1 = f1_score(y_val.iloc[:, 0], y_pred)
roc_auc = roc_auc_score(y_val.iloc[:, 0], y_prob)

# Calculate PR-AUC
precision_curve, recall_curve, _ = precision_recall_curve(y_val.iloc[:, 0],␣
↪y_prob)

pr_auc = auc(recall_curve, precision_curve)

result = {
'accuracy': accuracy,
'precision': precision_val,
'recall': recall_val,
'f1': f1,
'roc_auc': roc_auc,
'pr_auc': pr_auc,
}

# Find optimal threshold if requested

if find_threshold:
best_threshold, best_f1 = find_optimal_threshold(model, X_val, y_val)
result['best_threshold'] = best_threshold
result['best_f1'] = best_f1

return result

def objective_logreg(params):
model = LogisticRegression(**params, max_iter=1000)
model.fit(X_train_smote, y_train_smote.iloc[:, 0])

y_pred = model.predict(X_val)
y_prob = model.predict_proba(X_val)[:, 1]

78
# Calculate metrics
f1 = f1_score(y_val.iloc[:, 0], y_pred)
precision, recall, _ = precision_recall_curve(y_val.iloc[:, 0], y_prob)
pr_auc = auc(recall, precision)

# Combine metrics with more weight on PR-AUC which is better for imbalanced␣
↪data
combined_score = 0.4 * f1 + 0.6 * pr_auc

return {'loss': -combined_score, 'status': STATUS_OK, 'f1': f1, 'pr_auc':␣

↪pr_auc}

def objective_rf(params):
# Convert float parameters to int
params = {
'n_estimators': int(params['n_estimators']),
'max_depth': int(params['max_depth']),
'min_samples_split': int(params['min_samples_split']),
'class_weight': params['class_weight']
}

model = RandomForestClassifier(**params)
model.fit(X_train_smote, y_train_smote.iloc[:, 0])

y_pred = model.predict(X_val)
y_prob = model.predict_proba(X_val)[:, 1]

# Calculate metrics
f1 = f1_score(y_val.iloc[:, 0], y_pred)
precision, recall, _ = precision_recall_curve(y_val.iloc[:, 0], y_prob)
pr_auc = auc(recall, precision)

# Combine metrics with more weight on PR-AUC which is better for imbalanced␣
↪data
combined_score = 0.4 * f1 + 0.6 * pr_auc

return {'loss': -combined_score, 'status': STATUS_OK, 'f1': f1, 'pr_auc':␣

↪pr_auc}

def objective_xgb(params):
# Convert float parameters to int where needed
params = {
'n_estimators': int(params['n_estimators']),
'learning_rate': params['learning_rate'],
'max_depth': int(params['max_depth']),
'subsample': params['subsample'],

79
'scale_pos_weight': params['scale_pos_weight']
}

model = XGBClassifier(
**params,
eval_metric='logloss',
verbosity=0
)

# Create evaluation set for early stopping

eval_set = [(X_valid_es, y_valid_es)]

# Fit with early stopping

model.fit(
X_train_es,
y_train_es,
eval_set=eval_set,
verbose=False
)

# Evaluate on validation set

y_pred = model.predict(X_val)
y_prob = model.predict_proba(X_val)[:, 1]

# Calculate metrics
f1 = f1_score(y_val.iloc[:, 0], y_pred)
precision, recall, _ = precision_recall_curve(y_val.iloc[:, 0], y_prob)
pr_auc = auc(recall, precision)

# Combine metrics with more weight on PR-AUC which is better for imbalanced␣
↪data
combined_score = 0.4 * f1 + 0.6 * pr_auc

return {'loss': -combined_score, 'status': STATUS_OK, 'f1': f1, 'pr_auc':␣

↪pr_auc}

# Perform the hyperparameter search

print("Starting Logistic Regression optimization...")
trials_logreg = Trials()
best_params_logreg = fmin(fn=objective_logreg, space=space_logreg, algo=tpe.
↪suggest, max_evals=50, trials=trials_logreg)

print("Starting Random Forest optimization...")

trials_rf = Trials()
best_params_rf = fmin(fn=objective_rf, space=space_rf, algo=tpe.suggest,␣
↪max_evals=50, trials=trials_rf)

80
print("Starting XGBoost optimization...")
trials_xgb = Trials()
best_params_xgb = fmin(fn=objective_xgb, space=space_xgb, algo=tpe.suggest,␣
↪max_evals=50, trials=trials_xgb)

# Print the best hyperparameters

print("\nBest hyperparameters for Logistic Regression:")
final_params_logreg = {
'C': np.exp(best_params_logreg['C']),
'penalty': 'l2',
'solver': 'liblinear',
'class_weight': [None, 'balanced'][best_params_logreg['class_weight']]
}
print(final_params_logreg)

print("\nBest hyperparameters for Random Forest:")

final_params_rf = {
'n_estimators': int(best_params_rf['n_estimators']),
'max_depth': int(best_params_rf['max_depth']),
'min_samples_split': int(best_params_rf['min_samples_split']),
'class_weight': [None, 'balanced',␣
↪'balanced_subsample'][best_params_rf['class_weight']]

}
print(final_params_rf)

print("\nBest hyperparameters for XGBoost:")

final_params_xgb = {
'n_estimators': int(best_params_xgb['n_estimators']),
'learning_rate': np.exp(best_params_xgb['learning_rate']),
'max_depth': int(best_params_xgb['max_depth']),
'subsample': best_params_xgb['subsample'],
'scale_pos_weight': [1, 5][best_params_xgb['scale_pos_weight']]
}
print(final_params_xgb)

# Train the final models with the best hyperparameters

print("\nTraining final models with best hyperparameters...")

# Logistic Regression
final_logreg = LogisticRegression(**final_params_logreg, max_iter=1000) #␣
↪Added max_iter to prevent convergence warnings

final_logreg.fit(X_train_smote, y_train_smote.iloc[:, 0])

logreg_metrics = evaluate_model(final_logreg, X_val, y_val)

# Random Forest
final_rf = RandomForestClassifier(**final_params_rf)
final_rf.fit(X_train_smote, y_train_smote.iloc[:, 0])

81
rf_metrics = evaluate_model(final_rf, X_val, y_val)

# XGBoost
final_xgb = XGBClassifier(**final_params_xgb, eval_metric='logloss',␣
↪verbosity=0)

final_xgb.fit(X_train_smote, y_train_smote.iloc[:, 0])

xgb_metrics = evaluate_model(final_xgb, X_val, y_val)

# Print the final evaluation metrics

print("\nFinal Logistic Regression metrics:")
print(f"Accuracy: {logreg_metrics['accuracy']:.4f}")
print(f"Precision: {logreg_metrics['precision']:.4f}")
print(f"Recall: {logreg_metrics['recall']:.4f}")
print(f"F1 Score: {logreg_metrics['f1']:.4f}")
print(f"ROC-AUC: {logreg_metrics['roc_auc']:.4f}")
print(f"PR-AUC: {logreg_metrics['pr_auc']:.4f}")
print(f"Optimal Threshold: {logreg_metrics['best_threshold']:.4f} (F1:␣
↪{logreg_metrics['best_f1']:.4f})")

print("\nFinal Random Forest metrics:")

print(f"Accuracy: {rf_metrics['accuracy']:.4f}")
print(f"Precision: {rf_metrics['precision']:.4f}")
print(f"Recall: {rf_metrics['recall']:.4f}")
print(f"F1 Score: {rf_metrics['f1']:.4f}")
print(f"ROC-AUC: {rf_metrics['roc_auc']:.4f}")
print(f"PR-AUC: {rf_metrics['pr_auc']:.4f}")
print(f"Optimal Threshold: {rf_metrics['best_threshold']:.4f} (F1:␣
↪{rf_metrics['best_f1']:.4f})")

print("\nFinal XGBoost metrics:")

print(f"Accuracy: {xgb_metrics['accuracy']:.4f}")
print(f"Precision: {xgb_metrics['precision']:.4f}")
print(f"Recall: {xgb_metrics['recall']:.4f}")
print(f"F1 Score: {xgb_metrics['f1']:.4f}")
print(f"ROC-AUC: {xgb_metrics['roc_auc']:.4f}")
print(f"PR-AUC: {xgb_metrics['pr_auc']:.4f}")
print(f"Optimal Threshold: {xgb_metrics['best_threshold']:.4f} (F1:␣
↪{xgb_metrics['best_f1']:.4f})")

# Create results dictionary in the required format

results = {}
models = {
'Logistic Regression': final_logreg,
'Random Forest': final_rf,
'XGBoost': final_xgb
}

82
metrics_mapping = {
'Logistic Regression': logreg_metrics,
'Random Forest': rf_metrics,
'XGBoost': xgb_metrics
}

# Store evaluation results

for model_name, model in models.items():
metrics = metrics_mapping[model_name]
results[model_name] = {
"Accuracy": metrics['accuracy'],
"Precision": metrics['precision'],
"Recall": metrics['recall'],
"F1-Score": metrics['f1'],
"AUC": metrics['roc_auc']
}

# Select the best model based on F1-score

best_model_name = max(results, key=lambda k: results[k]["F1-Score"])
best_model = models[best_model_name]
best_metrics = metrics_mapping[best_model_name]

print(f"\nBest model based on F1-Score: {best_model_name}")

print(f"F1-Score: {results[best_model_name]['F1-Score']:.4f}")
print(f"Optimal threshold: {best_metrics['best_threshold']:.4f}")

# Function to make predictions with threshold

def predict_with_threshold(model, X, threshold=0.5):
y_prob = model.predict_proba(X)[:, 1]
return (y_prob >= threshold).astype(int)

# Add code to evaluate on test set if available

try:
# Evaluate the best model on the test set
test_metrics = evaluate_model(best_model, X_test, y_test,␣
↪find_threshold=False)

print(f"\nPerformance of the Best Model ({best_model_name}) on the Test Set:

↪")

print(f"Accuracy: {test_metrics['accuracy']:.4f}")
print(f"Precision: {test_metrics['precision']:.4f}")
print(f"Recall: {test_metrics['recall']:.4f}")
print(f"F1-Score: {test_metrics['f1']:.4f}")
print(f"AUC: {test_metrics['roc_auc']:.4f}")

# Use optimized threshold for predictions if desired

83
y_pred_optimized = predict_with_threshold(best_model, X_test,␣
↪threshold=best_metrics['best_threshold'])
opt_f1 = f1_score(y_test.iloc[:, 0], y_pred_optimized)
print(f"F1-Score with optimized threshold ({best_metrics['best_threshold']:.
↪4f}): {opt_f1:.4f}")

# Plot confusion matrix

plt.figure(figsize=(8, 6))
cm = confusion_matrix(y_test.iloc[:, 0], y_pred_optimized)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=[0, 1],␣
↪yticklabels=[0, 1])

plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix')
plt.show()

except NameError:
print("\nTest set not found. To evaluate on a test set, define X_test and␣
↪y_test variables.")

Before SMOTE - Class distribution:

Churn
0 83.151484
1 16.848516
Name: proportion, dtype: float64
After SMOTE - Class distribution:
Churn
0 50.0
1 50.0
Name: proportion, dtype: float64
Starting Logistic Regression optimization…
100%|��| 50/50 [00:15<00:00, 3.21trial/s, best loss:
-0.5831375717140068]
Starting Random Forest optimization…
100%|��| 50/50 [01:41<00:00, 2.03s/trial, best loss:
-0.8692813211066912]
Starting XGBoost optimization…
100%|��| 50/50 [01:29<00:00, 1.79s/trial, best loss:
-0.8848868863872488]

Best hyperparameters for Logistic Regression:

{'C': np.float64(1.0067670502521642), 'penalty': 'l2', 'solver': 'liblinear',
'class_weight': 'balanced'}

Best hyperparameters for Random Forest:

{'n_estimators': 180, 'max_depth': 18, 'min_samples_split': 2, 'class_weight':
None}

84
Best hyperparameters for XGBoost:
{'n_estimators': 110, 'learning_rate': np.float64(1.3764408103898214),
'max_depth': 8, 'subsample': np.float64(0.9104790831666846), 'scale_pos_weight':
5}

Training final models with best hyperparameters…

Final Logistic Regression metrics:

Accuracy: 0.7725
Precision: 0.4053
Recall: 0.7535
F1 Score: 0.5271
ROC-AUC: 0.8499
PR-AUC: 0.5991
Optimal Threshold: 0.7465 (F1: 0.5776)

Final Random Forest metrics:

Accuracy: 0.9408
Precision: 0.8651
Recall: 0.7676
F1 Score: 0.8134
ROC-AUC: 0.9727
PR-AUC: 0.8824
Optimal Threshold: 0.4960 (F1: 0.8266)

Final XGBoost metrics:

Accuracy: 0.9360
Precision: 0.8014
Recall: 0.8239
F1 Score: 0.8125
ROC-AUC: 0.9565
PR-AUC: 0.8824
Optimal Threshold: 0.4798 (F1: 0.8166)

Best model based on F1-Score: Random Forest

F1-Score: 0.8134
Optimal threshold: 0.4960

Performance of the Best Model (Random Forest) on the Test Set:

Accuracy: 0.9491
Precision: 0.8898
Recall: 0.7958
F1-Score: 0.8401
AUC: 0.9832
F1-Score with optimized threshold (0.4960): 0.8444

85
0.10 Model Analysis
[24]: # Feature Importance
if isinstance(best_model, (RandomForestClassifier, XGBClassifier)):
importances = best_model.feature_importances_
feature_names = X_train.columns
feature_importances = pd.Series(importances, index=feature_names).
↪sort_values(ascending=False)

print("\nFeature Importances:")
print(feature_importances)

# Visualize feature importances

plt.figure(figsize=(10, 6))
sns.barplot(x=feature_importances, y=feature_importances.index)
plt.title("Feature Importances")
plt.xlabel("Importance Score")

86
plt.ylabel("Feature")
plt.show()
else:
print("Feature importance is not directly available for this model type.")

Feature Importances:
Tenure 0.115815
TenureSquared 0.103631
Tenure_OrderAmountInteraction 0.082694
MaritalStatus_Single 0.063758
MaritalStatus_Married 0.044249
PreferedOrderCat_Laptop & Accessory 0.043390
CustomerExperienceScore 0.037941
DaySinceLastOrder 0.030060
CashbackAmount 0.029917
PreferredLoginDevice_Mobile Phone 0.028004
WarehouseToHome 0.026585
Gender_Male 0.023279
PreferredPaymentMode_Debit Card 0.022893
Cashback_OrderAmountRatio 0.022797
PreferredLoginDevice_Computer 0.022608
Gender_Female 0.022308
PreferredPaymentMode_Credit Card 0.022090
CustomerID 0.021090
NumberOfAddress 0.020599
OrderAmountHikeFromlastYear 0.018337
PreferedOrderCat_Mobile Phone 0.017166
SatisfactionScore 0.016677
CouponUsed_OrderCountRatio 0.014937
PreferedOrderCat_Mobile 0.014571
HourSpend_OrderCountInteraction 0.014262
CouponUsed 0.013399
PreferredLoginDevice_Phone 0.013123
OrderCount 0.011329
NumberOfDeviceRegistered 0.010503
PreferredPaymentMode_E wallet 0.010039
HourSpendOnApp 0.009492
CityTier 0.008710
PreferedOrderCat_Fashion 0.008695
Complain 0.008372
MaritalStatus_Divorced 0.008145
PreferredPaymentMode_COD 0.004896
PreferredPaymentMode_CC 0.004189
PreferredPaymentMode_UPI 0.004181
PreferredPaymentMode_Cash on Delivery 0.001825
PreferedOrderCat_Grocery 0.001813

87
PreferedOrderCat_Others 0.001628
dtype: float64

[25]: from sklearn.metrics import roc_curve, auc

import matplotlib.pyplot as plt

# Get predicted probabilities for the validation set

y_pred_prob = best_model.predict_proba(X_val)[:, 1]

# Calculate ROC curve and AUC

fpr, tpr, thresholds = roc_curve(y_val.iloc[:, 0], y_pred_prob)
roc_auc = auc(fpr, tpr)

# Plot ROC curve

plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (area = {roc_auc:
↪.2f})')

plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')

plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()

88
[26]: from sklearn.metrics import confusion_matrix
import seaborn as sns

# Get predictions for the validation set

y_pred = best_model.predict(X_val)

# Create confusion matrix

cm = confusion_matrix(y_val.iloc[:, 0], y_pred)

# Plot confusion matrix

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues")
plt.title("Confusion Matrix")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()

89
0.11 SHAP Analysis
[27]: import shap

# Create a SHAP explainer for the best model

explainer = shap.Explainer(best_model)

# Calculate SHAP values for a subset of the data

shap_values = explainer(X_test)

# Visualize SHAP values

shap.summary_plot(shap_values, X_test)

<Figure size 640x480 with 0 Axes>

90
[28]: # Individual SHAP plots
shap.plots.waterfall(shap_values[0, :, 1]) # Example for the first instance in␣
↪X_test, for class 1 (Churn)

# Dependence plots
shap.dependence_plot("Tenure", shap_values.values[:, :, 1], X_test) # Access␣
↪shap_values for class 1

91
0.12 Recomendations
[31]: report = f"""
# Customer Churn Prediction Report

## Model Performance Comparison

We evaluated three different models: Logistic Regression, Random Forest, and␣

↪XGBoost. The models were tuned using hyperparameter optimization to maximize␣

↪their performance. The table below summarizes the results on the validation␣

↪dataset.

| Model | Accuracy | Precision | Recall | F1-Score | AUC |

|-----------------------|----------|-----------|---------|----------|--------|
| Logistic Regression | {results['Logistic Regression']['Accuracy']:.4f} |␣
↪{results['Logistic Regression']['Precision']:.4f} | {results['Logistic␣

↪Regression']['Recall']:.4f} | {results['Logistic Regression']['F1-Score']:.

↪4f} | {results['Logistic Regression']['AUC']:.4f} |

92
| Random Forest | {results['Random Forest']['Accuracy']:.4f} |␣
↪{results['Random Forest']['Precision']:.4f} | {results['Random␣

↪Forest']['Recall']:.4f} | {results['Random Forest']['F1-Score']:.4f} |␣

↪{results['Random Forest']['AUC']:.4f} |

| XGBoost | {results['XGBoost']['Accuracy']:.4f} |␣
↪{results['XGBoost']['Precision']:.4f} | {results['XGBoost']['Recall']:.4f} |␣

↪{results['XGBoost']['F1-Score']:.4f} | {results['XGBoost']['AUC']:.4f} |

Based on these results, {best_model_name} was chosen as the best-performing␣

↪model due to its high F1-Score. This score reflects a balance between the␣
↪model's ability to correctly identify customers who will churn and its␣

↪accuracy in avoiding false positives.

## Actionable Recommendations

Based on the model's findings and feature importance, we recommend focusing␣

↪retention efforts on the following:

# Top 5 Recommendations Based on SHAP Analysis

# Top 5 Customer Retention Tips Backed by SHAP Insights

## 1. Give Extra Love to New Customers

**What the Data Says:** Customers who’ve been around for less than 6 months are␣
↪the most likely to leave. SHAP values show a strong connection between short␣

↪tenure and higher churn risk.

**What You Can Do:** Create a warm, welcoming experience for new customers—
think personalized onboarding, proactive support, and loyalty perks during␣
↪their first few months.

---

## 2. Reward Long-Term Loyalty

**What the Data Says:** Customers who stick around and spend more over time are␣
↪less likely to churn. SHAP shows tenure and spending patterns are powerful␣

↪predictors.

**What You Can Do:** Launch a loyalty program that gets better with time—offer␣
↪growing rewards or exclusive perks the longer they stay and the more they␣

↪spend.

---

## 3. Don’t Overlook Single Customers

93
**What the Data Says:** Being single is linked to different churn behavior␣
↪compared to married customers, with SHAP highlighting it as a key factor.

**What You Can Do:** Create campaigns and experiences that speak directly to␣
↪single customers’ preferences. Tailor your messaging and offers to better␣

↪match their lifestyle.

---

## 4. Track and Act on Customer Experience

**What the Data Says:** Customer Experience Score has a big influence on␣
↪whether people stay or go. Low scores are a red flag for churn.

**What You Can Do:** Regularly monitor experience scores and respond quickly␣
↪when they dip. Make it easy for customers to give feedback—and show them␣

↪you’re listening.

---

## 5. Make Mobile Work Smoothly

**What the Data Says:** The device people use to log in matters. SHAP analysis␣
↪shows mobile users may have different churn patterns than desktop users.

**What You Can Do:** Review your mobile experience closely. Fix bugs, speed␣
↪things up, and remove any friction that could push mobile users away.

"""
from IPython.display import Markdown
Markdown(report)
[31]:
1 Customer Churn Prediction Report
1.1 Model Performance Comparison
We evaluated three different models: Logistic Regression, Random Forest, and XGBoost. The
models were tuned using hyperparameter optimization to maximize their performance. The table
below summarizes the results on the validation dataset.

Model Accuracy Precision Recall F1-Score AUC

Logistic Regression 0.7725 0.4053 0.7535 0.5271 0.8499
Random Forest 0.9408 0.8651 0.7676 0.8134 0.9727
XGBoost 0.9360 0.8014 0.8239 0.8125 0.9565

Based on these results, Random Forest was chosen as the best-performing model due to its high
F1-Score. This score reflects a balance between the model’s ability to correctly identify customers
who will churn and its accuracy in avoiding false positives.

94
1.2 Actionable Recommendations
Based on the model’s findings and feature importance, we recommend focusing retention efforts on
the following: # Top 5 Recommendations Based on SHAP Analysis

2 Top 5 Customer Retention Tips Backed by SHAP Insights

2.1 1. Give Extra Love to New Customers
What the Data Says: Customers who’ve been around for less than 6 months are the most likely
to leave. SHAP values show a strong connection between short tenure and higher churn risk.
What You Can Do: Create a warm, welcoming experience for new customers—think personalized
onboarding, proactive support, and loyalty perks during their first few months.

2.2 2. Reward Long-Term Loyalty

What the Data Says: Customers who stick around and spend more over time are less likely to
churn. SHAP shows tenure and spending patterns are powerful predictors.
What You Can Do: Launch a loyalty program that gets better with time—offer growing rewards
or exclusive perks the longer they stay and the more they spend.

2.3 3. Don’t Overlook Single Customers

What the Data Says: Being single is linked to different churn behavior compared to married
customers, with SHAP highlighting it as a key factor.
What You Can Do: Create campaigns and experiences that speak directly to single customers’
preferences. Tailor your messaging and offers to better match their lifestyle.

2.4 4. Track and Act on Customer Experience

What the Data Says: Customer Experience Score has a big influence on whether people stay or
go. Low scores are a red flag for churn.
What You Can Do: Regularly monitor experience scores and respond quickly when they dip.
Make it easy for customers to give feedback—and show them you’re listening.

2.5 5. Make Mobile Work Smoothly

What the Data Says: The device people use to log in matters. SHAP analysis shows mobile
users may have different churn patterns than desktop users.
What You Can Do: Review your mobile experience closely. Fix bugs, speed things up, and
remove any friction that could push mobile users away.

95
[ ]:

Customer Churn Prediction
100% (1)
Customer Churn Prediction
18 pages
Predicting Employee Churn in Python
100% (1)
Predicting Employee Churn in Python
19 pages
Physci Pnu Edited
100% (1)
Physci Pnu Edited
146 pages
Predictive Modelling
100% (1)
Predictive Modelling
58 pages
QR Codes Kill Kittens: How to Alienate Customers, Dishearten Employees, and Drive Your Business into the Ground
From Everand
QR Codes Kill Kittens: How to Alienate Customers, Dishearten Employees, and Drive Your Business into the Ground
Scott Stratten
3.5/5 (11)
CustomerChurn Assignment
100% (3)
CustomerChurn Assignment
15 pages
Gen Chem (AKMS) Ebook
100% (3)
Gen Chem (AKMS) Ebook
342 pages
Agma Information Sheet: Inspection Practices - Part 2: Cylindrical Gears - Radial Measurements
100% (1)
Agma Information Sheet: Inspection Practices - Part 2: Cylindrical Gears - Radial Measurements
31 pages
Customer Churn - E-Commerce: Capstone Project Report
100% (1)
Customer Churn - E-Commerce: Capstone Project Report
43 pages
Uncertainties and Error Propagation
100% (1)
Uncertainties and Error Propagation
14 pages
Unit - 1 Digital Time Measurement Techniques: Lecture Notes # 1 Outline of The Lecture
No ratings yet
Unit - 1 Digital Time Measurement Techniques: Lecture Notes # 1 Outline of The Lecture
15 pages
Lab Assignment 1 Ucs551
No ratings yet
Lab Assignment 1 Ucs551
23 pages
Churn Prediction Product Idea
No ratings yet
Churn Prediction Product Idea
7 pages
Predictive+Modelling+-+Linear+Discriminant+Analysis+-+Mentor+version - Ipynb - Colaboratory
No ratings yet
Predictive+Modelling+-+Linear+Discriminant+Analysis+-+Mentor+version - Ipynb - Colaboratory
13 pages
Customer Churn Analysis by Gautam
No ratings yet
Customer Churn Analysis by Gautam
3 pages
Cambridge IGCSE ™
No ratings yet
Cambridge IGCSE ™
7 pages
Final Project Report
No ratings yet
Final Project Report
62 pages
Synt 3M 01 09 04 03 B - en
No ratings yet
Synt 3M 01 09 04 03 B - en
63 pages
Churn Prediction Model
No ratings yet
Churn Prediction Model
36 pages
Data Science Lab Manual
No ratings yet
Data Science Lab Manual
30 pages
Evaluation JMLT Postprint-Color
No ratings yet
Evaluation JMLT Postprint-Color
28 pages
L MCWJ1 S 7 JV PWK BPW Am PX Kqa Y050 e 99
No ratings yet
L MCWJ1 S 7 JV PWK BPW Am PX Kqa Y050 e 99
33 pages
DS Capestone PDF
No ratings yet
DS Capestone PDF
41 pages
Academic Word List: Sublist 6: Sheldon Smith
No ratings yet
Academic Word List: Sublist 6: Sheldon Smith
32 pages
Gen Chem 1 Lesson 2
No ratings yet
Gen Chem 1 Lesson 2
28 pages
Chapter 1
No ratings yet
Chapter 1
19 pages
6.1 Marketing Analysis Predicting Customer CHurn in Python
No ratings yet
6.1 Marketing Analysis Predicting Customer CHurn in Python
36 pages
Report
No ratings yet
Report
17 pages
1/28/2017 Ronald Morgan Shewchuk 1
No ratings yet
1/28/2017 Ronald Morgan Shewchuk 1
80 pages
Federated Deep Learning For Monkeypox Disease Detection On GAN-Augmented Dataset
No ratings yet
Federated Deep Learning For Monkeypox Disease Detection On GAN-Augmented Dataset
11 pages
DSS 2 Draft
No ratings yet
DSS 2 Draft
33 pages
CH 04
No ratings yet
CH 04
18 pages
Presentation 2
No ratings yet
Presentation 2
19 pages
Iranian Churn
No ratings yet
Iranian Churn
16 pages
Customer Churn Analysis - Jupyter Notebook
No ratings yet
Customer Churn Analysis - Jupyter Notebook
10 pages
Cambridge IGCSE™
No ratings yet
Cambridge IGCSE™
9 pages
Program 4+Linear+Discriminant+Analysis+-+Mentor+Version0.2 - New
No ratings yet
Program 4+Linear+Discriminant+Analysis+-+Mentor+Version0.2 - New
16 pages
Case Study Slides For Data Sceince
No ratings yet
Case Study Slides For Data Sceince
17 pages
Customer Churn Prediction
No ratings yet
Customer Churn Prediction
8 pages
Customer Churn Prediction
No ratings yet
Customer Churn Prediction
16 pages
Telco Customers Churn Predication - Analysis
No ratings yet
Telco Customers Churn Predication - Analysis
24 pages
Jurnal Ashton
No ratings yet
Jurnal Ashton
23 pages
DM Group Assignment
No ratings yet
DM Group Assignment
23 pages
DataScience Project-New
No ratings yet
DataScience Project-New
16 pages
Interim Report
No ratings yet
Interim Report
17 pages
Company Brochure English
No ratings yet
Company Brochure English
16 pages
Capstone Project
No ratings yet
Capstone Project
21 pages
Data Science Case Report
No ratings yet
Data Science Case Report
20 pages
Exploratry Data Analysis of The Telecom Customer Churn
No ratings yet
Exploratry Data Analysis of The Telecom Customer Churn
16 pages
Methematical Methods of Relative Engine Performance VOLPONI
No ratings yet
Methematical Methods of Relative Engine Performance VOLPONI
27 pages
KNN Paper
No ratings yet
KNN Paper
11 pages
Group 13 - Analyzing Customer Churn
No ratings yet
Group 13 - Analyzing Customer Churn
6 pages
Ecommerce Churn Analysis
No ratings yet
Ecommerce Churn Analysis
16 pages
Customer Churn Prediction Capstone Projectdocx
No ratings yet
Customer Churn Prediction Capstone Projectdocx
11 pages
CustomerChurnPrediction ProjectReport 2555425555
No ratings yet
CustomerChurnPrediction ProjectReport 2555425555
19 pages
Pathway HighPrep InterIIT TechMeet (1) 241020 234354-1
No ratings yet
Pathway HighPrep InterIIT TechMeet (1) 241020 234354-1
8 pages
Supervised Learning Project - Ipynb - Colab
No ratings yet
Supervised Learning Project - Ipynb - Colab
14 pages
Customer Churn Prediction
No ratings yet
Customer Churn Prediction
23 pages
12622-Article Text-22383-1-10-20220510
No ratings yet
12622-Article Text-22383-1-10-20220510
5 pages
Varshini Phase 2
No ratings yet
Varshini Phase 2
19 pages
ChurnRatePredictionPPT (AniketSahi)
No ratings yet
ChurnRatePredictionPPT (AniketSahi)
22 pages
Customer Churn Prediction
No ratings yet
Customer Churn Prediction
5 pages
Token ID Ain20250117003-1
No ratings yet
Token ID Ain20250117003-1
14 pages
Phase 3
No ratings yet
Phase 3
12 pages
Foodd: Food Detection Dataset For Calorie Measurement Using Food Images
No ratings yet
Foodd: Food Detection Dataset For Calorie Measurement Using Food Images
8 pages
Churn Prediction in Telecom Using Machine Learning in R
No ratings yet
Churn Prediction in Telecom Using Machine Learning in R
9 pages
AN296097 Hall Effect System With Two Linear Sensor ICs
No ratings yet
AN296097 Hall Effect System With Two Linear Sensor ICs
8 pages
Business Problem
No ratings yet
Business Problem
10 pages
WP 202 How To Choose It Rack Power Distribution
No ratings yet
WP 202 How To Choose It Rack Power Distribution
10 pages
Varshini Phase 3
No ratings yet
Varshini Phase 3
12 pages
Telecom Customer Churn Report
No ratings yet
Telecom Customer Churn Report
3 pages
Methodology
No ratings yet
Methodology
12 pages
Predictive Analytics Strategy
No ratings yet
Predictive Analytics Strategy
4 pages
An Another Carr and Prince
No ratings yet
An Another Carr and Prince
10 pages
Output 4
No ratings yet
Output 4
5 pages
2023 Scopus Kids Hobby Prediction
No ratings yet
2023 Scopus Kids Hobby Prediction
6 pages
Final Assignment On Mis
No ratings yet
Final Assignment On Mis
7 pages
Phase-2 Ibrahim
No ratings yet
Phase-2 Ibrahim
9 pages
Links For Datasets
No ratings yet
Links For Datasets
3 pages
Summary of Churn Analysis
No ratings yet
Summary of Churn Analysis
3 pages
Customer Churn Prediction Capstone Himanshu
No ratings yet
Customer Churn Prediction Capstone Himanshu
5 pages
Telecom Customer Churn
No ratings yet
Telecom Customer Churn
5 pages
AVD Lab 3
No ratings yet
AVD Lab 3
2 pages
ITTC - Recommended Procedures and Guidelines For Resistance Uncertainty Analysis
No ratings yet
ITTC - Recommended Procedures and Guidelines For Resistance Uncertainty Analysis
18 pages
Horno Vs Analizador de Humedad
No ratings yet
Horno Vs Analizador de Humedad
4 pages
Problem Sheet 1,2
No ratings yet
Problem Sheet 1,2
1 page
Predicting The Churn in Telecom Industry
No ratings yet
Predicting The Churn in Telecom Industry
9 pages
Grade Project
No ratings yet
Grade Project
1 page
Agilent 4339B/4349B High Resistance Meters: Technical Overview
No ratings yet
Agilent 4339B/4349B High Resistance Meters: Technical Overview
8 pages