Churn Predictions
Churn Predictions
try:
df = pd.read_csv('E Commerce Dataset_2024.csv')
display(df.head())
except FileNotFoundError:
print("Error: 'E Commerce Dataset_2024.csv' not found.")
df = None
except Exception as e:
print(f"An error occurred: {e}")
df = None
1
2 1 14.0 0.0 1.0
3 0 23.0 0.0 1.0
4 0 11.0 1.0 1.0
DaySinceLastOrder CashbackAmount
0 5.0 160
1 0.0 121
2 3.0 120
3 3.0 134
4 3.0 130
# Missing Values
missing_values = df.isnull().sum()
missing_percentage = (missing_values / len(df)) * 100
print("\nMissing Values:\n", missing_percentage)
# Descriptive Statistics
print("\nDescriptive Statistics:\n", df.describe(include='all'))
Data Types:
CustomerID int64
Churn int64
Tenure float64
PreferredLoginDevice object
CityTier int64
WarehouseToHome float64
PreferredPaymentMode object
Gender object
HourSpendOnApp float64
NumberOfDeviceRegistered int64
2
PreferedOrderCat object
SatisfactionScore int64
MaritalStatus object
NumberOfAddress int64
Complain int64
OrderAmountHikeFromlastYear float64
CouponUsed float64
OrderCount float64
DaySinceLastOrder float64
CashbackAmount int64
dtype: object
Missing Values:
CustomerID 0.000000
Churn 0.000000
Tenure 4.689165
PreferredLoginDevice 0.000000
CityTier 0.000000
WarehouseToHome 4.458259
PreferredPaymentMode 0.000000
Gender 0.000000
HourSpendOnApp 4.529307
NumberOfDeviceRegistered 0.000000
PreferedOrderCat 0.000000
SatisfactionScore 0.000000
MaritalStatus 0.000000
NumberOfAddress 0.000000
Complain 0.000000
OrderAmountHikeFromlastYear 4.706927
CouponUsed 4.547069
OrderCount 4.582593
DaySinceLastOrder 5.452931
CashbackAmount 0.000000
dtype: float64
Descriptive Statistics:
CustomerID Churn Tenure PreferredLoginDevice \
count 5630.000000 5630.000000 5366.000000 5630
unique NaN NaN NaN 3
top NaN NaN NaN Mobile Phone
freq NaN NaN NaN 2765
mean 52815.500000 0.168384 10.189899 NaN
std 1625.385339 0.374240 8.557241 NaN
min 50001.000000 0.000000 0.000000 NaN
25% 51408.250000 0.000000 2.000000 NaN
50% 52815.500000 0.000000 9.000000 NaN
75% 54222.750000 0.000000 16.000000 NaN
max 55630.000000 1.000000 61.000000 NaN
3
CityTier WarehouseToHome PreferredPaymentMode Gender \
count 5630.000000 5379.000000 5630 5630
unique NaN NaN 7 2
top NaN NaN Debit Card Male
freq NaN NaN 2314 3384
mean 1.654707 15.639896 NaN NaN
std 0.915389 8.531475 NaN NaN
min 1.000000 5.000000 NaN NaN
25% 1.000000 9.000000 NaN NaN
50% 1.000000 14.000000 NaN NaN
75% 3.000000 20.000000 NaN NaN
max 3.000000 127.000000 NaN NaN
4
25% 13.000000 1.000000 1.000000
50% 15.000000 1.000000 2.000000
75% 18.000000 2.000000 3.000000
max 26.000000 16.000000 16.000000
DaySinceLastOrder CashbackAmount
count 5323.000000 5630.000000
unique NaN NaN
top NaN NaN
freq NaN NaN
mean 4.543491 177.221492
std 3.654433 49.193869
min 0.000000 0.000000
25% 2.000000 146.000000
50% 3.000000 163.000000
75% 7.000000 196.000000
max 46.000000 325.000000
Churn Distribution:
Churn
0 83.161634
1 16.838366
Name: count, dtype: float64
5
[4]: # Relationship with Target Variable (Numerical)
for col in ['Tenure', 'WarehouseToHome', 'HourSpendOnApp',␣
↪'OrderAmountHikeFromlastYear', 'CouponUsed', 'OrderCount',␣
↪'DaySinceLastOrder', 'CashbackAmount']:
plt.figure(figsize=(8, 6))
df.boxplot(column=col, by='Churn', patch_artist=True, showfliers=False)␣
↪ # Suppress outliers
plt.title(f'{col} vs Churn')
plt.suptitle('') # remove default boxplot title
plt.ylabel(col)
plt.show()
# Correlation Analysis
numerical_features = df.select_dtypes(include=['number'])
correlation_matrix = numerical_features.corr()
plt.figure(figsize=(12, 10))
import seaborn as sns
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix of Numerical Features')
plt.show()
6
Churn
0 4499.0 11.502334 8.419217 0.0 5.0 10.0 17.0 61.0
1 867.0 3.379469 5.486089 0.0 0.0 1.0 3.0 21.0
<Figure size 800x600 with 0 Axes>
7
Summary Statistics for HourSpendOnApp grouped by Churn:
count mean std min 25% 50% 75% max
Churn
0 4485.0 2.925530 0.727184 0.0 2.0 3.0 3.0 5.0
1 890.0 2.961798 0.694427 2.0 2.0 3.0 3.0 4.0
<Figure size 800x600 with 0 Axes>
8
Summary Statistics for OrderAmountHikeFromlastYear grouped by Churn:
count mean std min 25% 50% 75% max
Churn
0 4431.0 15.724893 3.646256 11.0 13.0 15.0 18.0 26.0
1 934.0 15.627409 3.812084 11.0 13.0 14.0 18.0 26.0
<Figure size 800x600 with 0 Axes>
9
Summary Statistics for CouponUsed grouped by Churn:
count mean std min 25% 50% 75% max
Churn
0 4434.0 1.758232 1.893083 0.0 1.0 1.0 2.0 16.0
1 940.0 1.717021 1.902503 0.0 1.0 1.0 2.0 16.0
<Figure size 800x600 with 0 Axes>
10
Summary Statistics for OrderCount grouped by Churn:
count mean std min 25% 50% 75% max
Churn
0 4442.0 3.046601 2.964982 1.0 1.0 2.0 3.0 16.0
1 930.0 2.823656 2.809924 1.0 1.0 2.0 3.0 16.0
<Figure size 800x600 with 0 Axes>
11
Summary Statistics for DaySinceLastOrder grouped by Churn:
count mean std min 25% 50% 75% max
Churn
0 4429.0 4.807406 3.644758 0.0 2.0 4.0 8.0 31.0
1 894.0 3.236018 3.415137 0.0 1.0 2.0 5.0 46.0
<Figure size 800x600 with 0 Axes>
12
Summary Statistics for CashbackAmount grouped by Churn:
count mean std min 25% 50% 75% max
Churn
0 4682.0 180.633704 50.422799 0.0 147.0 166.0 201.0 325.0
1 948.0 160.369198 38.413534 110.0 132.0 150.0 175.0 324.0
<Figure size 800x600 with 0 Axes>
13
14
Value Counts for PreferredLoginDevice:
PreferredLoginDevice
Mobile Phone 2765
Computer 1634
Phone 1231
Name: count, dtype: int64
15
16
Value Counts for PreferredPaymentMode:
PreferredPaymentMode
Debit Card 2314
Credit Card 1501
E wallet 614
UPI 414
COD 365
CC 273
Cash on Delivery 149
Name: count, dtype: int64
17
18
Value Counts for Gender:
Gender
Male 3384
Female 2246
Name: count, dtype: int64
19
20
Value Counts for PreferedOrderCat:
PreferedOrderCat
Laptop & Accessory 2050
Mobile Phone 1271
Fashion 826
Mobile 809
Grocery 410
Others 264
Name: count, dtype: int64
21
22
Value Counts for MaritalStatus:
MaritalStatus
Married 2986
Single 1796
Divorced 848
Name: count, dtype: int64
23
24
Value Counts for Complain:
Complain
0 4026
1 1604
Name: count, dtype: int64
25
26
0.3 Data visualization
Visualize the data distributions, relationships between variables, and the target variable’s relation-
ship with other features.
[5]: import matplotlib.pyplot as plt
import seaborn as sns
# Distributions
numerical_cols = ['Tenure', 'WarehouseToHome', 'HourSpendOnApp',␣
↪'OrderAmountHikeFromlastYear', 'CouponUsed', 'OrderCount',␣
↪'DaySinceLastOrder', 'CashbackAmount']
27
categorical_cols = ['PreferredLoginDevice', 'CityTier', 'PreferredPaymentMode',␣
↪'Gender', 'PreferedOrderCat', 'SatisfactionScore', 'MaritalStatus',␣
↪'Complain']
# Outlier Detection
for col in numerical_cols:
plt.figure(figsize=(8, 6))
sns.boxplot(x=col, data=df)
plt.title(f'Boxplot of {col} for Outlier Detection')
plt.show()
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
0.4 Data cleaning
[6]: import pandas as pd
import numpy as np
69
# Outlier handling using IQR method
for col in numerical_cols:
Q1 = df[col].quantile(0.05)
Q3 = df[col].quantile(0.95)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
df[col] = np.clip(df[col], lower_bound, upper_bound)
display(df.head())
DaySinceLastOrder CashbackAmount
0 5.0 160
1 0.0 121
2 3.0 120
3 3.0 134
4 3.0 130
70
0.5 Data preparation
[7]: from sklearn.preprocessing import OneHotEncoder
display(prepared_df.head())
… PreferedOrderCat_Fashion PreferedOrderCat_Grocery \
0 … 0.0 0.0
1 … 0.0 0.0
2 … 0.0 0.0
3 … 0.0 0.0
4 … 0.0 0.0
71
3 1.0 0.0
4 0.0 1.0
[5 rows x 37 columns]
prepared_df['HourSpend_OrderCountInteraction'] = prepared_df['HourSpendOnApp']␣
↪* prepared_df['OrderCount']
# Ratio Features
prepared_df['Cashback_OrderAmountRatio'] = prepared_df['CashbackAmount'] /␣
↪prepared_df['OrderAmountHikeFromlastYear']
prepared_df['CouponUsed_OrderCountRatio'] = prepared_df['CouponUsed'] /␣
↪prepared_df['OrderCount']
# Combined Features
prepared_df['CustomerExperienceScore'] = prepared_df['SatisfactionScore'] * (1␣
↪- prepared_df['Complain'])
# Polynomial Features
prepared_df['TenureSquared'] = prepared_df['Tenure'] ** 2
# For simplicity, let's just print the correlation matrix without sorting
numerical_features = prepared_df.select_dtypes(include=['number'])
corr_matrix = numerical_features.corr()
72
# Further evaluation with visualization can be added if needed.
display(prepared_df.head())
73
CustomerExperienceScore -0.144989 -0.144989
TenureSquared -0.239627 -0.239627
CustomerID Churn Tenure CityTier WarehouseToHome HourSpendOnApp \
0 50001 1 4.0 3 6.0 3.0
1 50002 1 9.0 1 8.0 3.0
2 50003 1 9.0 1 30.0 2.0
3 50004 1 0.0 3 15.0 2.0
4 50005 1 0.0 1 12.0 3.0
Cashback_OrderAmountRatio CouponUsed_OrderCountRatio \
0 14.545455 1.0
1 8.066667 0.0
2 8.571429 0.0
3 5.826087 0.0
4 11.818182 1.0
CustomerExperienceScore TenureSquared
0 0 16.0
1 0 81.0
2 0 81.0
3 5 0.0
4 5 0.0
[5 rows x 43 columns]
74
0.7 Data splitting
[9]: from sklearn.model_selection import train_test_split
/usr/local/lib/python3.11/dist-packages/sklearn/linear_model/_logistic.py:465:
ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-
regression
n_iter_i = _check_optimize_result(
75
[19]: XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=None, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=None, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=None, n_jobs=None,
num_parallel_tree=None, random_state=None, …)
# Apply SMOTE
smote = SMOTE(random_state=random_state)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train.iloc[:,␣
↪0])
76
return X_train_smote, y_train_smote
# Create a small validation set from the resampled training data for early␣
↪stopping in XGBoost
space_rf = {
'n_estimators': hp.quniform('n_estimators', 50, 200, 10),
'max_depth': hp.quniform('max_depth', 5, 20, 1),
'min_samples_split': hp.quniform('min_samples_split', 2, 10, 1),
'class_weight': hp.choice('class_weight', [None, 'balanced',␣
↪'balanced_subsample'])
space_xgb = {
'n_estimators': hp.quniform('n_estimators', 50, 200, 10),
'learning_rate': hp.loguniform('learning_rate', -5, 0),
'max_depth': hp.quniform('max_depth', 3, 10, 1),
'subsample': hp.uniform('subsample', 0.5, 1),
'scale_pos_weight': hp.choice('scale_pos_weight', [1, 5]) # SMOTE already␣
↪balanced classes
77
y_pred = (y_prob >= threshold).astype(int)
f1 = f1_score(y_val.iloc[:, 0], y_pred)
if f1 > best_f1:
best_f1 = f1
best_threshold = threshold
# Calculate PR-AUC
precision_curve, recall_curve, _ = precision_recall_curve(y_val.iloc[:, 0],␣
↪y_prob)
result = {
'accuracy': accuracy,
'precision': precision_val,
'recall': recall_val,
'f1': f1,
'roc_auc': roc_auc,
'pr_auc': pr_auc,
}
return result
def objective_logreg(params):
model = LogisticRegression(**params, max_iter=1000)
model.fit(X_train_smote, y_train_smote.iloc[:, 0])
y_pred = model.predict(X_val)
y_prob = model.predict_proba(X_val)[:, 1]
78
# Calculate metrics
f1 = f1_score(y_val.iloc[:, 0], y_pred)
precision, recall, _ = precision_recall_curve(y_val.iloc[:, 0], y_prob)
pr_auc = auc(recall, precision)
# Combine metrics with more weight on PR-AUC which is better for imbalanced␣
↪data
combined_score = 0.4 * f1 + 0.6 * pr_auc
def objective_rf(params):
# Convert float parameters to int
params = {
'n_estimators': int(params['n_estimators']),
'max_depth': int(params['max_depth']),
'min_samples_split': int(params['min_samples_split']),
'class_weight': params['class_weight']
}
model = RandomForestClassifier(**params)
model.fit(X_train_smote, y_train_smote.iloc[:, 0])
y_pred = model.predict(X_val)
y_prob = model.predict_proba(X_val)[:, 1]
# Calculate metrics
f1 = f1_score(y_val.iloc[:, 0], y_pred)
precision, recall, _ = precision_recall_curve(y_val.iloc[:, 0], y_prob)
pr_auc = auc(recall, precision)
# Combine metrics with more weight on PR-AUC which is better for imbalanced␣
↪data
combined_score = 0.4 * f1 + 0.6 * pr_auc
def objective_xgb(params):
# Convert float parameters to int where needed
params = {
'n_estimators': int(params['n_estimators']),
'learning_rate': params['learning_rate'],
'max_depth': int(params['max_depth']),
'subsample': params['subsample'],
79
'scale_pos_weight': params['scale_pos_weight']
}
model = XGBClassifier(
**params,
eval_metric='logloss',
verbosity=0
)
# Calculate metrics
f1 = f1_score(y_val.iloc[:, 0], y_pred)
precision, recall, _ = precision_recall_curve(y_val.iloc[:, 0], y_prob)
pr_auc = auc(recall, precision)
# Combine metrics with more weight on PR-AUC which is better for imbalanced␣
↪data
combined_score = 0.4 * f1 + 0.6 * pr_auc
80
print("Starting XGBoost optimization...")
trials_xgb = Trials()
best_params_xgb = fmin(fn=objective_xgb, space=space_xgb, algo=tpe.suggest,␣
↪max_evals=50, trials=trials_xgb)
}
print(final_params_rf)
# Logistic Regression
final_logreg = LogisticRegression(**final_params_logreg, max_iter=1000) #␣
↪Added max_iter to prevent convergence warnings
# Random Forest
final_rf = RandomForestClassifier(**final_params_rf)
final_rf.fit(X_train_smote, y_train_smote.iloc[:, 0])
81
rf_metrics = evaluate_model(final_rf, X_val, y_val)
# XGBoost
final_xgb = XGBClassifier(**final_params_xgb, eval_metric='logloss',␣
↪verbosity=0)
82
metrics_mapping = {
'Logistic Regression': logreg_metrics,
'Random Forest': rf_metrics,
'XGBoost': xgb_metrics
}
print(f"Accuracy: {test_metrics['accuracy']:.4f}")
print(f"Precision: {test_metrics['precision']:.4f}")
print(f"Recall: {test_metrics['recall']:.4f}")
print(f"F1-Score: {test_metrics['f1']:.4f}")
print(f"AUC: {test_metrics['roc_auc']:.4f}")
83
y_pred_optimized = predict_with_threshold(best_model, X_test,␣
↪threshold=best_metrics['best_threshold'])
opt_f1 = f1_score(y_test.iloc[:, 0], y_pred_optimized)
print(f"F1-Score with optimized threshold ({best_metrics['best_threshold']:.
↪4f}): {opt_f1:.4f}")
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix')
plt.show()
except NameError:
print("\nTest set not found. To evaluate on a test set, define X_test and␣
↪y_test variables.")
84
Best hyperparameters for XGBoost:
{'n_estimators': 110, 'learning_rate': np.float64(1.3764408103898214),
'max_depth': 8, 'subsample': np.float64(0.9104790831666846), 'scale_pos_weight':
5}
85
0.10 Model Analysis
[24]: # Feature Importance
if isinstance(best_model, (RandomForestClassifier, XGBClassifier)):
importances = best_model.feature_importances_
feature_names = X_train.columns
feature_importances = pd.Series(importances, index=feature_names).
↪sort_values(ascending=False)
print("\nFeature Importances:")
print(feature_importances)
86
plt.ylabel("Feature")
plt.show()
else:
print("Feature importance is not directly available for this model type.")
Feature Importances:
Tenure 0.115815
TenureSquared 0.103631
Tenure_OrderAmountInteraction 0.082694
MaritalStatus_Single 0.063758
MaritalStatus_Married 0.044249
PreferedOrderCat_Laptop & Accessory 0.043390
CustomerExperienceScore 0.037941
DaySinceLastOrder 0.030060
CashbackAmount 0.029917
PreferredLoginDevice_Mobile Phone 0.028004
WarehouseToHome 0.026585
Gender_Male 0.023279
PreferredPaymentMode_Debit Card 0.022893
Cashback_OrderAmountRatio 0.022797
PreferredLoginDevice_Computer 0.022608
Gender_Female 0.022308
PreferredPaymentMode_Credit Card 0.022090
CustomerID 0.021090
NumberOfAddress 0.020599
OrderAmountHikeFromlastYear 0.018337
PreferedOrderCat_Mobile Phone 0.017166
SatisfactionScore 0.016677
CouponUsed_OrderCountRatio 0.014937
PreferedOrderCat_Mobile 0.014571
HourSpend_OrderCountInteraction 0.014262
CouponUsed 0.013399
PreferredLoginDevice_Phone 0.013123
OrderCount 0.011329
NumberOfDeviceRegistered 0.010503
PreferredPaymentMode_E wallet 0.010039
HourSpendOnApp 0.009492
CityTier 0.008710
PreferedOrderCat_Fashion 0.008695
Complain 0.008372
MaritalStatus_Divorced 0.008145
PreferredPaymentMode_COD 0.004896
PreferredPaymentMode_CC 0.004189
PreferredPaymentMode_UPI 0.004181
PreferredPaymentMode_Cash on Delivery 0.001825
PreferedOrderCat_Grocery 0.001813
87
PreferedOrderCat_Others 0.001628
dtype: float64
88
[26]: from sklearn.metrics import confusion_matrix
import seaborn as sns
89
0.11 SHAP Analysis
[27]: import shap
90
[28]: # Individual SHAP plots
shap.plots.waterfall(shap_values[0, :, 1]) # Example for the first instance in␣
↪X_test, for class 1 (Churn)
# Dependence plots
shap.dependence_plot("Tenure", shap_values.values[:, :, 1], X_test) # Access␣
↪shap_values for class 1
91
0.12 Recomendations
[31]: report = f"""
# Customer Churn Prediction Report
↪their performance. The table below summarizes the results on the validation␣
↪dataset.
92
| Random Forest | {results['Random Forest']['Accuracy']:.4f} |␣
↪{results['Random Forest']['Precision']:.4f} | {results['Random␣
↪{results['Random Forest']['AUC']:.4f} |
| XGBoost | {results['XGBoost']['Accuracy']:.4f} |␣
↪{results['XGBoost']['Precision']:.4f} | {results['XGBoost']['Recall']:.4f} |␣
↪{results['XGBoost']['F1-Score']:.4f} | {results['XGBoost']['AUC']:.4f} |
## Actionable Recommendations
**What You Can Do:** Create a warm, welcoming experience for new customers—
think personalized onboarding, proactive support, and loyalty perks during␣
↪their first few months.
---
↪predictors.
**What You Can Do:** Launch a loyalty program that gets better with time—offer␣
↪growing rewards or exclusive perks the longer they stay and the more they␣
↪spend.
---
93
**What the Data Says:** Being single is linked to different churn behavior␣
↪compared to married customers, with SHAP highlighting it as a key factor.
**What You Can Do:** Create campaigns and experiences that speak directly to␣
↪single customers’ preferences. Tailor your messaging and offers to better␣
---
**What You Can Do:** Regularly monitor experience scores and respond quickly␣
↪when they dip. Make it easy for customers to give feedback—and show them␣
↪you’re listening.
---
**What You Can Do:** Review your mobile experience closely. Fix bugs, speed␣
↪things up, and remove any friction that could push mobile users away.
"""
from IPython.display import Markdown
Markdown(report)
[31]:
1 Customer Churn Prediction Report
1.1 Model Performance Comparison
We evaluated three different models: Logistic Regression, Random Forest, and XGBoost. The
models were tuned using hyperparameter optimization to maximize their performance. The table
below summarizes the results on the validation dataset.
Based on these results, Random Forest was chosen as the best-performing model due to its high
F1-Score. This score reflects a balance between the model’s ability to correctly identify customers
who will churn and its accuracy in avoiding false positives.
94
1.2 Actionable Recommendations
Based on the model’s findings and feature importance, we recommend focusing retention efforts on
the following: # Top 5 Recommendations Based on SHAP Analysis
95
[ ]:
96