Phase 3
Phase 3
git
PHASE-3
1. Problem Statement
Customer churn refers to when clients stop doing business with a company. In highly
competitive industries, understanding why customers churn is crucial for retaining them. This
project aims to build a machine learning model that can accurately classify whether a customer is
likely to churn, using behavioral and demographic data from a structured dataset. Accurately
predicting churn allows businesses to take proactive steps for customer retention and reduced
revenue loss.
2. Abstract
This project applies machine learning to the problem of customer churn prediction using real-
world telecom data. The dataset includes customer demographics, subscription details, billing
patterns, and service usage. After rigorous preprocessing and analysis, we trained three models—
Logistic Regression, Random Forest, and XGBoost—with XGBoost achieving the highest
accuracy (86%) and F1-score (0.82). The model's predictions were interpreted using SHAP
values for transparency. This system enables telecom companies to identify and retain at-risk
customers effectively, resulting in better business performance.
3. System Requirements
○ Hardware:
○ Minimum 4 GB RAM (8 GB recommended)
○ Standard processor (Intel i3/i5 or AMD equivalent)
○ Software:
○ Python 3.10+
○ Libraries: pandas, numpy, matplotlib, seaborn, scikit-learn, xgboost, shap,
plotly
○ IDE: Google Colab / Jupyter Notebook
4. Objectives
Make the system interpretable and usable for non-technical business teams.
● Contract type, tenure, and monthly charges had strong correlations with churn
● Visualizations: Histograms, Boxplots, Correlation Heatmaps
● Insights: Customers with short contracts and high bills churn more; fiber internet users
show higher churn probability
9. Feature Engineering
- Created new features: Total Services Used, Engagement Level
- Interaction terms: e.g., contract type × charges
- Feature selection via SelectKBest
- PCA for dimensionality reduction while retaining interpretability
9. Model Building
● - Models: Logistic Regression, Random Forest, XGBoost
- Train-test split: 80-20
- Best model: XGBoost
- Accuracy: 86%
- F1-Score: 0.82
- AUC: 0.88
○
○ train_test_split(random_state=42)
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
n_samples = 20
data = {
'Contract': contract_column,
df = pd.DataFrame(data)
# Step 3: Preprocessing
df[col] = LabelEncoder().fit_transform(df[col])
y = df['Churn']
# Step 6: Scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)[:, 1]
# Step 9: Evaluation
# Confusion Matrix
plt.title("Confusion Matrix")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()
# ROC Curve
plt.title("ROC Curve")
plt.legend()
plt.grid()
plt.show()