Iranian Churn
Iranian Churn
Submitted to
MASTER’S DEGREE IN
COMPUTER APPLICATION
Summited
By
Lakshya Namdeo 23700220
Ishani Banerjee 2370187
Dibya Prakash Dash 2370152
1 Dataset Description 3
2 Problem Statement 4
3 Methodology 4
5 Data Splitting 5
5 Classification Method 5
6 Coding Process 6
9 Conclusion 8
12 Confusion Matrix 13
13 AUC_ROC Curve 14
11 References 15
2. Dataset Description
2
Overview
The dataset used in this project is the Iranian telecom churn dataset. It includes
various features that capture customer behaviour and demographics, which can help
predict whether a customer is likely to leave the service (churn) or remain
subscribed. The main aim of the dataset is to identify patterns and correlations that
can be utilised to develop a predictive model.
● Features:
○ Call Failure: Number of call failures made by the customer.
○ Complaints: Number of complaints raised by the customer.
○ Subscription Length: Duration of the customer’s subscription in
months.
○ Charge Amount: The total amount charged to the customer.
○ Frequency of Use: Frequency with which the customer uses the
service.
○ Other behavioural and usage metrics.
● Target Variable:
○ Churn: A binary indicator where 0 represents a non-churned customer
and 1 represents a churned customer.
● Class Distribution:
○ Non-Churned (0): 525 instances
○ Churned (1): 105 instances
○ Class Imbalance: There is a significant imbalance in the data, with
more non-churned instances than churned. This imbalance can affect
the model’s performance if not handled properly.
3
3. Problem Statement
The primary goal of this project is to predict customer churn for a telecom company
in Iran. Churn prediction models help businesses identify customers likely to leave,
enabling them to develop strategies to retain these customers. By accurately
predicting churn, the company can improve customer retention rates, reduce losses,
and increase profitability. Specifically, this project aims to develop a model that can
effectively classify whether a customer will churn based on their usage patterns,
complaints, and demographic data.
4. Methodology
Data Preprocessing
A series of preprocessing steps were carried out to clean and prepare the data:
1. Data Cleaning:
○ Missing values in the dataset were addressed using imputation
(mean/median) for numerical columns and mode for categorical
columns.
○ Categorical features were transformed into numerical values via
one-hot encoding where necessary.
2. Handling Outliers:
○ Outliers were detected through statistical methods (e.g., Z-score, IQR).
○ Outliers that could affect model performance were capped or removed
to ensure they didn’t distort the learning process.
3. Feature Scaling:
○ Standardisation using StandardScaler was applied to scale the
features, especially for models sensitive to feature scales (e.g., logistic
regression).
4
5. Data Splitting
The dataset was split into training and testing sets to evaluate the model’s
performance:
This split allowed the model to learn from the majority of the data while providing a
separate set for unbiased performance evaluation.
6. Classification Method
Each model was evaluated using the model_metrics function, which computed
performance metrics on both training and test sets.
5
7. Coding Process
Libraries Used
1. Data Loading and Preprocessing: Loaded the dataset and performed initial
data cleaning.
2. Model Training: Each model was trained on the training data (xtrain,
ytrain).
3. Evaluation: The model_metrics function iteratively fitted each model and
computed accuracy, precision, recall, and F1 scores, storing the results in a
DataFrame.
8. Performance Analysis
Confusion Matrix
AUC-ROC Curve
6
The ROC curve, shown in the provided image, illustrates the true positive rate
(sensitivity) against the false positive rate for XGBoost (XGB) and Random Forest
Classifier (RFC). Both models achieved an AUC score of 0.983, indicating
excellent discrimination between churn and non-churn classes.
Performance Summary
7
Key Observations and Insights
10. Conclusion
8
Summary
The Iranian Churn Prediction Project achieved its objective of developing a robust
model to predict customer churn with high accuracy and recall. The analysis shows
that ensemble methods, particularly Random Forest and XGBoost, are highly
effective for this dataset, likely due to their ability to capture complex patterns and
interactions among features. These models are not only accurate but also stable and
reliable, making them well-suited for deployment in a real-world telecom business
context.
In conclusion, the churn prediction model developed in this project has significant
potential to assist the telecom company in minimising customer attrition. The model’s
high performance, particularly in AUC and recall, demonstrates its ability to serve as
a reliable tool for churn prediction. By acting on these predictions, the company can
implement targeted retention strategies, enhancing customer satisfaction and
ultimately supporting sustainable business growth.
9
SOURCE CODE
10
3) MODEL BUILDING
11
4) GETTING METRICS FOR THE MODEL
12
5) CALCULATING ACCURACY , PRECISION , RECALL
13
7) DETERMINING CHURN VALUE
14
15
REFERENCES
16