0% found this document useful (0 votes)
82 views16 pages

Iranian Churn

Uploaded by

xegocic823
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
82 views16 pages

Iranian Churn

Uploaded by

xegocic823
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

A REPORT on

Iranian Churn Prediction

Submitted to

KIIT Deemed to be University


In Partial Fulfilment of the Requirement for the Award of

MASTER’S DEGREE IN
COMPUTER APPLICATION

Summited

By
Lakshya Namdeo 23700220
Ishani Banerjee 2370187
Dibya Prakash Dash 2370152

SCHOOL OF COMPUTER APPLICATION

KALINGA INSTITUTE OF INDUSTRIAL TECHNOLOGY


BHUBANESWAR, ODISHA -751024
November 2024
INDEX

SNO. TOPIC PAGENO.

1 Dataset Description 3

2 Problem Statement 4

3 Methodology 4

5 Data Splitting 5

5 Classification Method 5

6 Coding Process 6

7 Performance Analysis 6-7

8 Results and Discussion 7

9 Conclusion 8

10 Source code 9-14

12 Confusion Matrix 13

13 AUC_ROC Curve 14

11 References 15

2. Dataset Description

2
Overview

The dataset used in this project is the Iranian telecom churn dataset. It includes
various features that capture customer behaviour and demographics, which can help
predict whether a customer is likely to leave the service (churn) or remain
subscribed. The main aim of the dataset is to identify patterns and correlations that
can be utilised to develop a predictive model.

● Features:
○ Call Failure: Number of call failures made by the customer.
○ Complaints: Number of complaints raised by the customer.
○ Subscription Length: Duration of the customer’s subscription in
months.
○ Charge Amount: The total amount charged to the customer.
○ Frequency of Use: Frequency with which the customer uses the
service.
○ Other behavioural and usage metrics.
● Target Variable:
○ Churn: A binary indicator where 0 represents a non-churned customer
and 1 represents a churned customer.
● Class Distribution:
○ Non-Churned (0): 525 instances
○ Churned (1): 105 instances
○ Class Imbalance: There is a significant imbalance in the data, with
more non-churned instances than churned. This imbalance can affect
the model’s performance if not handled properly.

Summary Statistics and Preprocessing

Each feature underwent preprocessing steps to prepare it for model training:

1. Handling Missing Values: Missing values were either imputed with


mean/median values or removed based on the extent of missing data.
2. Outlier Detection and Treatment: Outliers were detected using statistical
methods such as the Interquartile Range (IQR) and treated to prevent them
from skewing model predictions.
3. Feature Scaling: StandardScaler was used to normalise the features,
ensuring that each feature contributes equally to model training, especially for
distance-based algorithms.

3
3. Problem Statement

The primary goal of this project is to predict customer churn for a telecom company
in Iran. Churn prediction models help businesses identify customers likely to leave,
enabling them to develop strategies to retain these customers. By accurately
predicting churn, the company can improve customer retention rates, reduce losses,
and increase profitability. Specifically, this project aims to develop a model that can
effectively classify whether a customer will churn based on their usage patterns,
complaints, and demographic data.

4. Methodology

Data Preprocessing

A series of preprocessing steps were carried out to clean and prepare the data:

1. Data Cleaning:
○ Missing values in the dataset were addressed using imputation
(mean/median) for numerical columns and mode for categorical
columns.
○ Categorical features were transformed into numerical values via
one-hot encoding where necessary.
2. Handling Outliers:
○ Outliers were detected through statistical methods (e.g., Z-score, IQR).
○ Outliers that could affect model performance were capped or removed
to ensure they didn’t distort the learning process.
3. Feature Scaling:
○ Standardisation using StandardScaler was applied to scale the
features, especially for models sensitive to feature scales (e.g., logistic
regression).

Model Evaluation Metrics

The following metrics were calculated to evaluate model performance:

● Accuracy: Measures the overall proportion of correct predictions.


● Precision: Indicates how many of the predicted churned customers were
actually churned.
● Recall: Reflects how well the model identifies actual churned customers.
● F1 Score: Balances precision and recall, useful in scenarios with class
imbalance.
● AUC-ROC: Measures the model's ability to distinguish between the churn and
non-churn classes.

4
5. Data Splitting

The dataset was split into training and testing sets to evaluate the model’s
performance:

● Split Ratio: 80% for training, 20% for testing.


● Random Seed: random_state=12 was set to ensure the split is
reproducible.
● Data Shapes:
○ Training set: (xtrain, ytrain) where xtrain includes features,
and ytrain includes churn labels.
○ Testing set: (xtest, ytest) for model evaluation.

This split allowed the model to learn from the majority of the data while providing a
separate set for unbiased performance evaluation.

6. Classification Method

To tackle the problem, several machine learning algorithms were used:

1. Logistic Regression: A baseline linear model for binary classification. It


assumes a linear relationship between the features and the log odds of the
target class.
2. Decision Tree Classifier: Captures non-linear relationships by recursively
splitting the data into homogenous sets based on feature values.
3. Random Forest Classifier: An ensemble of decision trees that reduces
overfitting and improves generalisation by averaging multiple trees'
predictions.
4. XGBoost Classifier: A powerful gradient-boosted tree algorithm known for
high accuracy, especially in classification problems with structured data.
5. Gradient Boosting Classifier: A boosting algorithm that sequentially builds
weak learners to minimise prediction errors.
6. AdaBoost Classifier: A boosting technique that adjusts weights to focus on
difficult instances, improving performance on imbalanced data.

Each model was evaluated using the model_metrics function, which computed
performance metrics on both training and test sets.

5
7. Coding Process

Libraries Used

● Data Handling: pandas, numpy


● Modelling and Metrics: sklearn, xgboost
● Visualisation: matplotlib, seaborn for plotting metrics and performance
graphs

Key Functions and Process

1. Data Loading and Preprocessing: Loaded the dataset and performed initial
data cleaning.
2. Model Training: Each model was trained on the training data (xtrain,
ytrain).
3. Evaluation: The model_metrics function iteratively fitted each model and
computed accuracy, precision, recall, and F1 scores, storing the results in a
DataFrame.

8. Performance Analysis

Confusion Matrix

The confusion matrix summarises the performance for each model:

● True Positives (TP): Correctly predicted churned customers.


● True Negatives (TN): Correctly predicted non-churned customers.
● False Positives (FP): Incorrectly predicted churned customers.
● False Negatives (FN): Incorrectly predicted non-churned customers.

For instance, in a sample confusion matrix, RandomForestClassifier achieved the


following:

AUC-ROC Curve

6
The ROC curve, shown in the provided image, illustrates the true positive rate
(sensitivity) against the false positive rate for XGBoost (XGB) and Random Forest
Classifier (RFC). Both models achieved an AUC score of 0.983, indicating
excellent discrimination between churn and non-churn classes.

Performance Summary

9. Results and Discussion

7
Key Observations and Insights

1. High AUC and ROC Performance:


○ Both the Random Forest and XGBoost models achieved an AUC of
0.983, indicating a strong ability to distinguish between churned and
non-churned customers. The high AUC suggests that these models are
effective in reducing both false positives and false negatives, making
them reliable for churn prediction.
2. Accuracy and Recall Balance:
○ While accuracy is an important metric, recall is critical in churn
prediction to ensure we capture as many actual churn cases as
possible. The ensemble models (Random Forest and XGBoost)
achieved a good balance between accuracy and recall, which is
beneficial for business applications where failing to identify churned
customers can lead to revenue loss.
3. Impact of Ensemble Techniques:
○ Random Forest and XGBoost both utilise ensemble techniques,
which combine multiple decision trees to improve robustness and
reduce overfitting. The superior performance of these models over
single estimators like Decision Trees indicates that ensemble methods
are particularly effective for this dataset, which may contain complex
patterns that are better captured through an ensemble approach.
4. Model Stability:
○ The ensemble models also showed greater stability across multiple
runs, with consistent results in terms of performance metrics (accuracy,
precision, recall). This stability is essential for real-world applications
where the model might be deployed and updated periodically. Stability
reduces the need for constant retraining and fine-tuning, lowering
maintenance costs.
5. Performance of Boosting Algorithms:
○ Gradient Boosting and AdaBoost performed well but were slightly
less effective compared to XGBoost. XGBoost, being a more optimised
version of boosting, provides superior handling of high-dimensional
data and faster convergence, explaining its higher scores. This
indicates that while boosting methods are suitable, XGBoost may be
preferable when working with complex datasets.
6. Effect of Class Imbalance:
○ Although the dataset was imbalanced, the models still achieved high
recall rates for the minority class (churned customers), In a real-world
setting, additional techniques such as SMOTE (Synthetic Minority
Over-sampling Technique) could be used to further balance the
dataset, possibly enhancing recall without compromising precision.

10. Conclusion

8
Summary

The Iranian Churn Prediction Project achieved its objective of developing a robust
model to predict customer churn with high accuracy and recall. The analysis shows
that ensemble methods, particularly Random Forest and XGBoost, are highly
effective for this dataset, likely due to their ability to capture complex patterns and
interactions among features. These models are not only accurate but also stable and
reliable, making them well-suited for deployment in a real-world telecom business
context.

In conclusion, the churn prediction model developed in this project has significant
potential to assist the telecom company in minimising customer attrition. The model’s
high performance, particularly in AUC and recall, demonstrates its ability to serve as
a reliable tool for churn prediction. By acting on these predictions, the company can
implement targeted retention strategies, enhancing customer satisfaction and
ultimately supporting sustainable business growth.

9
SOURCE CODE

1) IMPORTING DATASETS AND LIBRARIES

2) EXTRACTING DATASET FEATURES

10
3) MODEL BUILDING

11
4) GETTING METRICS FOR THE MODEL

12
5) CALCULATING ACCURACY , PRECISION , RECALL

6) BUILDING CONFUSION MATRIX

13
7) DETERMINING CHURN VALUE

8) PLOTTING THE ROC-AUC CURVE

14
15
REFERENCES

1) UCI machine learning dataset repository


2) Dataset Source: Iranian Telecom Dataset
3) Documentation and tutorials from the Scikit-Learn library and XGBoost library
for model implementation.

16

You might also like