Churn Prediction in Telecom Using Machine Learning in R
Churn Prediction in Telecom Using Machine Learning in R
Machine Learning in R
Table of Contents
1. Introduction ............................................................................................................................ 2
a. Background ........................................................................................................................ 2
2. Data ........................................................................................................................................ 3
b. Preprocessing ................................................................................................................. 4
b. Model Training............................................................................................................... 6
4. Conclusion ............................................................................................................................. 8
1|Page
1. Introduction
a. Background
In the highly competitive telecom industry, retaining these customers now beats attracting new
customers into the game. Customer churn is one of the biggest challenges that will be faced by
telecom companies – decrease in revenue attributed to the reduced number of subscribers who
move to their competitors. With customer retention being cheaper than attraction, reducing
churn has thus become critical to companies. The rise of machine learning (ML) has greatly
improved customer behaviour analysis over the past decade in providing patterns and early
actions to retain customers. These analytics enable operators to be aware of when customers
may leave and develop customized retention strategies thus increasing the customer loyalty
and satisfaction.
b. Problem Identification
Customer churn is directly related to firm profitability, customer-value over time and general
market stability. Classical analytical approaches are often unable to identify the sophisticated
relationships between such customer characteristics as billing pattern, contract details, usage
pattern, and demographic data pertaining to churn. This lapse causes delayed responses or
inability to forecast churn risks at all. Telecom companies work with massive pools of data
ranging from the background of clients, service engagements, and billings history.
Unfortunately, companies do not take full benefit as long as churn prediction systems are not
there. The major barriers are: accurate prediction of churn; the cause of customer churn; and
which factors have the greatest impact on the churn of customers.
2|Page
Moreover the project delves into feature importance and it explains the key drivers to churn
and visualizes model outputs in terms of ROC curves and confusion matrices. The work is
confined to binary churn classification from structured data and uses R programming and its
ML ecosystem, which includes caret, RandomForest, nnet, pROC, and ggplot2. The main
purpose is to enable action insights for telecom businesses to create successful retention
strategies and make data-driven decisions.
2. Data
a. Data Structure
The data set for this study has 7043 records and 21 attributes that describe a single telecom
customer. The dataset comprises the following data: demographics (gender, senior citizen);
account details (tenure, contract); service customers use (phone, internet); and financial metrics
(monthly and total charges). The major objective of this research is to model these features for
the purpose of predicting the binary indicator of target variable Churn, which is the indication
for the defection of a customer from the service (Yes) or not (No). InternetService and Gender
are among a set of factor variables that are numerically valued together with the continuous
MonthlyCharges and tenure. By using the R function str(), an initial review found most
columns had been correctly assigned data types except for TotalCharges column that was
mistakenly retained as a character field due to its null values.
3|Page
b. Preprocessing
Handling Missing Values
Two primary procedures were executed to date during the data cleaning process. First, missing
values were addressed. For instance, the totalCharges column had empty fields which were
filled by NA values. The na.omit() function was then used to eliminate these rows thus deleting
just a miniscule percentage of the data while ensuring that the modeling records were complete
and consistent. The R code for this step is as follows:
Outlier Detection
Second, outlier detection was performed. Using MonthlyCharges and TotalCharges, potential
extremes that could threaten model bias were searched and evaluated through boxplots. While
some (though quite a few) unusually high values were detected, they appeared in accordance
to actual customer behavior (for example, long term customer making high total charges).
Consequently, all observations were not outliers, and thus not discarded in the dataset. Instead
of exclusions, normalization of numerical features was carried out prior to training, to facilitate
neural network algorithms. According to the boxplots, MonthlyCharges were more common in
the range from 20 to 90, while TotalCharges were less uniform. These findings guaranteed that
the data set was mistake or outlier-free, a good base for building the model.
Boxplot for MonthlyCharges:
4|Page
Boxplot for TotalCharges:
a. Data Splitting
For model effectiveness evaluation, the dataset was divided into two sets: 70% for training and
30% for validation. For this purpose, the caret package create DataPartition() function was used
to make sure the classification balance of the target variable Churn was preserved in training
and validation sets. The use of such stratified sampling procedure is an important factor in
reducing sampling bias and equal distribution of churned and retained customers in the course
of model evaluation.
5|Page
b. Model Training
We developed three classification artificial intelligence algorithms that should predict customer
churn. The adopted classification model presentations focus on a balanced style of prediction
development, which is at once understandable and dependable, as well as in possession of
potential to express non-linear relations within the data.
Logistic Regression: This is one of the widely used approaches in addressing the problem
binary classifications. With use of a logistic function over a linear combination of input
features, it estimates the probability of customer exit from the service. This is where its strength
is; it provides insights by clearly demonstrating how individual factors contribute to customers
churning.
Random Forest: This ensemble technique during training constructs several decision trees and
obtains the predicted class as the mode of the individual tree outputs. When we use this method,
we can increase accuracy and reduce overfitting. Additionally, it shows the role played by every
feature in establishing the outcomes.
Neural Network: A feedforward model with a single hidden layer and five hidden nodes was
used. During training using the nnet package, numerical standardized features were entered
into the model. Although care should be taken to preprocess the inputs, in a way that in this
case, scaling the features, neural networks are useful for learning fine-grain patterns.
6|Page
c. Model Evaluation
Performances were quantitatively measured by using the following three main metrics of
evaluation:
• Accuracy: The proportion of correct predictions by the model from all the predictions
made.
• Sensitivity (Recall): Percentage of true churning customers predicted by the model’s
predictions.
• Specificity: The percentage of customers who were not churned by the model that gave
the right assessment.
In order for the models to be evaluated directly, each of them was run on to the same validation
data set. As it can be seen in the table below, the evaluation result for each model is given.
Model Accuracy Sensitivity Specificity
Logistic Regression 0.8140 0.5911 0.8947
Random Forest 0.8027 0.5482 0.8947
Neural Network 0.8050 0.5393 0.9012
d. Feature Importance
Importance of all the features was calculated through random forest analysis by using mean
decrease in Gini impurity. The score shows the extent this contributes to reducing the
unpredictability of the classification task. The results of the analysis showed that factors of
TotalCharges, MonthlyCharges, tenure and Contract most contributed to the prediction of the
7|Page
churn outcomes. These findings make sense as they demonstrate that a customer’s financial
implication and loyalty to the firm were interlinked.
To demonstrate the importance ranking, the following command was used:
4. Conclusion
8|Page
Forest displayed strong results, but its accuracy and sensitivity were both inferior to those
reached by Logistic Regression.
That Logistic Regression offers clear insights about feature importance makes it of particular
value to the business stakeholders who can use understandings to make strategic decisions.
This amount of clarity allows for making workable business strategies and explaining results
to stakeholders without technical knowledge. Consequently, Logistic Regression is the
preferred way to evaluate customer churn in this telecom dataset.
9|Page