Customer Churn Data - A Project Based On Logistic Regression
Customer Churn Data - A Project Based On Logistic Regression
0|Page
Table of Contents
1|Page
Project Objective
This case study is prepared for mobile operator so that they can predict the customers who are likely to leave them
and switch to another provider. The objective is to build the best model using Predictive Modeling techniques
which can identify right customers.
We will be performing below steps and will analyze the data using Predictive Modeling techniques to identify
such customers:
1. EDA
1.1 How does the data look like by doing Univariate and bivariate analysis? Plots and charts which
illustrate the relationships between variables.
1.2 Will look out for outliers and missing values
1.3 Checking multicollinearity & its treatment.
1.4 Summarize the insights we will get from EDA.
2. Build Various Predictive Models and compare them to get to the best one
2.1 Building Logistic Regression Model and its interpretation.
2.2 Building KNN Model and its interpretation.
2.3 Building Naive Bayes and its interpretation.
2.4 Performing Model Comparison using Model Performance metrices.
3. Actionable Insights
3.1 Interpretation & Recommendations from the best model
Complete case study is performed on given dataset (Cellphone.csv) to build suitable model for predicting using
Predictive Modeling Techniques like Logistic Regression, KNN and Naive Bayes and finally perform Model
Performance Measures by various metrics.
Confusion Matrix (for all models)
AUC – ROC (for all model)
Gini Coefficient (only for Logistic regression)
Kolmogorov Smirnov (KS) Chart (only for Logistic regression)
Predictive modeling is a process of performing data mining and probability to forecast outcomes. Model is made up
of several predictors, which are variables that are likely to influence future results. Once data has been collected for
relevant predictors, a statistical model is formulated. There are different methodologies to build predictive model but,
in this project, we will be using Logistic Regression along with KNN and Naïve Bayes classifiers.
Predictive Modeling works on constructive feedback principle. When we build a model, it becomes crucial to get
feedback from metrics, make improvements in our train model and continue until we achieve a desirable accuracy.
Evaluation metrics explain the performance of a model. An important aspects of evaluation metrics is their
capability to discriminate among model results.
2|Page
Predictive Modeling Process
Customer churn is subscriber’s cancelling their mobile or fixed line services or porting out to another service provider.
Customer churn is a loss to service provider as its difficult to win back the lost customer and it’s bad marketing if
he/she is leaving due to bad customer or user experience.
There are several reasons due to which customer switch from the existing network provider
- Network issues
- Billing issues
- Competitive offers from another Telecom Service Provider
- Financial hardship
- Bad user experience
- High roaming charges
- Overage charge
In this case study we will try to build model as per existing data for the customer who has already churned and their
behavior before they actually switched to another operator and what was the reason behind the same so that we
can identify potential customer who may get churn in future and service provider can take necessary steps to stop
same
3|Page
Data Dictionary
The dataset has data of 3333 customers data which has information about the customer usage behavior, contract
details and the payment details. The data also indicates which were the customers who cancelled their service.
Variables Description
Churn 1 if customer cancelled service, 0 if not
Account Weeks number of weeks customer has had active account
Contract Renewal 1 if customer recently renewed contract, 0 if not
Data Plan 1 if customer has data plan, 0 if not
Data Usage gigabytes of monthly data usage
Cust Serv Calls number of calls into customer service
Day Mins average daytime minutes per month
Day Calls average number of daytime calls
Monthly Charge average monthly bill
Overage Fee largest overage fee in last 12 months
Roam Mins average number of roaming minutes
data.frame':
$ Churn : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ AccountWeeks : int 128 107 137 84 75 118 121 147 117 141 ...
$ ContractRenewal : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 2 1 2 1 ...
$ DataPlan : Factor w/ 2 levels "0","1": 2 2 1 1 1 1 2 1 1 2 ...
$ DataUsage : num 2.7 3.7 0 0 0 0 2.03 0 0.19 3.02 ...
$ CustServCalls : int 1 1 0 2 3 0 3 0 1 0 ...
$ DayMins : num 265 162 243 299 167 ...
$ DayCalls : int 110 123 114 71 113 98 88 79 97 84 ...
$ MonthlyCharge : num 89 82 52 57 41 57 87.3 36 63.9 93.2 ...
$ OverageFee : num 9.87 9.78 6.06 3.1 7.42 ...
$ RoamMins : num 10 13.7 12.2 6.6 10.1 6.3 7.5 7.1 8.7 11.2 ..
Summary of Data
1. Churn
Conclusion: -
1. Has 2 values 0 and 1
2. Out of 3333 customers, 2850 customer are those who had not churned and 483 have value as 1
are those who had Churned.
3. The ratio of 1: 0 ratio is 1:5. So, Data balancing is not required.
4. Distribution is Yes – 85.5% and No – 14.5%.
5. 14% is the churn rate.
5|Page
2. Account Weeks
Conclusion: -
1. Is numeric variable
2. Average customers have active service from 100+ weeks
3. Most of the customers in the data set have active accounts from 50 to 150 weeks.
4. Also, we can see that those who are associated with the provider for a longer period are less likely
to leave but there is no clear interpretation can be made.
5. The outliers are obvious as there are customers who wish to stay with the current provider, so
certainly they will have more no. of active account weeks.
6|Page
3. Contract Renewal
Conclusion: -
1. The contract renewal has two values - 1 for recent renewal for contract by the customer and 0
for no contract renewal by the customers.
2. Clearly the distribution is imbalanced. It seems the people are more interested in renewal of
contact with the company. This can be plus point for the company.
3. 90% of customers has renewed their contract.
4. This is very clear that the customers who have not renewed their contract are more likely to
churn.
4. Data Plan
7|Page
Conclusion: -
1. Data Plan is numeric variable.
2. Around 30% of customers only have data plan.
3. Customers are more likely to churn if they do not have data plan.
4. This is very unlikely. It indicates that there is some problem with Data Plan that needs to be
checked.
5. Data Usage
Conclusion: -
1. Data Plan is numeric variable.
2. Around 30% of customers only have data plan.
3. Data usage has a very interesting distribution. It is clearly visible that the data usage is maximum
in the 0.5 range. After that it has normal distribution with high standard deviation.
4. So, customer who are using less data are more likely to churn.
8|Page
6. Customer Service Call
Conclusion: -
1. Cust Serv Call is numeric variable.
2. Most of the customers have not called or called only 3 times to customer care.
7. Daily Minutes
9|Page
Conclusion: -
1. Shows number of minutes in day customer was on call
2. Majority of customers use between 100-300minutes in a day
3. The average day time mins per month is normally distributed with some outlier. The outliers are
obvious because average day time min per month can vary from person to person as some may
use it for business purpose and at lower end can be students.
8. Day Calls
10 | P a g e
Conclusion: -
1. Average number of daytime calls is slightly left skewed with some outliers.
2. Mean of minute in a day is 180.
3. Most of the customers have calls between 50-150 In a day
9. Monthly Charges
11 | P a g e
Conclusion: -
1. Mean monthly charge is around $56
2. Most of the customers have monthly charge between 0 to 100.
3. Average monthly bill is slightly right skewed and have some outliers towards right end that can
be of business officials.
12 | P a g e
Conclusion: -
1. Mean overage charge is $10
2. Most of the customers are paying overage fees in bracket 6-12 per month
3. Overage fee in last 12 months is normally distributed with some outliers.
13 | P a g e
Conclusion: -
1. 90% of customers have roaming mins between 5 to 15min.
2. Average no. of roaming no. is normally distributed with some outliers.
14 | P a g e
Checking Correlation
Let's look at the correlation between all the variables before we proceed further with creating Logistic Regression
model and treat highly correlated variables accordingly to build the regression model.
Correlation Interpretation
- Data Usage and Data Plan are highly corelated.
- Monthly Charge is also highly correlated with Data Usage, Data Plan and Day Mins.
- Churn does not seem to be highly corelated with any of the variables.
- Churn has maximum correlation with Contract Renewal, Customer Service Calls and Day Mins.
As we are not able to clearly predict the variables based upon which we can predict customer churn clearly so we
will be performing logistic regression to deep dive in the given dataset and will process the given data set
accordingly to see significant role-playing variables.
15 | P a g e
Logistic Regression
We will start with Logistic Regression Analysis as it will give us clear insight that what are those variables which are
significant in building predictive model so that we can achieve more precision by eliminating irrelevant variables so
before proceeding we will be splitting available customer data into Train and Test data set and then performing
logistic regression.
But Before we proceed with the splitting variables in Train and Test, we will be converting categorical variables
(Churn, Contract Renewal and Data Plan) into factor variables.
Building Logistic Regression Model based upon all the given variables
16 | P a g e
Interpretation from logistic model using all available variables and going through the multicollinearity
- The multicollinearity has caused the inflated VIF values for correlated variables, making the model
unreliable for model building.
- We can clearly see that Data Usage and Monthly charges are contributing much in inflating values, so lets
build another model after performing variable reduction function based on VIF values.
We will not use the MonthlyCharge and DataUsage variables as it is inflating the correlation result and will create a
new model.
AccountWeeks and DayCalls are insignificant for the model, so we will remove them as well and create a new model
based upon the rest of the variables.
17 | P a g e
Creating Model 3 - Logistic Regression built after Removing all insignificant variables
Now based on this new built model we can see that all values are significant, and we can verify the same by
checking the multicollinearity as well.
Now we can see that VIF values are within range and all variables are significant and results are making more sense
and are in line with the results which we obtained from EDA.
1. Confusion Matrix: -
We will start model evaluation on train and test data by executing below code and will see that how accurate we
were able to predict that customer will churn or not
Calculating Confusion Matrix on Train and Test Data: - We are predicting classification of 0 and 1 for each row and
then we are putting our actual and predicted into a table to build confusion matrix to check that how accurate our
model is by executing below R Code.
18 | P a g e
Confusion Matrix Output: -
From Confusion matrix we can clearly see that our Train data is 86.11% accurate in predicting and Train data
confirms the same with 86.40% of accuracy. We can see there is a slight variation but that is within the range so
we can confirm that our model is good model.
2. ROC
The ROC curve is the plot between sensitivity and (1- specificity).
(1- specificity) is also known as false positive rate and sensitivity is also known as True Positive rate.
19 | P a g e
ROC Output Analysis: -
So, from the plot we can see that plot is covering large area under the curve and we are able to predict on the True
Positive side.
In Train data my true positive rate is 82.01% and in test data it’s 81.11%. so, there is no major variation in our Test
and Train data, and this proves that our model is more stable.
3. K-S chart
K-S will measure the degree of separation between the people likely to churn and people who will continue using
services from current operator.
By executing below code on Train and Test model, we will be able to see K-S Analysis result: -
4. Gini chart
Gini is the ratio between area between the ROC curve and the diagonal line & the area of the above triangle.
20 | P a g e
k-NN Classification
k-NN is a supervised learning algorithm. It uses labeled input data to learn a function that produces an
appropriate output when given new unlabeled data. So, let’s build our classification model by following below
steps: -
1. Confusion Matrix: -
21 | P a g e
2. ROC
In test data my true positive rate is 79.64%. This proves that our model is more stable.
1. Confusion Matrix: -
From Confusion matrix we can clearly see that accuracy of model is 90.20%.
22 | P a g e
2. ROC
In test data my true positive rate is 85.09%. This proves that our model is more stable.
1. Confusion Matrix: -
Calculating Confusion Matrix on Train and Test Data: -
23 | P a g e
Confusion Matrix Output: -
From Confusion matrix we can clearly see that our Train data is 85.04% accurate in predicting and Test data has
84.7% accuracy in prediction the churn rate.
Conclusion
Attributes Contributing to the Customer Churn in the company:
1. Logistics Regression Model, K-NN and Naïve Bayes Models are all able to predict the Customer Churn with
over 85% accuracy. K-NN Model with 5 Nearest Neighbours has the highest accuracy of 90%.
2. MonthlyCharge, DataUsage, DayCalls and AccountWeeks, shows that these variables do not have significant
effect on customer churn. So, these attributes can be ignored in building further models. That may give
better accuracy.
3. Customer Service calls have highest correlation with customer churn. From bivariate analysis also, as the
customer service calls to customers increases, the churn rate increases. This is a very useful observation that
more customer service calls irritate customers and leads to churn. This area is something the company
should focus to stop churn rate.
4. Contract Renewal also have high negative correlation. It is contributing to customer churn. This is also a very
meaningful insight to the company to focus more on contract renewal. Customers with no contract renewal
have higher churn rate. So, company should focus on providing offers to increase customers attraction
towards the renewing contracts of the plan.
5. Daily Mins i.e. average daytime minutes per month is also a key feature which is contributing the customer
churn. Company should focus on providing discounts and variety of plans and offers so that customers stick
to the companies plan and churn rate can be reduced.
24 | P a g e
Appendix
R Code
dim(cellphone)
str(cellphone)
summary(cellphone)
attach(cellphone)
colSums(is.na(cellphone))
cellphone$Churn = as.factor(cellphone$Churn)
cellphone$ContractRenewal = as.factor(cellphone$ContractRenewal)
cellphone$DataPlan = as.factor(cellphone$DataPlan)
library(ggplot2)
# Churn#
table(cellphone$Churn)
prop.table(table(cellphone$Churn))
ggplot(cellphone, aes(Churn)) + geom_bar(fill="aquamarine4")
# Account Weeks#
summary(cellphone$AccountWeeks)
boxplot(AccountWeeks, main= "Boxplot of account age in Weeks")
ActWeeks <- cut(cellphone$AccountWeeks, breaks = seq(0, 250, by = 25))
ggplot(cellphone, aes(ActWeeks, ..count.., fill = Churn)) + geom_bar(position="dodge")
# Contract Renewal#
ggplot(cellphone, aes(ContractRenewal, ..count.., fill = Churn)) + geom_bar(position="dodge")
# Data Plan#
ggplot(cellphone, aes(DataPlan, ..count.., fill = Churn)) + geom_bar(position="dodge")
table(cellphone$Churn, cellphone$DataPlan)
# Data Usage#
summary(cellphone$DataUsage)
25 | P a g e
boxplot(DataUsage, main= "Boxplot of Data Usage Per Month")
dataUsage = cut(cellphone$DataUsage, include.lowest = TRUE, breaks = seq(0, 5.5, by = 0.5))
ggplot(cellphone, aes(dataUsage, ..count.., fill = Churn)) + geom_bar(position="dodge")
# CustServCall#
summary(cellphone$CustServCall)
boxplot(CustServCalls, main= "Boxplot of Number of calls made to Customer Service")
custServCalls = cut(cellphone$CustServCalls, include.lowest = TRUE, breaks = seq(0, 9, by = 1))
ggplot(cellphone, aes(CustServCalls, ..count.., fill = Churn)) + geom_bar(position="dodge")
# DayMins#
summary(cellphone$DayMins)
boxplot(DayMins, main= "Boxplot of Avg. Day Time Mins")
dayMins = cut(cellphone$DayMins, include.lowest = TRUE, breaks = seq(0, 385, by = 35))
ggplot(cellphone, aes(dayMins, ..count.., fill = Churn)) + geom_bar(position="dodge")
# DayCalls#
summary(cellphone$DayCalls)
boxplot(DayCalls, main= "Boxplot of Avg. Day Time Calls")
dayCalls = cut(cellphone$DayCalls, include.lowest = TRUE, breaks = seq(0, 165, by = 16.5))
ggplot(cellphone, aes(dayCalls, ..count.., fill = Churn)) + geom_bar(position="dodge")
# MonthlyCharge#
summary(cellphone$MonthlyCharge)
boxplot(MonthlyCharge, main= "Boxplot of Avg. Monthly Bills")
monthlyCharge = cut(cellphone$MonthlyCharge, include.lowest = TRUE, breaks = seq(14, 114, by = 10))
ggplot(cellphone, aes(monthlyCharge, ..count.., fill = Churn)) + geom_bar(position="dodge")
# OverageFee#
summary(cellphone$OverageFee)
boxplot(OverageFee, main= "Boxplot of Overage fees")
overageFee = cut(cellphone$OverageFee, include.lowest = TRUE, breaks = seq(0, 19, by = 1.9))
ggplot(cellphone, aes(overageFee, ..count.., fill = Churn)) + geom_bar(position="dodge")
# RoamMins#
summary(cellphone$RoamMins)
boxplot(RoamMins, main= "Boxplot of Calls during roaming")
roamMins = cut(cellphone$RoamMins, include.lowest = TRUE, breaks = seq(0, 20, by = 2))
ggplot(cellphone, aes(roamMins, ..count.., fill = Churn)) + geom_bar(position="dodge")
# Correlation#
library(corrplot)
corrplot(cor(cellphone1))
26 | P a g e
# Create Training and Testing Datasets (70:30) #
library(caTools)
set.seed(1000)
spl = sample.split(cellphone$Churn, SplitRatio = 0.7)
train = subset(cellphone, spl==TRUE)
test = subset(cellphone, spl==FALSE)
##### Creating Logistic Regression Model based upon all the given variables #######
## Creating Model 3 - Logistic Regression built after Removing MonthlyCharge and Data Usage insignificant variables#
logreg2 = glm(Churn ~ . -MonthlyCharge - DataUsage , data= cellphone, family=binomial)
summary(logreg2)
library(car)
vif(logreg2)
## Creating Model 3 - Logistic Regression built after Removing all insignificant variables#
logreg3 = glm(Churn ~ . -MonthlyCharge - DataUsage -AccountWeeks -DayCalls, data= cellphone, family=binomial)
summary(logreg3)
library(car)
vif(logreg3)
library(ROCR)
predictROC2 = predict(logreg3, newdata = test)
pred2 = prediction(predictROC2, test$Churn)
perf2 = performance(pred2, "tpr", "fpr")
plot(perf2,colorize =T)
as.numeric(performance(pred2, "auc")@y.values)
# KS Test Validation #
library(ineq)
#KS on train#
KSLRTrain = max(attr(perf1, 'y.values')[[1]]-attr(perf1, 'x.values')[[1]])
KSLRTrain
#KS on test#
KSLRTest = max(attr(perf2, 'y.values')[[1]]-attr(perf2, 'x.values')[[1]])
KSLRTest
str(cellphone1)
summary(cellphone1)
data = cellphone1[,-1]
norm.data = scale(data)
usable.data = cbind(churn = cellphone1[,1], norm.data)
usable.data = as.data.frame(usable.data)
28 | P a g e
str(usable.data)
library(caTools)
spl = sample.split(usable.data$churn, SplitRatio = 0.7 )
nbtrain = subset(usable.data, spl == T)
nbtest = subset(usable.data, spl == F)
dim(nbtrain)
dim(nbtest)
library(class)
pred3 = knn(nbtrain[-1], nbtest[-1], nbtrain[,1], k = 3) # Value of K can be chosen basis the sqrt of n
library(e1071)
NB1 = naiveBayes(as.factor(nbtrain$churn) ~., data = nbtrain,method="class")
30 | P a g e