100% found this document useful (12 votes)

2K views

Customer Churn Data - A Project Based On Logistic Regression

This case study analyzes customer churn for a telecom operator. Logistic regression, K-NN, and Naive Bayes models are used to predict customer churn. Performance is evaluated using metrics like confusion matrix, AUC-ROC, Gini coefficient, and KS chart to identify the best model for predicting customer churn.

Uploaded by

Shyam Kishore Tripathi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (12 votes)

2K views

Customer Churn Data - A Project Based On Logistic Regression

Uploaded by

Shyam Kishore Tripathi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

This is the case study prepared for a

telecom operator to predict customer

churn. Complete case study is performed
using logistic regression, K-NN and Naïve
Bayes.

Project on Logistic Regression

SHYAM KISHORE TRIPATHI

PGP - BABI

0|Page
Table of Contents

1. PROJECT OBJECTIVES ........................................................................................................................................ 2

2. DEFINING BUSINESS PROBLEM………................................................................................................................. 3
3. EXPLORATORY DATA ANALYSIS…...................................................................................................................... 4
1.1 VARIABLE IDENTIFICATION & DATASET UNDERSTANDING ............................................................... 4
1.1.1 Data Structure...................................................................................................................... 4
1.1.2 Data Summary ..................................................................................................................... 4
1.1.3 Check/Visualize Missing Values........................................................................................... 4
1.2 UNIVARIATE & BIVARIATE ANALYSIS................................................................................................... 5
1.3 CHECK FOR CORRELATION ................................................................................................................ 15
6. LOGISTIC REGRESSION..................................................................................................................................... 16
6.1. LOGISTIC REGRESSION MODELS....................................................................................................... 16
6.2. LOGISTIC REGRESSION MODEL PERFORMANCE.............................................................................. 18
6.2.1. Confusion matrix of final model on train dataset............................................................. 18
6.2.2. Confusion matrix of final model on test dataset .............................................................. 18
6.2.3. Final model performance result ....................................................................................... 18
6.3. ROC – AREA UNDER THE CURVE ..................................................................................................... 19
6.4. K-S CHART......................................................................................................................................... 19
6.5. GINI ANALYSIS................................................................................................................................... 20
7. K-NEAREST NEIGHBOURS (K-NN) CLASSIFICATION ........................................................................................ 21
7.1. K-NN MODEL WITH K=3 (3 NEAREST NEIGHBOURS) ...................................................................... 21
7.1.1. Confusion matrix............................................................................................................... 21
7.1.2. ROC.................................................................................................................................... 22
7.2. K-NN MODEL WITH K=5 (5 NEAREST NEIGHBOURS) ...................................................................... 22
7.2.1. Confusion matrix............................................................................................................... 23
7.2.2. ROC.................................................................................................................................... 23
8. NAÏVE BAYES MODEL....................................................................................................................................... 23
8.1.1. CONFUSION MATRIX..................................................................................................................... 23
9. CONCLUSION .................................................................................................................................................. 24
10. APPENDIX.……............................................................................................................................................... 24

1|Page
Project Objective

This case study is prepared for mobile operator so that they can predict the customers who are likely to leave them
and switch to another provider. The objective is to build the best model using Predictive Modeling techniques
which can identify right customers.
We will be performing below steps and will analyze the data using Predictive Modeling techniques to identify
such customers:
1. EDA
1.1 How does the data look like by doing Univariate and bivariate analysis? Plots and charts which
illustrate the relationships between variables.
1.2 Will look out for outliers and missing values
1.3 Checking multicollinearity & its treatment.
1.4 Summarize the insights we will get from EDA.
2. Build Various Predictive Models and compare them to get to the best one
2.1 Building Logistic Regression Model and its interpretation.
2.2 Building KNN Model and its interpretation.
2.3 Building Naive Bayes and its interpretation.
2.4 Performing Model Comparison using Model Performance metrices.
3. Actionable Insights
3.1 Interpretation & Recommendations from the best model

Complete case study is performed on given dataset (Cellphone.csv) to build suitable model for predicting using
Predictive Modeling Techniques like Logistic Regression, KNN and Naive Bayes and finally perform Model
Performance Measures by various metrics.
Confusion Matrix (for all models)
AUC – ROC (for all model)
Gini Coefficient (only for Logistic regression)
Kolmogorov Smirnov (KS) Chart (only for Logistic regression)

What is Predictive Modeling?

Predictive modeling is a process of performing data mining and probability to forecast outcomes. Model is made up
of several predictors, which are variables that are likely to influence future results. Once data has been collected for
relevant predictors, a statistical model is formulated. There are different methodologies to build predictive model but,
in this project, we will be using Logistic Regression along with KNN and Naïve Bayes classifiers.

What are Model Performance Metrics?

Predictive Modeling works on constructive feedback principle. When we build a model, it becomes crucial to get
feedback from metrics, make improvements in our train model and continue until we achieve a desirable accuracy.
Evaluation metrics explain the performance of a model. An important aspects of evaluation metrics is their
capability to discriminate among model results.

2|Page
Predictive Modeling Process

Predicting Telecom Customer Churn

Defining Business Problem

Objective is to reduce customer churn for Telecom service provider by identifying the potential churn candidates
beforehand and take proactive actions to make them stay.

Customer churn is subscriber’s cancelling their mobile or fixed line services or porting out to another service provider.

Customer churn is a loss to service provider as its difficult to win back the lost customer and it’s bad marketing if
he/she is leaving due to bad customer or user experience.

There are several reasons due to which customer switch from the existing network provider
- Network issues
- Billing issues
- Competitive offers from another Telecom Service Provider
- Financial hardship
- Bad user experience
- High roaming charges
- Overage charge

In this case study we will try to build model as per existing data for the customer who has already churned and their
behavior before they actually switched to another operator and what was the reason behind the same so that we
can identify potential customer who may get churn in future and service provider can take necessary steps to stop
same

3|Page
Data Dictionary
The dataset has data of 3333 customers data which has information about the customer usage behavior, contract
details and the payment details. The data also indicates which were the customers who cancelled their service.

Variables Description
Churn 1 if customer cancelled service, 0 if not
Account Weeks number of weeks customer has had active account
Contract Renewal 1 if customer recently renewed contract, 0 if not
Data Plan 1 if customer has data plan, 0 if not
Data Usage gigabytes of monthly data usage
Cust Serv Calls number of calls into customer service
Day Mins average daytime minutes per month
Day Calls average number of daytime calls
Monthly Charge average monthly bill
Overage Fee largest overage fee in last 12 months
Roam Mins average number of roaming minutes

Data Summary and Exploratory Data Analysis

Structure of Data

data.frame':
$ Churn : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ AccountWeeks : int 128 107 137 84 75 118 121 147 117 141 ...
$ ContractRenewal : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 2 1 2 1 ...
$ DataPlan : Factor w/ 2 levels "0","1": 2 2 1 1 1 1 2 1 1 2 ...
$ DataUsage : num 2.7 3.7 0 0 0 0 2.03 0 0.19 3.02 ...
$ CustServCalls : int 1 1 0 2 3 0 3 0 1 0 ...
$ DayMins : num 265 162 243 299 167 ...
$ DayCalls : int 110 123 114 71 113 98 88 79 97 84 ...
$ MonthlyCharge : num 89 82 52 57 41 57 87.3 36 63.9 93.2 ...
$ OverageFee : num 9.87 9.78 6.06 3.1 7.42 ...
$ RoamMins : num 10 13.7 12.2 6.6 10.1 6.3 7.5 7.1 8.7 11.2 ..

Target Variable: - Churn

Summary of Data

- No blank values found

4|Page
Univariate and Bivariate Analysis

1. Churn

Conclusion: -
1. Has 2 values 0 and 1
2. Out of 3333 customers, 2850 customer are those who had not churned and 483 have value as 1
are those who had Churned.
3. The ratio of 1: 0 ratio is 1:5. So, Data balancing is not required.
4. Distribution is Yes – 85.5% and No – 14.5%.
5. 14% is the churn rate.

5|Page
2. Account Weeks

Conclusion: -
1. Is numeric variable
2. Average customers have active service from 100+ weeks
3. Most of the customers in the data set have active accounts from 50 to 150 weeks.
4. Also, we can see that those who are associated with the provider for a longer period are less likely
to leave but there is no clear interpretation can be made.
5. The outliers are obvious as there are customers who wish to stay with the current provider, so
certainly they will have more no. of active account weeks.

6|Page
3. Contract Renewal

Conclusion: -
1. The contract renewal has two values - 1 for recent renewal for contract by the customer and 0
for no contract renewal by the customers.
2. Clearly the distribution is imbalanced. It seems the people are more interested in renewal of
contact with the company. This can be plus point for the company.
3. 90% of customers has renewed their contract.
4. This is very clear that the customers who have not renewed their contract are more likely to
churn.

4. Data Plan

7|Page
Conclusion: -
1. Data Plan is numeric variable.
2. Around 30% of customers only have data plan.
3. Customers are more likely to churn if they do not have data plan.
4. This is very unlikely. It indicates that there is some problem with Data Plan that needs to be
checked.

5. Data Usage

Conclusion: -
1. Data Plan is numeric variable.
2. Around 30% of customers only have data plan.
3. Data usage has a very interesting distribution. It is clearly visible that the data usage is maximum
in the 0.5 range. After that it has normal distribution with high standard deviation.
4. So, customer who are using less data are more likely to churn.

8|Page
6. Customer Service Call

Conclusion: -
1. Cust Serv Call is numeric variable.
2. Most of the customers have not called or called only 3 times to customer care.

7. Daily Minutes

9|Page
Conclusion: -
1. Shows number of minutes in day customer was on call
2. Majority of customers use between 100-300minutes in a day
3. The average day time mins per month is normally distributed with some outlier. The outliers are
obvious because average day time min per month can vary from person to person as some may
use it for business purpose and at lower end can be students.

8. Day Calls

10 | P a g e
Conclusion: -
1. Average number of daytime calls is slightly left skewed with some outliers.
2. Mean of minute in a day is 180.
3. Most of the customers have calls between 50-150 In a day

9. Monthly Charges

11 | P a g e
Conclusion: -
1. Mean monthly charge is around $56
2. Most of the customers have monthly charge between 0 to 100.
3. Average monthly bill is slightly right skewed and have some outliers towards right end that can
be of business officials.

10. Overage Fees

12 | P a g e
Conclusion: -
1. Mean overage charge is $10
2. Most of the customers are paying overage fees in bracket 6-12 per month
3. Overage fee in last 12 months is normally distributed with some outliers.

11. Roaming Minutes

13 | P a g e
Conclusion: -
1. 90% of customers have roaming mins between 5 to 15min.
2. Average no. of roaming no. is normally distributed with some outliers.

14 | P a g e
Checking Correlation
Let's look at the correlation between all the variables before we proceed further with creating Logistic Regression
model and treat highly correlated variables accordingly to build the regression model.

Correlation Interpretation
- Data Usage and Data Plan are highly corelated.
- Monthly Charge is also highly correlated with Data Usage, Data Plan and Day Mins.
- Churn does not seem to be highly corelated with any of the variables.
- Churn has maximum correlation with Contract Renewal, Customer Service Calls and Day Mins.

As we are not able to clearly predict the variables based upon which we can predict customer churn clearly so we
will be performing logistic regression to deep dive in the given dataset and will process the given data set
accordingly to see significant role-playing variables.

15 | P a g e
Logistic Regression
We will start with Logistic Regression Analysis as it will give us clear insight that what are those variables which are
significant in building predictive model so that we can achieve more precision by eliminating irrelevant variables so
before proceeding we will be splitting available customer data into Train and Test data set and then performing
logistic regression.

But Before we proceed with the splitting variables in Train and Test, we will be converting categorical variables
(Churn, Contract Renewal and Data Plan) into factor variables.

Splitting data into train and test data set

Building Logistic Regression Model based upon all the given variables

Checking for Logistic Regression Model Multicollinearity

16 | P a g e
Interpretation from logistic model using all available variables and going through the multicollinearity

- The multicollinearity has caused the inflated VIF values for correlated variables, making the model
unreliable for model building.
- We can clearly see that Data Usage and Monthly charges are contributing much in inflating values, so lets
build another model after performing variable reduction function based on VIF values.

Steps for Variable Reduction:

- Use full set of explanatory variables.
- Calculate VIF for each variable.
- Remove variable with highest value.
- Recalculates all VIF values for logistic model built with the new set of variables.
- Removes again variable which has highest value, until all values are within threshold.

Creating Model 2 - Logistic Regression after Removing highly correlated variables

We will not use the MonthlyCharge and DataUsage variables as it is inflating the correlation result and will create a
new model.

AccountWeeks and DayCalls are insignificant for the model, so we will remove them as well and create a new model
based upon the rest of the variables.

17 | P a g e
Creating Model 3 - Logistic Regression built after Removing all insignificant variables

Now based on this new built model we can see that all values are significant, and we can verify the same by
checking the multicollinearity as well.

Now we can see that VIF values are within range and all variables are significant and results are making more sense
and are in line with the results which we obtained from EDA.

Regression Model Performance on Train Data set

1. Confusion Matrix: -
We will start model evaluation on train and test data by executing below code and will see that how accurate we
were able to predict that customer will churn or not

Calculating Confusion Matrix on Train and Test Data: - We are predicting classification of 0 and 1 for each row and
then we are putting our actual and predicted into a table to build confusion matrix to check that how accurate our
model is by executing below R Code.

Calculating Confusion Matrix on Test Data: -

18 | P a g e
Confusion Matrix Output: -

From Confusion matrix we can clearly see that our Train data is 86.11% accurate in predicting and Train data
confirms the same with 86.40% of accuracy. We can see there is a slight variation but that is within the range so
we can confirm that our model is good model.

2. ROC
The ROC curve is the plot between sensitivity and (1- specificity).
(1- specificity) is also known as false positive rate and sensitivity is also known as True Positive rate.

Calculating ROC on Train Data

Calculating ROC on Test Data

19 | P a g e
ROC Output Analysis: -
So, from the plot we can see that plot is covering large area under the curve and we are able to predict on the True
Positive side.

In Train data my true positive rate is 82.01% and in test data it’s 81.11%. so, there is no major variation in our Test
and Train data, and this proves that our model is more stable.

3. K-S chart
K-S will measure the degree of separation between the people likely to churn and people who will continue using
services from current operator.

By executing below code on Train and Test model, we will be able to see K-S Analysis result: -

K-S Output Analysis

From K-S analysis we can clearly see that our Train data can distinguish between people likely to churn and people
who will continue using services from current operator 53.92% on Train and 51.31% on Test accuracy. We can see
there is a slight variation but that is within the range so we can confirm that our model is ok.

4. Gini chart
Gini is the ratio between area between the ROC curve and the diagonal line & the area of the above triangle.

Gini Output Analysis

From Gini analysis we can clearly see that our Train data covering maximum area of churn and non-churn users with
52.45% and test data with 51.95% of accuracy. We can see there is a slight variation but that is within the range so
we can confirm that our model is ok.

20 | P a g e
k-NN Classification
k-NN is a supervised learning algorithm. It uses labeled input data to learn a function that produces an
appropriate output when given new unlabeled data. So, let’s build our classification model by following below
steps: -

Splitting data into Test and Train set in 70:30 ratio.

Creating k-NN model

When we choose 3 neighbors

Performing classification Model Performance Measures, when K=3

1. Confusion Matrix: -

Confusion Matrix Output: -

From Confusion matrix we can clearly see that accuracy of model is 89%.

21 | P a g e
2. ROC

ROC Output Analysis: -

So, from the plot we can see that plot is covering large area under the curve and we are able to predict on the True
Positive side.

In test data my true positive rate is 79.64%. This proves that our model is more stable.

Let’s try to see results with 5 neighbors

Performing classification Model Performance Measures on test data, when K=5

1. Confusion Matrix: -

Confusion Matrix Output: -

From Confusion matrix we can clearly see that accuracy of model is 90.20%.
22 | P a g e
2. ROC

ROC Output Analysis: -

So, from the plot we can see that plot is covering large area under the curve and we are able to predict on the True
Positive side.

In test data my true positive rate is 85.09%. This proves that our model is more stable.

Creating Naïve Bayes model

Naive Bayes classifier presume that the presence of a feature in a class is unrelated to the presence of any other
feature in the same class, so let’s build the model and see how good our model is as per this classification model

Performing classification Model Performance Measures for Naïve Bayes

1. Confusion Matrix: -
Calculating Confusion Matrix on Train and Test Data: -

Calculating Confusion Matrix on Test Data: -

23 | P a g e
Confusion Matrix Output: -

From Confusion matrix we can clearly see that our Train data is 85.04% accurate in predicting and Test data has
84.7% accuracy in prediction the churn rate.

Conclusion
Attributes Contributing to the Customer Churn in the company:

1. Logistics Regression Model, K-NN and Naïve Bayes Models are all able to predict the Customer Churn with
over 85% accuracy. K-NN Model with 5 Nearest Neighbours has the highest accuracy of 90%.

2. MonthlyCharge, DataUsage, DayCalls and AccountWeeks, shows that these variables do not have significant
effect on customer churn. So, these attributes can be ignored in building further models. That may give
better accuracy.

3. Customer Service calls have highest correlation with customer churn. From bivariate analysis also, as the
customer service calls to customers increases, the churn rate increases. This is a very useful observation that
more customer service calls irritate customers and leads to churn. This area is something the company
should focus to stop churn rate.

4. Contract Renewal also have high negative correlation. It is contributing to customer churn. This is also a very
meaningful insight to the company to focus more on contract renewal. Customers with no contract renewal
have higher churn rate. So, company should focus on providing offers to increase customers attraction
towards the renewing contracts of the plan.

5. Daily Mins i.e. average daytime minutes per month is also a key feature which is contributing the customer
churn. Company should focus on providing discounts and variety of plans and offers so that customers stick
to the companies plan and churn rate can be reduced.

24 | P a g e
Appendix
R Code

######### Loading Data in R ############

cellphone=read.csv("cellphone.csv")
cellphone1 = cellphone # creating raw backup of original data set and will use it if required further

##### Exploratory Data Analysis #######

dim(cellphone)
str(cellphone)
summary(cellphone)
attach(cellphone)

colSums(is.na(cellphone))

cellphone$Churn = as.factor(cellphone$Churn)
cellphone$ContractRenewal = as.factor(cellphone$ContractRenewal)
cellphone$DataPlan = as.factor(cellphone$DataPlan)

library(ggplot2)

# Churn#
table(cellphone$Churn)
prop.table(table(cellphone$Churn))
ggplot(cellphone, aes(Churn)) + geom_bar(fill="aquamarine4")

# Account Weeks#
summary(cellphone$AccountWeeks)
boxplot(AccountWeeks, main= "Boxplot of account age in Weeks")
ActWeeks <- cut(cellphone$AccountWeeks, breaks = seq(0, 250, by = 25))
ggplot(cellphone, aes(ActWeeks, ..count.., fill = Churn)) + geom_bar(position="dodge")

# Contract Renewal#
ggplot(cellphone, aes(ContractRenewal, ..count.., fill = Churn)) + geom_bar(position="dodge")

# Data Plan#
ggplot(cellphone, aes(DataPlan, ..count.., fill = Churn)) + geom_bar(position="dodge")
table(cellphone$Churn, cellphone$DataPlan)

# Data Usage#
summary(cellphone$DataUsage)

25 | P a g e
boxplot(DataUsage, main= "Boxplot of Data Usage Per Month")
dataUsage = cut(cellphone$DataUsage, include.lowest = TRUE, breaks = seq(0, 5.5, by = 0.5))
ggplot(cellphone, aes(dataUsage, ..count.., fill = Churn)) + geom_bar(position="dodge")

# CustServCall#
summary(cellphone$CustServCall)
boxplot(CustServCalls, main= "Boxplot of Number of calls made to Customer Service")
custServCalls = cut(cellphone$CustServCalls, include.lowest = TRUE, breaks = seq(0, 9, by = 1))
ggplot(cellphone, aes(CustServCalls, ..count.., fill = Churn)) + geom_bar(position="dodge")

# DayMins#
summary(cellphone$DayMins)
boxplot(DayMins, main= "Boxplot of Avg. Day Time Mins")
dayMins = cut(cellphone$DayMins, include.lowest = TRUE, breaks = seq(0, 385, by = 35))
ggplot(cellphone, aes(dayMins, ..count.., fill = Churn)) + geom_bar(position="dodge")

# DayCalls#
summary(cellphone$DayCalls)
boxplot(DayCalls, main= "Boxplot of Avg. Day Time Calls")
dayCalls = cut(cellphone$DayCalls, include.lowest = TRUE, breaks = seq(0, 165, by = 16.5))
ggplot(cellphone, aes(dayCalls, ..count.., fill = Churn)) + geom_bar(position="dodge")

# MonthlyCharge#
summary(cellphone$MonthlyCharge)
boxplot(MonthlyCharge, main= "Boxplot of Avg. Monthly Bills")
monthlyCharge = cut(cellphone$MonthlyCharge, include.lowest = TRUE, breaks = seq(14, 114, by = 10))
ggplot(cellphone, aes(monthlyCharge, ..count.., fill = Churn)) + geom_bar(position="dodge")
# OverageFee#
summary(cellphone$OverageFee)
boxplot(OverageFee, main= "Boxplot of Overage fees")
overageFee = cut(cellphone$OverageFee, include.lowest = TRUE, breaks = seq(0, 19, by = 1.9))
ggplot(cellphone, aes(overageFee, ..count.., fill = Churn)) + geom_bar(position="dodge")

# RoamMins#
summary(cellphone$RoamMins)
boxplot(RoamMins, main= "Boxplot of Calls during roaming")
roamMins = cut(cellphone$RoamMins, include.lowest = TRUE, breaks = seq(0, 20, by = 2))
ggplot(cellphone, aes(roamMins, ..count.., fill = Churn)) + geom_bar(position="dodge")

# Correlation#
library(corrplot)
corrplot(cor(cellphone1))

######## Build Models and compare ##################

###### Logistic Regression ############

26 | P a g e
# Create Training and Testing Datasets (70:30) #
library(caTools)

set.seed(1000)
spl = sample.split(cellphone$Churn, SplitRatio = 0.7)
train = subset(cellphone, spl==TRUE)
test = subset(cellphone, spl==FALSE)

##### Creating Logistic Regression Model based upon all the given variables #######

logreg = glm(Churn ~ ., data= cellphone, family=binomial)

summary(logreg)
library(car)
vif(logreg)

## Creating Model 3 - Logistic Regression built after Removing MonthlyCharge and Data Usage insignificant variables#
logreg2 = glm(Churn ~ . -MonthlyCharge - DataUsage , data= cellphone, family=binomial)
summary(logreg2)
library(car)
vif(logreg2)

## Creating Model 3 - Logistic Regression built after Removing all insignificant variables#
logreg3 = glm(Churn ~ . -MonthlyCharge - DataUsage -AccountWeeks -DayCalls, data= cellphone, family=binomial)
summary(logreg3)
library(car)
vif(logreg3)

####### Logistic Regression Model Performance Measures ########

# Confussion Matrix on Train Data #

predLRT = predict(logreg3, newdata=train[,-1],type="response")
tab1=table(train$Churn, predLRT>0.5)
sum(diag(tab1))/sum(tab1)

# Confussion Matrix on Test Data #

predLRTe = predict(logreg3, newdata=test[,-1], type="response")
tab2=table(test$Churn, predLRTe>0.5)
sum(diag(tab2))/sum(tab2)

####### Area Under the ROC curve (AUC - ROC) ######

#Validation on train data#

library(ROCR)
predictROC1 = predict(logreg3, newdata = train)
pred1 = prediction(predictROC1, train$Churn)
27 | P a g e
perf1 = performance(pred1, "tpr", "fpr")
plot(perf1,colorize =T)
as.numeric(performance(pred1, "auc")@y.values)

#Validation on test data#

library(ROCR)
predictROC2 = predict(logreg3, newdata = test)
pred2 = prediction(predictROC2, test$Churn)
perf2 = performance(pred2, "tpr", "fpr")
plot(perf2,colorize =T)
as.numeric(performance(pred2, "auc")@y.values)

# KS Test Validation #

library(ineq)
#KS on train#
KSLRTrain = max(attr(perf1, 'y.values')[[1]]-attr(perf1, 'x.values')[[1]])
KSLRTrain

#KS on test#
KSLRTest = max(attr(perf2, 'y.values')[[1]]-attr(perf2, 'x.values')[[1]])
KSLRTest

# Gini Test Validation #

#Gini for Train

giniLRTrain = ineq(predLRT, type="Gini")
giniLRTrain

#Gini for Test#

giniLRTest = ineq(predLRTe, type="Gini")
giniLRTest

###### KNN ############

str(cellphone1)
summary(cellphone1)
data = cellphone1[,-1]
norm.data = scale(data)
usable.data = cbind(churn = cellphone1[,1], norm.data)
usable.data = as.data.frame(usable.data)

28 | P a g e
str(usable.data)

library(caTools)
spl = sample.split(usable.data$churn, SplitRatio = 0.7 )
nbtrain = subset(usable.data, spl == T)
nbtest = subset(usable.data, spl == F)
dim(nbtrain)
dim(nbtest)

library(class)
pred3 = knn(nbtrain[-1], nbtest[-1], nbtrain[,1], k = 3) # Value of K can be chosen basis the sqrt of n

####### KNN Regression Model Performance Measures, when K=3 ##########

# Confusion Matrix on Test Data #

table.knn3 = table(nbtest[,1], pred3)
sum(diag(table.knn3))/sum(table.knn3)

# Area Under the ROC curve (AUC - ROC) #

library(ROCR)
predROC3 = ROCR::prediction(nbtest[,1], pred3)
perf3 = performance(predROC3, "tpr", "fpr")
plot(perf3)
as.numeric(performance(predROC3, "auc")@y.values)

####### KNN Regression Model Performance Measures, when K=5 ##########

library(class)
pred5 = knn(nbtrain[-1], nbtest[-1], nbtrain[,1], k = 5) # Value of K can be chosen basis the sqrt of n

# Confusion Matrix on Test Data #

table.knn5 = table(nbtest[,1], pred5)
sum(diag(table.knn5))/sum(table.knn5)

# Area Under the ROC curve (AUC - ROC) #

library(ROCR)
predROC5 = ROCR::prediction(nbtest[,1], pred5)
perf5 = performance(predROC5, "tpr", "fpr")
plot(perf5)
as.numeric(performance(predROC5, "auc")@y.values)

###### Naive Bayes ########

library(e1071)
NB1 = naiveBayes(as.factor(nbtrain$churn) ~., data = nbtrain,method="class")

#### Naive Bayes Model Performance #####

29 | P a g e
# Confussion Matrix on Train Data #
predNB1 = predict(NB1, newdata=nbtrain, type = "class")
table.NB1 = table(nbtrain$churn, predNB1)
sum(diag(table.NB1))/sum(table.NB1)

# Confussion Matrix on Train Data #

predNB2 = predict(NB1, newdata=nbtest, type = "class")
table.NB2 = table(nbtest$churn, predNB2)
sum(diag(table.NB2))/sum(table.NB2)

30 | P a g e

Problem Statement
0% (2)
Problem Statement
2 pages
Final Document of SQL Project With Questions
0% (2)
Final Document of SQL Project With Questions
5 pages
Capstone Presentation: Telecom Churn Study
100% (3)
Capstone Presentation: Telecom Churn Study
19 pages
Project Report - FRA V1.0
71% (7)
Project Report - FRA V1.0
28 pages
MRA ML1 Plabeeta Patangia
100% (8)
MRA ML1 Plabeeta Patangia
30 pages
WI - Intermediate Check Balance
90% (10)
WI - Intermediate Check Balance
2 pages
FinalReport Life Insurance
80% (5)
FinalReport Life Insurance
34 pages
Greatiearning: Project - (Sqlite & Mysql)
0% (1)
Greatiearning: Project - (Sqlite & Mysql)
7 pages
Café Chain Analysis
83% (6)
Café Chain Analysis
35 pages
MRA ML1 - Kirtesh
100% (7)
MRA ML1 - Kirtesh
43 pages
MRA Assignment: by Chitra Mukadam
100% (2)
MRA Assignment: by Chitra Mukadam
19 pages
Mountain State University 2
80% (5)
Mountain State University 2
4 pages
Capstone - Project - Final - Report - Churn - Prediction
100% (3)
Capstone - Project - Final - Report - Churn - Prediction
28 pages
Customer Churn Prediction Project: by Shweta Gupta
100% (6)
Customer Churn Prediction Project: by Shweta Gupta
41 pages
Telecom Churn Report
No ratings yet
Telecom Churn Report
66 pages
FRA Milestone1 - Maminulislam
100% (4)
FRA Milestone1 - Maminulislam
23 pages
Boston Condo Sale Story
0% (1)
Boston Condo Sale Story
11 pages
Capstone Proect Notes 2
100% (2)
Capstone Proect Notes 2
16 pages
Project +Sweta+Kumari+ +FRA+Milestone+1+ July+ 2021
100% (2)
Project +Sweta+Kumari+ +FRA+Milestone+1+ July+ 2021
31 pages
FRA Project Business Report
100% (2)
FRA Project Business Report
27 pages
Report - Project8 - FRA - Surabhi - Report
0% (1)
Report - Project8 - FRA - Surabhi - Report
15 pages
Problem Statement
100% (1)
Problem Statement
17 pages
Vaibhav Kumar MRA Project Milestone 1
100% (3)
Vaibhav Kumar MRA Project Milestone 1
29 pages
Vaibhav Kumar MRA Project Milestone 2
No ratings yet
Vaibhav Kumar MRA Project Milestone 2
18 pages
Business Report M2 PDF
100% (2)
Business Report M2 PDF
14 pages
Data Visualization in Tableau - Car Insurance Claim Project
50% (2)
Data Visualization in Tableau - Car Insurance Claim Project
51 pages
Analysis of Transport Choice of Employees - A Project On Machine Learning
100% (10)
Analysis of Transport Choice of Employees - A Project On Machine Learning
24 pages
Par Report
No ratings yet
Par Report
5 pages
Chapter 3 - Sip
100% (4)
Chapter 3 - Sip
3 pages
Bricolage
No ratings yet
Bricolage
44 pages
Predictive Modelling Project - Business Report
100% (1)
Predictive Modelling Project - Business Report
23 pages
CustomerChurn Assignment
100% (3)
CustomerChurn Assignment
15 pages
Capstone Project Business: Predict Customer Churn in E-Commerce
100% (2)
Capstone Project Business: Predict Customer Churn in E-Commerce
10 pages
Gowtham Mra 2
No ratings yet
Gowtham Mra 2
18 pages
MRA Project Milestone 1 - Maminulislam
83% (6)
MRA Project Milestone 1 - Maminulislam
30 pages
Facebook Comment Volume Prediction
100% (1)
Facebook Comment Volume Prediction
12 pages
Capstone Project
100% (1)
Capstone Project
7 pages
Financial Risk Analysis Project Report Financial Risk Analysis Project Report
100% (2)
Financial Risk Analysis Project Report Financial Risk Analysis Project Report
29 pages
MRA Project - (RFM Analysis Using Python)
No ratings yet
MRA Project - (RFM Analysis Using Python)
8 pages
PN1 Shakti Akshaya S PDF
100% (2)
PN1 Shakti Akshaya S PDF
60 pages
House
100% (2)
House
19 pages
Mra Project1 - Firoz Afzal
60% (5)
Mra Project1 - Firoz Afzal
20 pages
Milestone 1
No ratings yet
Milestone 1
2 pages
MRA Project ML 1: Abhishek Kapoor Dsba Aug A20
100% (1)
MRA Project ML 1: Abhishek Kapoor Dsba Aug A20
47 pages
MRA Milestone-1 Graded Project
100% (2)
MRA Milestone-1 Graded Project
41 pages
Grocery Project
100% (5)
Grocery Project
40 pages
RFM Analysis PDF
100% (1)
RFM Analysis PDF
29 pages
FRA Assignment
100% (1)
FRA Assignment
31 pages
Capstone - 1 Notes - Vikas Chauhan PDF
100% (3)
Capstone - 1 Notes - Vikas Chauhan PDF
13 pages
Vivek Dubey - Marketing & Retail Analytics
100% (2)
Vivek Dubey - Marketing & Retail Analytics
20 pages
Capstone Project Vivek
100% (4)
Capstone Project Vivek
145 pages
FRA Report
100% (1)
FRA Report
30 pages
MRA Project Milestone 1 PDF
No ratings yet
MRA Project Milestone 1 PDF
1 page
Data Mining Graded Assignment: Problem 1: Clustering Analysis
100% (3)
Data Mining Graded Assignment: Problem 1: Clustering Analysis
39 pages
Telecom Customer Churn Prediction Assessment PDF
100% (1)
Telecom Customer Churn Prediction Assessment PDF
23 pages
Lifi
100% (1)
Lifi
16 pages
House Prices Predictive Model Summary Report
100% (1)
House Prices Predictive Model Summary Report
20 pages
Anamit Deb Gupta Mra - Project Milestone - 1
100% (1)
Anamit Deb Gupta Mra - Project Milestone - 1
30 pages
MRA Project - (Market Basket Analysis Analysis Using Python)
No ratings yet
MRA Project - (Market Basket Analysis Analysis Using Python)
11 pages
Customer Churn Analysis
No ratings yet
Customer Churn Analysis
10 pages
Report - Project8 - FRA - Surabhi - Report
100% (2)
Report - Project8 - FRA - Surabhi - Report
15 pages
Predective Modellig Project
100% (1)
Predective Modellig Project
18 pages
Problem 1:: Readingcsv PD Read - Excel (Readingcsv) Readingcsv Head
No ratings yet
Problem 1:: Readingcsv PD Read - Excel (Readingcsv) Readingcsv Head
18 pages
Project 3 - Build A Logistic Regression Model To Predict Custo Mer Churn in Telecom IndustryV1.0 PDF
100% (1)
Project 3 - Build A Logistic Regression Model To Predict Custo Mer Churn in Telecom IndustryV1.0 PDF
38 pages
Abhishekj uvatkar
No ratings yet
Abhishekj uvatkar
4 pages
Shark Tank - Web and Social Media Analytics Case Study
100% (1)
Shark Tank - Web and Social Media Analytics Case Study
9 pages
Australian Gas Production - Project On Time Series Forecasting
100% (19)
Australian Gas Production - Project On Time Series Forecasting
29 pages
Telecom Churn Solution
100% (5)
Telecom Churn Solution
28 pages
Market Segmentation - Product Service Management
No ratings yet
Market Segmentation - Product Service Management
16 pages
DM PDF
No ratings yet
DM PDF
5 pages
Quantitative Data Analysis: Choosing Between SPSS, PLS and AMOS in Social Science Research
No ratings yet
Quantitative Data Analysis: Choosing Between SPSS, PLS and AMOS in Social Science Research
13 pages
THE PROBLEM OF TRUTH IN BAKHTIN'S THEORY Marks
No ratings yet
THE PROBLEM OF TRUTH IN BAKHTIN'S THEORY Marks
14 pages
3rd Long Test
100% (1)
3rd Long Test
2 pages
1st Monthly Test PR1
No ratings yet
1st Monthly Test PR1
4 pages
Thus Spake Ramana From The CD
No ratings yet
Thus Spake Ramana From The CD
15 pages
Bala Shabu - Classroom ClimateCLASSROOM CLIMATE AND STUDENTS ACADEMIC ACHIEVEMENT IN SOCIAL STUDIES: A CASE STUDY OF KEFFI LOCAL GOVERNMENT IN NASARAWA STATE.
No ratings yet
Bala Shabu - Classroom ClimateCLASSROOM CLIMATE AND STUDENTS ACADEMIC ACHIEVEMENT IN SOCIAL STUDIES: A CASE STUDY OF KEFFI LOCAL GOVERNMENT IN NASARAWA STATE.
79 pages
Speech Acts in Foreign Language Acquisition: Andrew REIMANN
No ratings yet
Speech Acts in Foreign Language Acquisition: Andrew REIMANN
10 pages
Math Teachers Guide 8 QUARTER 3
68% (82)
Math Teachers Guide 8 QUARTER 3
66 pages
Classification of Psychological Tests 2
100% (3)
Classification of Psychological Tests 2
4 pages
Checklist - For - RANDOMIED CONTROLLED TRIALS PDF
No ratings yet
Checklist - For - RANDOMIED CONTROLLED TRIALS PDF
9 pages
PHILOSOPHY
100% (2)
PHILOSOPHY
213 pages
Sanchez Hucles Davis 2010
No ratings yet
Sanchez Hucles Davis 2010
11 pages
Best Model For Transformer PDF
No ratings yet
Best Model For Transformer PDF
15 pages
FActor Analysis Project
No ratings yet
FActor Analysis Project
8 pages
Alexander Spirkin - Fundamentals of Philosophy
100% (20)
Alexander Spirkin - Fundamentals of Philosophy
425 pages
Chapter 14 EducPsych
No ratings yet
Chapter 14 EducPsych
5 pages
Discourse Analysis - Wikipedia
No ratings yet
Discourse Analysis - Wikipedia
31 pages
Inferiority Complex1
100% (1)
Inferiority Complex1
26 pages
Whatever Happened To Psychology As The Science of Behavior?: B. E Skinner
No ratings yet
Whatever Happened To Psychology As The Science of Behavior?: B. E Skinner
10 pages
Algebra of Proposition
No ratings yet
Algebra of Proposition
6 pages
Skema Kelantan PT3 2016
100% (1)
Skema Kelantan PT3 2016
13 pages
CDA Part Two
100% (1)
CDA Part Two
21 pages
Jerry Sanders Study Guide
No ratings yet
Jerry Sanders Study Guide
2 pages
Consumer Behavior Case Study
No ratings yet
Consumer Behavior Case Study
27 pages
Dialnet CriticalThinkingAndItsImportanceInEducation 6975444
No ratings yet
Dialnet CriticalThinkingAndItsImportanceInEducation 6975444
11 pages
Cartographic Scale and Minimum Mapping Unit Influence On LULC Modelling
No ratings yet
Cartographic Scale and Minimum Mapping Unit Influence On LULC Modelling
8 pages

Customer Churn Data - A Project Based On Logistic Regression

Uploaded by

Customer Churn Data - A Project Based On Logistic Regression

Uploaded by

This is the case study prepared for a

telecom operator to predict customer

Project on Logistic Regression

SHYAM KISHORE TRIPATHI

1. PROJECT OBJECTIVES ........................................................................................................................................ 2

What is Predictive Modeling?

What are Model Performance Metrics?

Predicting Telecom Customer Churn

Defining Business Problem

Data Summary and Exploratory Data Analysis

Target Variable: - Churn

- No blank values found

10. Overage Fees

11. Roaming Minutes

Splitting data into train and test data set

Checking for Logistic Regression Model Multicollinearity

Steps for Variable Reduction:

Creating Model 2 - Logistic Regression after Removing highly correlated variables

Regression Model Performance on Train Data set

Calculating Confusion Matrix on Test Data: -

Calculating ROC on Train Data

Calculating ROC on Test Data

K-S Output Analysis

Gini Output Analysis

Splitting data into Test and Train set in 70:30 ratio.

Creating k-NN model

Performing classification Model Performance Measures, when K=3

Confusion Matrix Output: -

ROC Output Analysis: -

Let’s try to see results with 5 neighbors

Performing classification Model Performance Measures on test data, when K=5

Confusion Matrix Output: -

ROC Output Analysis: -

Creating Naïve Bayes model

Performing classification Model Performance Measures for Naïve Bayes

Calculating Confusion Matrix on Test Data: -

######### Loading Data in R ############

##### Exploratory Data Analysis #######

######## Build Models and compare ##################

###### Logistic Regression ############

logreg = glm(Churn ~ ., data= cellphone, family=binomial)

####### Logistic Regression Model Performance Measures ########

# Confussion Matrix on Train Data #

# Confussion Matrix on Test Data #

####### Area Under the ROC curve (AUC - ROC) ######

#Validation on train data#

#Validation on test data#

# Gini Test Validation #

#Gini for Train

#Gini for Test#

###### KNN ############

####### KNN Regression Model Performance Measures, when K=3 ##########

# Confusion Matrix on Test Data #

# Area Under the ROC curve (AUC - ROC) #

####### KNN Regression Model Performance Measures, when K=5 ##########

# Confusion Matrix on Test Data #

# Area Under the ROC curve (AUC - ROC) #

###### Naive Bayes ########

#### Naive Bayes Model Performance #####

# Confussion Matrix on Train Data #

You might also like