0% found this document useful (0 votes)

239 views38 pages

Logistic+Regression+-+Telecom+Churn+Case+Study - Jupyter Notebook

This document describes a case study using logistic regression to predict customer churn for a telecom company. It involves importing and merging customer data from multiple sources to create a consolidated dataframe with 21 predictor variables. The goal is to build a model to predict whether a customer will switch providers based on demographics, services used, expenses, and other past information. The data is prepared by converting some binary yes/no variables to 0/1 for modeling.

Uploaded by

sad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

239 views38 pages

Logistic+Regression+-+Telecom+Churn+Case+Study - Jupyter Notebook

Uploaded by

sad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 38

10/13/22, 10:59 AM Logistic+Regression+-+Telecom+Churn+Case+Study - Jupyter Notebook

Telecom Churn Case Study

With 21 predictor variables we need to predict whether a particular customer will switch to another telecom
provider or not. In telecom terminology, this is referred to as churning and not churning, respectively.

Problem Statment

You have a telecom firm which has collected data of all its customers. The main types of attributes are:

Demographics (age, gender etc.)

Services availed (internet packs purchased, special offers taken etc.)
Expenses (amount of recharge done per month etc.)

Based on all this past information, you want to build a model which will predict whether a particular customer
will churn, i.e. whether they will switch to a different service provider or not.

So the variable of interest, i.e. the target variable here is ‘Churn’ which will tell us whether or not a particular
customer has churned.

Churn is a binary variable:

1 means that the customer has churned &

0 means the customer has not churned

Step 1: Importing and Merging Data

In [1]:

# Suppressing Warnings
import warnings
warnings.filterwarnings('ignore')

In [2]:

# Importing Pandas and NumPy

import pandas as pd, numpy as np

localhost:8888/notebooks/01 Subjects/Subj- DA- Data Analytics/Slides/03 Logistic Regression/Logistic%2BRegression%2B-%2BTelecom%2BC… 1/38

10/13/22, 10:59 AM Logistic+Regression+-+Telecom+Churn+Case+Study - Jupyter Notebook

In [3]:

# Importing all datasets

churn_data = pd.read_csv("churn_data.csv")
churn_data.head()

Out[3]:

customerID tenure PhoneService Contract PaperlessBilling PaymentMethod MonthlyCharg

7590- Month-
0 1 No Yes Electronic check 29
VHVEG to-month

5575-
1 34 Yes One year No Mailed check 56
GNVDE

3668- Month-
2 2 Yes Yes Mailed check 53
QPYBK to-month

7795- Bank transfer

3 45 No One year No 42
CFOCW (automatic)

9237- Month-
4 2 Yes Yes Electronic check 70
HQITU to-month

In [4]:

customer_data = pd.read_csv("customer_data.csv")
customer_data.head()

Out[4]:

customerID gender SeniorCitizen Partner Dependents

0 7590-VHVEG Female 0 Yes No

1 5575-GNVDE Male 0 No No

2 3668-QPYBK Male 0 No No

3 7795-CFOCW Male 0 No No

4 9237-HQITU Female 0 No No

localhost:8888/notebooks/01 Subjects/Subj- DA- Data Analytics/Slides/03 Logistic Regression/Logistic%2BRegression%2B-%2BTelecom%2BC… 2/38

10/13/22, 10:59 AM Logistic+Regression+-+Telecom+Churn+Case+Study - Jupyter Notebook

In [5]:

internet_data = pd.read_csv("internet_data.csv")
internet_data.head()

Out[5]:

customerID MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection T

7590- No phone
0 DSL No Yes No
VHVEG service

5575-
1 No DSL Yes No Yes
GNVDE

3668-
2 No DSL Yes Yes No
QPYBK

7795- No phone
3 DSL Yes No Yes
CFOCW service

9237-
4 No Fiber optic No No No
HQITU

Combining all data files into one consolidated dataframe

In [6]:

# Merging on 'customerID'
df_1 = pd.merge(churn_data, customer_data, how='inner', on='customerID')

In [7]:

# Final dataframe with all predictor variables

telecom = pd.merge(df_1, internet_data, how='inner', on='customerID')

Step 2: Inspecting the Dataframe

localhost:8888/notebooks/01 Subjects/Subj- DA- Data Analytics/Slides/03 Logistic Regression/Logistic%2BRegression%2B-%2BTelecom%2BC… 3/38

10/13/22, 10:59 AM Logistic+Regression+-+Telecom+Churn+Case+Study - Jupyter Notebook

In [8]:

# Let's see the head of our master dataset

telecom.head()

Out[8]:

customerID tenure PhoneService Contract PaperlessBilling PaymentMethod MonthlyCharg

7590- Month-
0 1 No Yes Electronic check 29
VHVEG to-month

5575-
1 34 Yes One year No Mailed check 56
GNVDE

3668- Month-
2 2 Yes Yes Mailed check 53
QPYBK to-month

7795- Bank transfer

3 45 No One year No 42
CFOCW (automatic)

9237- Month-
4 2 Yes Yes Electronic check 70
HQITU to-month

5 rows × 21 columns

In [9]:

# Let's check the dimensions of the dataframe

telecom.shape

Out[9]:

(7043, 21)

In [10]:

# let's look at the statistical aspects of the dataframe

telecom.describe()

Out[10]:

tenure MonthlyCharges SeniorCitizen

count 7043.000000 7043.000000 7043.000000

mean 32.371149 64.761692 0.162147

std 24.559481 30.090047 0.368612

min 0.000000 18.250000 0.000000

25% 9.000000 35.500000 0.000000

50% 29.000000 70.350000 0.000000

75% 55.000000 89.850000 0.000000

max 72.000000 118.750000 1.000000

localhost:8888/notebooks/01 Subjects/Subj- DA- Data Analytics/Slides/03 Logistic Regression/Logistic%2BRegression%2B-%2BTelecom%2BC… 4/38

10/13/22, 10:59 AM Logistic+Regression+-+Telecom+Churn+Case+Study - Jupyter Notebook

In [11]:

# Let's see the type of each column

telecom.info()

Int64Index: 7043 entries, 0 to 7042

Data columns (total 21 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 customerID 7043 non-null object

1 tenure 7043 non-null int64

2 PhoneService 7043 non-null object

3 Contract 7043 non-null object

4 PaperlessBilling 7043 non-null object

5 PaymentMethod 7043 non-null object

6 MonthlyCharges 7043 non-null float64

7 TotalCharges 7043 non-null object

8 Churn 7043 non-null object

9 gender 7043 non-null object

10 SeniorCitizen 7043 non-null int64

11 Partner 7043 non-null object

12 Dependents 7043 non-null object

13 MultipleLines 7043 non-null object

14 InternetService 7043 non-null object

15 OnlineSecurity 7043 non-null object

16 OnlineBackup 7043 non-null object

17 DeviceProtection 7043 non-null object

18 TechSupport 7043 non-null object

19 StreamingTV 7043 non-null object

20 StreamingMovies 7043 non-null object

dtypes: float64(1), int64(2), object(18)

memory usage: 1.2+ MB

Step 3: Data Preparation

Converting some binary variables (Yes/No) to 0/1

In [12]:

# List of variables to map

varlist = ['PhoneService', 'PaperlessBilling', 'Churn', 'Partner', 'Dependents']

# Defining the map function

def binary_map(x):
return x.map({'Yes': 1, "No": 0})

# Applying the function to the housing list

telecom[varlist] = telecom[varlist].apply(binary_map)

localhost:8888/notebooks/01 Subjects/Subj- DA- Data Analytics/Slides/03 Logistic Regression/Logistic%2BRegression%2B-%2BTelecom%2BC… 5/38

10/13/22, 10:59 AM Logistic+Regression+-+Telecom+Churn+Case+Study - Jupyter Notebook

In [13]:

telecom.head()

Out[13]:

customerID tenure PhoneService Contract PaperlessBilling PaymentMethod MonthlyCharg

7590- Month-
0 1 0 1 Electronic check 29
VHVEG to-month

5575-
1 34 1 One year 0 Mailed check 56
GNVDE

3668- Month-
2 2 1 1 Mailed check 53
QPYBK to-month

7795- Bank transfer

3 45 0 One year 0 42
CFOCW (automatic)

9237- Month-
4 2 1 1 Electronic check 70
HQITU to-month

5 rows × 21 columns

For categorical variables with multiple levels, create dummy features (one-hot encoded)

In [14]:

# Creating a dummy variable for some of the categorical variables and dropping the first on
dummy1 = pd.get_dummies(telecom[['Contract', 'PaymentMethod', 'gender', 'InternetService']]

# Adding the results to the master dataframe

telecom = pd.concat([telecom, dummy1], axis=1)

In [15]:

telecom.head()

Out[15]:

customerID tenure PhoneService Contract PaperlessBilling PaymentMethod MonthlyCharg

7590- Month-
0 1 0 1 Electronic check 29
VHVEG to-month

5575-
1 34 1 One year 0 Mailed check 56
GNVDE

3668- Month-
2 2 1 1 Mailed check 53
QPYBK to-month

7795- Bank transfer

3 45 0 One year 0 42
CFOCW (automatic)

9237- Month-
4 2 1 1 Electronic check 70
HQITU to-month

5 rows × 29 columns

localhost:8888/notebooks/01 Subjects/Subj- DA- Data Analytics/Slides/03 Logistic Regression/Logistic%2BRegression%2B-%2BTelecom%2BC… 6/38

10/13/22, 10:59 AM Logistic+Regression+-+Telecom+Churn+Case+Study - Jupyter Notebook

In [16]:

# Creating dummy variables for the remaining categorical variables and dropping the level w

# Creating dummy variables for the variable 'MultipleLines'

ml = pd.get_dummies(telecom['MultipleLines'], prefix='MultipleLines')
# Dropping MultipleLines_No phone service column
ml1 = ml.drop(['MultipleLines_No phone service'], 1)
#Adding the results to the master dataframe
telecom = pd.concat([telecom,ml1], axis=1)

# Creating dummy variables for the variable 'OnlineSecurity'.

os = pd.get_dummies(telecom['OnlineSecurity'], prefix='OnlineSecurity')
os1 = os.drop(['OnlineSecurity_No internet service'], 1)
# Adding the results to the master dataframe
telecom = pd.concat([telecom,os1], axis=1)

# Creating dummy variables for the variable 'OnlineBackup'.

ob = pd.get_dummies(telecom['OnlineBackup'], prefix='OnlineBackup')
ob1 = ob.drop(['OnlineBackup_No internet service'], 1)
# Adding the results to the master dataframe
telecom = pd.concat([telecom,ob1], axis=1)

# Creating dummy variables for the variable 'DeviceProtection'.

dp = pd.get_dummies(telecom['DeviceProtection'], prefix='DeviceProtection')
dp1 = dp.drop(['DeviceProtection_No internet service'], 1)
# Adding the results to the master dataframe
telecom = pd.concat([telecom,dp1], axis=1)

# Creating dummy variables for the variable 'TechSupport'.

ts = pd.get_dummies(telecom['TechSupport'], prefix='TechSupport')
ts1 = ts.drop(['TechSupport_No internet service'], 1)
# Adding the results to the master dataframe
telecom = pd.concat([telecom,ts1], axis=1)

# Creating dummy variables for the variable 'StreamingTV'.

st =pd.get_dummies(telecom['StreamingTV'], prefix='StreamingTV')
st1 = st.drop(['StreamingTV_No internet service'], 1)
# Adding the results to the master dataframe
telecom = pd.concat([telecom,st1], axis=1)

# Creating dummy variables for the variable 'StreamingMovies'.

sm = pd.get_dummies(telecom['StreamingMovies'], prefix='StreamingMovies')
sm1 = sm.drop(['StreamingMovies_No internet service'], 1)
# Adding the results to the master dataframe
telecom = pd.concat([telecom,sm1], axis=1)

localhost:8888/notebooks/01 Subjects/Subj- DA- Data Analytics/Slides/03 Logistic Regression/Logistic%2BRegression%2B-%2BTelecom%2BC… 7/38

10/13/22, 10:59 AM Logistic+Regression+-+Telecom+Churn+Case+Study - Jupyter Notebook

In [17]:

telecom.head()

Out[17]:

customerID tenure PhoneService Contract PaperlessBilling PaymentMethod MonthlyCharg

7590- Month-
0 1 0 1 Electronic check 29
VHVEG to-month

5575-
1 34 1 One year 0 Mailed check 56
GNVDE

3668- Month-
2 2 1 1 Mailed check 53
QPYBK to-month

7795- Bank transfer

3 45 0 One year 0 42
CFOCW (automatic)

9237- Month-
4 2 1 1 Electronic check 70
HQITU to-month

5 rows × 43 columns

Dropping the repeated variables

In [18]:

# We have created dummies for the below variables, so we can drop them
telecom = telecom.drop(['Contract','PaymentMethod','gender','MultipleLines','InternetServic
'TechSupport', 'StreamingTV', 'StreamingMovies'], 1)

In [19]:

#The varaible was imported as a string we need to convert it to float

telecom['TotalCharges'] = pd.to_numeric(telecom['TotalCharges'], errors='coerce')

localhost:8888/notebooks/01 Subjects/Subj- DA- Data Analytics/Slides/03 Logistic Regression/Logistic%2BRegression%2B-%2BTelecom%2BC… 8/38

10/13/22, 10:59 AM Logistic+Regression+-+Telecom+Churn+Case+Study - Jupyter Notebook

In [20]:

telecom.info()

Int64Index: 7043 entries, 0 to 7042

Data columns (total 32 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 customerID 7043 non-null object

1 tenure 7043 non-null int64

2 PhoneService 7043 non-null int64

3 PaperlessBilling 7043 non-null int64

4 MonthlyCharges 7043 non-null float64

5 TotalCharges 7032 non-null float64

6 Churn 7043 non-null int64

7 SeniorCitizen 7043 non-null int64

8 Partner 7043 non-null int64

9 Dependents 7043 non-null int64

10 Contract_One year 7043 non-null uint8

11 Contract_Two year 7043 non-null uint8

12 PaymentMethod_Credit card (automatic) 7043 non-null uint8

13 PaymentMethod_Electronic check 7043 non-null uint8

14 P tM th d M il d h k 7043 ll i t8
Now you can see that you have all variables as numeric.

Checking for Outliers

In [21]:

# Checking for outliers in the continuous variables

num_telecom = telecom[['tenure','MonthlyCharges','SeniorCitizen','TotalCharges']]

localhost:8888/notebooks/01 Subjects/Subj- DA- Data Analytics/Slides/03 Logistic Regression/Logistic%2BRegression%2B-%2BTelecom%2BC… 9/38

10/13/22, 10:59 AM Logistic+Regression+-+Telecom+Churn+Case+Study - Jupyter Notebook

In [22]:

# Checking outliers at 25%, 50%, 75%, 90%, 95% and 99%

num_telecom.describe(percentiles=[.25, .5, .75, .90, .95, .99])

Out[22]:

tenure MonthlyCharges SeniorCitizen TotalCharges

count 7043.000000 7043.000000 7043.000000 7032.000000

mean 32.371149 64.761692 0.162147 2283.300441

std 24.559481 30.090047 0.368612 2266.771362

min 0.000000 18.250000 0.000000 18.800000

25% 9.000000 35.500000 0.000000 401.450000

50% 29.000000 70.350000 0.000000 1397.475000

75% 55.000000 89.850000 0.000000 3794.737500

90% 69.000000 102.600000 1.000000 5976.640000

95% 72.000000 107.400000 1.000000 6923.590000

99% 72.000000 114.729000 1.000000 8039.883000

max 72.000000 118.750000 1.000000 8684.800000

From the distribution shown above, you can see that there no outliers in your data. The numbers are gradually
increasing.

Checking for Missing Values and Inputing Them

In [23]:

# Adding up the missing values (column-wise)

telecom.isnull().sum()

Out[23]:

customerID 0

tenure 0

PhoneService 0

PaperlessBilling 0

MonthlyCharges 0

TotalCharges 11

Churn 0

SeniorCitizen 0

Partner 0

Dependents 0

Contract_One year 0

Contract_Two year 0

PaymentMethod_Credit card (automatic) 0

PaymentMethod_Electronic check 0

PaymentMethod_Mailed check 0

gender_Male 0

InternetService_Fiber optic 0

InternetService No 0

It means that 11/7043 = 0.001561834 i.e 0.1%, best is to remove these observations from the analysis

localhost:8888/notebooks/01 Subjects/Subj- DA- Data Analytics/Slides/03 Logistic Regression/Logistic%2BRegression%2B-%2BTelecom%2B… 10/38

10/13/22, 10:59 AM Logistic+Regression+-+Telecom+Churn+Case+Study - Jupyter Notebook

In [ ]:

In [24]:

# Checking the percentage of missing values

round(100*(telecom.isnull().sum()/len(telecom.index)), 2)

Out[24]:

customerID 0.00

tenure 0.00

PhoneService 0.00

PaperlessBilling 0.00

MonthlyCharges 0.00

TotalCharges 0.16

Churn 0.00

SeniorCitizen 0.00

Partner 0.00

Dependents 0.00

Contract_One year 0.00

Contract_Two year 0.00

PaymentMethod_Credit card (automatic) 0.00

PaymentMethod_Electronic check 0.00

PaymentMethod_Mailed check 0.00

gender_Male 0.00

InternetService_Fiber optic 0.00

InternetService No 0.00

In [25]:

# Removing NaN TotalCharges rows

telecom = telecom[~np.isnan(telecom['TotalCharges'])]

In [26]:

# Checking percentage of missing values after removing the missing values

round(100*(telecom.isnull().sum()/len(telecom.index)), 2)

Out[26]:

customerID 0.0

tenure 0.0

PhoneService 0.0

PaperlessBilling 0.0

MonthlyCharges 0.0

TotalCharges 0.0

Churn 0.0

SeniorCitizen 0.0

Partner 0.0

Dependents 0.0

Contract_One year 0.0

Contract_Two year 0.0

PaymentMethod_Credit card (automatic) 0.0

PaymentMethod_Electronic check 0.0

PaymentMethod_Mailed check 0.0

gender_Male 0.0

InternetService_Fiber optic 0.0

InternetService No 0.0

Now we don't have any missing values

localhost:8888/notebooks/01 Subjects/Subj- DA- Data Analytics/Slides/03 Logistic Regression/Logistic%2BRegression%2B-%2BTelecom%2B… 11/38
10/13/22, 10:59 AM Logistic+Regression+-+Telecom+Churn+Case+Study - Jupyter Notebook
o e do t a e a y ss g a ues

Step 4: Test-Train Split

In [ ]:

# Section 1
from sklearn.model_selection import train_test_split

# Section 2
X = telecom.drop(['Churn','customerID'], axis=1)

print(X.head())

y = telecom['Churn']

y.head()

# Section 3
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, test_size=0.3, ra

In [27]:

from sklearn.model_selection import train_test_split

In [28]:

# Putting feature variable to X

X = telecom.drop(['Churn','customerID'], axis=1)

X.head()

Out[28]:

tenure PhoneService PaperlessBilling MonthlyCharges TotalCharges SeniorCitizen Partner

0 1 0 1 29.85 29.85 0 1

1 34 1 0 56.95 1889.50 0 0

2 2 1 1 53.85 108.15 0 0

3 45 0 0 42.30 1840.75 0 0

4 2 1 1 70.70 151.65 0 0

5 rows × 30 columns

localhost:8888/notebooks/01 Subjects/Subj- DA- Data Analytics/Slides/03 Logistic Regression/Logistic%2BRegression%2B-%2BTelecom%2B… 12/38

10/13/22, 10:59 AM Logistic+Regression+-+Telecom+Churn+Case+Study - Jupyter Notebook

In [29]:

# Putting response variable to y

y = telecom['Churn']

y.head()

Out[29]:

0 0

1 0

2 1

3 0

4 1

Name: Churn, dtype: int64

In [30]:

# Splitting the data into train and test

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, test_size=0.3, ra

Step 5: Feature Scaling

In [31]:

from sklearn.preprocessing import StandardScaler

In [32]:

scaler = StandardScaler()

X_train[['tenure','MonthlyCharges','TotalCharges']] = scaler.fit_transform(X_train[['tenure

X_train.head()

Out[32]:

tenure PhoneService PaperlessBilling MonthlyCharges TotalCharges SeniorCitizen P

879 0.019693 1 1 -0.338074 -0.276449 0

5790 0.305384 0 1 -0.464443 -0.112702 0

6498 -1.286319 1 1 0.581425 -0.974430 0

880 -0.919003 1 1 1.505913 -0.550676 0

2784 -1.163880 1 1 1.106854 -0.835971 0

5 rows × 30 columns

localhost:8888/notebooks/01 Subjects/Subj- DA- Data Analytics/Slides/03 Logistic Regression/Logistic%2BRegression%2B-%2BTelecom%2B… 13/38

10/13/22, 10:59 AM Logistic+Regression+-+Telecom+Churn+Case+Study - Jupyter Notebook

In [33]:

### Checking the Churn Rate

churn = (sum(telecom['Churn'])/len(telecom['Churn'].index))*100
churn

Out[33]:

26.578498293515356

We have almost 27% churn rate

Step 6: Looking at Correlations

In [34]:

# Importing matplotlib and seaborn

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [35]:

# Let's see the correlation matrix

plt.figure(figsize = (20,10)) # Size of the figure
sns.heatmap(telecom.corr(),annot = True)
plt.show()

Dropping highly correlated dummy variables

localhost:8888/notebooks/01 Subjects/Subj- DA- Data Analytics/Slides/03 Logistic Regression/Logistic%2BRegression%2B-%2BTelecom%2B… 14/38

10/13/22, 10:59 AM Logistic+Regression+-+Telecom+Churn+Case+Study - Jupyter Notebook

In [36]:

X_test = X_test.drop(['MultipleLines_No','OnlineSecurity_No','OnlineBackup_No','DeviceProte
'StreamingTV_No','StreamingMovies_No'], 1)
X_train = X_train.drop(['MultipleLines_No','OnlineSecurity_No','OnlineBackup_No','DevicePro
'StreamingTV_No','StreamingMovies_No'], 1)

Checking the Correlation Matrix

After dropping highly correlated variables now let's check the correlation matrix again.

In [37]:

plt.figure(figsize = (20,10))
sns.heatmap(X_train.corr(),annot = True)
plt.show()

Step 7: Model Building

Let's start by splitting our data into a training set and a test set.

Running Your First Training Model

In [38]:

import statsmodels.api as sm

localhost:8888/notebooks/01 Subjects/Subj- DA- Data Analytics/Slides/03 Logistic Regression/Logistic%2BRegression%2B-%2BTelecom%2B… 15/38

10/13/22, 10:59 AM Logistic+Regression+-+Telecom+Churn+Case+Study - Jupyter Notebook

In [39]:

# Logistic regression model

logm1 = sm.GLM(y_train,(sm.add_constant(X_train)), family = sm.families.Binomial())
logm1.fit().summary()

Out[39]:

Generalized Linear Model Regression Results

Dep. Variable: Churn No. Observations: 4922

Model: GLM Df Residuals: 4898

Model Family: Binomial Df Model: 23

Link Function: logit Scale: 1.0000

Method: IRLS Log-Likelihood: -2004.7

Date: Tue, 07 Dec 2021 Deviance: 4009.4

Time: 15:54:13 Pearson chi2: 6.07e+03

No. Iterations: 7

Covariance Type: nonrobust

coef std err z P>|z| [0.025 0.975]

Step 8: Feature Selection Using RFE

In [40]:

from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression()

In [41]:

from sklearn.feature_selection import RFE

rfe = RFE(logreg, 15) # running RFE with 15 variables as output
rfe = rfe.fit(X_train, y_train)

In [42]:

# Whether the column was in top 15 ?

rfe.support_

Out[42]:

array([ True, False, True, True, True, True, False, False, True,

True, True, False, True, False, True, True, True, False,

False, False, True, True, True])

localhost:8888/notebooks/01 Subjects/Subj- DA- Data Analytics/Slides/03 Logistic Regression/Logistic%2BRegression%2B-%2BTelecom%2B… 16/38

10/13/22, 10:59 AM Logistic+Regression+-+Telecom+Churn+Case+Study - Jupyter Notebook

In [43]:

#LIsting of top 15 columns

list(zip(X_train.columns, rfe.support_, rfe.ranking_))

Out[43]:

[('tenure', True, 1),

('PhoneService', False, 3),

('PaperlessBilling', True, 1),

('MonthlyCharges', True, 1),

('TotalCharges', True, 1),

('SeniorCitizen', True, 1),

('Partner', False, 7),

('Dependents', False, 6),

('Contract_One year', True, 1),

('Contract_Two year', True, 1),

('PaymentMethod_Credit card (automatic)', True, 1),

('PaymentMethod_Electronic check', False, 4),

('PaymentMethod_Mailed check', True, 1),

('gender_Male', False, 8),

('InternetService_Fiber optic', True, 1),

('InternetService_No', True, 1),

('MultipleLines_Yes', True, 1),

('OnlineSecurity_Yes', False, 2),

('OnlineBackup_Yes', False, 5),

('DeviceProtection_Yes', False, 9),

('TechSupport_Yes', True, 1),

('StreamingTV_Yes', True, 1),

('StreamingMovies_Yes', True, 1)]

In [44]:

col = X_train.columns[rfe.support_]

In [45]:

#columns rejected by RFE

X_train.columns[~rfe.support_]

Out[45]:

Index(['PhoneService', 'Partner', 'Dependents',

'PaymentMethod_Electronic check', 'gender_Male', 'OnlineSecurity_Ye

s',

'OnlineBackup_Yes', 'DeviceProtection_Yes'],

dtype='object')

Assessing the model with StatsModels

localhost:8888/notebooks/01 Subjects/Subj- DA- Data Analytics/Slides/03 Logistic Regression/Logistic%2BRegression%2B-%2BTelecom%2B… 17/38

10/13/22, 10:59 AM Logistic+Regression+-+Telecom+Churn+Case+Study - Jupyter Notebook

In [46]:

X_train_sm = sm.add_constant(X_train[col])
logm2 = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logm2.fit()
res.summary()

Out[46]:

Generalized Linear Model Regression Results

Dep. Variable: Churn No. Observations: 4922

Model: GLM Df Residuals: 4906

Model Family: Binomial Df Model: 15

Link Function: logit Scale: 1.0000

Method: IRLS Log-Likelihood: -2011.1

Date: Tue, 07 Dec 2021 Deviance: 4022.2

Time: 15:54:13 Pearson chi2: 6.25e+03

No. Iterations: 7

Covariance Type: nonrobust

coef std err z P>|z| [0.025 0.975]

const -2.2462 0.189 -11.879 0.000 -2.617 -1.876

tenure -1.5596 0.187 -8.334 0.000 -1.926 -1.193

PaperlessBilling 0.3436 0.090 3.832 0.000 0.168 0.519

MonthlyCharges -0.9692 0.199 -4.878 0.000 -1.359 -0.580

TotalCharges 0.7421 0.197 3.764 0.000 0.356 1.128

SeniorCitizen 0.4296 0.100 4.312 0.000 0.234 0.625

Contract_One year -0.6830 0.128 -5.342 0.000 -0.934 -0.432

Contract_Two year -1.2931 0.211 -6.138 0.000 -1.706 -0.880

PaymentMethod_Credit card (automatic) -0.3724 0.113 -3.308 0.001 -0.593 -0.152

PaymentMethod_Mailed check -0.3723 0.111 -3.345 0.001 -0.591 -0.154

InternetService_Fiber optic 1.5865 0.216 7.342 0.000 1.163 2.010

InternetService_No -1.6897 0.216 -7.830 0.000 -2.113 -1.267

MultipleLines_Yes 0.3779 0.104 3.640 0.000 0.174 0.581

TechSupport_Yes -0.2408 0.109 -2.210 0.027 -0.454 -0.027

StreamingTV_Yes 0.5796 0.114 5.102 0.000 0.357 0.802

StreamingMovies_Yes 0.4665 0.111 4.197 0.000 0.249 0.684

localhost:8888/notebooks/01 Subjects/Subj- DA- Data Analytics/Slides/03 Logistic Regression/Logistic%2BRegression%2B-%2BTelecom%2B… 18/38

10/13/22, 10:59 AM Logistic+Regression+-+Telecom+Churn+Case+Study - Jupyter Notebook

In [47]:

# Getting the predicted values on the train set

y_train_pred = res.predict(X_train_sm)
y_train_pred[:10]

Out[47]:

879 0.192642

5790 0.275624

6498 0.599507

880 0.513571

2784 0.648233

3874 0.414846

5387 0.431184

6623 0.801788

4465 0.228194

5364 0.504575

dtype: float64

In [48]:

y_train_pred = y_train_pred.values.reshape(-1)
y_train_pred[:10]

Out[48]:

array([0.19264205, 0.27562384, 0.59950707, 0.51357126, 0.64823272,

0.41484553, 0.43118361, 0.80178789, 0.22819404, 0.50457542])

Creating a dataframe with the actual churn flag and the predicted probabilities

In [49]:

y_train_pred_final = pd.DataFrame({'Churn':y_train.values, 'Churn_Prob':y_train_pred})

y_train_pred_final['CustID'] = y_train.index
y_train_pred_final.head()

Out[49]:

Churn Churn_Prob CustID

0 0 0.192642 879

1 0 0.275624 5790

2 1 0.599507 6498

3 1 0.513571 880

4 1 0.648233 2784

Creating new column 'predicted' with 1 if Churn_Prob > 0.5 else 0

localhost:8888/notebooks/01 Subjects/Subj- DA- Data Analytics/Slides/03 Logistic Regression/Logistic%2BRegression%2B-%2BTelecom%2B… 19/38

10/13/22, 10:59 AM Logistic+Regression+-+Telecom+Churn+Case+Study - Jupyter Notebook

In [50]:

y_train_pred_final['predicted'] = y_train_pred_final.Churn_Prob.map(lambda x: 1 if x > 0.5

# Let's see the head

y_train_pred_final.head()

Out[50]:

Churn Churn_Prob CustID predicted

0 0 0.192642 879 0

1 0 0.275624 5790 0

2 1 0.599507 6498 1

3 1 0.513571 880 1

4 1 0.648233 2784 1

In [51]:

from sklearn import metrics

In [52]:

# Confusion matrix
confusion = metrics.confusion_matrix(y_train_pred_final.Churn, y_train_pred_final.predicted
print(confusion)

[[3275 360]

[ 574 713]]

In [53]:

# Predicted not_churn churn

# Actual
# not_churn 3270 365
# churn 579 708

In [54]:

# Let's check the overall accuracy.

print(metrics.accuracy_score(y_train_pred_final.Churn, y_train_pred_final.predicted))

0.8102397399431126

Checking VIFs

localhost:8888/notebooks/01 Subjects/Subj- DA- Data Analytics/Slides/03 Logistic Regression/Logistic%2BRegression%2B-%2BTelecom%2B… 20/38

10/13/22, 10:59 AM Logistic+Regression+-+Telecom+Churn+Case+Study - Jupyter Notebook

In [55]:

# Check for the VIF values of the feature variables.

from statsmodels.stats.outliers_influence import variance_inflation_factor

In [56]:

# Create a dataframe that will contain the names of all the feature variables and their res
vif = pd.DataFrame()
vif['Features'] = X_train[col].columns
vif['VIF'] = [variance_inflation_factor(X_train[col].values, i) for i in range(X_train[col]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

Out[56]:

Features VIF

2 MonthlyCharges 14.85

3 TotalCharges 10.42

0 tenure 7.38

9 InternetService_Fiber optic 5.61

10 InternetService_No 5.27

6 Contract_Two year 3.14

13 StreamingTV_Yes 2.79

14 StreamingMovies_Yes 2.79

1 PaperlessBilling 2.76

11 MultipleLines_Yes 2.38

12 TechSupport_Yes 1.95

5 Contract_One year 1.85

8 PaymentMethod_Mailed check 1.73

7 PaymentMethod_Credit card (automatic) 1.45

4 SeniorCitizen 1.33

There are a few variables with high VIF. It's best to drop these variables as they aren't helping much with
prediction and unnecessarily making the model complex. The variable 'MonthlyCharges' has the highest VIF. So
let's start by dropping that.

localhost:8888/notebooks/01 Subjects/Subj- DA- Data Analytics/Slides/03 Logistic Regression/Logistic%2BRegression%2B-%2BTelecom%2B… 21/38

10/13/22, 10:59 AM Logistic+Regression+-+Telecom+Churn+Case+Study - Jupyter Notebook

In [57]:

col = col.drop('MonthlyCharges', 1)
col

Out[57]:

Index(['tenure', 'PaperlessBilling', 'TotalCharges', 'SeniorCitizen',

'Contract_One year', 'Contract_Two year',

'PaymentMethod_Credit card (automatic)', 'PaymentMethod_Mailed chec

k',

'InternetService_Fiber optic', 'InternetService_No',

'MultipleLines_Yes', 'TechSupport_Yes', 'StreamingTV_Yes',

'StreamingMovies_Yes'],

dtype='object')

localhost:8888/notebooks/01 Subjects/Subj- DA- Data Analytics/Slides/03 Logistic Regression/Logistic%2BRegression%2B-%2BTelecom%2B… 22/38

10/13/22, 10:59 AM Logistic+Regression+-+Telecom+Churn+Case+Study - Jupyter Notebook

In [58]:

# Let's re-run the model using the selected variables

X_train_sm = sm.add_constant(X_train[col])
logm3 = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logm3.fit()
res.summary()

Out[58]:

Generalized Linear Model Regression Results

Dep. Variable: Churn No. Observations: 4922

Model: GLM Df Residuals: 4907

Model Family: Binomial Df Model: 14

Link Function: logit Scale: 1.0000

Method: IRLS Log-Likelihood: -2023.1

Date: Tue, 07 Dec 2021 Deviance: 4046.2

Time: 15:54:14 Pearson chi2: 5.80e+03

No. Iterations: 7

Covariance Type: nonrobust

coef std err z P>|z| [0.025 0.975]

const -1.5971 0.130 -12.296 0.000 -1.852 -1.343

tenure -1.3286 0.180 -7.401 0.000 -1.681 -0.977

PaperlessBilling 0.3533 0.089 3.958 0.000 0.178 0.528

TotalCharges 0.4347 0.186 2.340 0.019 0.071 0.799

SeniorCitizen 0.4569 0.099 4.601 0.000 0.262 0.652

Contract_One year -0.7289 0.127 -5.729 0.000 -0.978 -0.480

Contract_Two year -1.3277 0.210 -6.322 0.000 -1.739 -0.916

PaymentMethod_Credit card (automatic) -0.3870 0.112 -3.442 0.001 -0.607 -0.167

PaymentMethod_Mailed check -0.3618 0.110 -3.274 0.001 -0.578 -0.145

InternetService_Fiber optic 0.6888 0.109 6.297 0.000 0.474 0.903

InternetService_No -0.9555 0.156 -6.120 0.000 -1.262 -0.649

MultipleLines_Yes 0.1700 0.094 1.814 0.070 -0.014 0.354

TechSupport_Yes -0.4371 0.101 -4.307 0.000 -0.636 -0.238

StreamingTV_Yes 0.2881 0.096 2.996 0.003 0.100 0.477

StreamingMovies_Yes 0.1944 0.096 2.031 0.042 0.007 0.382

In [59]:

y_train_pred = res.predict(X_train_sm).values.reshape(-1)

localhost:8888/notebooks/01 Subjects/Subj- DA- Data Analytics/Slides/03 Logistic Regression/Logistic%2BRegression%2B-%2BTelecom%2B… 23/38

10/13/22, 10:59 AM Logistic+Regression+-+Telecom+Churn+Case+Study - Jupyter Notebook

In [60]:

y_train_pred[:10]

Out[60]:

array([0.22790197, 0.22864388, 0.67489226, 0.61586836, 0.66226032,

0.41819928, 0.28813321, 0.7951366 , 0.17433167, 0.51908788])

In [61]:

y_train_pred_final['Churn_Prob'] = y_train_pred

In [62]:

# Creating new column 'predicted' with 1 if Churn_Prob > 0.5 else 0

y_train_pred_final['predicted'] = y_train_pred_final.Churn_Prob.map(lambda x: 1 if x > 0.5
y_train_pred_final.head()

Out[62]:

Churn Churn_Prob CustID predicted

0 0 0.227902 879 0

1 0 0.228644 5790 0

2 1 0.674892 6498 1

3 1 0.615868 880 1

4 1 0.662260 2784 1

In [63]:

# Let's check the overall accuracy.

print(metrics.accuracy_score(y_train_pred_final.Churn, y_train_pred_final.predicted))

0.8057700121901666

So overall the accuracy hasn't dropped much.

Let's check the VIFs again

localhost:8888/notebooks/01 Subjects/Subj- DA- Data Analytics/Slides/03 Logistic Regression/Logistic%2BRegression%2B-%2BTelecom%2B… 24/38

10/13/22, 10:59 AM Logistic+Regression+-+Telecom+Churn+Case+Study - Jupyter Notebook

In [64]:

vif = pd.DataFrame()
vif['Features'] = X_train[col].columns
vif['VIF'] = [variance_inflation_factor(X_train[col].values, i) for i in range(X_train[col]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

Out[64]:

Features VIF

2 TotalCharges 7.46

0 tenure 6.90

5 Contract_Two year 3.07

8 InternetService_Fiber optic 2.96

13 StreamingMovies_Yes 2.62

12 StreamingTV_Yes 2.59

1 PaperlessBilling 2.55

9 InternetService_No 2.44

10 MultipleLines_Yes 2.27

11 TechSupport_Yes 1.95

4 Contract_One year 1.79

7 PaymentMethod_Mailed check 1.63

6 PaymentMethod_Credit card (automatic) 1.42

3 SeniorCitizen 1.31

In [65]:

# Let's drop TotalCharges since it has a high VIF

col = col.drop('TotalCharges')
col

Out[65]:

Index(['tenure', 'PaperlessBilling', 'SeniorCitizen', 'Contract_One year',

'Contract_Two year', 'PaymentMethod_Credit card (automatic)',

'PaymentMethod_Mailed check', 'InternetService_Fiber optic',

'InternetService_No', 'MultipleLines_Yes', 'TechSupport_Yes',

'StreamingTV_Yes', 'StreamingMovies_Yes'],

dtype='object')

localhost:8888/notebooks/01 Subjects/Subj- DA- Data Analytics/Slides/03 Logistic Regression/Logistic%2BRegression%2B-%2BTelecom%2B… 25/38

10/13/22, 10:59 AM Logistic+Regression+-+Telecom+Churn+Case+Study - Jupyter Notebook

In [66]:

# Let's re-run the model using the selected variables

X_train_sm = sm.add_constant(X_train[col])
logm4 = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logm4.fit()
res.summary()

Out[66]:

Generalized Linear Model Regression Results

Dep. Variable: Churn No. Observations: 4922

Model: GLM Df Residuals: 4908

Model Family: Binomial Df Model: 13

Link Function: logit Scale: 1.0000

Method: IRLS Log-Likelihood: -2025.9

Date: Tue, 07 Dec 2021 Deviance: 4051.9

Time: 15:54:14 Pearson chi2: 5.25e+03

No. Iterations: 7

Covariance Type: nonrobust

coef std err z P>|z| [0.025 0.975]

const -1.6577 0.127 -13.094 0.000 -1.906 -1.410

tenure -0.9426 0.065 -14.480 0.000 -1.070 -0.815

PaperlessBilling 0.3455 0.089 3.877 0.000 0.171 0.520

SeniorCitizen 0.4597 0.100 4.613 0.000 0.264 0.655

Contract_One year -0.7218 0.127 -5.702 0.000 -0.970 -0.474

Contract_Two year -1.2987 0.208 -6.237 0.000 -1.707 -0.891

PaymentMethod_Credit card (automatic) -0.3874 0.113 -3.442 0.001 -0.608 -0.167

PaymentMethod_Mailed check -0.3307 0.110 -3.020 0.003 -0.545 -0.116

InternetService_Fiber optic 0.8052 0.097 8.272 0.000 0.614 0.996

InternetService_No -0.9726 0.155 -6.261 0.000 -1.277 -0.668

MultipleLines_Yes 0.2097 0.092 2.279 0.023 0.029 0.390

TechSupport_Yes -0.4046 0.101 -4.019 0.000 -0.602 -0.207

StreamingTV_Yes 0.3390 0.094 3.619 0.000 0.155 0.523

StreamingMovies_Yes 0.2428 0.093 2.598 0.009 0.060 0.426

In [67]:

y_train_pred = res.predict(X_train_sm).values.reshape(-1)

localhost:8888/notebooks/01 Subjects/Subj- DA- Data Analytics/Slides/03 Logistic Regression/Logistic%2BRegression%2B-%2BTelecom%2B… 26/38

10/13/22, 10:59 AM Logistic+Regression+-+Telecom+Churn+Case+Study - Jupyter Notebook

In [68]:

y_train_pred[:10]

Out[68]:

array([0.24581699, 0.26536078, 0.66940978, 0.63097033, 0.68291606,

0.39952622, 0.27582791, 0.79816753, 0.19878625, 0.52911878])

In [69]:

y_train_pred_final['Churn_Prob'] = y_train_pred

In [70]:

# Creating new column 'predicted' with 1 if Churn_Prob > 0.5 else 0

y_train_pred_final['predicted'] = y_train_pred_final.Churn_Prob.map(lambda x: 1 if x > 0.5
y_train_pred_final.head()

Out[70]:

Churn Churn_Prob CustID predicted

0 0 0.245817 879 0

1 0 0.265361 5790 0

2 1 0.669410 6498 1

3 1 0.630970 880 1

4 1 0.682916 2784 1

In [71]:

# Let's check the overall accuracy.

print(metrics.accuracy_score(y_train_pred_final.Churn, y_train_pred_final.predicted))

0.8061763510767981

The accuracy is still practically the same.

Let's now check the VIFs again

localhost:8888/notebooks/01 Subjects/Subj- DA- Data Analytics/Slides/03 Logistic Regression/Logistic%2BRegression%2B-%2BTelecom%2B… 27/38

10/13/22, 10:59 AM Logistic+Regression+-+Telecom+Churn+Case+Study - Jupyter Notebook

In [72]:

Out[72]:

Features VIF

4 Contract_Two year 2.98

7 InternetService_Fiber optic 2.67

12 StreamingMovies_Yes 2.54

11 StreamingTV_Yes 2.51

1 PaperlessBilling 2.45

9 MultipleLines_Yes 2.24

0 tenure 2.04

8 InternetService_No 2.03

10 TechSupport_Yes 1.92

3 Contract_One year 1.78

6 PaymentMethod_Mailed check 1.63

5 PaymentMethod_Credit card (automatic) 1.42

2 SeniorCitizen 1.31

All variables have a good value of VIF. So we need not drop any more variables and we can proceed with
making predictions using this model only

In [73]:

# Let's take a look at the confusion matrix again

confusion = metrics.confusion_matrix(y_train_pred_final.Churn, y_train_pred_final.predicted
confusion

Out[73]:

array([[3278, 357],

[ 597, 690]], dtype=int64)

In [74]:

# Actual/Predicted not_churn churn

# not_churn 3269 366
# churn 595 692

localhost:8888/notebooks/01 Subjects/Subj- DA- Data Analytics/Slides/03 Logistic Regression/Logistic%2BRegression%2B-%2BTelecom%2B… 28/38

10/13/22, 10:59 AM Logistic+Regression+-+Telecom+Churn+Case+Study - Jupyter Notebook

In [75]:

# Let's check the overall accuracy.

metrics.accuracy_score(y_train_pred_final.Churn, y_train_pred_final.predicted)

Out[75]:

0.8061763510767981

Metrics beyond simply accuracy

In [76]:

TP = confusion[1,1] # true positive

TN = confusion[0,0] # true negatives
FP = confusion[0,1] # false positives
FN = confusion[1,0] # false negatives

In [77]:

# Let's see the sensitivity of our logistic regression model

TP / float(TP+FN)

Out[77]:

0.5361305361305362

In [78]:

# Let us calculate specificity

TN / float(TN+FP)

Out[78]:

0.9017881705639614

In [79]:

# Calculate false postive rate - predicting churn when customer does not have churned
print(FP/ float(TN+FP))

0.09821182943603851

In [80]:

# positive predictive value

print (TP / float(TP+FP))

0.6590257879656161

In [81]:

# Negative predictive value

print (TN / float(TN+ FN))

0.8459354838709677

localhost:8888/notebooks/01 Subjects/Subj- DA- Data Analytics/Slides/03 Logistic Regression/Logistic%2BRegression%2B-%2BTelecom%2B… 29/38

10/13/22, 10:59 AM Logistic+Regression+-+Telecom+Churn+Case+Study - Jupyter Notebook

Step 9: Plotting the ROC Curve

ROC Curve: Receiver Operating Characteristic Curve

An ROC curve demonstrates several things:

It shows the tradeoff between sensitivity and specificity (any increase in sensitivity will be accompanied by
a decrease in specificity).
The closer the curve follows the left-hand border and then the top border of the ROC space, the more
accurate the test.
The closer the curve comes to the 45-degree diagonal of the ROC space, the less accurate the test.

In [82]:

def draw_roc( actual, probs ):

fpr, tpr, thresholds = metrics.roc_curve( actual, probs,
drop_intermediate = False )
auc_score = metrics.roc_auc_score( actual, probs )
plt.figure(figsize=(5, 5))
plt.plot( fpr, tpr, label='ROC curve (area = %0.2f)' % auc_score )
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate or [1 - True Negative Rate]')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic example')
plt.legend(loc="lower right")
plt.show()

return None

In [83]:

fpr, tpr, thresholds = metrics.roc_curve( y_train_pred_final.Churn, y_train_pred_final.Chur

localhost:8888/notebooks/01 Subjects/Subj- DA- Data Analytics/Slides/03 Logistic Regression/Logistic%2BRegression%2B-%2BTelecom%2B… 30/38

10/13/22, 10:59 AM Logistic+Regression+-+Telecom+Churn+Case+Study - Jupyter Notebook

In [84]:

draw_roc(y_train_pred_final.Churn, y_train_pred_final.Churn_Prob)

Step 10: Finding Optimal Cutoff Point

Optimal cutoff probability is that prob where we get balanced sensitivity and specificity

In [85]:

# Let's create columns with different probability cutoffs

numbers = [float(x)/10 for x in range(10)]
for i in numbers:
y_train_pred_final[i]= y_train_pred_final.Churn_Prob.map(lambda x: 1 if x > i else 0)
y_train_pred_final.head()

Out[85]:

Churn Churn_Prob CustID predicted 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

0 0 0.245817 879 0 1 1 1 0 0 0 0 0 0 0

1 0 0.265361 5790 0 1 1 1 0 0 0 0 0 0 0

2 1 0.669410 6498 1 1 1 1 1 1 1 1 0 0 0

3 1 0.630970 880 1 1 1 1 1 1 1 1 0 0 0

4 1 0.682916 2784 1 1 1 1 1 1 1 1 0 0 0

localhost:8888/notebooks/01 Subjects/Subj- DA- Data Analytics/Slides/03 Logistic Regression/Logistic%2BRegression%2B-%2BTelecom%2B… 31/38

10/13/22, 10:59 AM Logistic+Regression+-+Telecom+Churn+Case+Study - Jupyter Notebook

In [86]:

# Now let's calculate accuracy sensitivity and specificity for various probability cutoffs.
cutoff_df = pd.DataFrame( columns = ['prob','accuracy','sensi','speci'])
from sklearn.metrics import confusion_matrix

# TP = confusion[1,1] # true positive

# TN = confusion[0,0] # true negatives
# FP = confusion[0,1] # false positives
# FN = confusion[1,0] # false negatives

num = [0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
for i in num:
cm1 = metrics.confusion_matrix(y_train_pred_final.Churn, y_train_pred_final[i] )
total1=sum(sum(cm1))
accuracy = (cm1[0,0]+cm1[1,1])/total1

speci = cm1[0,0]/(cm1[0,0]+cm1[0,1])
sensi = cm1[1,1]/(cm1[1,0]+cm1[1,1])
cutoff_df.loc[i] =[ i ,accuracy,sensi,speci]
print(cutoff_df)

prob accuracy sensi speci

0.0 0.0 0.261479 1.000000 0.000000

0.1 0.1 0.614994 0.943279 0.498762

0.2 0.2 0.721861 0.846154 0.677854

0.3 0.3 0.770012 0.776224 0.767813

0.4 0.4 0.790532 0.636364 0.845117

0.5 0.5 0.806176 0.536131 0.901788

0.6 0.6 0.798050 0.380730 0.945805

0.7 0.7 0.776310 0.196581 0.981568

0.8 0.8 0.747867 0.041181 0.998074

0.9 0.9 0.738521 0.000000 1.000000

In [87]:

# Let's plot accuracy sensitivity and specificity for various probabilities.

cutoff_df.plot.line(x='prob', y=['accuracy','sensi','speci'])
plt.show()

From the curve above, 0.3 is the optimum point to take it as a cutoff probability.

localhost:8888/notebooks/01 Subjects/Subj- DA- Data Analytics/Slides/03 Logistic Regression/Logistic%2BRegression%2B-%2BTelecom%2B… 32/38

10/13/22, 10:59 AM Logistic+Regression+-+Telecom+Churn+Case+Study - Jupyter Notebook

In [88]:

y_train_pred_final['final_predicted'] = y_train_pred_final.Churn_Prob.map( lambda x: 1 if x

y_train_pred_final.head()

Out[88]:

Churn Churn_Prob CustID predicted 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 final_p

0 0 0.245817 879 0 1 1 1 0 0 0 0 0 0 0

1 0 0.265361 5790 0 1 1 1 0 0 0 0 0 0 0

2 1 0.669410 6498 1 1 1 1 1 1 1 1 0 0 0

3 1 0.630970 880 1 1 1 1 1 1 1 1 0 0 0

4 1 0.682916 2784 1 1 1 1 1 1 1 1 0 0 0

In [89]:

# Let's check the overall accuracy.

metrics.accuracy_score(y_train_pred_final.Churn, y_train_pred_final.final_predicted)

Out[89]:

0.7700121901665989

In [90]:

confusion2 = metrics.confusion_matrix(y_train_pred_final.Churn, y_train_pred_final.final_pr

confusion2

Out[90]:

array([[2791, 844],

[ 288, 999]], dtype=int64)

In [91]:

TP = confusion2[1,1] # true positive

TN = confusion2[0,0] # true negatives
FP = confusion2[0,1] # false positives
FN = confusion2[1,0] # false negatives

In [92]:

# Let's see the sensitivity of our logistic regression model

TP / float(TP+FN)

Out[92]:

0.7762237762237763

localhost:8888/notebooks/01 Subjects/Subj- DA- Data Analytics/Slides/03 Logistic Regression/Logistic%2BRegression%2B-%2BTelecom%2B… 33/38

10/13/22, 10:59 AM Logistic+Regression+-+Telecom+Churn+Case+Study - Jupyter Notebook

In [93]:

# Let us calculate specificity

TN / float(TN+FP)

Out[93]:

0.7678129298486933

In [94]:

# Calculate false postive rate - predicting churn when customer does not have churned
print(FP/ float(TN+FP))

0.23218707015130674

In [95]:

# Positive predictive value

print (TP / float(TP+FP))

0.5420510037981552

In [96]:

# Negative predictive value

print (TN / float(TN+ FN))

0.9064631373822669

Step 11: Making predictions on the test set

In [109]:

X_test[['tenure','MonthlyCharges','TotalCharges']] = scaler.transform(X_test[['tenure','Mon

In [110]:

X_test = X_test[col]
X_test.head()

Out[110]:

Contract_One Contract_Two PaymentMethod_C

tenure PaperlessBilling SeniorCitizen
year year card (autom

942 -0.347623 1 0 0 0

3730 0.999203 1 0 0 0

1761 1.040015 1 0 0 1

2283 -1.286319 1 0 0 0

1872 0.346196 0 0 0 1

localhost:8888/notebooks/01 Subjects/Subj- DA- Data Analytics/Slides/03 Logistic Regression/Logistic%2BRegression%2B-%2BTelecom%2B… 34/38

10/13/22, 10:59 AM Logistic+Regression+-+Telecom+Churn+Case+Study - Jupyter Notebook

In [111]:

X_test_sm = sm.add_constant(X_test)

Making predictions on the test set

In [112]:

y_test_pred = res.predict(X_test_sm)

In [113]:

y_test_pred[:10]

Out[113]:

942 0.419725

3730 0.260232

1761 0.008650

2283 0.592626

1872 0.013989

1970 0.692893

2532 0.285289

1616 0.008994

2485 0.602307

5914 0.145153

dtype: float64

In [114]:

# Converting y_pred to a dataframe which is an array

y_pred_1 = pd.DataFrame(y_test_pred)

In [115]:

# Let's see the head

y_pred_1.head()

Out[115]:

942 0.419725

3730 0.260232

1761 0.008650

2283 0.592626

1872 0.013989

In [116]:

# Converting y_test to dataframe

y_test_df = pd.DataFrame(y_test)

localhost:8888/notebooks/01 Subjects/Subj- DA- Data Analytics/Slides/03 Logistic Regression/Logistic%2BRegression%2B-%2BTelecom%2B… 35/38

10/13/22, 10:59 AM Logistic+Regression+-+Telecom+Churn+Case+Study - Jupyter Notebook

In [117]:

# Putting CustID to index

y_test_df['CustID'] = y_test_df.index

In [118]:

# Removing index for both dataframes to append them side by side

y_pred_1.reset_index(drop=True, inplace=True)
y_test_df.reset_index(drop=True, inplace=True)

In [119]:

# Appending y_test_df and y_pred_1

y_pred_final = pd.concat([y_test_df, y_pred_1],axis=1)

In [120]:

y_pred_final.head()

Out[120]:

Churn CustID 0

0 0 942 0.419725

1 1 3730 0.260232

2 0 1761 0.008650

3 1 2283 0.592626

4 0 1872 0.013989

In [121]:

# Renaming the column

y_pred_final= y_pred_final.rename(columns={ 0 : 'Churn_Prob'})

In [122]:

# Rearranging the columns

y_pred_final = y_pred_final.reindex(['CustID','Churn','Churn_Prob'], axis=1)

localhost:8888/notebooks/01 Subjects/Subj- DA- Data Analytics/Slides/03 Logistic Regression/Logistic%2BRegression%2B-%2BTelecom%2B… 36/38

10/13/22, 10:59 AM Logistic+Regression+-+Telecom+Churn+Case+Study - Jupyter Notebook

In [123]:

# Let's see the head of y_pred_final

y_pred_final.head()

Out[123]:

CustID Churn Churn_Prob

0 942 0 0.419725

1 3730 1 0.260232

2 1761 0 0.008650

3 2283 1 0.592626

4 1872 0 0.013989

In [131]:

y_pred_final['final_predicted'] = y_pred_final.Churn_Prob.map(lambda x: 1 if x > 0.3 else 0

In [132]:

y_pred_final.head()

Out[132]:

CustID Churn Churn_Prob final_predicted

0 942 0 0.419725 1

1 3730 1 0.260232 0

2 1761 0 0.008650 0

3 2283 1 0.592626 1

4 1872 0 0.013989 0

In [133]:

# Let's check the overall accuracy.

metrics.accuracy_score(y_pred_final.Churn, y_pred_final.final_predicted)

Out[133]:

0.7407582938388626

In [134]:

confusion2 = metrics.confusion_matrix(y_pred_final.Churn, y_pred_final.final_predicted )

confusion2

Out[134]:

array([[1144, 384],

[ 163, 419]], dtype=int64)

localhost:8888/notebooks/01 Subjects/Subj- DA- Data Analytics/Slides/03 Logistic Regression/Logistic%2BRegression%2B-%2BTelecom%2B… 37/38

10/13/22, 10:59 AM Logistic+Regression+-+Telecom+Churn+Case+Study - Jupyter Notebook

In [135]:

TP = confusion2[1,1] # true positive

TN = confusion2[0,0] # true negatives
FP = confusion2[0,1] # false positives
FN = confusion2[1,0] # false negatives

In [136]:

# Let's see the sensitivity of our logistic regression model

TP / float(TP+FN)

Out[136]:

0.7199312714776632

In [130]:

# Let us calculate specificity

TN / float(TN+FP)

Out[130]:

0.8416230366492147

localhost:8888/notebooks/01 Subjects/Subj- DA- Data Analytics/Slides/03 Logistic Regression/Logistic%2BRegression%2B-%2BTelecom%2B… 38/38

QAQC Electrical Inspection: A Beginner's Guide
From Everand
QAQC Electrical Inspection: A Beginner's Guide
RATHEESH VIDYADHARAN
4.5/5 (2)
QuantEconlectures Python3 PDF
100% (1)
QuantEconlectures Python3 PDF
1,125 pages
KPMG Data
50% (2)
KPMG Data
3,723 pages
KPMG - Data Set
100% (1)
KPMG - Data Set
1,685 pages
Blank: CFC Cumulative Forecast Error or Bias Error
100% (1)
Blank: CFC Cumulative Forecast Error or Bias Error
2 pages
A Star Search PDF
100% (1)
A Star Search PDF
6 pages
EFFIE 2002 Case Studies
100% (1)
EFFIE 2002 Case Studies
16 pages
January 1, 1983 1990 5 July 1994 1930 1960
100% (1)
January 1, 1983 1990 5 July 1994 1930 1960
13 pages
Logistic Regression
100% (1)
Logistic Regression
56 pages
Taller Practica Churn
50% (2)
Taller Practica Churn
6 pages
Unit - V
100% (1)
Unit - V
75 pages
Project 5 PDF
100% (1)
Project 5 PDF
48 pages
Homework 2
100% (1)
Homework 2
14 pages
Python Vs R in Data and Machine Learning PDF
100% (1)
Python Vs R in Data and Machine Learning PDF
6 pages
Logistic Regression Model Study Assignment
100% (1)
Logistic Regression Model Study Assignment
5 pages
Case Study 2
100% (1)
Case Study 2
12 pages
KPMG
100% (1)
KPMG
2 pages
Logistic Regression
100% (1)
Logistic Regression
17 pages
Poly
100% (1)
Poly
108 pages
M&A Deal of ABC Inc. and XYZ Inc.: Insert Your Title Here
100% (1)
M&A Deal of ABC Inc. and XYZ Inc.: Insert Your Title Here
25 pages
Preparation and Evaluation of Polyherbal Hair Oil
100% (1)
Preparation and Evaluation of Polyherbal Hair Oil
13 pages
EDA Lecture Module 2
100% (1)
EDA Lecture Module 2
42 pages
CS229 Lecture 3 PDF
100% (1)
CS229 Lecture 3 PDF
35 pages
Tutor
100% (1)
Tutor
309 pages
1.1 Simple Linear Regression Model
100% (1)
1.1 Simple Linear Regression Model
15 pages
Photon Prog Guide
100% (1)
Photon Prog Guide
919 pages
Logistic Regression: Gunjan Bharadwaj Assistant Professor Dept of CEA
100% (1)
Logistic Regression: Gunjan Bharadwaj Assistant Professor Dept of CEA
42 pages
Risk Return Summery
100% (1)
Risk Return Summery
85 pages
Scip y Lectures
100% (1)
Scip y Lectures
329 pages
EDA On Titanic Dataset
100% (1)
EDA On Titanic Dataset
39 pages
CPE412 Pattern Recognition (Week 8)
100% (1)
CPE412 Pattern Recognition (Week 8)
25 pages
Forecasting of Stock Prices Using Multi Layer Perceptron
100% (1)
Forecasting of Stock Prices Using Multi Layer Perceptron
6 pages
Wine Quality Prediction Using ML PPR
100% (1)
Wine Quality Prediction Using ML PPR
8 pages
Name: Reg. No.: Lab Exercise:: Shivam Batra 19BPS1131
100% (1)
Name: Reg. No.: Lab Exercise:: Shivam Batra 19BPS1131
10 pages
Ivy - Data Science and Data Visualization Certification Course
100% (1)
Ivy - Data Science and Data Visualization Certification Course
10 pages
Quest Stat
100% (1)
Quest Stat
2 pages
Course Title: Data Pre-Processing and Visualization
100% (2)
Course Title: Data Pre-Processing and Visualization
11 pages
Logistic Regression Example
100% (1)
Logistic Regression Example
22 pages
8multiple Linear Regression
100% (1)
8multiple Linear Regression
21 pages
Linear Discriminant Analysis - Credit Card Default Analysis
No ratings yet
Linear Discriminant Analysis - Credit Card Default Analysis
7 pages
Homework 2
100% (1)
Homework 2
12 pages
Employee Attrition Miniblogs
100% (1)
Employee Attrition Miniblogs
15 pages
Stats For Managers - Intro
100% (1)
Stats For Managers - Intro
101 pages
Sas Notes Module 4-Categorical Data Analysis Testing Association Between Categorical Variables
100% (1)
Sas Notes Module 4-Categorical Data Analysis Testing Association Between Categorical Variables
16 pages
Using Statistical Techniq Ues in Analyzing Data
100% (1)
Using Statistical Techniq Ues in Analyzing Data
40 pages
Import As
100% (1)
Import As
27 pages
Project LDA
100% (1)
Project LDA
32 pages
Python For You and Me: Release 0.3.alpha1
100% (1)
Python For You and Me: Release 0.3.alpha1
143 pages
Airbnbs in Seattle, Wa: Questions
100% (1)
Airbnbs in Seattle, Wa: Questions
5 pages
Leer Los Datos: Import As Import As Import As From Import From Import
100% (1)
Leer Los Datos: Import As Import As Import As From Import From Import
14 pages
Decision Tree Classification
100% (1)
Decision Tree Classification
11 pages
7. Heteroscedasticity: y = β + β x + · · · + β x + u
100% (1)
7. Heteroscedasticity: y = β + β x + · · · + β x + u
21 pages
Merge +1
No ratings yet
Merge +1
107 pages
Regression Models Course Project
100% (1)
Regression Models Course Project
4 pages
Correlation & Regression
100% (1)
Correlation & Regression
53 pages
TASK 1 Data - Quality - Analysis
No ratings yet
TASK 1 Data - Quality - Analysis
2 pages
Community Medicine Trans - Epidemic Investigation 2
100% (1)
Community Medicine Trans - Epidemic Investigation 2
10 pages
Customer Churn Prediction
No ratings yet
Customer Churn Prediction
16 pages
DS Capestone PDF
No ratings yet
DS Capestone PDF
41 pages
PMI-SP® Exam Focus: Study Guide with Practice Tests
From Everand
PMI-SP® Exam Focus: Study Guide with Practice Tests
SUJAN
No ratings yet
SD Short Note @LazyDiv-Adithya Bandara
No ratings yet
SD Short Note @LazyDiv-Adithya Bandara
50 pages
Basic C MCQ - Final
No ratings yet
Basic C MCQ - Final
51 pages
Single Linked List
No ratings yet
Single Linked List
6 pages
Algorithmic Thinking. Learn Algorithms... Your Coding Skills 2ed 2024 Zingaro D. Download PDF
100% (3)
Algorithmic Thinking. Learn Algorithms... Your Coding Skills 2ed 2024 Zingaro D. Download PDF
62 pages
Unit 3 Basic Computer Organization and Design - 1
No ratings yet
Unit 3 Basic Computer Organization and Design - 1
62 pages
C Program To Implement AVL Tree
No ratings yet
C Program To Implement AVL Tree
7 pages
JM Appendix C
No ratings yet
JM Appendix C
32 pages
Algorithm Unit 2 Notes
No ratings yet
Algorithm Unit 2 Notes
12 pages
Fuzzy Logic Systems
No ratings yet
Fuzzy Logic Systems
118 pages
Binary Numbers Representation
No ratings yet
Binary Numbers Representation
4 pages
Ada Unit-1 To 5 Assignment PP Question It
No ratings yet
Ada Unit-1 To 5 Assignment PP Question It
5 pages
Tensor Flow
No ratings yet
Tensor Flow
12 pages
Traffic Assignment
No ratings yet
Traffic Assignment
5 pages
Dsa 7
No ratings yet
Dsa 7
6 pages
Chap1 Boolean Algebra - Unit 3
No ratings yet
Chap1 Boolean Algebra - Unit 3
25 pages
Java Learning Guide
No ratings yet
Java Learning Guide
3 pages
Cube Analyst 2.0: An Introduction To The Next Generation of Matrix Estimation Software
No ratings yet
Cube Analyst 2.0: An Introduction To The Next Generation of Matrix Estimation Software
12 pages
Lab Manual Dsa
No ratings yet
Lab Manual Dsa
36 pages
戰力計算
No ratings yet
戰力計算
7 pages
The Process of Summation
100% (1)
The Process of Summation
4 pages
Implementation of Binary Tree
No ratings yet
Implementation of Binary Tree
4 pages
CSE357 Workbook
No ratings yet
CSE357 Workbook
298 pages
Python
No ratings yet
Python
47 pages
USCSP 369-Java II Labbook
No ratings yet
USCSP 369-Java II Labbook
56 pages
Macro " B " Programming: Makino Asia Pte LTD Application Department
85% (13)
Macro " B " Programming: Makino Asia Pte LTD Application Department
71 pages
Assignment 3 - Solution
No ratings yet
Assignment 3 - Solution
3 pages
BE VII & VIII SEM Syllabus AY 23-24
No ratings yet
BE VII & VIII SEM Syllabus AY 23-24
51 pages
FinalProject Checkpoint3
No ratings yet
FinalProject Checkpoint3
15 pages
Report Plag
No ratings yet
Report Plag
68 pages
A Survey On Self-Supervised Learning Algorithms Applications and Future Trends
No ratings yet
A Survey On Self-Supervised Learning Algorithms Applications and Future Trends
20 pages