Logistic+Regression+-+Telecom+Churn+Case+Study - Jupyter Notebook
Logistic+Regression+-+Telecom+Churn+Case+Study - Jupyter Notebook
Problem Statment
You have a telecom firm which has collected data of all its customers. The main types of attributes are:
Based on all this past information, you want to build a model which will predict whether a particular customer
will churn, i.e. whether they will switch to a different service provider or not.
So the variable of interest, i.e. the target variable here is ‘Churn’ which will tell us whether or not a particular
customer has churned.
In [1]:
# Suppressing Warnings
import warnings
warnings.filterwarnings('ignore')
In [2]:
In [3]:
Out[3]:
7590- Month-
0 1 No Yes Electronic check 29
VHVEG to-month
5575-
1 34 Yes One year No Mailed check 56
GNVDE
3668- Month-
2 2 Yes Yes Mailed check 53
QPYBK to-month
9237- Month-
4 2 Yes Yes Electronic check 70
HQITU to-month
In [4]:
customer_data = pd.read_csv("customer_data.csv")
customer_data.head()
Out[4]:
1 5575-GNVDE Male 0 No No
2 3668-QPYBK Male 0 No No
3 7795-CFOCW Male 0 No No
4 9237-HQITU Female 0 No No
In [5]:
internet_data = pd.read_csv("internet_data.csv")
internet_data.head()
Out[5]:
7590- No phone
0 DSL No Yes No
VHVEG service
5575-
1 No DSL Yes No Yes
GNVDE
3668-
2 No DSL Yes Yes No
QPYBK
7795- No phone
3 DSL Yes No Yes
CFOCW service
9237-
4 No Fiber optic No No No
HQITU
In [6]:
# Merging on 'customerID'
df_1 = pd.merge(churn_data, customer_data, how='inner', on='customerID')
In [7]:
In [8]:
Out[8]:
7590- Month-
0 1 No Yes Electronic check 29
VHVEG to-month
5575-
1 34 Yes One year No Mailed check 56
GNVDE
3668- Month-
2 2 Yes Yes Mailed check 53
QPYBK to-month
9237- Month-
4 2 Yes Yes Electronic check 70
HQITU to-month
5 rows × 21 columns
In [9]:
Out[9]:
(7043, 21)
In [10]:
Out[10]:
In [11]:
<class 'pandas.core.frame.DataFrame'>
In [12]:
In [13]:
telecom.head()
Out[13]:
7590- Month-
0 1 0 1 Electronic check 29
VHVEG to-month
5575-
1 34 1 One year 0 Mailed check 56
GNVDE
3668- Month-
2 2 1 1 Mailed check 53
QPYBK to-month
9237- Month-
4 2 1 1 Electronic check 70
HQITU to-month
5 rows × 21 columns
For categorical variables with multiple levels, create dummy features (one-hot encoded)
In [14]:
# Creating a dummy variable for some of the categorical variables and dropping the first on
dummy1 = pd.get_dummies(telecom[['Contract', 'PaymentMethod', 'gender', 'InternetService']]
In [15]:
telecom.head()
Out[15]:
7590- Month-
0 1 0 1 Electronic check 29
VHVEG to-month
5575-
1 34 1 One year 0 Mailed check 56
GNVDE
3668- Month-
2 2 1 1 Mailed check 53
QPYBK to-month
9237- Month-
4 2 1 1 Electronic check 70
HQITU to-month
5 rows × 29 columns
In [16]:
# Creating dummy variables for the remaining categorical variables and dropping the level w
In [17]:
telecom.head()
Out[17]:
7590- Month-
0 1 0 1 Electronic check 29
VHVEG to-month
5575-
1 34 1 One year 0 Mailed check 56
GNVDE
3668- Month-
2 2 1 1 Mailed check 53
QPYBK to-month
9237- Month-
4 2 1 1 Electronic check 70
HQITU to-month
5 rows × 43 columns
In [18]:
# We have created dummies for the below variables, so we can drop them
telecom = telecom.drop(['Contract','PaymentMethod','gender','MultipleLines','InternetServic
'TechSupport', 'StreamingTV', 'StreamingMovies'], 1)
In [19]:
In [20]:
telecom.info()
<class 'pandas.core.frame.DataFrame'>
14 P tM th d M il d h k 7043 ll i t8
Now you can see that you have all variables as numeric.
In [21]:
In [22]:
Out[22]:
From the distribution shown above, you can see that there no outliers in your data. The numbers are gradually
increasing.
In [23]:
Out[23]:
customerID 0
tenure 0
PhoneService 0
PaperlessBilling 0
MonthlyCharges 0
TotalCharges 11
Churn 0
SeniorCitizen 0
Partner 0
Dependents 0
Contract_One year 0
Contract_Two year 0
PaymentMethod_Electronic check 0
PaymentMethod_Mailed check 0
gender_Male 0
InternetService_Fiber optic 0
InternetService No 0
It means that 11/7043 = 0.001561834 i.e 0.1%, best is to remove these observations from the analysis
In [ ]:
In [24]:
Out[24]:
customerID 0.00
tenure 0.00
PhoneService 0.00
PaperlessBilling 0.00
MonthlyCharges 0.00
TotalCharges 0.16
Churn 0.00
SeniorCitizen 0.00
Partner 0.00
Dependents 0.00
gender_Male 0.00
InternetService No 0.00
In [25]:
In [26]:
Out[26]:
customerID 0.0
tenure 0.0
PhoneService 0.0
PaperlessBilling 0.0
MonthlyCharges 0.0
TotalCharges 0.0
Churn 0.0
SeniorCitizen 0.0
Partner 0.0
Dependents 0.0
gender_Male 0.0
InternetService No 0.0
In [ ]:
# Section 1
from sklearn.model_selection import train_test_split
# Section 2
X = telecom.drop(['Churn','customerID'], axis=1)
print(X.head())
y = telecom['Churn']
y.head()
# Section 3
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, test_size=0.3, ra
In [27]:
In [28]:
X.head()
Out[28]:
0 1 0 1 29.85 29.85 0 1
1 34 1 0 56.95 1889.50 0 0
2 2 1 1 53.85 108.15 0 0
3 45 0 0 42.30 1840.75 0 0
4 2 1 1 70.70 151.65 0 0
5 rows × 30 columns
In [29]:
y.head()
Out[29]:
0 0
1 0
2 1
3 0
4 1
In [30]:
In [31]:
In [32]:
scaler = StandardScaler()
X_train[['tenure','MonthlyCharges','TotalCharges']] = scaler.fit_transform(X_train[['tenure
X_train.head()
Out[32]:
5 rows × 30 columns
In [33]:
Out[33]:
26.578498293515356
In [34]:
In [35]:
In [36]:
X_test = X_test.drop(['MultipleLines_No','OnlineSecurity_No','OnlineBackup_No','DeviceProte
'StreamingTV_No','StreamingMovies_No'], 1)
X_train = X_train.drop(['MultipleLines_No','OnlineSecurity_No','OnlineBackup_No','DevicePro
'StreamingTV_No','StreamingMovies_No'], 1)
After dropping highly correlated variables now let's check the correlation matrix again.
In [37]:
plt.figure(figsize = (20,10))
sns.heatmap(X_train.corr(),annot = True)
plt.show()
In [38]:
import statsmodels.api as sm
In [39]:
Out[39]:
No. Iterations: 7
In [40]:
In [41]:
In [42]:
Out[42]:
array([ True, False, True, True, True, True, False, False, True,
In [43]:
Out[43]:
In [44]:
col = X_train.columns[rfe.support_]
In [45]:
Out[45]:
'OnlineBackup_Yes', 'DeviceProtection_Yes'],
dtype='object')
In [46]:
X_train_sm = sm.add_constant(X_train[col])
logm2 = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logm2.fit()
res.summary()
Out[46]:
No. Iterations: 7
In [47]:
Out[47]:
879 0.192642
5790 0.275624
6498 0.599507
880 0.513571
2784 0.648233
3874 0.414846
5387 0.431184
6623 0.801788
4465 0.228194
5364 0.504575
dtype: float64
In [48]:
y_train_pred = y_train_pred.values.reshape(-1)
y_train_pred[:10]
Out[48]:
Creating a dataframe with the actual churn flag and the predicted probabilities
In [49]:
Out[49]:
0 0 0.192642 879
1 0 0.275624 5790
2 1 0.599507 6498
3 1 0.513571 880
4 1 0.648233 2784
In [50]:
Out[50]:
0 0 0.192642 879 0
1 0 0.275624 5790 0
2 1 0.599507 6498 1
3 1 0.513571 880 1
4 1 0.648233 2784 1
In [51]:
In [52]:
# Confusion matrix
confusion = metrics.confusion_matrix(y_train_pred_final.Churn, y_train_pred_final.predicted
print(confusion)
[[3275 360]
[ 574 713]]
In [53]:
In [54]:
0.8102397399431126
Checking VIFs
In [55]:
In [56]:
# Create a dataframe that will contain the names of all the feature variables and their res
vif = pd.DataFrame()
vif['Features'] = X_train[col].columns
vif['VIF'] = [variance_inflation_factor(X_train[col].values, i) for i in range(X_train[col]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif
Out[56]:
Features VIF
2 MonthlyCharges 14.85
3 TotalCharges 10.42
0 tenure 7.38
10 InternetService_No 5.27
13 StreamingTV_Yes 2.79
14 StreamingMovies_Yes 2.79
1 PaperlessBilling 2.76
11 MultipleLines_Yes 2.38
12 TechSupport_Yes 1.95
4 SeniorCitizen 1.33
There are a few variables with high VIF. It's best to drop these variables as they aren't helping much with
prediction and unnecessarily making the model complex. The variable 'MonthlyCharges' has the highest VIF. So
let's start by dropping that.
In [57]:
col = col.drop('MonthlyCharges', 1)
col
Out[57]:
'StreamingMovies_Yes'],
dtype='object')
In [58]:
Out[58]:
No. Iterations: 7
In [59]:
y_train_pred = res.predict(X_train_sm).values.reshape(-1)
In [60]:
y_train_pred[:10]
Out[60]:
In [61]:
y_train_pred_final['Churn_Prob'] = y_train_pred
In [62]:
Out[62]:
0 0 0.227902 879 0
1 0 0.228644 5790 0
2 1 0.674892 6498 1
3 1 0.615868 880 1
4 1 0.662260 2784 1
In [63]:
0.8057700121901666
In [64]:
vif = pd.DataFrame()
vif['Features'] = X_train[col].columns
vif['VIF'] = [variance_inflation_factor(X_train[col].values, i) for i in range(X_train[col]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif
Out[64]:
Features VIF
2 TotalCharges 7.46
0 tenure 6.90
13 StreamingMovies_Yes 2.62
12 StreamingTV_Yes 2.59
1 PaperlessBilling 2.55
9 InternetService_No 2.44
10 MultipleLines_Yes 2.27
11 TechSupport_Yes 1.95
3 SeniorCitizen 1.31
In [65]:
Out[65]:
dtype='object')
In [66]:
Out[66]:
No. Iterations: 7
In [67]:
y_train_pred = res.predict(X_train_sm).values.reshape(-1)
In [68]:
y_train_pred[:10]
Out[68]:
In [69]:
y_train_pred_final['Churn_Prob'] = y_train_pred
In [70]:
Out[70]:
0 0 0.245817 879 0
1 0 0.265361 5790 0
2 1 0.669410 6498 1
3 1 0.630970 880 1
4 1 0.682916 2784 1
In [71]:
0.8061763510767981
In [72]:
vif = pd.DataFrame()
vif['Features'] = X_train[col].columns
vif['VIF'] = [variance_inflation_factor(X_train[col].values, i) for i in range(X_train[col]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif
Out[72]:
Features VIF
12 StreamingMovies_Yes 2.54
11 StreamingTV_Yes 2.51
1 PaperlessBilling 2.45
9 MultipleLines_Yes 2.24
0 tenure 2.04
8 InternetService_No 2.03
10 TechSupport_Yes 1.92
2 SeniorCitizen 1.31
All variables have a good value of VIF. So we need not drop any more variables and we can proceed with
making predictions using this model only
In [73]:
Out[73]:
array([[3278, 357],
In [74]:
In [75]:
Out[75]:
0.8061763510767981
In [77]:
Out[77]:
0.5361305361305362
In [78]:
Out[78]:
0.9017881705639614
In [79]:
# Calculate false postive rate - predicting churn when customer does not have churned
print(FP/ float(TN+FP))
0.09821182943603851
In [80]:
0.6590257879656161
In [81]:
0.8459354838709677
It shows the tradeoff between sensitivity and specificity (any increase in sensitivity will be accompanied by
a decrease in specificity).
The closer the curve follows the left-hand border and then the top border of the ROC space, the more
accurate the test.
The closer the curve comes to the 45-degree diagonal of the ROC space, the less accurate the test.
In [82]:
return None
In [83]:
In [84]:
draw_roc(y_train_pred_final.Churn, y_train_pred_final.Churn_Prob)
Optimal cutoff probability is that prob where we get balanced sensitivity and specificity
In [85]:
Out[85]:
Churn Churn_Prob CustID predicted 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
0 0 0.245817 879 0 1 1 1 0 0 0 0 0 0 0
1 0 0.265361 5790 0 1 1 1 0 0 0 0 0 0 0
2 1 0.669410 6498 1 1 1 1 1 1 1 1 0 0 0
3 1 0.630970 880 1 1 1 1 1 1 1 1 0 0 0
4 1 0.682916 2784 1 1 1 1 1 1 1 1 0 0 0
In [86]:
# Now let's calculate accuracy sensitivity and specificity for various probability cutoffs.
cutoff_df = pd.DataFrame( columns = ['prob','accuracy','sensi','speci'])
from sklearn.metrics import confusion_matrix
num = [0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
for i in num:
cm1 = metrics.confusion_matrix(y_train_pred_final.Churn, y_train_pred_final[i] )
total1=sum(sum(cm1))
accuracy = (cm1[0,0]+cm1[1,1])/total1
speci = cm1[0,0]/(cm1[0,0]+cm1[0,1])
sensi = cm1[1,1]/(cm1[1,0]+cm1[1,1])
cutoff_df.loc[i] =[ i ,accuracy,sensi,speci]
print(cutoff_df)
In [87]:
From the curve above, 0.3 is the optimum point to take it as a cutoff probability.
In [88]:
y_train_pred_final.head()
Out[88]:
Churn Churn_Prob CustID predicted 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 final_p
0 0 0.245817 879 0 1 1 1 0 0 0 0 0 0 0
1 0 0.265361 5790 0 1 1 1 0 0 0 0 0 0 0
2 1 0.669410 6498 1 1 1 1 1 1 1 1 0 0 0
3 1 0.630970 880 1 1 1 1 1 1 1 1 0 0 0
4 1 0.682916 2784 1 1 1 1 1 1 1 1 0 0 0
In [89]:
Out[89]:
0.7700121901665989
In [90]:
Out[90]:
array([[2791, 844],
In [91]:
In [92]:
Out[92]:
0.7762237762237763
In [93]:
Out[93]:
0.7678129298486933
In [94]:
# Calculate false postive rate - predicting churn when customer does not have churned
print(FP/ float(TN+FP))
0.23218707015130674
In [95]:
0.5420510037981552
In [96]:
0.9064631373822669
In [109]:
X_test[['tenure','MonthlyCharges','TotalCharges']] = scaler.transform(X_test[['tenure','Mon
In [110]:
X_test = X_test[col]
X_test.head()
Out[110]:
942 -0.347623 1 0 0 0
3730 0.999203 1 0 0 0
1761 1.040015 1 0 0 1
2283 -1.286319 1 0 0 0
1872 0.346196 0 0 0 1
In [111]:
X_test_sm = sm.add_constant(X_test)
In [112]:
y_test_pred = res.predict(X_test_sm)
In [113]:
y_test_pred[:10]
Out[113]:
942 0.419725
3730 0.260232
1761 0.008650
2283 0.592626
1872 0.013989
1970 0.692893
2532 0.285289
1616 0.008994
2485 0.602307
5914 0.145153
dtype: float64
In [114]:
In [115]:
Out[115]:
942 0.419725
3730 0.260232
1761 0.008650
2283 0.592626
1872 0.013989
In [116]:
In [117]:
In [118]:
In [119]:
In [120]:
y_pred_final.head()
Out[120]:
Churn CustID 0
0 0 942 0.419725
1 1 3730 0.260232
2 0 1761 0.008650
3 1 2283 0.592626
4 0 1872 0.013989
In [121]:
In [122]:
In [123]:
Out[123]:
0 942 0 0.419725
1 3730 1 0.260232
2 1761 0 0.008650
3 2283 1 0.592626
4 1872 0 0.013989
In [131]:
In [132]:
y_pred_final.head()
Out[132]:
0 942 0 0.419725 1
1 3730 1 0.260232 0
2 1761 0 0.008650 0
3 2283 1 0.592626 1
4 1872 0 0.013989 0
In [133]:
Out[133]:
0.7407582938388626
In [134]:
Out[134]:
array([[1144, 384],
In [135]:
In [136]:
Out[136]:
0.7199312714776632
In [130]:
Out[130]:
0.8416230366492147