Building Logistic regression model in python
Building Logistic regression model in python
Import Libraries
Tip: Intall the packages, which are not avaliable by using the below command line trick in
the juypter note by using "!" mark
It is essential that the dependent variable is binary and that the level 1 of the dependent
variable signifies the intended outcome
It is imperative to include only the relevant variables and ensure that the independent
variables are not highly correlated, that is, multicollinearity should be minimized
Additionally, the independent variables should exhibit a linear relationship with the log odds,
and a substantial sample size is required for logistic regression to be effective
The objective of the classification exercise is to anticipate whether the customer will agree
to subscribe (1/0) to a term deposit, as denoted by the variable y
In [7]: df =pd.read_csv("banking.csv",header=0)
df =df.dropna()
print("The shape of the data: ",df.shape)
print("columns:",list(df.columns))
Out[9]: ('job',
0 blue-collar
1 technician
2 management
3 services
4 retired
8 admin.
10 housemaid
19 unemployed
25 entrepreneur
68 self-employed
70 unknown
103 student
Name: job, dtype: object)
In [10]: cat_cols
Out[10]: ['job',
'marital',
'education',
'default',
'housing',
'loan',
'contact',
'month',
'day_of_week',
'poutcome']
The above datfarme gives you the list of values withn each categorical columns. From the
columns, the esucation column of the dataset has fot many categories, hence, we need to
reduce tose categorires.
Out[11]: ('education',
0 basic.4y
1 unknown
2 university.degree
3 high.school
7 basic.9y
23 professional.course
28 basic.6y
3059 illiterate
Name: education, dtype: object)
We could group all the categories with basic* into one group and call them "basic"
Out[13]: 0 Basic
1 unknown
2 university.degree
3 high.school
23 professional.course
3059 illiterate
Name: education, dtype: object
2. Data Exploration
In [14]: df['y'].value_counts()
Out[14]: 0 36548
1 4640
Name: y, dtype: int64
In [15]: sns.countplot(x="y",data=df,palette='pastel')
plt.show()
The distribution of instances across our classes is imbalanced, with a ratio of 89:11 between no-
subscription and subscription cases.
Prior to taking steps to rectify this issue, we must conduct further investigation.
In [17]: df.groupby('y').mean()
Customers who purchased the term deposit have a higher average age in comparison to
those who did not
Additionally, the pdays (i.e., the number of days since the last time the customer was
contacted) is lower for the customers who agreed to the term deposit offer, which is
expected since customers are more likely to remember the previous call and increase the
likelihood of a successful sale
Interestingly, the number of contacts or calls made during the current campaign is lower for
customers who subscribed to the term deposit, which is somewhat unexpected
In [18]: df.groupby('job').mean()
job
self-
39.949331 264.142153 2.660802 976.621393 0.143561 0.094159 93.5599
employed
marital
In [20]: df.groupby('education').mean()
education
3. Visualizations
generate functions for visualizations since we will be using teh same line of script multiple
times for many columns
In [21]: def visuals(df, title,xla):
"""get the bar plots for the given columns"""
pd.crosstab(df.job,df.y).plot(kind='bar')
plt.title(title)
plt.xlabel(xla)
plt.ylabel("Frequency of Purchase")
def visuals_2(df, catcol,title,xla,yla):
table=pd.crosstab(df[catcol],df.y)
table.div(table.sum(1).astype(float), axis=0).plot(kind='bar', stacked=True
plt.title(title)
plt.xlabel(xla)
plt.ylabel(yla)
The frequency of purchase depends on job title. Hence, job title can be a good predictor of
the outcome
3.1.2 on marital status
marital status does not appear to be a stroong predictor for the outcome variable, y
In [27]: df.age.hist()
plt.title('Histogram of Age')
plt.xlabel('Age')
plt.ylabel('Frequency')
most of the customers of the bank are in the age range of 30-40 years old
3.1.6 Poutcome
In [28]: visuals(df, "Purchase Frequency for month","Poutcome")
In [29]: data=df
In [30]: cat_vars=['job','marital','education','default','housing','loan','contact','mon
for var in cat_vars:
cat_list='var'+'_'+var
cat_list = pd.get_dummies(data[var], prefix=var)
data1=data.join(cat_list)
data=data1
cat_vars=['job','marital','education','default','housing','loan','contact','mon
data_vars=data.columns.values.tolist()
to_keep=[i for i in data_vars if i not in cat_vars]
In [31]: df.columns
In [32]: data_final=data[to_keep]
data_final.columns.values
I will apply the SMOTE algorithm (Synthetic Minority Oversampling Technique) to up-
sample the no-subscription class
Essentially, SMOTE functions by generating artificial samples of the underrepresented
class instead of duplicating existing ones
This is accomplished by selecting one of the k-nearest-neighbors at random and utilizing it
to produce a comparable but randomly modified new instance
Our implementation of SMOTE will take place in Python.
In [35]: X = data_final.loc[:, data_final.columns != 'y']
y = data_final.loc[:, data_final.columns == 'y']
os = SMOTE(random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random
columns = X_train.columns
os_data_X,os_data_y=os.fit_resample(X_train, y_train)
os_data_X = pd.DataFrame(data=os_data_X,columns=columns )
os_data_y= pd.DataFrame(data=os_data_y,columns=['y'])
# we can Check the numbers of our data
print("length of oversampled data is ",len(os_data_X))
print("Number of no subscription in oversampled data",len(os_data_y[os_data_y[
print("Number of subscription",len(os_data_y[os_data_y['y']==1]))
print("Proportion of no subscription data in oversampled data is ",len(os_data_
print("Proportion of subscription data in oversampled data is ",len(os_data_y[o
In [42]: data_final_vars=data_final.columns.values.tolist()
y =['y']
X =[i for i in data_final_vars if i not in y]
logreg = LogisticRegression()
rfe = RFE(estimator=LogisticRegression(), n_features_to_select=20)
rfe = rfe.fit(os_data_X, os_data_y.values.ravel())
print(rfe.support_)
print(rfe.ranking_)
...
In [50]: X = os_data_X[os_data_X.columns[rfe.support_].tolist()]
y = os_data_y['y']
In [53]: import statsmodels.api as sm
logit_model=sm.Logit(y,X)
result=logit_model.fit()
print(result.summary2())
Optimization terminated successfully.
Current function value: 0.457815
Iterations 7
Results: Logit
=============================================================================
===============================================
Model: Logit
Pseudo R-squared: 0.340
Dependent Variable: y
AIC: 46857.8129
Date: 2023-03-07 19:42
BIC: 47025.8148
No. Observations: 51134
Log-Likelihood: -23410.
Df Model: 18
LL-Null: -35443.
Df Residuals: 51115
LLR p-value: 0.0000
Converged: 1.0000
Scale: 1.0000
No. Iterations: 7.0000
-----------------------------------------------------------------------------
-----------------------------------------------
Coef. Std.Err. z P>|z|
[0.025 0.975]
-----------------------------------------------------------------------------
-----------------------------------------------
marital_divorced 0.2603 0.0589 4.4206 0.0000
0.1449 0.3757
marital_married 0.7914 0.0338 23.4427 0.0000
0.7252 0.8576
marital_single 0.9827 0.0384 25.6005 0.0000
0.9075 1.0579
marital_unknown 0.3845 0.3688 1.0427 0.2971
-0.3383 1.1072
education_Basic -2.1012 nan nan nan
nan nan
education_high.school -1.8820 0.0237 -79.5746 0.0000
-1.9284 -1.8357
education_professional.course -2.1144 0.0398 -53.1173 0.0000
-2.1924 -2.0364
education_university.degree -1.4377 0.0071 -202.0387 0.0000
-1.4517 -1.4238
education_unknown -2.0225 0.0728 -27.7899 0.0000
-2.1651 -1.8798
housing_no -0.0918 0.0606 -1.5153 0.1297
-0.2105 0.0269
housing_unknown 0.9029 10346142767106344.0000 0.0000 1.0000
-20278067202438012.0000 20278067202438012.0000
housing_yes 0.1163 0.0590 1.9724 0.0486
0.0007 0.2319
loan_no 2.6996 0.0025 1095.2871 0.0000
2.6948 2.7045
loan_unknown 0.9029 10346142767106344.0000 0.0000 1.0000
-20278067202438012.0000 20278067202438012.0000
loan_yes 2.0478 0.0494 41.4185 0.0000
1.9509 2.1447
day_of_week_fri -2.9526 0.0200 -147.8848 0.0000
-2.9917 -2.9134
day_of_week_mon -3.1085 0.0207 -150.1916 0.0000
-3.1491 -3.0679
day_of_week_thu -2.7495 0.0066 -418.1946 0.0000
-2.7624 -2.7366
day_of_week_tue -2.8561 0.0154 -185.9759 0.0000
-2.8862 -2.8260
day_of_week_wed -2.7589 0.0133 -208.0982 0.0000
-2.7849 -2.7329
=============================================================================
===============================================
Out[64]: LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust
the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with
nbviewer.org.
Confusion matrix
In [67]: confusion_matrix = pd.crosstab(y_test, y_pred, rownames=['Actual'], colnames=[
sns.heatmap(confusion_matrix, annot=True, fmt='g')
print('Accuracy: ',metrics.accuracy_score(y_test, y_pred))
#plt.savefig('5samples_down_regulated.png')
plt.show()
Accuracy: 0.8296069356626035
The precision is defined as tp / (tp + fp), where tp is the number of true positives and fp is
the number of false positives. It represents the classifier's ability to avoid labeling negative
samples as positive
The recall is calculated as tp / (tp + fn), where tp is the number of true positives and fn is
the number of false negatives. It measures the classifier's ability to identify all positive
samples
The F-beta score is a weighted harmonic mean of precision and recall, with the optimal
value at 1 and the worst score at 0. The F-beta score assigns more weight to recall than
precision, with the weight determined by the beta factor. A beta value of 1.0 indicates that
recall and precision have equal importance
Finally, the support corresponds to the number of instances of each class in y_test.
Interpretation:
Out of the complete set of test data, 83% of the term deposits that were promoted corresponded
to the deposits that were preferred by the customers
ROC curve
In [74]:
logit_roc_auc = roc_auc_score(y_test, logreg.predict(X_test))
fpr, tpr, thresholds = roc_curve(y_test, logreg.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
#plt.savefig('Log_ROC_5samples_down_regulated_with_features.png')
plt.show()
The ROC (Receiver Operating Characteristic) curve is a widely used tool for evaluating
binary classifiers
It is plotted on a graph where the dotted line represents the ROC curve of a completely
random classifier
A reliable classifier should aim to stay as distant as possible from this line, preferably
towards the top-left corner of the graph
Conclusions
Prior to applying the data to the LR model, several steps were taken,
Including data visualization to identify the most effective predictors and the least effective
predictors
The imbalance in the y variable was resolved using the SMOTE method
Recursive feature elimination was also conducted to select the best features that would
improve the accuracy of the model
Additionally, columns with insignificant P-values (>=0.5) were removed after the
implementation of the LR method.