logistic-regression
March 24, 2024
[2]: import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression #This is for logistic␣
↪regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix,␣
↪accuracy_score,classification_report,roc_curve,roc_auc_score #Metrics for␣
↪classification
import seaborn as sns
import matplotlib.pyplot as plt
from pandas_profiling import ProfileReport #pandas profiling for report␣
↪generation
[31]: #load dataset
data=pd.read_csv("C:/Users/ramaleer/Desktop/Practical 2/Datasets/Bank.CSV")
data.head()
[31]: age duration emp_var_rate cons_price_idx cons_conf_idx euribor3m \
0 44 210 1.4 93.444 -36.1 4.963
1 53 138 -0.1 93.200 -42.0 4.021
2 28 339 -1.7 94.055 -39.8 0.729
3 39 185 -1.8 93.075 -47.1 1.405
4 55 137 -2.9 92.201 -31.4 0.869
nr_employed y
0 5228.1 0
1 5195.8 0
2 4991.6 1
3 5099.1 0
4 5076.2 1
0.0.1 Data Description
• 1 - Age (numeric)
• 2 - Duration: last contact duration, in seconds (numeric)
• 3 - Emp.var.rate: employment variation rate - quarterly indicator (numeric)
• 4 - Cons.price.idx: consumer price index - monthly indicator (numeric)
1
• 5 - Cons.conf.idx: consumer confidence index - monthly indicator (numeric)
• 6 - Euribor3m: euribor 3 month rate - daily indicator (numeric)
• 7 - Nr.employed: number of employees - quarterly indicator (numeric)
The classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y).
Where variable y is encoded as yes = 1 and No = 0
[4]: #check if the data set is balanced or not.
data.y.value_counts()
#values show its imbalanced
[4]: 0 36548
1 4640
Name: y, dtype: int64
[5]: #get information about data
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 41188 non-null int64
1 duration 41188 non-null int64
2 emp_var_rate 41188 non-null float64
3 cons_price_idx 41188 non-null float64
4 cons_conf_idx 41188 non-null float64
5 euribor3m 41188 non-null float64
6 nr_employed 41188 non-null float64
7 y 41188 non-null int64
dtypes: float64(5), int64(3)
memory usage: 2.5 MB
[6]: #get the column and row count
print("Columns:",data.shape[1])
print("Rows:",data.shape[0])
Columns: 8
Rows: 41188
[7]: #define x, y
#Seperating independent data matrix & response vector
x = data.drop(columns = ['y'], axis=1) #independent variables selected
y = data.y #target vaiable is y
2
[8]: # Splitting data into training & testing sets (Validation set approach)
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.
↪2,random_state=0) # 20% of data is selected for test set-validation set
1 Creating the logistic regression model
[9]: model=LogisticRegression()
2 Training the model with training data
[10]: model.fit(x_train,y_train)
[10]: LogisticRegression()
3 Estimated coefficients for parameters and intercept of the model
[11]: model.coef_
[11]: array([[ 0.00102983, 0.00453544, -0.21667 , 0.4244128 , 0.05623385,
-0.27695426, -0.0078621 ]])
[12]: model.intercept_
[12]: array([0.00389662])
4 Predict the class of the unseen data
[13]: #here we use our test set for validation
y_pred=model.predict(x_test)
y_pred
[13]: array([0, 0, 0, …, 0, 0, 0], dtype=int64)
5 Predicted probabilities for each observation
[14]: #get the probabilities
y_pred_probs=model.predict_proba(x_test)
y_pred_probs #Left side column for 0 and right side column for 1
[14]: array([[0.93744048, 0.06255952],
[0.67198681, 0.32801319],
3
[0.9914955 , 0.0085045 ],
…,
[0.9921623 , 0.0078377 ],
[0.94365365, 0.05634635],
[0.99442537, 0.00557463]])
[15]: #probability is summation equals to 1
#Example:
0.93744108 + 0.06255892
#probability of outcome = 0 (no) and probability of outcome = 1 (yes)
[15]: 1.0
6 Confusion matrix
[16]: #obtain the confucion matrix
confusion_matrix(y_test,y_pred)
[16]: array([[7157, 168],
[ 606, 307]], dtype=int64)
[17]: sns.heatmap(confusion_matrix(y_test,y_pred),annot=True,fmt="g")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()
4
7 Values inside the Confusion Matrix
[18]: tn,fp,fn,tp=confusion_matrix(y_test,y_pred).ravel()
[19]: tn #True Negatives : Negatives predicted correcly
[19]: 7157
[20]: fn #False Negatives : Predicted as No but Actualy its y= No.
[20]: 606
[21]: tp #True Positives : Positives predicted correcly
[21]: 307
[22]: fp #False Prositives : Predicted as positives but its actually false(because␣
↪its actually negative)
[22]: 168
5
8 Accuracy & Misclassification Error
[23]: #accuracy values from manual calculation
accuracy=(np.diag(confusion_matrix(y_test,y_pred)).sum())/len(y_test)
accuracy
[23]: 0.9060451565914057
[24]: #accuracy values: from function
accuracy_score(y_test,y_pred)
[24]: 0.9060451565914057
[25]: 1-accuracy_score(y_test,y_pred)
[25]: 0.09395484340859428
[26]: # MCE=miss classification error : incorrect classifications (FP and FN)
MCE=1-accuracy
MCE
[26]: 0.09395484340859428
9 Classification report with more metrics
[27]: print(classification_report(y_test,y_pred))
precision recall f1-score support
0 0.92 0.98 0.95 7325
1 0.65 0.34 0.44 913
accuracy 0.91 8238
macro avg 0.78 0.66 0.70 8238
weighted avg 0.89 0.91 0.89 8238
10 Receiver operating characteristic Curve (ROC Curve)
[28]: fpr, tpr, _ = roc_curve(y_test, y_pred_probs[:,1])
plt.plot(fpr,tpr)
plt.title("ROC Curve")
plt.show()
6
[29]: #Area under Curve
auc = roc_auc_score(y_test, y_pred_probs[:,1])
auc
[29]: 0.9121246014152794
[30]: # Homework : Check if there is a way for you to calculate the best threshold
# Homewrork : Experiment with other available classifier models on this Data to␣
↪classify the subscription of term deposit.
10.1 Generate Report using Pandas Profiling
[32]: profile = ProfileReport(data)
profile
Summarize dataset: 0%| | 0/5 [00:00<?, ?it/s]
Generate report structure: 0%| | 0/1 [00:00<?, ?it/s]
Render HTML: 0%| | 0/1 [00:00<?, ?it/s]
<IPython.core.display.HTML object>
7
[32]:
[ ]:
[ ]: