0% found this document useful (0 votes)
18 views

Logistic Regression

This document describes using logistic regression for classification. It loads banking data, splits it into training and test sets, builds a logistic regression model, and evaluates the model's performance using various metrics like accuracy, confusion matrix, ROC curve, etc. Key steps include preprocessing data, training and testing the model, interpreting coefficients, predicting probabilities and classes, and generating classification reports.

Uploaded by

Nipuni
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Logistic Regression

This document describes using logistic regression for classification. It loads banking data, splits it into training and test sets, builds a logistic regression model, and evaluates the model's performance using various metrics like accuracy, confusion matrix, ROC curve, etc. Key steps include preprocessing data, training and testing the model, interpreting coefficients, predicting probabilities and classes, and generating classification reports.

Uploaded by

Nipuni
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

logistic-regression

March 24, 2024

[2]: import pandas as pd


import numpy as np
from sklearn.linear_model import LogisticRegression #This is for logistic␣
↪regression

from sklearn.model_selection import train_test_split


from sklearn.metrics import confusion_matrix,␣
↪accuracy_score,classification_report,roc_curve,roc_auc_score #Metrics for␣

↪classification

import seaborn as sns


import matplotlib.pyplot as plt

from pandas_profiling import ProfileReport #pandas profiling for report␣


↪generation

[31]: #load dataset


data=pd.read_csv("C:/Users/ramaleer/Desktop/Practical 2/Datasets/Bank.CSV")
data.head()

[31]: age duration emp_var_rate cons_price_idx cons_conf_idx euribor3m \


0 44 210 1.4 93.444 -36.1 4.963
1 53 138 -0.1 93.200 -42.0 4.021
2 28 339 -1.7 94.055 -39.8 0.729
3 39 185 -1.8 93.075 -47.1 1.405
4 55 137 -2.9 92.201 -31.4 0.869

nr_employed y
0 5228.1 0
1 5195.8 0
2 4991.6 1
3 5099.1 0
4 5076.2 1

0.0.1 Data Description


• 1 - Age (numeric)
• 2 - Duration: last contact duration, in seconds (numeric)
• 3 - Emp.var.rate: employment variation rate - quarterly indicator (numeric)
• 4 - Cons.price.idx: consumer price index - monthly indicator (numeric)

1
• 5 - Cons.conf.idx: consumer confidence index - monthly indicator (numeric)
• 6 - Euribor3m: euribor 3 month rate - daily indicator (numeric)
• 7 - Nr.employed: number of employees - quarterly indicator (numeric)
The classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y).
Where variable y is encoded as yes = 1 and No = 0
[4]: #check if the data set is balanced or not.
data.y.value_counts()

#values show its imbalanced

[4]: 0 36548
1 4640
Name: y, dtype: int64

[5]: #get information about data


data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 41188 non-null int64
1 duration 41188 non-null int64
2 emp_var_rate 41188 non-null float64
3 cons_price_idx 41188 non-null float64
4 cons_conf_idx 41188 non-null float64
5 euribor3m 41188 non-null float64
6 nr_employed 41188 non-null float64
7 y 41188 non-null int64
dtypes: float64(5), int64(3)
memory usage: 2.5 MB

[6]: #get the column and row count

print("Columns:",data.shape[1])
print("Rows:",data.shape[0])

Columns: 8
Rows: 41188

[7]: #define x, y
#Seperating independent data matrix & response vector

x = data.drop(columns = ['y'], axis=1) #independent variables selected


y = data.y #target vaiable is y

2
[8]: # Splitting data into training & testing sets (Validation set approach)

x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.
↪2,random_state=0) # 20% of data is selected for test set-validation set

1 Creating the logistic regression model


[9]: model=LogisticRegression()

2 Training the model with training data


[10]: model.fit(x_train,y_train)

[10]: LogisticRegression()

3 Estimated coefficients for parameters and intercept of the model


[11]: model.coef_

[11]: array([[ 0.00102983, 0.00453544, -0.21667 , 0.4244128 , 0.05623385,


-0.27695426, -0.0078621 ]])

[12]: model.intercept_

[12]: array([0.00389662])

4 Predict the class of the unseen data


[13]: #here we use our test set for validation
y_pred=model.predict(x_test)
y_pred

[13]: array([0, 0, 0, …, 0, 0, 0], dtype=int64)

5 Predicted probabilities for each observation


[14]: #get the probabilities
y_pred_probs=model.predict_proba(x_test)
y_pred_probs #Left side column for 0 and right side column for 1

[14]: array([[0.93744048, 0.06255952],


[0.67198681, 0.32801319],

3
[0.9914955 , 0.0085045 ],
…,
[0.9921623 , 0.0078377 ],
[0.94365365, 0.05634635],
[0.99442537, 0.00557463]])

[15]: #probability is summation equals to 1


#Example:
0.93744108 + 0.06255892

#probability of outcome = 0 (no) and probability of outcome = 1 (yes)

[15]: 1.0

6 Confusion matrix
[16]: #obtain the confucion matrix
confusion_matrix(y_test,y_pred)

[16]: array([[7157, 168],


[ 606, 307]], dtype=int64)

[17]: sns.heatmap(confusion_matrix(y_test,y_pred),annot=True,fmt="g")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()

4
7 Values inside the Confusion Matrix
[18]: tn,fp,fn,tp=confusion_matrix(y_test,y_pred).ravel()

[19]: tn #True Negatives : Negatives predicted correcly

[19]: 7157

[20]: fn #False Negatives : Predicted as No but Actualy its y= No.

[20]: 606

[21]: tp #True Positives : Positives predicted correcly

[21]: 307

[22]: fp #False Prositives : Predicted as positives but its actually false(because␣


↪its actually negative)

[22]: 168

5
8 Accuracy & Misclassification Error
[23]: #accuracy values from manual calculation
accuracy=(np.diag(confusion_matrix(y_test,y_pred)).sum())/len(y_test)
accuracy

[23]: 0.9060451565914057

[24]: #accuracy values: from function


accuracy_score(y_test,y_pred)

[24]: 0.9060451565914057

[25]: 1-accuracy_score(y_test,y_pred)

[25]: 0.09395484340859428

[26]: # MCE=miss classification error : incorrect classifications (FP and FN)


MCE=1-accuracy
MCE

[26]: 0.09395484340859428

9 Classification report with more metrics


[27]: print(classification_report(y_test,y_pred))

precision recall f1-score support

0 0.92 0.98 0.95 7325


1 0.65 0.34 0.44 913

accuracy 0.91 8238


macro avg 0.78 0.66 0.70 8238
weighted avg 0.89 0.91 0.89 8238

10 Receiver operating characteristic Curve (ROC Curve)


[28]: fpr, tpr, _ = roc_curve(y_test, y_pred_probs[:,1])
plt.plot(fpr,tpr)
plt.title("ROC Curve")
plt.show()

6
[29]: #Area under Curve
auc = roc_auc_score(y_test, y_pred_probs[:,1])
auc

[29]: 0.9121246014152794

[30]: # Homework : Check if there is a way for you to calculate the best threshold
# Homewrork : Experiment with other available classifier models on this Data to␣
↪classify the subscription of term deposit.

10.1 Generate Report using Pandas Profiling


[32]: profile = ProfileReport(data)
profile

Summarize dataset: 0%| | 0/5 [00:00<?, ?it/s]


Generate report structure: 0%| | 0/1 [00:00<?, ?it/s]
Render HTML: 0%| | 0/1 [00:00<?, ?it/s]
<IPython.core.display.HTML object>

7
[32]:

[ ]:

[ ]:

You might also like