0% found this document useful (0 votes)

37 views24 pages

Building Logistic Regression Model in Python

The document outlines the process of performing binary logistic regression using a dataset from a Portuguese banking institution, focusing on predicting customer subscription to term deposits. It includes steps for data import, preprocessing, exploration, and visualization, highlighting key variables and their relationships to the target outcome. The analysis reveals imbalances in subscription rates and identifies potential predictors such as job title, education, and previous contact outcomes.

Uploaded by

leandrobalbi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views24 pages

Building Logistic Regression Model in Python

Uploaded by

leandrobalbi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

1.

Import Libraries

Author: Alva Rani James, PhD, Year: 2023

In [75]: import pandas as pd

import numpy as np
import [Link] as plt
[Link]("font", size=14)
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import seaborn as sns
from imblearn.over_sampling import SMOTE
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from [Link] import classification_report
from [Link] import roc_auc_score
from [Link] import roc_curve

Tip: Intall the packages, which are not avaliable by using the below command line trick in
the juypter note by using "!" mark

In [4]: #!pip install imblearn

Logistic regression assumptions

To perform binary logistic regression:

It is essential that the dependent variable is binary and that the level 1 of the dependent
variable signifies the intended outcome
It is imperative to include only the relevant variables and ensure that the independent
variables are not highly correlated, that is, multicollinearity should be minimized
Additionally, the independent variables should exhibit a linear relationship with the log odds,
and a substantial sample size is required for logistic regression to be effective

Given these assumptions, we can now examine our dataset

Download the data

In [4]: ! wget [Link]

--2023-03-07 [Link]-- [Link]

-marketing-predictive-engine/master/[Link] ([Link]
[Link]/madmashup/targeted-marketing-predictive-engine/master/[Link])
Resolving [Link] ([Link])... 185.199.10
9.133, [Link], [Link], ...
Connecting to [Link] ([Link])|185.199.1
09.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4882918 (4.7M) [text/plain]
Saving to: ‘[Link]’

100%[======================================>] 4,882,918 13.7MB/s in 0.3s

2023-03-07 [Link] (13.7 MB/s) - ‘[Link]’ saved [4882918/4882918]

2. Export the data

The data originates from the UCI Machine Learning repository and pertains to phone-based
direct marketing campaigns carried out by a Portuguese banking institution.

The objective of the classification exercise is to anticipate whether the customer will agree
to subscribe (1/0) to a term deposit, as denoted by the variable y

In [7]: df =pd.read_csv("[Link]",header=0)
df =[Link]()
print("The shape of the data: ",[Link])
print("columns:",list([Link]))

The shape of the data: (41188, 21)

columns: ['age', 'job', 'marital', 'education', 'default', 'housing', 'loan',
'contact', 'month', 'day_of_week', 'duration', 'campaign', 'pdays', 'previou
s', 'poutcome', 'emp_var_rate', 'cons_price_idx', 'cons_conf_idx', 'euribor3
m', 'nr_employed', 'y']
Preprocessing or Data cleaning

1.1 Input variables

In [8]: [Link]

Out[8]: age int64

job object
marital object
education object
default object
housing object
loan object
contact object
month object
day_of_week object
duration int64
campaign int64
pdays int64
previous int64
poutcome object
emp_var_rate float64
cons_price_idx float64
cons_conf_idx float64
euribor3m float64
nr_employed float64
y int64
dtype: object
1.2 Check the categorical columns and reduce the
catogires for a better modelling
In [9]: cat_cols= list(df.select_dtypes(include=['object']).columns)
cat_df=[]
for cat_col in cat_cols:
df_cate = cat_col,df[cat_col].drop_duplicates()
cat_df.append(df_cate)
cat_df[0]

Out[9]: ('job',
0 blue-collar
1 technician
2 management
3 services
4 retired
8 admin.
10 housemaid
19 unemployed
25 entrepreneur
68 self-employed
70 unknown
103 student
Name: job, dtype: object)

In [10]: cat_cols

Out[10]: ['job',
'marital',
'education',
'default',
'housing',
'loan',
'contact',
'month',
'day_of_week',
'poutcome']

The above datfarme gives you the list of values withn each categorical columns. From the
columns, the esucation column of the dataset has fot many categories, hence, we need to
reduce tose categorires.

The following categories are their in education column:

In [11]: cat_df[2]

Out[11]: ('education',
0 basic.4y
1 unknown
2 [Link]
3 [Link]
7 basic.9y
23 [Link]
28 basic.6y
3059 illiterate
Name: education, dtype: object)

We could group all the categories with basic* into one group and call them "basic"

In [12]: df['education']=[Link](df['education'] =='basic.9y', 'Basic', df['education'

df['education']=[Link](df['education'] =='basic.6y', 'Basic', df['education'
df['education']=[Link](df['education'] =='basic.4y', 'Basic', df['education'

1.2.1 After grouping

In [13]: df['education'].drop_duplicates()

Out[13]: 0 Basic
1 unknown
2 [Link]
3 [Link]
23 [Link]
3059 illiterate
Name: education, dtype: object

2. Data Exploration
In [14]: df['y'].value_counts()

Out[14]: 0 36548
1 4640
Name: y, dtype: int64
In [15]: [Link](x="y",data=df,palette='pastel')
[Link]()

2.1 Get the percentage of subscriptions

In [16]: count_no_sub = len(df[df['y']==0])
count_sub = len(df[df['y']==1])
pct_of_no_sub = count_no_sub/(count_no_sub+count_sub)
print("percentage of no subscription is", pct_of_no_sub*100)
pct_of_sub = count_sub/(count_no_sub+count_sub)
print("percentage of subscription", pct_of_sub*100)

percentage of no subscription is 88.73458288821988

percentage of subscription 11.265417111780131

The distribution of instances across our classes is imbalanced, with a ratio of 89:11 between no-
subscription and subscription cases.

Prior to taking steps to rectify this issue, we must conduct further investigation.

2.1.1 Mean on neumerical columns

In [17]: [Link]('y').mean()

Out[17]: age duration campaign pdays previous emp_var_rate cons_price_idx cons_c

0 39.911185 220.844807 2.633085 984.113878 0.132374 0.248875 93.603757 -40

1 40.913147 553.191164 2.051724 792.035560 0.492672 -1.233448 93.354386 -39

2.1.1 Observations

Customers who purchased the term deposit have a higher average age in comparison to
those who did not
Additionally, the pdays (i.e., the number of days since the last time the customer was
contacted) is lower for the customers who agreed to the term deposit offer, which is
expected since customers are more likely to remember the previous call and increase the
likelihood of a successful sale
Interestingly, the number of contacts or calls made during the current campaign is lower for
customers who subscribed to the term deposit, which is somewhat unexpected

2.1.2 Mean on categorical columns to get more sense of our data

In [18]: [Link]('job').mean()

Out[18]: age duration campaign pdays previous emp_var_rate cons_price_

job

admin. 38.187296 254.312128 2.623489 954.319229 0.189023 0.015563 93.5340

blue-collar 39.555760 264.542360 2.558461 985.160363 0.122542 0.248995 93.6566

entrepreneur 41.723214 263.267857 2.535714 981.267170 0.138736 0.158723 93.6053

housemaid 45.500000 250.454717 2.639623 960.579245 0.137736 0.433396 93.6765

management 42.362859 257.058140 2.476060 962.647059 0.185021 -0.012688 93.5227

retired 62.027326 273.712209 2.476744 897.936047 0.327326 -0.698314 93.4307

self-
39.949331 264.142153 2.660802 976.621393 0.143561 0.094159 93.5599
employed

services 37.926430 258.398085 2.587805 979.974049 0.154951 0.175359 93.6346

student 25.894857 283.683429 2.104000 840.217143 0.524571 -1.408000 93.3316

technician 38.507638 250.232241 2.577339 964.408127 0.153789 0.274566 93.5614

unemployed 39.733728 249.451677 2.564103 935.316568 0.199211 -0.111736 93.5637

unknown 45.563636 239.675758 2.648485 938.727273 0.154545 0.357879 93.7189

In [19]: [Link]('marital').mean()

Out[19]: age duration campaign pdays previous emp_var_rate cons_price_idx

marital

divorced 44.899393 253.790330 2.61340 968.639853 0.168690 0.163985 93.606563

married 42.307165 257.438623 2.57281 967.247673 0.155608 0.183625 93.597367

single 33.158714 261.524378 2.53380 949.909578 0.211359 -0.167989 93.517300

unknown 40.275000 312.725000 3.18750 937.100000 0.275000 -0.221250 93.471250

In [20]: [Link]('education').mean()

Out[20]: age duration campaign pdays previous emp_var_rate cons_

education

Basic 42.163910 263.043874 2.559498 974.877967 0.141053 0.191329

[Link] 37.998213 260.886810 2.568576 964.358382 0.185917 0.032937

illiterate 48.500000 276.777778 2.277778 943.833333 0.111111 -0.133333

[Link] 40.080107 252.533855 2.586115 960.765974 0.163075 0.173012

[Link] 38.879191 253.223373 2.563527 951.807692 0.192390 -0.028090

unknown 43.481225 262.390526 2.596187 942.830734 0.226459 0.059099

3. Visualizations

3.1.1 on job title

generate functions for visualizations since we will be using teh same line of script multiple
times for many columns
In [21]: def visuals(df, title,xla):
"""get the bar plots for the given columns"""
[Link]([Link],df.y).plot(kind='bar')
[Link](title)
[Link](xla)
[Link]("Frequency of Purchase")
def visuals_2(df, catcol,title,xla,yla):
table=[Link](df[catcol],df.y)
[Link]([Link](1).astype(float), axis=0).plot(kind='bar', stacked=True
[Link](title)
[Link](xla)
[Link](yla)

In [22]: %matplotlib inline

visuals(df, "Purchage Frequency for job title","Job")

The frequency of purchase depends on job title. Hence, job title can be a good predictor of
the outcome
3.1.2 on marital status

In [23]: visuals_2(df, "marital",'Stacked Bar Chart of Marital Status vs Purchase','Mari

marital status does not appear to be a stroong predictor for the outcome variable, y

3.1.3 on eductaion status

In [24]: visuals_2(df, "education",'Stacked Bar Chart of Education vs Purchase','Educati

Education appears to be a good predictor of the outcome variable

3.1.3 Day of the week

In [25]: visuals(df, "Purchase Frequency for Day of Week","Day of Week")

Day of week may appear to be not a good predictor of the outcome

3.1.4 month
In [26]: visuals(df, "Purchase Frequency for month","Month")

Month might be a good predictor

3.1.5 Age

In [27]: [Link]()
[Link]('Histogram of Age')
[Link]('Age')
[Link]('Frequency')

Out[27]: Text(0, 0.5, 'Frequency')

most of the customers of the bank are in the age range of 30-40 years old
3.1.6 Poutcome
In [28]: visuals(df, "Purchase Frequency for month","Poutcome")

Poutcome seems to be a good predictor of the outcome variable.

Convert the categorical variables into dummy

binaries
We are gonna use the following categorical columns

In [29]: data=df

In [30]: cat_vars=['job','marital','education','default','housing','loan','contact','mon
for var in cat_vars:
cat_list='var'+'_'+var
cat_list = pd.get_dummies(data[var], prefix=var)
data1=[Link](cat_list)
data=data1
cat_vars=['job','marital','education','default','housing','loan','contact','mon
data_vars=[Link]()
to_keep=[i for i in data_vars if i not in cat_vars]
In [31]: [Link]

Out[31]: Index(['age', 'job', 'marital', 'education', 'default', 'housing', 'loan',

'contact', 'month', 'day_of_week', 'duration', 'campaign', 'pdays',
'previous', 'poutcome', 'emp_var_rate', 'cons_price_idx',
'cons_conf_idx', 'euribor3m', 'nr_employed', 'y'],
dtype='object')

Our final data columns will be:

In [32]: data_final=data[to_keep]
data_final.[Link]

Out[32]: array(['age', 'duration', 'campaign', 'pdays', 'previous', 'emp_var_rate',

'cons_price_idx', 'cons_conf_idx', 'euribor3m', 'nr_employed', 'y',
'job_admin.', 'job_blue-collar', 'job_entrepreneur',
'job_housemaid', 'job_management', 'job_retired',
'job_self-employed', 'job_services', 'job_student',
'job_technician', 'job_unemployed', 'job_unknown',
'marital_divorced', 'marital_married', 'marital_single',
'marital_unknown', 'education_Basic', 'education_high.school',
'education_illiterate', 'education_professional.course',
'education_university.degree', 'education_unknown', 'default_no',
'default_unknown', 'default_yes', 'housing_no', 'housing_unknown',
'housing_yes', 'loan_no', 'loan_unknown', 'loan_yes',
'contact_cellular', 'contact_telephone', 'month_apr', 'month_aug',
'month_dec', 'month_jul', 'month_jun', 'month_mar', 'month_may',
'month_nov', 'month_oct', 'month_sep', 'day_of_week_fri',
'day_of_week_mon', 'day_of_week_thu', 'day_of_week_tue',
'day_of_week_wed', 'poutcome_failure', 'poutcome_nonexistent',
'poutcome_success'], dtype=object)

Over-sampling to overcome the sample imbalance

between subscriptions and non-subscriptions

After constructing our training data,

I will apply the SMOTE algorithm (Synthetic Minority Oversampling Technique) to up-
sample the no-subscription class
Essentially, SMOTE functions by generating artificial samples of the underrepresented
class instead of duplicating existing ones
This is accomplished by selecting one of the k-nearest-neighbors at random and utilizing it
to produce a comparable but randomly modified new instance
Our implementation of SMOTE will take place in Python.
In [35]: X = data_final.loc[:, data_final.columns != 'y']
y = data_final.loc[:, data_final.columns == 'y']
os = SMOTE(random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random
columns = X_train.columns
os_data_X,os_data_y=os.fit_resample(X_train, y_train)
os_data_X = [Link](data=os_data_X,columns=columns )
os_data_y= [Link](data=os_data_y,columns=['y'])
# we can Check the numbers of our data
print("length of oversampled data is ",len(os_data_X))
print("Number of no subscription in oversampled data",len(os_data_y[os_data_y[
print("Number of subscription",len(os_data_y[os_data_y['y']==1]))
print("Proportion of no subscription data in oversampled data is ",len(os_data_
print("Proportion of subscription data in oversampled data is ",len(os_data_y[o

length of oversampled data is 51134

Number of no subscription in oversampled data 25567
Number of subscription 25567
Proportion of no subscription data in oversampled data is 0.5
Proportion of subscription data in oversampled data is 0.5

The data after performing SMOTE for balancing

In [52]: [Link](data = os_data_y,x='y',palette='hls')

[Link]()

We now have an ideally balanced dataset!

I applied oversampling to the training data
This was done intentionally to prevent information from the test data being utilized in
creating synthetic instances
By only oversampling the training data, we ensure that no information is transferred from
the test data to the model training process
Construct the model with the best performing feature

How can we choose the best performing feature?

One way to do so is by using Recursive feature elimination method (Readmore

[Link]
([Link]

The Recursive Feature Elimination (RFE) algorithm operates by constructing a model

multiple times and selecting the best or worst performing feature in each iteration
The chosen feature is then removed from the dataset, and the process is repeated with the
remaining features
This recursive process continues until all features in the dataset have been evaluated
The primary objective of RFE is to choose features by iteratively considering smaller and
smaller subsets of features

In [42]: data_final_vars=data_final.[Link]()
y =['y']
X =[i for i in data_final_vars if i not in y]
logreg = LogisticRegression()
rfe = RFE(estimator=LogisticRegression(), n_features_to_select=20)
rfe = [Link](os_data_X, os_data_y.[Link]())
print(rfe.support_)
print(rfe.ranking_)

...

Columns which the RFE metho supports?

In [48]: os_data_X[os_data_X.columns[rfe.support_].tolist()].columns

Out[48]: Index(['marital_divorced', 'marital_married', 'marital_single',

'marital_unknown', 'education_Basic', 'education_high.school',
'education_professional.course', 'education_university.degree',
'education_unknown', 'housing_no', 'housing_unknown', 'housing_yes',
'loan_no', 'loan_unknown', 'loan_yes', 'day_of_week_fri',
'day_of_week_mon', 'day_of_week_thu', 'day_of_week_tue',
'day_of_week_wed'],
dtype='object')

In [50]: X = os_data_X[os_data_X.columns[rfe.support_].tolist()]
y = os_data_y['y']
In [53]: import [Link] as sm
logit_model=[Link](y,X)
result=logit_model.fit()
print(result.summary2())
Optimization terminated successfully.
Current function value: 0.457815
Iterations 7
Results: Logit
=============================================================================
===============================================
Model: Logit
Pseudo R-squared: 0.340
Dependent Variable: y
AIC: 46857.8129
Date: 2023-03-07 19:42
BIC: 47025.8148
No. Observations: 51134
Log-Likelihood: -23410.
Df Model: 18
LL-Null: -35443.
Df Residuals: 51115
LLR p-value: 0.0000
Converged: 1.0000
Scale: 1.0000
No. Iterations: 7.0000
-----------------------------------------------------------------------------
-----------------------------------------------
Coef. [Link]. z P>|z|
[0.025 0.975]
-----------------------------------------------------------------------------
-----------------------------------------------
marital_divorced 0.2603 0.0589 4.4206 0.0000
0.1449 0.3757
marital_married 0.7914 0.0338 23.4427 0.0000
0.7252 0.8576
marital_single 0.9827 0.0384 25.6005 0.0000
0.9075 1.0579
marital_unknown 0.3845 0.3688 1.0427 0.2971
-0.3383 1.1072
education_Basic -2.1012 nan nan nan
nan nan
education_high.school -1.8820 0.0237 -79.5746 0.0000
-1.9284 -1.8357
education_professional.course -2.1144 0.0398 -53.1173 0.0000
-2.1924 -2.0364
education_university.degree -1.4377 0.0071 -202.0387 0.0000
-1.4517 -1.4238
education_unknown -2.0225 0.0728 -27.7899 0.0000
-2.1651 -1.8798
housing_no -0.0918 0.0606 -1.5153 0.1297
-0.2105 0.0269
housing_unknown 0.9029 10346142767106344.0000 0.0000 1.0000
-20278067202438012.0000 20278067202438012.0000
housing_yes 0.1163 0.0590 1.9724 0.0486
0.0007 0.2319
loan_no 2.6996 0.0025 1095.2871 0.0000
2.6948 2.7045
loan_unknown 0.9029 10346142767106344.0000 0.0000 1.0000
-20278067202438012.0000 20278067202438012.0000
loan_yes 2.0478 0.0494 41.4185 0.0000
1.9509 2.1447
day_of_week_fri -2.9526 0.0200 -147.8848 0.0000
-2.9917 -2.9134
day_of_week_mon -3.1085 0.0207 -150.1916 0.0000
-3.1491 -3.0679
day_of_week_thu -2.7495 0.0066 -418.1946 0.0000
-2.7624 -2.7366
day_of_week_tue -2.8561 0.0154 -185.9759 0.0000
-2.8862 -2.8260
day_of_week_wed -2.7589 0.0133 -208.0982 0.0000
-2.7849 -2.7329
=============================================================================
===============================================

In [ ]: ## remove columns with Pvalue greater than alpha (0.05)

In [63]: cols=['euribor3m', 'job_blue-collar', 'job_housemaid',

'month_apr', 'month_aug' ,'month_jul', 'month_jun', 'month_mar',
'month_may', 'month_nov', 'month_oct', "poutcome_success"]
X=os_data_X[cols]
y=os_data_y['y']
logit_model=[Link](y,X)
result=logit_model.fit()
print(result.summary2())

Optimization terminated successfully.

Current function value: 0.547612
Iterations 7
Results: Logit
=================================================================
Model: Logit Pseudo R-squared: 0.210
Dependent Variable: y AIC: 56027.2286
Date: 2023-03-07 19:46 BIC: 56133.3350
No. Observations: 51134 Log-Likelihood: -28002.
Df Model: 11 LL-Null: -35443.
Df Residuals: 51122 LLR p-value: 0.0000
Converged: 1.0000 Scale: 1.0000
No. Iterations: 7.0000
-----------------------------------------------------------------
Coef. [Link]. z P>|z| [0.025 0.975]
-----------------------------------------------------------------
euribor3m 0.1724 0.0055 31.3940 0.0000 0.1616 0.1832
job_blue-collar -1.0763 0.0359 -29.9385 0.0000 -1.1467 -1.0058
job_housemaid -1.6958 0.1293 -13.1176 0.0000 -1.9492 -1.4424
month_apr -0.2657 0.0406 -6.5414 0.0000 -0.3453 -0.1861
month_aug -1.6702 0.0390 -42.7906 0.0000 -1.7467 -1.5937
month_jul -1.6076 0.0389 -41.3749 0.0000 -1.6838 -1.5315
month_jun -1.3558 0.0392 -34.6062 0.0000 -1.4326 -1.2790
month_mar 0.7349 0.0856 8.5880 0.0000 0.5672 0.9026
month_may -1.5306 0.0294 -51.9744 0.0000 -1.5883 -1.4729
month_nov -1.7494 0.0457 -38.3212 0.0000 -1.8389 -1.6600
month_oct 0.4949 0.0746 6.6378 0.0000 0.3488 0.6410
poutcome_success 3.1883 0.0593 53.7274 0.0000 3.0720 3.3046
=================================================================
Linear regression implementaion
In [64]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random
logreg = LogisticRegression()
[Link](X_train, y_train)

Out[64]: LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust
the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with
[Link].

Predicting the test set results and the accuracy

In [65]: y_pred = [Link](X_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(l

Accuracy of logistic regression classifier on test set: 0.83

Confusion matrix
In [67]: confusion_matrix = [Link](y_test, y_pred, rownames=['Actual'], colnames=[
[Link](confusion_matrix, annot=True, fmt='g')
print('Accuracy: ',metrics.accuracy_score(y_test, y_pred))
#[Link]('5samples_down_regulated.png')

[Link]()

Accuracy: 0.8296069356626035

The result is telling us that we have 6810+5917 correct predictions

1758+856 incorrect predictions.

Compute precision, recall, F-measure and support

The precision is defined as tp / (tp + fp), where tp is the number of true positives and fp is
the number of false positives. It represents the classifier's ability to avoid labeling negative
samples as positive
The recall is calculated as tp / (tp + fn), where tp is the number of true positives and fn is
the number of false negatives. It measures the classifier's ability to identify all positive
samples
The F-beta score is a weighted harmonic mean of precision and recall, with the optimal
value at 1 and the worst score at 0. The F-beta score assigns more weight to recall than
precision, with the weight determined by the beta factor. A beta value of 1.0 indicates that
recall and precision have equal importance
Finally, the support corresponds to the number of instances of each class in y_test.

In [68]: print(classification_report(y_test, y_pred))

precision recall f1-score support

0 0.79 0.89 0.84 7666

1 0.87 0.77 0.82 7675

accuracy 0.83 15341

macro avg 0.83 0.83 0.83 15341
weighted avg 0.83 0.83 0.83 15341

Interpretation:
Out of the complete set of test data, 83% of the term deposits that were promoted corresponded
to the deposits that were preferred by the customers
ROC curve
In [74]:
logit_roc_auc = roc_auc_score(y_test, [Link](X_test))
fpr, tpr, thresholds = roc_curve(y_test, logreg.predict_proba(X_test)[:,1])
[Link]()
[Link](fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
[Link]([0, 1], [0, 1],'r--')
[Link]([0.0, 1.0])
[Link]([0.0, 1.05])
[Link]('False Positive Rate')
[Link]('True Positive Rate')
[Link]('Receiver operating characteristic')
[Link](loc="lower right")
#[Link]('Log_ROC_5samples_down_regulated_with_features.png')
[Link]()

The ROC (Receiver Operating Characteristic) curve is a widely used tool for evaluating
binary classifiers
It is plotted on a graph where the dotted line represents the ROC curve of a completely
random classifier
A reliable classifier should aim to stay as distant as possible from this line, preferably
towards the top-left corner of the graph

Conclusions

Prior to applying the data to the LR model, several steps were taken,

Including data visualization to identify the most effective predictors and the least effective
predictors
The imbalance in the y variable was resolved using the SMOTE method
Recursive feature elimination was also conducted to select the best features that would
improve the accuracy of the model
Additionally, columns with insignificant P-values (>=0.5) were removed after the
implementation of the LR method.

Asg One
No ratings yet
Asg One
10 pages
Bank Marketing Classification Models
No ratings yet
Bank Marketing Classification Models
23 pages
Kunal Assignment 3
No ratings yet
Kunal Assignment 3
19 pages
Bank Marketing Ingles
No ratings yet
Bank Marketing Ingles
37 pages
Articles Xgboost Classification With Smote-Enn Algorithm
No ratings yet
Articles Xgboost Classification With Smote-Enn Algorithm
11 pages
Data Pre Processing and Cleaning
No ratings yet
Data Pre Processing and Cleaning
56 pages
Ensemble Techniques Project
100% (2)
Ensemble Techniques Project
28 pages
Zindi Financial Inclusion Guide
No ratings yet
Zindi Financial Inclusion Guide
12 pages
Data Analysis in The Banking Sector: Pandas Fundamentals
No ratings yet
Data Analysis in The Banking Sector: Pandas Fundamentals
16 pages
Germany Credit Analysis
No ratings yet
Germany Credit Analysis
41 pages
LDA CreditCardDefault Code N
No ratings yet
LDA CreditCardDefault Code N
11 pages
Kritika Sejwal 24MCI10023 ML Lab Project Report
No ratings yet
Kritika Sejwal 24MCI10023 ML Lab Project Report
10 pages
Exp 343
No ratings yet
Exp 343
18 pages
Analyzing Customer Data with NumPy
No ratings yet
Analyzing Customer Data with NumPy
9 pages
Data Visualization & Preprocessing Guide
No ratings yet
Data Visualization & Preprocessing Guide
18 pages
Project Paarth
No ratings yet
Project Paarth
21 pages
BankX Marketing 1744722258
No ratings yet
BankX Marketing 1744722258
29 pages
Data Pre Processing and Cleaning
No ratings yet
Data Pre Processing and Cleaning
23 pages
Python Code For Loan Default Prediction
No ratings yet
Python Code For Loan Default Prediction
4 pages
Naive Bayes Vs Logistic Regression
No ratings yet
Naive Bayes Vs Logistic Regression
16 pages
Supervised Decision Trees A Case Study For AllLife Bank
No ratings yet
Supervised Decision Trees A Case Study For AllLife Bank
50 pages
Cleaning Data in Python
No ratings yet
Cleaning Data in Python
8 pages
Quadexp IDS Project
No ratings yet
Quadexp IDS Project
22 pages
Aiml
No ratings yet
Aiml
27 pages
Predictive+Modelling+-+Logistic+Regression+-+Student+Version-New2.3.ipynb - Colaboratory
No ratings yet
Predictive+Modelling+-+Logistic+Regression+-+Student+Version-New2.3.ipynb - Colaboratory
12 pages
Pandas
No ratings yet
Pandas
35 pages
Loan Interest Prediction Using Linear Regression
No ratings yet
Loan Interest Prediction Using Linear Regression
26 pages
Banking Marketing Target Prediction
No ratings yet
Banking Marketing Target Prediction
13 pages
Logistic Regression Model in Jupyter
No ratings yet
Logistic Regression Model in Jupyter
22 pages
57 - AI2 - PRAC 6.ipynb - Colab
No ratings yet
57 - AI2 - PRAC 6.ipynb - Colab
3 pages
Churn Prediction Model
No ratings yet
Churn Prediction Model
36 pages
Modelling and Simmulation Assignment - Ipynb - Colab
No ratings yet
Modelling and Simmulation Assignment - Ipynb - Colab
7 pages
LendingClub Loan Default Prediction Model
No ratings yet
LendingClub Loan Default Prediction Model
18 pages
Advanced Machine Learning Course Guide
No ratings yet
Advanced Machine Learning Course Guide
36 pages
Credit Risk Prediction Model Overview
No ratings yet
Credit Risk Prediction Model Overview
19 pages
Pandas
No ratings yet
Pandas
32 pages
Credit - Defaulters - Prediction Using Logostic Regression
No ratings yet
Credit - Defaulters - Prediction Using Logostic Regression
17 pages
EDA Basics: Python for Data Analysis
100% (1)
EDA Basics: Python for Data Analysis
30 pages
DS Food
No ratings yet
DS Food
23 pages
Exp 8 - LM
No ratings yet
Exp 8 - LM
10 pages
Machine Learning Paper BD
No ratings yet
Machine Learning Paper BD
16 pages
Loan Prediction
No ratings yet
Loan Prediction
33 pages
Pandas For Machine Learning
No ratings yet
Pandas For Machine Learning
10 pages
Exercises 2
No ratings yet
Exercises 2
10 pages
All Life Bank - AIML - ML - Project - Low - Code - Notebook
No ratings yet
All Life Bank - AIML - ML - Project - Low - Code - Notebook
78 pages
Observation: Import As Import As Import As Import As
No ratings yet
Observation: Import As Import As Import As Import As
31 pages
Python Pandas: 12 Data Manipulation Techniques
100% (2)
Python Pandas: 12 Data Manipulation Techniques
19 pages
DSBDA Practicals
No ratings yet
DSBDA Practicals
16 pages
ML LAB Manual-1
No ratings yet
ML LAB Manual-1
33 pages
Data Preprocess Steps
No ratings yet
Data Preprocess Steps
2 pages
Da Lab Mannual
No ratings yet
Da Lab Mannual
25 pages
Kunal DA-12 Assignment-4
No ratings yet
Kunal DA-12 Assignment-4
26 pages
DataAnalytics Lab Manual
No ratings yet
DataAnalytics Lab Manual
35 pages
Census Income Data Analysis Guide
No ratings yet
Census Income Data Analysis Guide
22 pages
Machine Learning Record VR19
No ratings yet
Machine Learning Record VR19
46 pages
Unit7 Working With Pandas - Solved
No ratings yet
Unit7 Working With Pandas - Solved
12 pages
Eda - 1@3pm 8th Nov
No ratings yet
Eda - 1@3pm 8th Nov
2 pages
Machine Learning With PySpark and MLlib - Solving A Binary Classification Problem - by Susan Li - Towards Data Science
No ratings yet
Machine Learning With PySpark and MLlib - Solving A Binary Classification Problem - by Susan Li - Towards Data Science
10 pages
Complete Case Analysis (CCA) : Advantages
No ratings yet
Complete Case Analysis (CCA) : Advantages
6 pages
000-First-Use-of-English-Part-1-With-Answers (1) B2 Ingles PAU Practicar
No ratings yet
000-First-Use-of-English-Part-1-With-Answers (1) B2 Ingles PAU Practicar
189 pages
Civil Engineering Calculus Exam
No ratings yet
Civil Engineering Calculus Exam
4 pages
Elizabeth Bennet
No ratings yet
Elizabeth Bennet
31 pages
Celtx vs Word: Script Writing Guide
No ratings yet
Celtx vs Word: Script Writing Guide
2 pages
Class VI Term II Syllabus 2021-22
No ratings yet
Class VI Term II Syllabus 2021-22
3 pages
Guidelines For Preparation of VTU MTech Project
No ratings yet
Guidelines For Preparation of VTU MTech Project
2 pages
MTH 211: Set Theory & Abstract Algebra
No ratings yet
MTH 211: Set Theory & Abstract Algebra
77 pages
Writing Tips
No ratings yet
Writing Tips
11 pages
Evolution of Holy Spirit Interpretations
No ratings yet
Evolution of Holy Spirit Interpretations
2 pages
JCHVol39#1Final With Cover
100% (1)
JCHVol39#1Final With Cover
131 pages
Gs33j64e10-01en 009
No ratings yet
Gs33j64e10-01en 009
19 pages
Decoding The Perpetuation of Patriarchal Culture in The Barbie Movie..
No ratings yet
Decoding The Perpetuation of Patriarchal Culture in The Barbie Movie..
13 pages
English Tenses and Grammar Guide
100% (2)
English Tenses and Grammar Guide
9 pages
Meeting 13 - English 1
No ratings yet
Meeting 13 - English 1
7 pages
Daniel 5
No ratings yet
Daniel 5
5 pages
AI & Data Science Projects by Sanjay S
No ratings yet
AI & Data Science Projects by Sanjay S
1 page
Huawei: VRP User Manual - Configuration Guide
No ratings yet
Huawei: VRP User Manual - Configuration Guide
9 pages
Literacylesson Foreshadowing
0% (1)
Literacylesson Foreshadowing
3 pages
A Job Interview Learner Worksheets
100% (1)
A Job Interview Learner Worksheets
7 pages
Sanet - ST Debt-Free Art Degree Foundations in Drawing-9780760391617
75% (16)
Sanet - ST Debt-Free Art Degree Foundations in Drawing-9780760391617
332 pages
English Grammar in Use
No ratings yet
English Grammar in Use
2 pages
Cross and Scepter The Rise of The Scandinavian Kingdoms From The Vikings To The Reformation Course Book Sverre Bagge PDF Download
No ratings yet
Cross and Scepter The Rise of The Scandinavian Kingdoms From The Vikings To The Reformation Course Book Sverre Bagge PDF Download
35 pages
Introduction to SAS PROC IML
No ratings yet
Introduction to SAS PROC IML
50 pages
Bibliography On Autobiographical Studies
No ratings yet
Bibliography On Autobiographical Studies
4 pages
Pspcljeee2021mornwww Exambook Co
No ratings yet
Pspcljeee2021mornwww Exambook Co
26 pages
402 Information Tech SQP
No ratings yet
402 Information Tech SQP
10 pages
135 To 150 Mcqs Word Best Mcqs
No ratings yet
135 To 150 Mcqs Word Best Mcqs
3 pages
HTML Input Types
No ratings yet
HTML Input Types
13 pages
Experiment - 6: Write C Program To Compute The First and Follow Sets For The Given Grammar
No ratings yet
Experiment - 6: Write C Program To Compute The First and Follow Sets For The Given Grammar
6 pages
At The Beach
No ratings yet
At The Beach
1 page