Loan Eligibility Prediction using Machine Learning Models in Python
Last Updated :
23 Jul, 2025
Have you ever thought about the apps that can predict whether you will get your loan approved or not? In this article, we are going to develop one such model that can predict whether a person will get his/her loan approved or not by using some of the background information of the applicant like the applicant's gender, marital status, income, etc.
Step 1: Importing Libraries
In this step, we will be importing libraries like NumPy, Pandas, Matplotlib, etc.
Python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn import metrics
from sklearn.svm import SVC
from imblearn.over_sampling import RandomOverSampler
import warnings
warnings.filterwarnings('ignore')
Step 2: Loading Dataset
Python
df = pd.read_csv('loan_data.csv')
df.head()
Output:
headTo see the shape of the dataset, we can use shape method.
Python
Output:
(598, 6)
To print the information of the dataset, we can use info() method
Python
Output:
infoTo get values like the mean, count and min of the column we can use describe() method.
Python
Output:
describeStep 3: Exploratory Data Analysis
EDA refers to the detailed analysis of the dataset which uses plots like distplot, barplots, etc.
Let's start by plotting the piechart for LoanStatus column.
Python
temp = df['Loan_Status'].value_counts()
plt.pie(temp.values,
labels=temp.index,
autopct='%1.1f%%')
plt.show()
Output:
Piechart for LoanStatusHere we have an imbalanced dataset. We will have to balance it before training any model on this data.
We specify the DataFrame df
as the data source for the sb.countplot()
function. The x
parameter is set to the column name from which the count plot is to be created, and hue
is set to 'Loan_Status' to create count bars based on the 'Loan_Status' categories.
Python
plt.subplots(figsize=(15, 5))
for i, col in enumerate(['Gender', 'Married']):
plt.subplot(1, 2, i+1)
sb.countplot(data=df, x=col, hue='Loan_Status')
plt.tight_layout()
plt.show()
Output:
countplotOne of the main observations we can draw here is that the chances of getting a loan approved for married people are quite low compared to those who are not married.
Python
plt.subplots(figsize=(15, 5))
for i, col in enumerate(['ApplicantIncome', 'LoanAmount']):
plt.subplot(1, 2, i+1)
sb.distplot(df[col])
plt.tight_layout()
plt.show()
Output:

To find out the outliers in the columns, we can use boxplot.
Python
plt.subplots(figsize=(15, 5))
for i, col in enumerate(['ApplicantIncome', 'LoanAmount']):
plt.subplot(1, 2, i+1)
sb.boxplot(df[col])
plt.tight_layout()
plt.show()
Output:
BoxplotThere are some extreme outlier's in the data we need to remove them.
Python
df = df[df['ApplicantIncome'] < 25000]
df = df[df['LoanAmount'] < 400000]
Let's see the mean amount of the loan granted to males as well as females. For that, we will use groupyby() method.
Python
df.groupby('Gender').mean(numeric_only=True)['LoanAmount']
Output:
Gender LoanAmount
Female 126.697248
Male 146.872294
dtype: float64
The loan amount requested by males is higher than what is requested by females.
Python
df.groupby(['Married', 'Gender']).mean(numeric_only=True)['LoanAmount']
Output:
Mean Loan AmountHere is one more interesting observation in addition to the previous one that the married people requested loan amount is generally higher than that of the unmarried. This may be one of the reason's that we observe earlier that the chances of getting loan approval for a married person are lower than that compared to an unmarried person.
Python
# Function to apply label encoding
def encode_labels(data):
for col in data.columns:
if data[col].dtype == 'object':
le = LabelEncoder()
data[col] = le.fit_transform(data[col])
return data
# Applying function in whole column
df = encode_labels(df)
# Generating Heatmap
sb.heatmap(df.corr() > 0.8, annot=True, cbar=False)
plt.show()
Output:
heatmapStep 4: Data Preprocessing
In this step, we will split the data for training and testing. After that, we will preprocess the training data.
Python
features = df.drop('Loan_Status', axis=1)
target = df['Loan_Status'].values
X_train, X_val, Y_train, Y_val = train_test_split(features, target,
test_size=0.2,
random_state=10)
# As the data was highly imbalanced we will balance
# it by adding repetitive rows of minority class.
ros = RandomOverSampler(sampling_strategy='minority',
random_state=0)
X, Y = ros.fit_resample(X_train, Y_train)
X_train.shape, X.shape
Output:
((456, 5), (638, 5))
We will now use Standard scaling for normalizing the data. To know more about StandardScaler refer this link.
Python
# Normalizing the features for stable and fast training.
scaler = StandardScaler()
X = scaler.fit_transform(X)
X_val = scaler.transform(X_val)
Step 5: Model Development
We will use Support Vector Classifier for training the model.
Python
from sklearn.metrics import roc_auc_score
model = SVC(kernel='rbf')
model.fit(X, Y)
print('Training Accuracy : ', metrics.roc_auc_score(Y, model.predict(X)))
print('Validation Accuracy : ', metrics.roc_auc_score(Y_val, model.predict(X_val)))
print()
Output:
Training Accuracy : 0.6300940438871474
Validation Accuracy : 0.48198198198198194
Step 6: Model Evaluation
Model Evaluation can be done using confusion matrix.
we first train the SVC model using the training data X
and Y
. Then, we calculate the ROC AUC scores for both the training and validation datasets. The confusion matrix is built for the validation data by using the confusion_matrix
function from sklearn.metrics
. Finally, we plot the confusion matrix using the plot_confusion_matrix
function from the sklearn.metrics.plot_confusion_matrix
submodule.
Python
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix
training_roc_auc = roc_auc_score(Y, model.predict(X))
validation_roc_auc = roc_auc_score(Y_val, model.predict(X_val))
print('Training ROC AUC Score:', training_roc_auc)
print('Validation ROC AUC Score:', validation_roc_auc)
print()
cm = confusion_matrix(Y_val, model.predict(X_val))
Output:
Training ROC AUC Score: 0.6300940438871474
Validation ROC AUC Score: 0.48198198198198194
Python
plt.figure(figsize=(6, 6))
sb.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=False)
plt.title('Confusion Matrix')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()
Output:
Confusion Matrix
Python
from sklearn.metrics import classification_report
print(classification_report(Y_val, model.predict(X_val)))
Output:
classification reportAs this dataset contains fewer features the performance of the model is not up to the mark maybe if we will use a better and big dataset we will be able to achieve better accuracy.
You can download the dataset and source code from here:
Similar Reads
Machine Learning Tutorial Machine learning is a branch of Artificial Intelligence that focuses on developing models and algorithms that let computers learn from data without being explicitly programmed for every task. In simple words, ML teaches the systems to think and understand like humans by learning from the data.Do you
5 min read
Introduction to Machine Learning
Python for Machine Learning
Machine Learning with Python TutorialPython language is widely used in Machine Learning because it provides libraries like NumPy, Pandas, Scikit-learn, TensorFlow, and Keras. These libraries offer tools and functions essential for data manipulation, analysis, and building machine learning models. It is well-known for its readability an
5 min read
Pandas TutorialPandas is an open-source software library designed for data manipulation and analysis. It provides data structures like series and DataFrames to easily clean, transform and analyze large datasets and integrates with other Python libraries, such as NumPy and Matplotlib. It offers functions for data t
6 min read
NumPy Tutorial - Python LibraryNumPy (short for Numerical Python ) is one of the most fundamental libraries in Python for scientific computing. It provides support for large, multi-dimensional arrays and matrices along with a collection of mathematical functions to operate on arrays.At its core it introduces the ndarray (n-dimens
3 min read
Scikit Learn TutorialScikit-learn (also known as sklearn) is a widely-used open-source Python library for machine learning. It builds on other scientific libraries like NumPy, SciPy and Matplotlib to provide efficient tools for predictive data analysis and data mining.It offers a consistent and simple interface for a ra
3 min read
ML | Data Preprocessing in PythonData preprocessing is a important step in the data science transforming raw data into a clean structured format for analysis. It involves tasks like handling missing values, normalizing data and encoding variables. Mastering preprocessing in Python ensures reliable insights for accurate predictions
6 min read
EDA - Exploratory Data Analysis in PythonExploratory Data Analysis (EDA) is a important step in data analysis which focuses on understanding patterns, trends and relationships through statistical tools and visualizations. Python offers various libraries like pandas, numPy, matplotlib, seaborn and plotly which enables effective exploration
6 min read
Feature Engineering
Supervised Learning
Unsupervised Learning
Model Evaluation and Tuning
Advance Machine Learning Technique
Machine Learning Practice