0% found this document useful (0 votes)
29 views

Python Code For Loan Default Prediction

The document details a machine learning project to classify loan applicants as defaulters or non-defaulters using Python. It covers loading and exploring the data, preparing it for modeling, building and evaluating a logistic regression model, and concludes with next steps to improve the model performance.

Uploaded by

sitaramr54
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

Python Code For Loan Default Prediction

The document details a machine learning project to classify loan applicants as defaulters or non-defaulters using Python. It covers loading and exploring the data, preparing it for modeling, building and evaluating a logistic regression model, and concludes with next steps to improve the model performance.

Uploaded by

sitaramr54
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Python code for loan default prediction

A simple machine learning project to classify loan applicants as


defaulters or non-defaulters

Importing libraries
• # We need pandas to manipulate data frames
• import pandas as pd
• # We need numpy to perform numerical operations
• import numpy as np
• # We need sklearn to use machine learning models and metrics
• from sklearn.linear_model import LogisticRegression
• from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
• # We need matplotlib and seaborn to visualize the data
• import matplotlib.pyplot as plt
• import seaborn as sns

Loading and exploring the data


• # We load the data from a csv file into a pandas data frame
• data = pd.read_csv('loan_data.csv')
• # We check the shape and the first five rows of the data
• print(data.shape)
• print(data.head())
• # We see that the data has 13 columns and 614 rows
• # The columns are: Loan_ID, Gender, Married, Dependents, Education, Self_Employed,
ApplicantIncome, CoapplicantIncome, LoanAmount, Loan_Amount_Term, Credit_History,
Property_Area, Loan_Status
• # The target variable is Loan_Status, which indicates whether the loan was approved (Y) or
not (N)
• # We check the summary statistics of the numerical columns
• print(data.describe())
• # We see that the mean and standard deviation of ApplicantIncome, CoapplicantIncome,
LoanAmount, and Loan_Amount_Term vary a lot
• # We check the distribution of the categorical columns
• print(data['Gender'].value_counts())
• print(data['Married'].value_counts())
• print(data['Dependents'].value_counts())
• print(data['Education'].value_counts())
• print(data['Self_Employed'].value_counts())
• print(data['Credit_History'].value_counts())
• print(data['Property_Area'].value_counts())
• print(data['Loan_Status'].value_counts())
• # We see that the data is imbalanced, as there are more males, married, graduates, non-
self-employed, with credit history, and approved loans than the opposite categories
• # We also see that there are some missing values in some columns, such as Gender,
Married, Dependents, Self_Employed, and Credit_History
• # We visualize the relationship between the target variable and the other variables using bar
plots
• plt.figure(figsize=(15,10))
• plt.subplot(2,4,1)
• sns.countplot(x='Gender', hue='Loan_Status', data=data)
• plt.subplot(2,4,2)
• sns.countplot(x='Married', hue='Loan_Status', data=data)
• plt.subplot(2,4,3)
• sns.countplot(x='Dependents', hue='Loan_Status', data=data)
• plt.subplot(2,4,4)
• sns.countplot(x='Education', hue='Loan_Status', data=data)
• plt.subplot(2,4,5)
• sns.countplot(x='Self_Employed', hue='Loan_Status', data=data)
• plt.subplot(2,4,6)
• sns.countplot(x='Credit_History', hue='Loan_Status', data=data)
• plt.subplot(2,4,7)
• sns.countplot(x='Property_Area', hue='Loan_Status', data=data)
• plt.tight_layout()
• plt.show()
• # We see that the loan approval rate is higher for males, married, with 0 or 1 dependents,
graduates, non-self-employed, with credit history, and from semi-urban areas
• # We visualize the relationship between the target variable and the numerical variables
using box plots
• plt.figure(figsize=(15,5))
• plt.subplot(1,4,1)
• sns.boxplot(x='Loan_Status', y='ApplicantIncome', data=data)
• plt.subplot(1,4,2)
• sns.boxplot(x='Loan_Status', y='CoapplicantIncome', data=data)
• plt.subplot(1,4,3)
• sns.boxplot(x='Loan_Status', y='LoanAmount', data=data)
• plt.subplot(1,4,4)
• sns.boxplot(x='Loan_Status', y='Loan_Amount_Term', data=data)
• plt.tight_layout()
• plt.show()
• # We see that there is no significant difference in the median income and loan amount
between the approved and rejected loans
• # We also see that there are some outliers in the income and loan amount columns
• # We see that the loan amount term is mostly 360 months, with some exceptions

Preparing the data for modeling


• # We split the data into features (X) and target (y)
• X = data.drop(['Loan_ID', 'Loan_Status'], axis=1)
• y = data['Loan_Status']
• # We encode the target variable as 1 for Y and 0 for N
• y = y.map({'Y':1, 'N':0})
• # We check the percentage of missing values in each column
• print(X.isnull().sum()/len(X)*100)
• # We see that the missing values are less than 10% in each column, so we can impute them
with the mode (most frequent value)
• for col in X.columns:
• X[col].fillna(X[col].mode()[0], inplace=True)
• # We check that there are no more missing values
• print(X.isnull().sum())
• # We encode the categorical variables as dummy variables (0 or 1)
• X = pd.get_dummies(X, drop_first=True)
• # We check the new shape and columns of X
• print(X.shape)
• print(X.columns)
• # We see that X now has 14 columns, after creating dummy variables for Gender, Married,
Dependents, Education, Self_Employed, Credit_History, and Property_Area
• # We split the data into train and test sets, with 80% for train and 20% for test
• from sklearn.model_selection import train_test_split
• X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
• # We check the shape of the train and test sets
• print(X_train.shape, y_train.shape)
• print(X_test.shape, y_test.shape)

Building and evaluating the model


• # We create a logistic regression model, which is a simple and widely used classification
algorithm
• model = LogisticRegression()
• # We fit the model on the train data
• model.fit(X_train, y_train)
• # We make predictions on the test data
• y_pred = model.predict(X_test)
• # We evaluate the model performance using accuracy, confusion matrix, and classification
report
• print('Accuracy:', accuracy_score(y_test, y_pred))
• print('Confusion matrix:\n', confusion_matrix(y_test, y_pred))
• print('Classification report:\n', classification_report(y_test, y_pred))
• # We see that the model has an accuracy of 0.78, which means it correctly classified 78%
of the test data
• # We see that the confusion matrix shows that the model predicted 79 true positives, 18
true negatives, 15 false positives, and 11 false negatives
• # We see that the classification report shows that the model has a precision of 0.84, a
recall of 0.87, and an f1-score of 0.85 for the positive class (loan approved)
• # We see that the model has a precision of 0.55, a recall of 0.47, and an f1-score of 0.51 for
the negative class (loan rejected)
• # We see that the model is better at predicting the positive class than the negative class,
which is expected given the imbalanced data

Conclusion
• # We have completed a simple machine learning project to predict loan default using
python
• # We have imported the necessary libraries, loaded and explored the data, prepared the
data for modeling, built and evaluated a logistic regression model
• # We have learned some basic steps and concepts of data analysis and machine learning,
such as data manipulation, data visualization, data imputation, data encoding, data
splitting, model fitting, model prediction, and model evaluation
• # We have seen that the model has a decent accuracy, but it can be improved by using
more complex algorithms, tuning the hyperparameters, or balancing the data

You might also like