0% found this document useful (0 votes)
43 views5 pages

75.an Approach For Prediction of Loan Approval Using

Uploaded by

kadjkddaanjk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views5 pages

75.an Approach For Prediction of Loan Approval Using

Uploaded by

kadjkddaanjk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Proceedings of the International Conference on Electronics and Sustainable Communication Systems (ICESC 2020)

IEEE Xplore Part Number: CFP20V66-ART; ISBN: 978-1-7281-4108-4

An Approach for Prediction of Loan Approval using


Machine Learning Algorithm
Mohammad Ahmad Sheikh Amit Kumar Goel Tapas Kumar
School of Comuting Science & Professor, School of Comuting Science Professor, School of Comuting Science
Engineering & Engineering & Engineering
Galgotias University Galgotias University Galgotias University
Greater Noida . India Greater Noida, India Greater Noida, India
[email protected] [email protected] [email protected]

Abstract— In our banking system, banks have many products History), Business Value, Assets of Customer etc has been
to sell but main source of income of any banks is on its credit line. considered. List of parameters as shown below:
S o they can earn from interest of those loans which they credits.A
bank’s profit or a loss depends to a large extent on loans i.e.
whether the customers are paying back the loan or defaulting. By
predicting the loan defaulters, the bank can reduce its Non-
Qualification Categorical
Performing Assets. This makes the study of this phenomenon In Service / Business Categorical
very important. Pre vious research in this era has shown that Owner
there are so many methods to study the problem of controlling Individual income of Qualitative
loan default. But as the right predictions are very important for Applicant
the maximization of profits, it is essential to study the nature of Individual income of Co- Qualitative
the different methods and their comparison. A very important
approach in predictive analytics is used to study the problem of Applicant (if Any)
predicting loan defaulters: The Logistic regression model. The Amount of Loan required Qualitative
data is collected from the Kaggle for studying and prediction. Term for which loan Qualitative
Logistic Regression models have been performed and the Required
different measures of performances are computed. The models Credit History of Qualitative
are compared on the basis of the performance measures such as Applicant
sensitivity and specificity. The final results have shown that the
model produce different results.Model is marginally better Area of Property Categorical
because it includes variables (personal attributes of customer like
age, purpose, credit history, credit amount, credit duration, etc.) II. LIT ERAT URE SURVEY
other than checking account information (which shows wealth of Logistic Regression is a popular and very useful algorithm
a customer) that should be taken into account to calculate the of machine learning for classification problems. The advantage
probability of default on loan correctly. Therefore, by using a of logistic regression is that it is a predict ive analysis. It is
logistic regression approach, the right customers to be targeted used for description of data and use to explain relationship
for granting loan can be easily detected by evaluating their between a single binary variable and single or mu ltiple
likelihood of default on loan. The model concludes that a bank nominal, ordinal and ration level variables which are
should not only target the rich customers for granting loan but it independent in nature.
should assess the other attributes of a customer as well which
play a very important part in credit granting decisions and The model development for the prediction is taken in
predicting the loan defaulters. account using the sigmoid function in logistic regression as the
outcome is targeted binary either 0 or 1 [11][15]. The dataset of
Keywords—loan, outlier, Prediction, component,Overfitting, bank customers has been divided into training and test data
Transform sets.. The train dataset contains approximately 600+ rows and
13+ colu mns whereas the test dataset contains 300+ rows and
I. INT RODUCT ION 12+ colu mns, the test dataset does not contain the target
This paper has taken the data of previous customers of variable. Both the datasets are having missing values in their
various banks to whom on a set of parameters loan were rows, and the mean, median or mode is used to fill the missing
approved. So the machine learning model is trained on that values but not removing the rows comp letely because the
record to get accurate results. Our main objective of this datasets are already small. Using the Feature Engineering
research is to predict the safety of loan [1][3]. To predict loan techniques, the project is further proceeded and move towards
safety, the logistic regression algorithm is used. First the data is the exploratory data analysis , where the dependent and
cleaned so as to avoid the missing values in the data set. To independent variable is studied through statistics concepts such
train our model data set of 1500 cases and 10 numerical and 8 normal distribution, Probability density function etc. Study of
categorical attributes has been taken. To credit a loan to the univariate, bivariate and mult ivariate analysis will give the
customer various parameters like CIBIL Score (Credit view of the inside dependent and independent variable[13]14].
The model is focusing on to target those customers who are

978-1-7281-4108-4/20/$31.00 ©2020 IEEE 490

Authorized licensed use limited to: TRINITY COLLEGE LIBRARY DUBLIN. Downloaded on August 15,2020 at 06:27:17 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the International Conference on Electronics and Sustainable Communication Systems (ICESC 2020)
IEEE Xplore Part Number: CFP20V66-ART; ISBN: 978-1-7281-4108-4

eligib le for loans and therefore the logistic regression is B. Pre Processing
enabled using the sigmoid function as it divided the probability Data min ing technique has been used in Pre-Processing for
into binary output. Therefore the Prediction model can be transforming raw data which is collect using online form into
developed. useful and efficient formats .There is a need to convert it in
useful format because it may have some irrelevant,missing
III. PROBLEM ST AT EMENT
information and noisy data. To deal with this problem data
Banks, Housing Finance Co mpanies and some NBFC deal cleaning technique has been used.
in various types of loans like housing loan, personal loan, Before data min ing the data reduction techniques is used to
business loan etc in all over the part of countries. These deal with huge volume of data. So data analysis will become
companies have existence in Rural, Semi Urban and Urban easier and it intends to get accurate results. So data storage
areas. After applying loan by customer these companies capacity increase and cost to analysis of data reduces.
validates the eligibility of customers to get the loan or not. This The size of data can be reduced by encodingmechanisms. So
paper provides a solution to automate this process by it may be lossy or lossless. If the orig inal data is obtained after
employing machine learning algorith m. So the customer will reconstruction fro m co mpressed data, such reductions are called
fill an online loan application form. This form consist details lossless reduction else it is called lossy reduction. Wavelet
like Sex, Marital Status, Qualification, Details of Dependents, transforms and PCA (Principal Co mponent Analysis) methods
Annual Income, A mount of Loan, Credit History of Applicant are effective for reduction.
and others. To automate this process by using machine learning ID 0
algorithm, First the algorithm will identify those segments of
Sex 13
the customers who are eligible to get loan amounts so bank can
Married 3
focus on these customers [4][7].
No_Dependents 15
Loan prediction is a very common real-life problem that Qualification 0
every finance company faces in their lending operations. If the In Service / Self_Employed 32
loan approval process is automated, it can save a lot of man Annual_Income_Applicant 0
hours and improve the speed of service to the customers. The Annual_income_Coapplicant 0
increase in customer satisfaction and savings in operational Amount_Loan 22
costs are significant[9]. However, the benefits can only be Term 14
reaped if the bank has a robust model to accurately predict Credit_History _ Applicant 50
which customer's loan it should approve and which to reject, in Assets 0
order to minimize the risk of loan default Status_Loan 0
IV. PROPOSED M ODEL C. Feature Engineering
Prediction of granting the loan to the customers by the bank In feature engineering a proper input dataset which is
is the proposed model. Classification is the target for compatible as permachine learn ing algorith m requirements is
developing the model and hence using Logistic Regression prepared. In our model Pandas and Numpy library has been
with sig moid function is used for developing the model. imported to run. So the performance of machine learning
Preprocessing is the major area of the model where it consumes model improves.
more time and then Exp loratory Data Analysis which is import pandas as pd
followed by Feature Engineering and then Model Selection. import numpy as np
Feeding thetwo separate datasets to the model, and then
preceding the model. D. List of Techniques:
Logistic regression is a type of statistical machine learning 1) Imputation:There is one more measure problem i.e.
technique/algorithm which is used to classify the data by missing values when data is prepared for our machine learning
considering outcome variables on extreme ends and tries to model. There may be many reason of missing values like hu man
make a logarith mic line that distinguishes between them. By errors, interruptions in flow of data, security concerns, and so
this way prediction can be made through Logistic Regression. on. The performance of machine learning model severely
affected by missing values.
A. Data Collection train['Gender'].fillna(train['Gender'].mode()[0],inplace=Tr
Data has been collected from the Kaggle one of the most ue)
data source providers for the learning purpose and hence the train['Married'].fillna(train['Married'].mode()[0],inplace=
data is collected from the Kaggle, which had two data sets one True)
for the training and another testing[12]. The training dataset is train['Dependents'].fillna(train['Dependents'].mode()[0],in
used to train the model in wh ich datasets is further divided into place=True)
two parts such as 80:20 or 70:30 the major datasets is used for ID 0
the train the model and the minor dataset is used for the test the
Sex 0
model and hence the accuracy of our developed model is
Married 0
calculated.
No_Dependents 0
Qualification 0

978-1-7281-4108-4/20/$31.00 ©2020 IEEE 491

Authorized licensed use limited to: TRINITY COLLEGE LIBRARY DUBLIN. Downloaded on August 15,2020 at 06:27:17 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the International Conference on Electronics and Sustainable Communication Systems (ICESC 2020)
IEEE Xplore Part Number: CFP20V66-ART; ISBN: 978-1-7281-4108-4

In Service / Self_Employed 0
Annual_Income_Applicant 0
Annual_income_Coapplicant 0
Amount_Loan 22
Term 14
Credit_History _ Applicant 0
Assets 0
Status_Loan 0

2) Handling Outliers: To detect the outliers the data is


demonstrated visually and afterwards handled the outliers. Fig. 2. Heap Map
When the ouliers decisions visulaized are of high p recision and
accurate. Percentiles is another mathematical method to detect 4) Log Transform:Logarith m transformation (or log
outliers. In this method,it assumes a certain percentage of value transform) is very common mathematical transformations
fro m top or taken it from bottom as an outlier. The key point is technique in feature engineering. The benefit of log
here to set the percentage value once again, and this depends on transformation is to handle skewed data and after transformation
the distribution of your data as mentioned earlier. distribution becomes more appro ximate to normal. Log
transformtion decreases the effect of the outliers, due to the
normalization of magnitude differences and machine learning
model becomes more robust.

Before Log Transform After Log Transform

5) One Hot Encoding:One hot encoding is commonly used


encoding methods of machine learning. After using this method
the values spreads in a single and multiple colu mns having
Fig. 1. Application income vs Loan Status
values 0 and 1. These values shows a relation between encoded
3) Binning:The key point between performance and and group columns. When the categorical data by using this
overfitting is binning. In my opinion, for nu merical values method has been changed then it would be difficult to
columns, except very few overfitting cases, binning might be understand for algorithms, to a numerical format and enables to
redundant for some kind of algorith ms, due to its effect on the group the categorical data without losing any of the information.
performance of model. However, for categorical colu mns, the
labels which have low frequencies might affected from the from sklearn.preprocessing import LabelEncoder
robustness of statistical models in negative manner. After number=LabelEncoder()
assigning a comman category to all these less frequent values
helps to keep the model robust. S LA P
G M D E E AI CAI LA CH T A LAL
1 0 0 0 0 584 0 12 36 1 2 4.8520
9 8 0 3
1 1 1 0 0 458 150 12 36 1 0 4.8520
3 8 8 0 3

G Sex
M Married
D No_Dependents
E Qualification
SE In Service / Self_Employed
AI Annual_Income_Applicant
CAI Annual_income_Coapplicant

978-1-7281-4108-4/20/$31.00 ©2020 IEEE 492

Authorized licensed use limited to: TRINITY COLLEGE LIBRARY DUBLIN. Downloaded on August 15,2020 at 06:27:17 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the International Conference on Electronics and Sustainable Communication Systems (ICESC 2020)
IEEE Xplore Part Number: CFP20V66-ART; ISBN: 978-1-7281-4108-4

LA Amount_Loan
CH Credit_History _ Applicant

LAT Loan Amount Transfer 3) Precision:


Percentage ratio of positive instances and total predicted
PA Assets positive instances gives precision value. In the below
LAL loan Amount log equation denominator represents the model positive
prediction done from the whole given dataset. Precision
value tells the perfectness of our model. In our data set good
V. M ODEL SELECT ION precision value has been obtained.
The process of selecting a final mach ine learn ing model
fro m among a group of candidate machine learning models for
aparticular training dataset of Loan customer is called model 4) Recall:
selection. Percentage ratio of positive instances with actual total
There are different types of model like logistic regression, positive instances is recall value. Here denominator (TP + FN)
SVM, KNN, etc. All these models have some merits and shows the total number of positive instances which are present
demerits for example predictive error gives the statistical noise in whole dataset. As a result it has obtained ‘how much extra
in the data, the incompleteness of the sample data, and the right ones, the model will failed if it shows maximu m right
limitations of each different model type. The chosen ones’.
modelmeets the requirements and constraints of the
stakeholders (Bank and Customers) project stakeholders. A .
model should have parameters like
 Skillful as compared to naive models. 5) F1 Score:
 Skillful relative to other tested models. The harmonic mean (HM) of precision and recall values is
 Skillful relative to the state-of-the-art. called F1 Score. Model will be best performer if it shows
Thus, Prediction of loan approval is a type of a classification maximu m F1 Score. Nu merator shows the product of precision
problem and hence this model is used. and recall if one goes low either precision or recall, the final F1
score goes down significantly. So a model does well in F1
from sklearn.linear_model import LogisticRegression model = score if the positive predicted (precision) having positive
LogisticRegression() valueand doesn't miss out on positives and predicts them
negative (recall).
model.fit(x_train, y_train)

VI. M ODEL EVALUAT E


Model evaluation is technique which is used for the
evaluating the performance of the model based on some VII. CONCLUSION
constraints it should be kept in mind while evaluating the model The process of prediction starts from cleaning and
that it can’t underfoot or overfit the model. Various methods are processing of data, imputation of missing values, experimental
present to evaluate the performance of the model such as analysis of data set and then model building to evaluation of
Confusion metrics, Accuracy, Precision, Recall, F1 score etc. model and testing on test data. On Data set, the best case
1) Confusion Metrics: accuracy obtained on the original data set is 0.811. The
following conclusions are reached after analysis that
thoseapplicants whose credit score was worstwill fail to get loan
approval, due to a higher probability of not paying back the loan
amount. Most of the time, those applicants who have high
income and demands for lower amount of loan are more likely
to get approved which makes sense, more likely to pay back
their loans. So me other characteristic like gender and marital
status seems not to be taken into consideration by the company.

REFERENCES
Fig. 3. Confusion Matrix [1] Toby Segaran, “Programming Collective Intelligence: Building Smart
Web 2.0 Applications.” O’Reilly Media.
2) Accuracy: [2] Drew Conway and John Myles White,” Machine Learning for Hackers:
Case Studies and Algorithms to Get you Started,” O’Reilly Media.
Accuracy of the model has been measured by predefined [3] Trevor Hastie, Robert Tibshirani, and Jerome Friedman,”The Elements
of Statistical Learning: Data Mining, Inference, and Prediction ,”
metrics. In a balance class model shows high accuracy but in the Springer ,Kindle
case of unbalanced class the accuracy is very less.

978-1-7281-4108-4/20/$31.00 ©2020 IEEE 493

Authorized licensed use limited to: TRINITY COLLEGE LIBRARY DUBLIN. Downloaded on August 15,2020 at 06:27:17 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the International Conference on Electronics and Sustainable Communication Systems (ICESC 2020)
IEEE Xplore Part Number: CFP20V66-ART; ISBN: 978-1-7281-4108-4

[4] PhilHyo Jin Do ,Ho-Jin Choi, “ Sentiment analysis of real-life situations [11] Raj, J. S., & Ananthi, J. V., “ Recurrent neural networks and nonlinear
using loca- tion, people and time as contextual features,” International prediction in support vector machine” Journal of Soft Computing
Conference on Big Data and Smart Computing (BIGCOMP), pp. 39–42. Paradigm (JSCP), 1(01), 33-40, 2019.
IEEE, 2015. [12] Aakanksha Saha, Tamara Denning, VivekSrikumar, Sneha Kumar
[5] Bing Liu, “ Sentiment Analysis and Opinion Mining,” Morgan & Kasera. "Secrets inSource Code: Reducing False Positives usingMachine
Claypool Publishers, May 2012. Learning", 2020 InternationalConference on Communication Systems
[6] Bing Liu, “ Sentiment Analysis: Mining Opinions, Sentiments, and &Networks (COMSNETS), 2020.
Emotions,” Cambridge University Press, ISBN:978-1-107-01789-4. [13] X.Frencis Jensy, V.P.Sumathi,Janani Shiva Shri, “An exploratory Data
[7] Shiyang Liao, Junbo Wang, Ruiyun Yu, Koichi Sato, and Zixue Cheng, Analysis for Loan Prediction based on nature of clients”, International
“ CNN for situations understanding based on sentiment analysis of Journal of Recent T echnology and Engineering (IJRT E),Volume-7
twitter data,” Procedia computer science, 111:376–381, 2017.CrossRef. Issue-4S, November 2018.
[8] K I Rahmani, M.A. Ansari, Amit Kumar Goel, “ An Efficient Indexing [14] Pidikiti Supriya, Myneedi Pavani, Nagarapu Saisushma,Namburi
Algorithm for CBIR,”IEEE- International Conference on Computational Vimala Kumari, k Vikash,“Loan Prediction by using Machine Learning
Intelligence & Communication Technology ,13-14 Feb 2015. Models”, International Journal of Engineering and Techniques.Volume 5
Issue 2, Mar-Apr 2019
[9] Gurlove Singh, Amit Kumar Goel ,”Face Detection and Recognition
System using Digital Image Processing” , 2nd International conference [15] Nikhil Madane, Siddharth Nanda,”Loan Prediction using Decision tree”,
on Innovative Mechanism for Industry Application ICMIA 2020, 5 -7 Journal of the Gujrat Research History,Volume 21 Issue 14s, December
March 2020, IEEE Publisher. 2019.
[10] Amit Kumar Goel, Kalpana Batra, Poonam Phogat,” Manage big data
using optical networks”, Journal of Statistics and Management .
Systems “ Volume 23, 2020, Issue 2, T aylors & Francis.

978-1-7281-4108-4/20/$31.00 ©2020 IEEE 494

Authorized licensed use limited to: TRINITY COLLEGE LIBRARY DUBLIN. Downloaded on August 15,2020 at 06:27:17 UTC from IEEE Xplore. Restrictions apply.

You might also like