Loan Default Prediction Using Machine Learning
Loan Default Prediction Using Machine Learning
Abstract - The project titled "Loan Default Prediction The subsequent sections delve into the employed
Using Machine Learning" has been developed with the aim of methodology, the considered evaluation metrics, and the
enhancing the evaluation of credit risk in financial institutions. broader implications of integrating machine learning into the
Traditional models for credit scoring often encounter crucial domain of credit risk assessment which would make
difficulties in capturing the intricate financial behaviours, thus potential positive changes in the current system of loan
necessitating the utilization of advanced machine learning credibility. This will not only be a crucial financial tool for
techniques. By making use of a comprehensive dataset that the lenders but also provide transparency of the system to the
incorporates information about borrowers as well as historical people.
financial data, this project employs algorithms such as
Random Forest and Gradient Boosting. Through the II. OBJECTIVES
preprocessing of data and the application of feature
engineering methods, the dataset is optimized, and the A. Make a Predictive Model for Loan Defaults:
performance of the models is thoroughly evaluated using Create a machine learning model that can find out the
metrics such as accuracy and precision. The selected models possible loan default that can occur with efficiency and high
are further refined through the process of hyperparameter
accuracy.
tuning, ensuring that their predictive capabilities are
optimized. The success of this project lies in its provision of a B. Providing A Complete Analysis of The Outcome:
reliable tool to financial institutions, enabling them to
Result analysis of how the result was calculated what
accurately predict the risks associated with loan defaults. Once
validated, the final model seamlessly integrates into the existing
factors were primarily involved in the final outcome.
loan processing systems, thereby empowering lenders to make C. Providing A Report to The Borrower:
more well-informed decisions. By contributing to the
advancement of credit risk assessment methodologies through Providing a detailed report to the borrower if in case the
machine learning, this project is poised to bring about a result is negative what are the possible reasons of loan being
revolutionary change in the way financial institutions manage defaulted and provide possible solutions in improvising the
their loan portfolios, offering improved accuracy and efficiency chances of increasing the probability of acquiring the loan.
in the identification of potential loan defaults. D. Advise On the Amount of Loan That Could Be Acquired
Keywords – Loan Defaults, financial data, credit scores, (if result is negative):
machine learning, credit risk. If the outcome is negative then it will give a probable
result of how much amount the borrower is eligible for and
I. INTRODUCTION can acquire without any default.
In the ever-changing realm of financial institutions, E. Reducing The Time Constraint:
accurately predicting the risks associated with loan defaults
plays a crucial role in ensuring the stability and longevity of Minimizing the time taken to approve a loan by
lending practices. Conventional models for evaluating credit evaluating the loan applicant’s profile and eligibility for loan
scores often struggle with the intricacies of borrowers' with high accuracy using a predictive machine learning
financial behaviours, necessitating a shift towards more algorithm.
advanced and adaptable approaches. F. Transparency Of the Work:
This project emerges as a strategic response to this Transparency of all the activities performed in financial
challenge, with the aim of harnessing the capabilities of institutes is a major problem for both the institute and the
state-of-the-art machine learning techniques. This project loan applicants providing a proper report for it would solve
centres on leveraging a diverse range of characteristics, this problem to some extent.
including borrower information, historical financial data, and
credit history, to construct a robust predictive model. By III. LITERATURE REVIEW
incorporating algorithms like Random Forest, Gradient The scholarly discourse on Loan Default Prediction
Boosting, and Neural Networks, we aspire to transcend the utilizing machine learning underscores a noteworthy shift
limitations of traditional models and enhance the accuracy of towards refining methodologies for assessing credit risk.
credit risk assessment. The significance of this endeavour Scholars emphasize the integration of diverse datasets,
lies in its potential to revolutionize credit risk management including unconventional attributes such as social and online
within financial institutions. behaviour, to augment predictive models, recognizing the
Through meticulous data preprocessing, feature dynamic nature of borrowers' financial activities.
engineering, and model optimization, our project endeavours The selection of algorithms is a central focus, with
to provide lenders with a dependable and efficient tool for studieshighlighting the efficacy of ensemble methodologies
identifying and managing loan default risks. such as Random Forest and Gradient Boosting.
2
Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on July 09,2024 at 14:37:16 UTC from IEEE Xplore. Restrictions apply.
IV. PROPOSED METHOD exploring different combinations of hyperparameters to find
the optimal configuration. Additionally, cross-validation is
The methodologies employed for the prediction of loan employed to assess the model's performance across multiple
defaults through the use of machine learning encompass a subsets of the dataset.
comprehensive and systematic process aimed at developing a
precise and robust predictive model. E. Model Training:
Once the optimal hyperparameters have been determined,
the selected model undergoes iterative training on the
historical dataset. During training, the model's parameters are
adjusted to minimize the difference between predicted and
actual outcomes, thereby improving its predictive accuracy.
F. Evaluation Metrics:
Accuracy is used to measure the overall correctness of
the model's predictions. Additionally, precision, recall, and
the F1 score provide insights into the model's ability to
correctly identify positive cases. Furthermore, the area under
the ROC curve (AUC-ROC) is utilized to assess the model's
ability to discriminate between default and non-default cases.
Fig. 1. The initial steps involving data collection, exploratory data analysis
and feature engineering.
3
Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on July 09,2024 at 14:37:16 UTC from IEEE Xplore. Restrictions apply.
the median value. Additionally, categorical variables like The success of the model hinges on meticulous data
'term', 'verification_status', and 'purpose' undergo numerical preprocessing and thoughtful consideration of the dataset's
encoding through Label Encoding, while numerical values unique characteristics.
are extracted from pertinent columns such as 'term',
'total_acc', and 'duration_of_loan'. Future iterations of the project may involve further fine-
tuning of hyperparameters, exploration of alternative models
Following data preprocessing, the dataset is split into for comparison, and more in-depth feature engineering to
training and testing sets, with 80% of the data reserved for enhance predictive capabilities.
training the model and 20% for evaluating its performance.
TABLE II. A BRIEF RESULT EVALUATION WITH OTHER RESEARCHES
The Random Forest Classifier, configured with 100 THAT HAVE BEEN PERFORMED ON THIS TOPIC
estimators and a fixed random state of 42, is then trained on
S. No. Parameters Comparison with Other Researches
the pre-processed data. Model evaluation involves making
1 Data Processing Handling missing values and outliers for
predictions on the test set and analysing performance metrics cleaner data.
such as accuracy, a confusion matrix, and a classification 2 Model Selection Use of Random Forest Algorithm rather
report.To gain deeper insights into the model's behaviour, than a traditional model like logistic
visualizations are incorporated. regression.
3 Evaluation Use of confusion matrix and
Metrics classification report.
4 Feature Extensive feature engineering and
Importance evaluation for better understanding of the
prominent features that contribute.
5 Explainability Proper explanation of the outcome for
better understandability to both the
financial institute and the loan
applicants.
VI. CONCLUSION
The Random Forest Classifier implemented for the
purpose of predicting loan repayment demonstrates
promising performance and offers a strong framework for
evaluating the likelihood of borrowers repaying their loans.
The model utilizes a combination of important features, such
as loan terms, interest rates, annual income, and verification
Fig. 4. This Graph represents the actual, predicted and the F1-Score of the
model
status, to generate accurate predictions. By carefully
preprocessing the data, managing missing values, and
encoding categorical variables, the model proves its ability to
effectively learn from the given dataset.
The evaluation metrics, which include accuracy, a
confusion matrix, and a classification report, collectively
provide evidence of the model's informed predictions. The
visualizations, notably the confusion matrix heatmap and the
feature importance bar chart, contribute to a comprehensive
understanding of the model's behaviour and highlight key
factors that influence its decision-making.
As a tool for financial institutions and lenders, this model
has the potential to improve the loan approval decision-
making process by identifying applicants who pose a high
risk. The success of this project emphasizes the importance
Fig. 5. This Graph Represents the Important Features That Were Selected of careful feature engineering and the interpretability of the
During the Random Forest Model and Filtered According to Their model, which opens up possibilities for further refinements
Importance and potential applications in real-world lending scenarios.
Future iterations could explore additional models,
hyperparameter tuning, and expanded feature engineering to
A heatmap of the confusion matrix vividly illustrates the continuously enhance the accuracy and robustness of the
model's ability to correctly classify instances of loan predictions.
repayment, offering valuable information on true positives, VII. FUTURE SCOPE
true negatives, false positives, and false negatives.
The expansive future potential of the loan repayment
Additionally, a bar chart portraying feature importance prediction model lies in its ability toenhancepredictive
aids in understanding which variables significantly influence capabilities and contribute to the evolving financial decision-
the model's predictions. In conclusion, this project making landscape. By further refining the model through
successfully explores the application of a Random Forest advanced feature engineering, incorporating alternative
Classifier for loan repayment prediction, providing valuable machine learning algorithms, and hyperparameter tuning, its
insights into the model's performance and the relative accuracy and versatility can be elevated. To adapt to
importance of various features. changing economic conditions and borrower behaviours,
4
Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on July 09,2024 at 14:37:16 UTC from IEEE Xplore. Restrictions apply.
integration of real-time data streams and continuous model [5] Xiang, Zhang. Loan Risk Prediction Model based on Random Forest.
training would enable adaptive responses. Furthermore, Advances in Economics, Management and Political Sciences, (2023).
doi: 10.54254/2754-1169/5/20220082
exploring interpretability techniques and model
[6] José, Antonio, Núñez, Mora., Pilar, Madrazo-Lemarroy. Loan Default
explainability tools can foster greater trust and transparency Prediction: A Complete Revision of LendingClub. Revista mexicana
in the decision-making process for stakeholders. The model's de economía y finanzas, (2023). doi: 10.21919/remef. v18i3.886
applicability extends beyond the binary prediction [7] Simiao, Wang. Loan Prediction Using Machine Learning Methods.
framework, with potential adaptations for risk stratification, Advances in Economics, Management and Political Sciences, (2023).
dynamic interest rate determination, and personalized doi: 10.54254/2754-1169/5/20220081
financial counselling. By embracing innovations in fintech [8] Shasha, Liu., Ming, Shan, Guan., Yang, Li., Menglu, Wang., HuiMin,
and artificial intelligence, this model has the potential to play Zhu. A Bayesian deep learning method based on loan default rate
detection. (2023). doi: 10.1117/12.2678879
a pivotal role in fostering more informed and equitable
lending practices in the ever-evolving financial landscape. [9] Ebenezer, Owusu., Richard, Quainoo., Solomon, Kuuku, Mensah.,
Justice, Kwame, Appati. A Deep Learning Approach for Loan Default
Prediction Using Imbalanced Dataset. International Journal of
REFERENCES Intelligent Information Technologies, (2023). doi:
[1] Loan Default Prediction Using Machine Learning Techniques. Indian 10.4018/ijiit.318672
Scientific Journal of Research in Engineering and Management, [10] Jovanne, C., Alejandrino., Jovito, Jr., P., Bolacoy., John, Vianne,
(2023). doi: 10.55041/ijsrem24519 Murcia. Supervised and unsupervised data mining approaches in loan
[2] Hongyun, Jin. Loan Risk Prediction based on Random Forest Model. default prediction. International Journal of Electrical and Computer
(2023). doi: 10.21203/rs.3.rs-3094217/v1 Engineering, (2023). doi: 10.11591/ijece. v13i2.pp1837-1847
[3] Muhamad, Abdul, Aziz, Muhamad, Saleh, Jumaa., Mohammed, [11] Zixuan, Zhang. Credit Card Default Prediction based on Machine
Saqib. (2023). Improving Credit Risk Assessment through Deep Learning Techniques. BCP business & management, (2023). doi:
Learning-based Consumer Loan Default Prediction Model. 10.54691/bcpbm. v44i.4954
International Journal of Finance & Banking Studies, doi: [12] Jiaqi, Fan. Predicting of Credit Default by SVM and Decision Tree
10.20525/ijfbs. v12i1.2579 Model Based on Credit Card Data. BCP business & management,
[4] Yash, Diwate., Prashant, Singh, Rana., Pratik, A., Chavan. Loan (2023). doi: 10.54691/bcpbm. v38i.3666
approval prediction using machine learning. International Research [13] Loan Default Prediction Based on Machine Learning Methods.
Journal of Modernization in Engineering Technology and Science, (2023). doi: 10.4108/eai.2-12-2022.2328740
(2023). doi: 10.56726/irjmets39658
5
Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on July 09,2024 at 14:37:16 UTC from IEEE Xplore. Restrictions apply.