ML Report1
ML Report1
VARUN SURYA– AP22110010900 (feature engineering and model training and testing)
KARTHIK - AP22110010909(feature engineering and model training and testing)
N. TARUN KUMAR- AP22110010903 (pre-processing)
P.SIDDHART-AP22110010978 (pre-processing)
1 Acknowledgment
2 Abstract
4 Project Background
7 Model Architecture
9 Experimentation Details
• Dataset Used
• Model Comparisons
11 References
Acknowledgment
I would like to express my heartfelt gratitude to all those who have contributed to the successful
completion of this project.
First and foremost, I extend my sincere thanks to my project guide, Krishna Siva Prasad Mudigonda,
for their invaluable guidance, continuous support, and encouragement throughout the course of this
work. Their expertise and insightful feedback have been instrumental in shaping the direction and
outcome of this project.
I am also deeply grateful to SRM University – AP, particularly the Department of Computer Science
and Engineering, School of Engineering and Sciences, for providing the necessary resources, facilities,
and a conducive environment for research and development.
A special note of thanks to my peers and friends for their constant support, valuable suggestions, and
motivation during every stage of this journey.
Lastly, I owe my deepest gratitude to my family for their unwavering love, patience, and
encouragement, without which this project would not have been possible.
This accomplishment is the result of the collective support and contributions of all the above
individuals, to whom I remain truly thankful.
Abstract
In today’s digital era, automating the loan approval process is crucial for improving efficiency,
minimizing manual errors, and ensuring unbiased decision-making. This project, titled "Loan Approval
and Prediction using Machine Learning," aims to develop a predictive model that can assess the
eligibility of loan applicants based on their financial and personal information.
The system utilizes historical loan application data and applies various supervised machine learning
algorithms, including Logistic Regression, K-Nearest Neighbours (KNN), and Random Forest
Classifier. Each model was trained and evaluated using standard performance metrics such as accuracy,
precision, recall, and F1-score to determine its effectiveness.
Among the models tested, the Random Forest Classifier outperformed the others, achieving an
accuracy of 83% , demonstrating its robustness in handling classification tasks with both numerical
and categorical data. Logistic Regression and KNN also provided competitive results, with accuracies
of 80% and 75%, respectively.
The project showcases how machine learning can be effectively used in the financial sector to
automate and enhance the loan approval process. It also provides insights into the key factors
influencing loan decisions, making the system not only predictive but also interpretable and practical
for real-world deployment.
Project Background
In the banking and financial sector, loan approval is a critical process that involves evaluating a
borrower's eligibility based on various financial, personal, and credit-related parameters. Traditionally,
this process has relied heavily on manual evaluation by loan officers, which is time-consuming, prone
to human error, and often lacks consistency. With the rise in digital transformation and the availability
of large datasets, there is a growing need to automate and optimize this decision-making process using
data-driven approaches.
Machine learning (ML) offers powerful tools for predictive modelling and classification tasks, making
it highly suitable for loan approval systems. By training ML models on historical data, it is possible to
uncover patterns and relationships that can help in accurately predicting whether a loan application
should be approved or not. This reduces the time taken for processing applications, improves accuracy,
and ensures fair and objective decision-making.
This project focuses on predicting whether a loan application will be approved or not using machine
learning. The goal is to build a smart system that can help banks or financial institutions make faster
and more accurate decisions based on the details provided by the applicants.
We use a dataset that contains information about loan applicants, such as their gender, marital status,
income, loan amount, credit history, and more. By analysing this data, we train machine learning
models to recognize patterns that usually lead to loan approval or rejection.
To achieve this, we applied different models like Random Forest, Logistic Regression, and K-Nearest
Neighbours, and compared their performance. We also used techniques like data cleaning, feature
engineering, and scaling to prepare the data properly for training.
This system can help reduce the workload for loan officers and make the process more efficient,
accurate, and fair for everyone involved. With proper training and testing, the model can act as a
supportive tool in the decision-making process.
In addition, a small user input system was developed to allow applicants to check their eligibility
instantly by entering a few details. This interactive feature adds a practical aspect to the project and
demonstrates how machine learning can be integrated into real-world applications.
PROBLEM STATEMENT
In today’s fast-paced digital world, financial institutions are increasingly relying on data-driven
strategies to optimize their decision-making processes. One of the most critical and frequent decisions
banks face is whether to approve or reject a loan application. Traditionally, this process has been
manual, time-consuming, and often influenced by human bias, leading to inconsistent and suboptimal
outcomes.
This project aims to automate and enhance the loan approval process using supervised machine
learning algorithms. By analysing historical loan data, the system learns complex patterns and
relationships among multiple features such as applicant income, credit history, loan amount,
employment status, and more. The goal is to build a predictive model that can classify whether a new
loan application is likely to be approved or not.
To address this challenge, the project explores multiple machine learning models—including Random
Forest, Logistic Regression, and K-Nearest Neighbours (KNN)—and compares their performance
using key evaluation metrics such as accuracy, precision, recall, F1 score.Feature engineering
techniques such as deriving Total Income, Loan-to-Income Ratio, and Log-Transformed Loan Amount
are implemented to boost model accuracy.
The final system not only predicts loan approval with high reliability but also provides an interactive
interface where users can input new applicant data and receive instant predictions. This contributes to a
more efficient, transparent, and scalable loan evaluation process, ultimately supporting better financial
decision-making.
To solve the problem of predicting loan approval, we propose the use of supervised machine learning
techniques. These models learn from historical loan data to find patterns and relationships between
various applicant features (such as income, employment status, loan amount, etc.) and the final loan
approval decision.
The dataset is first cleaned and pre-processed to handle missing values and convert categorical data
into numerical form. Feature engineering is then applied to create new meaningful variables like total
income, loan-to-income ratio, and more — which help the model understand the applicant's financial
situation better.
We experimented with multiple ML algorithms, including:
Random Forest Classifier: Known for its high accuracy and ability to handle complex data.
Logistic Regression: A simple yet effective model for binary classification problems like loan
approval.
K-Nearest Neighbours (KNN): A distance-based algorithm that makes decisions based on
similar past applications.
Each model is trained and evaluated using metrics like accuracy, precision, recall, and F1 score. Based
on performance comparison, we identify the best-suited model for real-world use.
This approach not only automates the loan screening process but also ensures consistency and fairness
in decision-making by removing human bias. It can significantly speed up loan processing while
improving reliability and trust in the system.
Model Architecture
1)Data Collection:
We begin by importing data from Kaggle, loan dataset containing information about applicants such as income,
credit history, education, and loan amount.
2) Data Preprocessing:
This includes handling missing values, encoding categorical variables, and normalizing the data. It
ensures the dataset is clean and suitable for training.
Here we can see 7 categorical data , loan_ID is unique and had no corelation with any other feature so
we can remove it
3.Feature Engineering
Feature engineering is the process of creating new input features or modifying existing ones to help the
machine learning model perform better. In my project, I applied a few basic but important techniques:
1. Handling Categorical Data:
Some columns like Gender, Married, Education, and Property_Area had text values. Since most
ML models work with numbers, I converted these categories into numerical values using label
encoding or one-hot encoding wherever needed.
2. Filling Missing Values:
The dataset had missing values in columns like LoanAmount, Loan_Amount_Term, and
Self_Employed. I filled these using either the mean, mode, or based on the most frequent value
in that column to avoid errors during training.
4. Scaling/Normalization:
For numerical features like ApplicantIncome and LoanAmount, I applied scaling to bring all
values to a similar range. This is helpful especially for algorithms that are sensitive to feature
magnitudes.
4. Splitting the Data
1)Once the dataset has been cleaned and transformed, the next crucial step is to split the data into
training and testing subsets. This process ensures that we can evaluate how well our machine learning
model generalizes to unseen data.
The training set is used to train the model, allowing it to learn patterns and relationships from
the input features.
The testing set is used to evaluate the model's performance on new, unseen data and helps
prevent overfitting.
2)In this project, the dataset is split in a 60:40 ratio, where:
60% of the data is used for training (X_train, Y_train)
40% is used for testing (X_test, Y_test)
3) We used the train_test_split() function from Scikit-learn, with a fixed random_state for
reproducibility of results.
4)This split allows us to effectively validate the performance of multiple machine learning algorithms
like Random Forest, Logistic Regression, and K-Nearest Neighbours under consistent conditions.
5)Model Training
After data preprocessing and splitting, the core part of the machine learning workflow begins —
training the model. In this project, we explored and trained multiple classification algorithms to
predict loan approval status based on applicant details.
The following models were trained using the training dataset:
Random Forest Classifier – An ensemble method that builds multiple decision trees and
merges their results for better accuracy and control over overfitting.
Logistic Regression – A statistical model used for binary classification problems. It estimates
the probability of a loan being approved.
K-Nearest Neighbors (KNN) – A simple and effective model that classifies data based on the
majority vote of its neighbors.
Each model was trained using the fit() method on the training set (X_train, Y_train). The training
process involves the model learning patterns and correlations between input features and the target
label (Loan_Status).
These trained models were then evaluated on the test set to measure their generalization performance
using metrics like Accuracy, Precision, Recall, and F1-Score.
6)Model Evaluation
Model evaluation is a crucial step to determine how well our trained machine learning models perform
on unseen data. After training the models, we evaluated them using the test dataset to ensure they can
make accurate predictions for new loan applications.
We used the following performance metrics:
Accuracy: Measures the overall correctness of the model by calculating the ratio of correctly
predicted instances to the total number of predictions.
Precision: Indicates how many of the predicted approved loans were actually approved. It's
important when minimizing false approvals is critical.
Recall: Shows how many actual approved loans were correctly identified by the model. It's
important when it's risky to miss true approvals.
7. Prediction
After training and evaluating multiple machine learning models, the final step was to use them for
predicting loan approval based on user input.
We built a simple Python interface where users enter details like gender, marital status, dependents,
education, employment, income, loan amount and term, credit history, and property area. These inputs
are processed and passed to the chosen model for prediction.
Prediction = 1 → Loan Approved
Prediction = 0 → Loan Not Approved
We evaluated models like Random Forest, Logistic Regression, and KNN using the F1 Score, which
balances precision and recall. After comparison, Random Forest showed the highest accuracy and was
selected for making predictions.
This system can be integrated into a larger application to assist banks in automating loan screening.
Machine Learning Models Used
In this project, we implemented and evaluated three different machine learning models to predict loan
approval. These models were selected based on their effectiveness in binary classification tasks and
their ability to handle different types of data.
🔹 Logistic Regression
A linear model used for binary classification.
It estimates the probability of loan approval by applying a logistic function to the weighted sum
of input features.
It is simple, interpretable, and provides good baseline performance.
🔹 Random Forest Classifier
An ensemble learning technique that builds multiple decision trees and combines their outputs
for more accurate and stable predictions.
It reduces overfitting and handles both numerical and categorical data efficiently.
Suitable for capturing complex feature interactions.
We used different combos of max depth and estimators and we found depth as 3 and estimators
equal to 90 provide best accuracy.
🔹 K-Nearest Neighbors (KNN)
A non-parametric, instance-based learning algorithm.
It classifies a data point based on the majority vote of its ‘k’ nearest neighbors in the training set.
Easy to implement and effective in scenarios where decision boundaries are non-linear.
Each model was trained and evaluated using the same dataset to ensure a fair comparison. Their
performance was assessed using accuracy, precision, recall, and F1 score.
Experimentation Details
In order to build an efficient Loan Approval Prediction system, the dataset was thoroughly prepared
and evaluated through multiple stages of experimentation. Below are the key elements of the
experimentation process:
Dataset Used
The dataset named loan.csv was sourced for this project.
It contains various features such as applicant income, co-applicant income, credit history, loan
amount, and more.
The target variable is Loan_Status, indicating whether a loan was approved (1) or not approved
(0).
Model Comparisons
In this section, we compare the performance of three machine learning models: Random Forest
Classifier (RFC), K-Nearest Neighbors (KNN), and Logistic Regression (LR). The goal is to evaluate
and compare their predictive performance using key evaluation metrics.
Key Observations:
Accuracy:
o Random Forest achieved the highest accuracy (82.92%), indicating the most correct
overall predictions.
o Logistic Regression followed closely with 80.83%.
o KNN performed the lowest with 63.75%.
Precision:
o Random Forest had the best precision (83.87%), meaning fewer false positives.
o Logistic Regression was slightly lower at 80.71%.
o KNN had the lowest precision at 69.42%.
Recall:
o Logistic Regression led in recall (98.21%), identifying most actual positive cases.
o Random Forest also had strong recall (93.41%).
o KNN had lower recall (85.63%).
F1 Score:
o Random Forest achieved the highest F1 score (88.39%), showing a strong balance
between precision and recall.
o Logistic Regression followed with 87.36%.
o KNN had the lowest F1 score (76.68%).
Model Summary:
Random Forest Classifier showed the best overall performance, with high accuracy, precision, recall,
and F1 score, making it the most reliable choice for loan approval prediction.
Logistic Regression also performed well, especially in recall, but was slightly behind RFC in other
metrics.
K-Nearest Neighbors had the weakest performance among the three, with significantly lower accuracy
and precision, making it less suitable for deployment in this case.
Observation: The accuracy decreased slightly, but the recall improved, indicating that the model is
now identifying more positive samples. The precision is still slightly lower, but the overall F1 score
has improved, showing a better balance between precision and recall.
2. LogisticRegression:
o Accuracy: 82.50%
o Precision: 80.77%
o Recall: 98.82%
o F1 Score: 88.89%
Observation: This model has a very high recall, almost perfect at 98.82%, with a solid F1 score of
88.89%. The accuracy has slightly improved compared to the RandomForest, and it strikes a good
balance between precision and recall.
3. KNeighborsClassifier:
o Accuracy: 75.00%
o Precision: 78.95%
o Recall: 88.24%
o F1 Score: 83.33%
Observation: The KNN model's performance has improved slightly after feature engineering, but it
still lags behind RandomForest and LogisticRegression. The recall is decent, but it could benefit from
further improvements in precision.
Conclusion:
LogisticRegression seems to be performing the best after feature engineering with high recall
and a solid F1 score.
RandomForestClassifier is still a good option, showing a well-balanced F1 score and decent
recall.
KNeighborsClassifier lags a bit, and could potentially be improved with more refined feature
engineering or hyperparameter tuning.
Conclusion and Future Recommendations
Conclusion:
This project successfully demonstrated the application of machine learning techniques in
predicting loan approval status based on applicant details. By preprocessing the data,
engineering relevant features, and evaluating multiple classification models, we were able to
build an effective system for automated loan decision-making. Among the models tested, the
Random Forest Classifier outperformed others with the highest accuracy and F1 score, making
it the most reliable choice for this use case.
The implementation of user input prediction functionality further showcased the real-world
usability of the model, simulating how such a system could assist financial institutions in
streamlining their approval processes.
Future Recommendations:
1. Dataset Expansion: Incorporate a larger and more diverse dataset to enhance model
generalization and robustness.
2. Integration with UI: Develop a user-friendly web or mobile interface where users can input
their data and get instant predictions.
3. Real-time Data Updates: Include live data integration to make the system dynamic and
responsive to real-time changes in user profiles and financial trends.
4. Bias and Fairness Analysis: Assess the model for any potential biases and ensure fair lending
practices by addressing discrimination based on gender, income, or region.
References
1. Scikit-learn: Machine Learning in Python
https://scikit-learn.org/
2. Pandas: Python Data Analysis Library
https://pandas.pydata.org/
3. Kaggle: Loan Prediction Problem Dataset
https://www.kaggle.com/datasets/altruistdelhite04/loan-prediction-problem-dataset
4. NumPy: Fundamental Package for Scientific Computing
https://numpy.org/