final project documentation
final project documentation
A REPORT SUBMITTED
IN PARTIAL FULFILMENT FOR THE DEGREE OF
BACHELOR OF TECHNOLOGY
In
Computer Science and Engineering
By
Pursued in
Department of Computer Science and Engineering
To
This is to certify that the project report entitled Identifying Risky Bank Loan Using Python
submitted by AritraBhattacharjee, PallabMidya, IshitaDas, PritamDebnath to the
Bengal institute of Technology and Management, Santiniketan, in partial fulfilment for the
award of the degree of B.Tech in Computer Science and Engineering is a bonafide record
of project work carried out by them under my supervision. The contents of this report, in full
or in parts, have not been submitted to any other Institution or University for the award of
any degree or diploma.
i
DECLARATION
I declare that this project report titled Identifying Risky Bank Loan Using Python
submitted in partial fulfilment of the degree of B. Tech in Computer Science and
Engineering is a record of original work carried out by us under the supervision of Prof.
Soumen Bhowmik and has not formed the basis for the award of any other degree or
diploma, in this or any other Institution or University. In keeping with the ethical
practice in reporting scientific information, due acknowledgements have been made
wherever the findings of others have been cited.
Aritra Bhattacharjee
University Roll No: 16300120051
University Registration No: 201630100120004
Pallab Midya
University Roll No:16300120032
University Registration No: 201630100120023
Ishita Das
University Roll No: 16300119023
University Registration No: 039880
Pritam Debnath
University Roll No: 16300120049
University Registration No: 201630100120006
Santiniketan-731236
ii
ACKNOWLEDGMENTS
This work would not have been possible without the constant support, guidlinece, and
assistance of my supervisor Prof. Soumen Bhowmik his levels of patience, knowledge,
and ingenuity are something we will always keep aspiring to.
I take this opportunity to thank our humble guidance and hod of our department of
computer scince and engineering prof. Soumen Bhomik sir of B.I.T.M college, under
whom’s guidance we completed this project of ours. So a large appreciation of ours goes
to him.
I extend my sincere thanks to one and all of BITM family for the completion of this
document on the project report format guidelines.
I wish to express my thanks to all Teachers and Friends of the Department of Computer
Science and Engineering who were helpful in many ways for the completion of the
project.
Aritra Bhattacharjee
Pallab Midya
Ishita Das
Pritam Debnath
iii
ABSTRACT
Loans are no longer considered a last resort to buy a sought-after Smartphone or a dream
house. Over the last decade or so, people have become less hesitant in applying for a
loan, whether it’s personal, vehicle, education, business, or home especially when they
don’t have a lump sum at their disposal. Besides, Home and Education Loans provide tax
advantages that reduce tax liability and increase the cash in hand from salary income. To
get loans with minimal paperwork, quick eligibility checks, and competitive interest
rates. They have opened an online channel to apply and submit documents for the
approval process. If you still find the loan application and review process intimidating,
credit history is indicative of your future repayment behaviour, based on your pattern in
settling past loans. It helps the bank to know if you will be punctual and regular with
your payments. Banks weigh your employment history and current engagement to ensure
that your source of income is reliable. Note that there is no extra spacing between two
successive paragraphs.
iv
TABLE OF CONTENTS
CERTIFICATE i
DECLARATION ii
ACKNOWLEDGEMENTS iii
ABSTRACT iv
TABLE OF CONTENTS v
CHAPTER 1: INTRODUCTION 1
1.1 Motivation 2
CHAPTER 2: ANALYSIS 10
2.3 Purpose 15
2.4 Scope 15
v
3.7 Model deployment 25
CHAPTER 4: IMPLIMENTATION 26
4.1.1 Python 27
4.2.1 NumPy 28
4.2.2 Pandas 29
4.2.3 Seaborn 30
4.2.4 Sklearn 31
4.2.6 SVM 32
4.2.7 Accuracy_Score 33
SCREENSHOTS 36
CONCLUSION 43
REFARENCES 44
vi
LIST OF FIGURES
vii
ABBREVIATIONS/ NOTATIONS/ NOMENCLATURE
Utmost care should be taken by the project student while using technical abbreviations, notations and
nomenclature.
ML Machine Learning
viii
CHAPTER 1
INTRODUCTION
1
1.1 Motivation:
We have discovered with time and experience that a large number of companies are looking
for insights and value that come from fundamentally descriptive activities. This means that
companies are often willing to allocate resources to acquire the necessary awareness of the
phenomenon that we analysts are going to study.
If we are able to investigate the data and ask the right questions, the EDA process becomes
extremely powerful. By combining data visualization skills, a skilled analyst is able to build a
career only by leveraging these skills. You don’t even have to go into modelling. A good
approach to EDA therefore allows us to provide added value to many business contexts,
especially where our client / boss finds difficulties in the interpretation or access to data. This
is the basic idea that led me to put down such a template.
2
Flow Chart
3
We will see some of the most common and important features of Pandas and also some
techniques to manipulate order to understand it thoroughly.
This process describes how we can move to ask new questions until we are satisfied.
The process of exploratory data analysis.
We will see some of the most common and important features of Pandas and also some
techniques to manipulate the data in order to understand it thoroughly.
I usually open Excel or create a text file to put some notes down, in this fashion:
2. Type: the type or format of the variable. This can be categorical, numeric,
Boolean, and so on
4
3. Context: useful information to understand the semantic space of the variable. In
the case of our dataset, the context is always the chemical-physical one, so it’s
easy. In another context, for example that of real estate, a variable could belong
to a particular segment, such as the anatomy of the material or the social one
(how many neighbours are there?)
4. Expectation: how relevant is this variable with respect to our task? We can use
a scale “High, Medium, Low”.
5
Statistics Handbook, Exploratory Data Analysis is ‘a philosophy’ towards how the data is to
be analyzed.
And despite its apparent importance (after all, who wouldn’t want more efficient models!), it
is more often than not that, because of this lack of rigid structure and elusive nature, EDA
isn’t used nearly as often as it should be.
Let’s methodically demystify the EDA concept, starting from the very basics of differences
between EDA vs data analysis and moving through to the exploratory data analysis steps and
techniques sprinkled with a few simple Python examples you can try yourself. And hopefully,
by the end of it, you’ll agree that EDA is not as intimidating as it is often made out to be
when working with data science and machine learning projects.
Why not call EDA just plain classical data analysis. If the answer to the above question isn’t
obvious already, let’s just put it out there: Because it’s not!
Since you’ve made it up to here, thankfully, you won’t be left having to take the above
statement with a pinch of salt. While you must have already vaguely sensed some of the
differences between classical data analysis and EDA the following explanation should serve
to give you a clearer picture.
EDA is indeed a data analysis approach, however, it differs starkly from the classical
approach in the very way it seeks to find a solution to a problem, or for that matter the way it
addresses one.
In the classical approach, the model is imposed on the data and the analysis and testing
follow. With EDA, on the other hand, the collected data set is first analyzed to infer what
model would be best suited for the data by investigating its underlying structure.
EDA is a data-focused approach – both in its structure and the models it suggests. On the flip
side, classical data analysis is aimed at generating predictions from models and is generally
quantitative in nature. Even the rigidity and formality that are prevalent in classical
techniques are absent in EDA. The two methods differ even by the way they deal with
information in that classical estimation techniques focus only on a few important
characteristics resulting in a loss of information whereas EDA techniques make almost no
assumption and often make use of all available data.
6
1.2.3 Graphical Exploratory Data Analysis Techniques
Some of the commonly used graphical techniques are:
Histogram:
Histograms can be used to summarize both continuous and discrete data. They help to
visualize the data distribution. They serve especially well to indicate gaps in data and even
outliers.
The given data provides the percentage of people who prefer a particular proportion of
Nuts.Unlike the previous methods that are univariate (i.e. involving only a single variable)
scatter plot reveals the relationship between two variables. The relationships reveal
themselves in the form of structures in the plot such as lines or curves that cannot simply be
explained as randomness.
7
Hypothesis testing is accomplished in a series of steps. In this process, a null hypothesis,
which is initially assumed to be true, is replaced by an alternative hypothesis if the testing
results in the null hypothesis being rejected. This is done by comparing a quantitative
measure called the ‘test statistic’, which shows whether sample data is in agreement with the
null hypothesis, to a critical value to decide on the rejection of the null hypothesis.
The EDA techniques we have gone over In this section are by no means an exhaustive list of
techniques that can be used for accomplishing EDA. On venturing to use EDA and exploring
on your own you are bound to discover other techniques and also find that some work for you
better than the others do.
1. Importing and cleaning the imported data: The first step to any analysis is to
import the data, clean and transform it to the required readable format.
2. Univariate analysis: It is logical to start the analysis by considering one
variable at a time, learning each variable's distribution and summary statistics.
3. Pair exploration: The next step would be to identify relationships between
pairs of variables using simple two-dimensional graphs.
8
4. Multivariate analysis: Having analyzed the variables in pairs, the relationships
between larger groups can be analyzed to investigate and identify more
complex relationships.
5. Hypothesis testing and estimation: The assumptions made regarding the data
set can be tested at this stage and estimations are made regarding the
variability of variables.
Learn everything there is to know about Exploratory Data Analysis in this blog, a method for
evaluating and drawing conclusions from large amounts of data.
Exploratory Data Analysis is essential for any business. It allows data scientists to analyze
the data before coming to any assumption. It ensures that the results produced are valid and
applicable to business outcomes and goals.
The primary objective of Exploratory Data Analysis in order to perform Exploratory Data
Analysis is to uncover the underlying structure. The structure of the various data sets
determines the trends, patterns, and relationships among them. A business cannot come to a
final conclusion or draw assumptions from a huge quantity of data and rather requires taking
an exhaustive look at the data set through an analytical lens.
Therefore, performing an Exploratory Data Analysis allows data scientists to detect errors,
debunk assumptions, and much more to ultimately select an appropriate predictive model.
9
CHAPTER 2
ANALYSIS
10
2.1 Proposed system
To making a project of risky bank loan we use different types of latest algorithms which helps
us to understand what types of risk would be happened during any bank loan by using
algorithms the technologies that are mainly proposed in our project.
11
With this type of dataset, you could explore various questions related to bank loans, such as
what factors are associated with loan default, which loan characteristics are most strongly
correlated with interest rates, or how loan amounts vary by borrower demographics.
Data processing:
Bank loan Data processing typically involves the collection, analysis, and management of
information related to loan applications and approvals. This process may include several
steps, such as verifying the borrower's creditworthiness, assessing the risk of lending money,
and determining the terms and conditions of the loan.
Here are some of the key steps involved in bank loan data processing:
1. Application submission: Borrowers submit loan applications, either in person or
online, providing details about their financial situation, employment, and credit
history.
2. Credit check: The bank checks the borrower's credit score and credit history to
determine their creditworthiness. This involves reviewing the borrower's payment
history, outstanding debts, and other financial information.
3. Income verification: The bank verifies the borrower's income, employment status,
and other financial information to assess their ability to repay the loan.
4. Loan assessment: Based on the borrower's creditworthiness and ability to repay the
loan, the bank assesses the risk of lending money and determines the terms and
conditions of the loan.
5. Loan approval or rejection: The bank either approves or rejects the loan application
based on the borrower's creditworthiness, ability to repay the loan, and other factors.
6. Loan disbursal: If the loan is approved, the bank disburses the funds to the borrower.
7. Loan management: The bank manages the loan account, including collecting
payments, tracking the borrower's payment history, and handling any issues that
arise.
Throughout this process, the bank may use various software applications to help automate
and streamline the loan processing workflow, such as loan origination systems, credit scoring
models, and loan servicing platforms. Additionally, data privacy and security are important
considerations in bank loan data processing, as banks must protect borrower information and
comply with applicable regulations.
Training Dataset:
A training dataset in the context of bank loans would typically include information about past
loan applications, including details such as the applicant's credit history, income, employment
status, loan amount requested, and whether the loan was approved or denied.
The purpose of using a training dataset in this context would be to build a machine learning
model that can predict the likelihood of a new loan application being approved or denied
based on the applicant's characteristics and loan details. The model would be trained on the
historical loan data to identify patterns and relationships between these variables and loan
12
approval outcomes, and would then use these patterns to make predictions on new loan
applications.
It is important to note that the accuracy and effectiveness of the model would depend heavily
on the quality and relevance of the training dataset used. The dataset should be representative
of the population being considered for loans and should be large enough to capture a diverse
range of loan scenarios and outcomes. Additionally, the dataset should be properly labelled
and cleaned to ensure that the model is not biased by inaccurate or incomplete data.
Test Dataset:
A test dataset in a risky bank loan scenario would typically contain a set of data points
representing loan applications that have been approved or denied based on certain criteria.
The dataset may include a variety of features such as the applicant's credit score, income
level, employment status, debt-to-income ratio, loan amount, loan term, and other relevant
information.
The purpose of the test dataset is to evaluate the effectiveness of a machine learning model or
algorithm in predicting the likelihood of loan default or delinquency based on the available
features. The model may be trained on a separate dataset and then tested on this new dataset
to determine its accuracy and generalize ability.
It is important to ensure that the test dataset is representative of the population of loan
applications that the model is intended to be applied to. The dataset should also be carefully
curetted to prevent any biases or inconsistencies that could affect the accuracy of the model's
predictions.
Model:
The process of making a decision about whether or not to approve a risky bank loan typically
involves several steps and considerations. Here is a general overview of the model-making
process:
Assessment of the borrower's creditworthiness:
The first step is to evaluate the borrower's creditworthiness by examining their credit history,
income, debt-to-income ratio, employment status, and other relevant financial information.
This assessment helps determine the likelihood of the borrower being able to repay the loan.
Analysis of the loan terms:
The bank will analyze the proposed loan terms, such as the interest rate, repayment period,
and collateral requirements, to determine whether they are appropriate for the level of risk
associated with the borrower.
Evaluation of the purpose of the loan:
The bank will consider the purpose of the loan, such as whether it is for a business investment
or a personal expense, to determine the level of risk associated with the loan.
Examination of the borrower's financial stability:
The bank will assess the stability of the borrower's financial situation, including their job
security, savings, and investment portfolio, to determine the likelihood of them being able to
repay the loan.
13
Review of industry and market conditions:
The bank will also consider industry and market conditions that may impact the borrower's
ability to repay the loan, such as changes in interest rates, economic conditions, and
competition.
Risk analysis and decision-making:
Based on the information gathered during the previous steps, the bank will perform a risk
analysis to determine the level of risk associated with the loan. Based on this analysis, the
bank will make a decision about whether or not to approve the loan, and if so, under what
terms and conditions.
Overall, the model-making process for risky bank loans involves a thorough evaluation of
various factors that can impact the borrower's ability to repay the loan. The goal is to balance
the level of risk associated with the loan with the potential benefits of approving it, such as
generating revenue for the bank and helping the borrower achieve their financial goals .
g) Security: The RBL system should be designed with security in mind to protect
confidential customer information. The system should ensure that user data is
protected against unauthorized access, theft, and tampering.
h) Reliability: The RBL system should be reliable and available 24/7. The system should
be able to handle unexpected errors and ensure data integrity.
14
i) Scalability: The RBL system should be designed to accommodate future growth. The
system should be scalable to handle an increasing number of loan applications without
affecting its performance.
j) Usability: The RBL system should be easy to use for bank staff with minimal training.
The system should have a user-friendly interface that is intuitive and easy to navigate.
Conclusion:
The Risky Bank Loan system is a software system that uses the SVM algorithm to evaluate
the risk of bank loans. This document outlined the functional and non-functional
requirements for the system. These requirements should serve as a guide for the development
team to ensure the successful implementation of the system.
2.3 Purpose
The purpose of developing a risky bank loan system using SVM (Support Vector Machine)
algorithm is to help banks and financial institutions make more informed decisions when
granting loans to their clients.
The SVM algorithm is a powerful machine learning technique that can effectively identify
patterns and relationships within large datasets, making it well-suited for analyzing complex
financial data. By using SVM, the system can accurately predict the likelihood of a borrower
defaulting on their loan, based on a range of factors such as credit history, income,
employment status, and other relevant data.
With this system in place, banks can better assess the risk associated with lending money to a
particular borrower and make more informed decisions about whether to grant or deny a loan.
This can help to minimize the risk of loan defaults and improve the overall financial health of
the institution.
Furthermore, this system can also help to reduce the time and resources required for manual
loan analysis, by automating the loan evaluation process. This can lead to faster loan
processing times, increased efficiency, and a more streamlined lending process.
Overall, the purpose of developing a risky bank loan system using SVM algorithm is to
enable banks and financial institutions to make more informed, data-driven lending decisions,
and to minimize the risk of loan defaults and financial losses.
2.4 Scope
Introduction:
Risk management is a critical component of any financial institution, especially banks. To
prevent losses and ensure profitability, banks rely on risk assessment techniques to evaluate
and mitigate risk. One such technique is the use of machine learning algorithms, such as
Support Vector Machines (SVMs), to identify high-risk loans. The aim of this paper is to
describe the scope of making a project for a risky bank loan system using the SVM algorithm.
The paper will provide a comprehensive overview of the SVM algorithm, its application in
risk assessment, and the scope of using the algorithm to identify high-risk loans.
15
Background:
The banking industry is characterized by high levels of risk due to the nature of its operations.
Banks lend money to individuals and businesses, and these loans have to be repaid with
interest. However, not all borrowers are able to repay their loans, which can result in losses
for the bank. To minimize these losses, banks use risk assessment techniques to evaluate the
creditworthiness of borrowers and the potential risks associated with lending to them.
Support Vector Machines (SVMs) are a type of machine learning algorithm that can be used
for classification and regression analysis. The SVM algorithm works by identifying a
hyperplane that separates the data into different classes. The algorithm then finds the
maximum margin hyperplane, which is the hyperplane that is furthest from the data points of
both classes. SVMs are widely used in risk assessment and fraud detection, and they have
been shown to be effective in identifying high-risk loans.
Scope of the Project:
The scope of the project is to develop a risky bank loan system that uses the SVM algorithm
to identify high-risk loans. The system will be designed to evaluate loan applications based
on various parameters, including the borrower's credit history, income, and debt-to-income
ratio. The system will then assign a risk score to each loan application, indicating the
likelihood of the loan defaulting.
The SVM algorithm will be used to train the system on a dataset of historical loan data. The
dataset will include information about the borrower, the loan amount, the loan term, the
interest rate, and other relevant parameters. The system will use this data to identify patterns
and relationships that can be used to predict the likelihood of a loan defaulting.
The system will be designed to be user-friendly and easy to use. Bank employees will be able
to input loan application data into the system, and the system will generate a risk score for
each loan application. The system will also provide recommendations on whether to approve
or reject a loan application based on the risk score.
Challenges and Limitations:
One of the main challenges of developing a risky bank loan system using the SVM algorithm
is the availability of high-quality data. The system will require a large dataset of historical
loan data to train the SVM algorithm effectively. However, obtaining high-quality data can be
challenging, especially for smaller banks that may not have access to large datasets.
Another challenge is the complexity of the SVM algorithm. The SVM algorithm can be
difficult to understand and implement, especially for non-technical users. The system will
need to be designed in a way that is easy to use and understand for bank employees.
The accuracy of the system is another limitation. The accuracy of the system will depend on
the quality and quantity of data used to train the SVM algorithm. If the system is not trained
on a sufficiently large and diverse dataset, it may not be accurate in identifying high-risk
loans.
16
CHAPTER 3
17
3.1 Data collection
The data set collected for predicting given data is split into Training set and Test set.
Generally, 7:3 ratios are applied to split the Training set and Test set. The Data Model which
was created using machine learning algorithms are applied on the Training set and based on
the test result accuracy, Test set prediction is done.
18
8. Checking for duplicate data.
9. Checking Missing values
lues of data frame.
10. Checking unique values of data frame.
11. Checking count values of data frame.
12. Rename and drop the given data frame.
13. To specify the type of values.
14. To create extra columns.
19
The steps and techniques for data cleaning will vary from dataset to dataset. The primary goal
of data cleaning is to detect and remove errors and anomalies to increase the value of data in
analytics and decision making.
20
Flow Chart
Data Collection
Data Pre-Processing
Data Cleaning
Model Selection
Model Deployment
Figure:3.4:Flow Chart
21
Some researchers have been done using the SVM algorithm to classify potential
customers who are able to pay debts or not in loan companies. However, the evaluation of the
model used is only the accuracy and error values. Therefore, the authors are interested in
using other evaluation model measures, such as the AUC value. This research analyzes
credit risk using machine learning method, namely SVM algorithm to classify
prospective customers into good credit or bad credit class. The classification process of
dataset was done by using programming language Python. The dataset were taken randomly
from 2015 to 2018 at Bank XX, as many as 610 data. The dataset were divided into two parts
along with percentage of the training data 80% and the testing data 20%. The variables used
were gender, plafond, rate, term of time, job, income, face amount, warranty, and loan history
as independent variable as well as credit status as dependent variable. This research was
done in three stages; data pre-processing, model building, and model testing. Model was
made using the training data then checked using the testing data to discover its performance.
Model evaluation was counted using confusion matrix which are accuracy, sensitivity,
specificity, precision, FI-score, false positive rate, false negative rate, and AUC (Area Under
the Curve) values
Support Vector Machines (SVM) is a popular supervised machine learning algorithm used for
classification and regression tasks. It works by finding the hyper plane in a high-dimensional
space that maximally separates the data points into different classes. In Python, you can
implement SVM using the Scikit-learn library, which provides a user-friendly interface for
training and testing SVM models.
Here's an example of how to use SVM for classification in Python:
First, you need to import the necessary libraries:
Next, you can load a sample dataset, such as the iris dataset, using the `load_iris()` function
from Scikit-learn:
python
iris = datasets.load_iris()
X = iris.data
y = iris.target
You can then split the dataset into training and testing sets using the `train_test_split()`
function:
python
python
22
Next, you can create an instance of the SVM classifier by calling the `SVC()` function:
python
svm = SVC(kernel='linear')
You can then train the classifier on the training data using the `fit()` method:
python
svm.fit(X_train, y_train)
Finally, you can make predictions on the test data using the `predict()` method and calculate
the accuracy of the model using the `accuracy_score()` function:
python
y_pred = svm.predict(X_test)
print('Accuracy:', accuracy)
This is just a simple example of using SVM for classification in Python. There are many
different hyper parameters you can tune in SVM to optimize the model's performance, such
as the choice of kernel, regularization parameter, and gamma value.
23
Figure 3.6: Flow Chart of SVM Model
24
We first load the Iris dataset and then split the data into training and testing sets with a 80:20
ratio. We create an SVM model with a linear kernel and regularization parameter `C=1` and
train the model using the training data. Finally, we evaluate the performance of the trained
SVM model on the testing data by computing the accuracy and classification report.
Note that in practice, you may need to tune the hyper parameters of the SVM model to obtain
the best performance on the given dataset. This can be done using cross-validation or grid
search techniques.
python
joblib.dump(svm, 'svm_model.pkl')
Load the saved model: To use the trained SVM model for making predictions, you need to
load the model from the file using the `joblib` module. For example:
python
svm = joblib.load('svm_model.pkl')
Prepare new data: To make predictions on new data, you need to pre-process the data in the
same way as the training data. you may need to scale the data or perform feature selection.
Make predictions: Once you have loaded the trained SVM model and pre-processed the new
data, you can use the `predict()` method to make predictions on the new data.
25
CHAPTER 4
IMPLIMENTATION
26
4.1 Introduction to technologies used
In our project the technologies we have used are:
Technologies
used
Numpy
Pandas
Seaboarn
Sklearn
First of all we had to choose a programming language which will be well suited for our
project requirements and we wanted to choose a programming language which is flexible,
easy to write and read and has a rich library available. So we decided to use the python
programming language.
4.1.1 Python
Python is a high-level programming language that has become increasingly popular in recent
years. Created in the late 1980s by Guido van Rossum, Python is known for its simplicity,
readability, and flexibility. It is an interpreted language, which means that the code can be
executed without being compiled beforehand, making it ideal for rapid prototyping and
development.
Python's popularity stems from its ability to be used in a wide variety of applications,
including web development, scientific computing, artificial intelligence, data analysis, and
automation. It also has a vast array of libraries and frameworks that make development faster
and more efficient. Python's syntax is designed to be simple and easy to understand, making
it accessible to new programmers and allowing them to focus on solving problems rather than
worrying about the intricacies of the language.
27
Python's versatility and ease of use have made it a favourite among developers, students and
businesses alike. It is widely used by tech giants such as Google, Facebook, and Amazon, as
well as small start-ups and individual developers. The language's community is also active
and vibrant, with a large number of resources available online, including tutorials,
documentation, and support forums
In summary, Python is a powerful and flexible language that is easy to learn and use. Its
widespread adoption and active community make it an excellent choice for a wide range of
applications, from simple scripts to complex software systems.
We use python programming language in our project because of its rich libraries and its
flexibility advantages because of the predefined functions and classes in python libraries like
numpy,pandas and sklearnetc we can grasp the possibility of quick successful processing and
implementation of our project.
Now after deciding the programming language it was time for us to decide about the
integrated development and learning environment (IDLE) for our project. Since we choose
python programming language we had a vast number of idles to choose from like visual
studio codes,PYcharm,spideretc But we thought to choose a platform which is cloud base so
we can easily share, store and present our project without having to carry it out physically.
We found a platform named Google colaboratory which allows us to write python code in an
online platform and stores it our cloud storage it also provide us with in time cpu and ram
units online.
4.1.2 Google Colab
Google Colab is a cloud-based platform that provides an interactive environment for writing
and running Python code. It is a free service offered by Google that allows users to access
powerful computing resources, including GPUs and TPUs, from anywhere with an internet
connection. With Google Colab, you can create and share documents called notebooks, which
contain code, text, images, and visualizations. These notebooks are stored on Google Drive
and can be easily shared and collaborated on with others. Google Colab also includes a wide
range of pre-installed libraries and tools, making it an excellent platform for data analysis,
machine learning, and artificial intelligence. With its easy-to-use interface and powerful
features, Google Colab is an ideal tool for developers, researchers, and students alike.
4.2 Modules & Modules Description
One of the main reasons we choose python programming language for this project of ours is
because of its vast libraries/modules which can be used in various fields such as data
analysis,AI,scientific computation etc.In our project the specific applications of those
libraries/modules which we were needed was data analysis,dataintegration,data visualisation
and machine learning algorithm implementation.So we choose our python libraries according
to those specific need there for we used the python library “NumPy” and the library “pandas”
for the data integration and data analysis purposes,then we used the python library “seaborn”
for the purpose on data visualisation and then finally we used the python library “sklearn” for
machine learning /algorithmic implementations in our project.
Now lets get to know each of this modules a little better.
4.2.1 NumPy
NumPy is a Python library that is commonly used for numerical computations and data
analysis. It offers a high-performance multidimensional array object and a wide variety of
28
tools for working with these arrays. Its popularity stems from its ability to efficiently perform
mathematical operations on entire arrays of data, which can often be much faster than
working with individual elements. NumPy also comes equipped with several built-in
functions for mathematical operations, including linear algebra, Fourier transforms, and
random number generation.
The main focus of NumPy is its ndarray (n-dimensional array) object, which is a
multidimensional container that holds data of the same type. This allows for fast
mathematical computations and operations to be carried out on entire arrays of data, which is
beneficial in scientific and engineering applications as well as data analysis and machine
learning. NumPy is open source and freely available, and it can be installed using a package
manager such as pip. It is also compatible with many other Python libraries, including
pandas.
As we had to include and handle the data set of the bank customers in our project and also do
analysis on it, so we can later modify the data as we want. For this reasons we choose the
Numpy library of python in our project as it converts our data set in an N-dimensional array
and provide us with various useful predefined functions and classes to operate much
efficiently on that array then applying individual operations on that data set, so that’s the
reason we choose the Numpy library.
To use NumPy in Python, you need to import the library first. You can install NumPy using
pip by running the following command in your terminal:
“pip install numpy”
After installation, you can import NumPy using the following line of code:
“googlecolab note book”
“import numpy as np”
This line of code imports the NumPy library and gives it the alias "np" for convenience.
4.2.2 Pandas
We used Pandas library in our data analysis project because Pandas is a popular open-source
library for data manipulation and analysis in Python. It provides high-level data structures
and functions designed to make working with structured data(which we created before with
NumPy) fast and easy.
The two main data structures in Pandas are Series and Data Frame. A Series is a one-
dimensional array-like object that can hold any data type, while a Data Frame is a two-
dimensional tabular data structure with labelled axes (rows and columns) that can also hold
any data type.
Pandas offers a wide range of functions for data manipulation, including filtering, sorting,
merging, grouping, and aggregation which are some exact operations that we needed for our
project.
Pandas can handle various data formats such as CSV(we used file type CSV as the data set in
our project), Excel, SQL databases, and more. It is widely used in industries such as finance,
economics, social sciences, and engineering, as well as in data science and machine learning
applications.
29
To use Pandas in Python, you need to import the library first. You can install Pandas using
pip by running the following command in your terminal:
“pip install pandas”
After installation, you can import Pandas using the following line of code:
“googlecolab note book”
import pandas as pd
This line of code imports the Pandas library and gives it the alias "pd" for convenience.
4.2.3 Seaborn
In our EDA project we used Seaborn as our data visualization library which based on
Matplotlib which is also a data visualisation library of python’s. Seaborn is built on top of
Matplotlib, which means you can still use Matplotlib functions when working with Seaborn.
Seaborn library provides a high-level interface for creating informative and visually
appealing statistical graphics and also the seaborn library makes it is easier to operate the data
visualisation operations
Seaborn is particularly useful for exploring and understanding data through visualizations. It
offers a wide range of statistical plots, such as scatter plots, line plots, bar plots, histograms,
box plots, violin plots, and many others. These plots are designed to showcase relationships,
distributions, and patterns in your data.
Here are some key features and benefits of using Seaborn:
1. High-level interface: Seaborn provides a simple and intuitive API for creating complex
statistical visualizations with minimal code.
2. Attractive default styles: Seaborn comes with pleasant and visually appealing default
styles, making your visualizations look professional without much customization.
3. Statistical enhancements: Seaborn extends the capabilities of Matplotlib by adding
additional statistical features to the plots. For example, it can automatically compute and
display confidence intervals or fit regression models to your data.
4. Easy integration with Pandas: Seaborn works seamlessly with Pandas data structures,
making it easy to visualize and explore datasets stored in DataFrames.
5. Categorical data support: Seaborn includes functions that handle categorical data
effectively, allowing you to create informative visualizations based on different categories.
To use Seaborn, you first need to install it. You can install Seaborn using pip by running the
following command in your terminal:
“pip install seaborn”
After installation, you can import Seaborn using the following line of code:
“googlecolab note book”
“importseaborn as sns”
By convention, Seaborn is often imported with the alias "sns" for brevity. Once
imported, you can start using Seaborn's functions and create visually appealing and
informative plots to explore and communicate your data effectively.
30
4.2.4 Sklearn
Sklearn is a popular open-source library for machine learning in Python. It is built on top of
NumPy, SciPy, and Matplotlib and provides a wide range of tools for supervised and
unsupervised machine learning tasks such as classification, regression, clustering,
dimensionality reduction, and model selection.
It was apparent for us to use sklearn library to implement machine learning in our project
because of its inter dependencies and connectivities with the libraries that we already used
previously (NumPy library and Seaborn library which is build on matplotlib library).The
sklearn library is build on NumPy and matplotlib libraries so choosing sklearn for our project
gave us way more dependency then any other libraries as our machine learning library.
Here are some key features and benefits of using Sklearn:
1. Simple and consistent API: Sklearn provides a simple and consistent API that makes it
easy to use for beginners and experts alike.
2. Wide range of algorithms: Sklearn includes a wide range of state-of-the-art machine
learning algorithms, from simple linear regression to advanced ensemble methods like
random forests and gradient boosting.
3. Integration with other Python libraries: Sklearn integrates well with other Python libraries
like Pandas and NumPy, making it easy to handle and manipulate data.
4. Model selection and evaluation: Sklearn provides tools for model selection and evaluation,
such as cross-validation and grid search, which helps you choose the best model for your
data.
5. Active development and community: Sklearn is under active development and has a large
community of contributors and users who provide support, documentation, and examples.
To use Sklearn, you first need to install it. You can install Sklearn using pip by running the
following command in your terminal:
“pip install scikit-learn”
After installation, you can import Sklearn using the following line of code:
“Google colab note book”
“importsklearn”
Sklearn is a large library with many modules and functions. To use a specific module or
function, you need to import it explicitly. For example, we used the train_test_split module
from sklearn.Model selection,
you can import it as follows:
“Google colab note book”
“fromsklearn.Model selection import train_test_split”
With Sklearn, we can start building and training machine learning models quickly and
efficiently, even if we are beginners.
31
4.2.5 Train_test_split:
Now about the modules under Sklearn
Sklearn,The
The `train_test_split` module is a function provided by
scikit-learn
learn (sklearn) library, which is widely used for machine learning tasks in Python. This
function allows you to split a dataset into training and testing subsets,
subsets, which is crucial for
evaluating and validating the performance of machine learning models.
The `train_test_split` function takes in one or more arrays or matrices as input, representing
the features and labels of your dataset. It randomly shuffles and ssplits
plits the data into two or
more portions based on the specified test size or train size. The typical usage splits the data
into a training set and a testing set, but you can also use it for more advanced scenarios like
cross-validation
Here is our code syntax
tax of the `train_test_split` function:
In the example above, `X` represents the input features, `y` represents the corresponding
labels or target values, and `test_size=0.1` indicates that 10% of the data will be allocated
alloc for
testing, while the remaining 90% will be used for training. The `random_state` parameter is
optional and allows you to set a seed value for random shuffling, ensuring reproducibility of
the same train-test split.
The `train_test_split` function returns
returns four subsets: `X_train` (training features), `X_test`
(testing features), `y_train` (training labels), and `y_test` (testing labels). These subsets can
be used for training and evaluating machine learning models.
It's important to split our data into separate training and testing sets to evaluate how well our
model generalizes to unseen data. The training set is used to train the model, while the testing
set is used to assess its performance. This helps to identify potential issues like overfitting,
where
here the model performs well on the training data but poorly on unseen data.
By using `train_test_split`, we conveniently split our dataset into training and testing subsets
and proceed with building, training, and evaluating our machine learning model wit with
confidence.
4.2.6 SVM
Support Vector Machines (SVM) is a supervised machine learning algorithm used for both
classification and regression tasks. SVM aims to find an optimal hyperplane in a high high-
dimensional feature space that separates different classes or predicts continuous values.
The key idea behind SVM is to maximize the margin between the decision boundary (hyper
plane) and the closest data points from each class. These closest data points are called support
vectors, hence the name “Support
Support Vector Machines.” By maximizing the margin, SVM seeks
to achieve better generalization and robustness against noise in the data.
SVM has several advantages:
32
1. Effective in high-dimensional spaces: SVM performs well even in cases where the number
of features is greater than the number of samples. It is suitable for tasks with a large number
of features.
2. Versatility: SVM supports various kernel functions, such as linear, polynomial, radial basis
function (RBF), and sigmoid, allowing flexibility in capturing complex relationships in the
data.
3. Robust against overfitting: By maximizing the margin, SVM tends to generalize well and
be less prone to overfitting. It works particularly well in cases with clear margin separation.
However, SVM also has some considerations:
1. Computational complexity: SVM can be computationally expensive, especially for large
datasets, as it requires solving a quadratic programming problem.
2. Sensitivity to hyperparameters: The performance of SVM can be sensitive to the choice of
hyperparameters, such as the kernel type and C-parameter. Proper tuning is important for
optimal results.
Sklearn is a popular Python library that provides an SVM implementation with the `SVC`
class for classification tasks and `SVR` class for regression tasks. These classes offer various
configuration options and allows us to train SVM models on our datasets.
4.2.7 Accuracy_score
The `accuracy_score` function is a utility provided by scikit-learn (sklearn), a popular Python
library for machine learning. It is used to calculate the accuracy of a classification model by
comparing the predicted labels with the true labels.
The `accuracy_score` function takes two arguments: `y_true` and `y_pred`. Here's the basic
syntax:
- `y_true` represents the true labels or target values of the dataset.
- `y_pred` represents the predicted labels obtained from a classification model.
The `accuracy_score` function compares the corresponding elements in `y_true` and `y_pred`
and calculates the accuracy of the predictions. It returns a single floating-point number
representing the accuracy score, which indicates the percentage of correct predictions.
The `accuracy_score` function can handle multiclass classification as well, where it computes
the accuracy by considering all classes. It supports both 1D arrays (for binary classification)
and 2D arrays (for multiclass classification).
The `accuracy_score` function is a useful tool for evaluating the performance of classification
models. It provides a simple and intuitive metric to measure the accuracy of predictions,
making it easier to assess the model's effectiveness.
33
fromsklearn.model_selection import train_test_split
fromsklearn import svm
fromsklearn.metrics import accuracy_score
34
loan_dataset['Dependents'].value_counts()
Feture selection/Data Visualization
# education& Loan Status
sns.countplot(x='Education',hue='Loan_Status',data=loan_dataset)
# marital status & Loan Status
sns.countplot(x='Married',hue='Loan_Status',data=loan_dataset)
Model Selection
# convert categorical columns to numerical values
loan_dataset.replace({'Married':{'No':0,'Yes':1},'Gender':{'Male':1,'Female':0},'Self_Employe
d':{'No':0,'Yes':1},
'Property_Area':{'Rural':0,'Semiurban':1,'Urban':2},'Education':{'Graduate':1,'Not
Graduate':0}},inplace=True)
loan_dataset.head()
# separating the data and label
X = loan_dataset.drop(columns=['Loan_ID','Loan_Status'],axis=1)
Y = loan_dataset['Loan_Status'] ```
(above given python code is a sample code of our project only been included upto the model
selection part of our project the machine learning part comes after this part which is excluded
in the sample code.)
35
SCREENSHOTS
36
Importing the Dependencies/modules
Dependencies/modules:
Here we are adding the .CSV file or the data set that we are going to work on initializing
initializin it as
“loan_dataset”.also printing its type.
Exploratory data analysis:
Here we are showing the first five rows of our dataset by using the “head()” function on our
“loan_dataset” then by using the “loan_dataset.shape” we can see that how many rows and
columns are actually there in our data set in our case it is 614 rows and 13 columns.
37
Figure 5.4:
5.4 describe() implimentation
Here we are evaluating the sum of all the null data in our data set by applying the “isnull()”
and the “sum()”functions on our dataset, meaning wherever in our da data
ta set there is a “0” or a
unrecognisable data we are finding them and couting them to see in total how many null data
is present in our data set.like as the figure above shows that in our data set the
“gender”column has total of 13 missing or null values.
38
Figure 5.6: dropna() implimentation
Here by using the “dropna()”function we are droping or deleting the null values which were
previously present in our dataset.
Then after executing the “dropna()”function if we again use the “isnull()” and
“sum()”function
ction then we can clearly see that there are no more null values present now in out
dataset.
Figure 5.7:
5.7 replace()&head() implement
Here we are using the “replace()”function.by using this function we can replace a existing
value by the values of our choosing
sing to a perticuler column in our dataset.we have to pass the
arguments of this “replace()” function like first we have to put the name of the column where
we want to operate then we have to use the “{“ and put the existiong value ”:” then the
replacement value,same again if more then one replacements there then”}” close and then we
have to give “inplace=true” to acctualy changing the values and lastly close the function
parenthesis “)” and we are done just have interpret the line no
now.
39
Feature selection and
d data visualization:
Here we are using the Seaborn library’s function “countplot()” which is ploting a graphical
representation of our values in our dataset.here we are comparing the values of “education”
and “loan_status” column from our dataset.here the blue bars represents the rejected loans
and orange bars represents the approved loans.So as we can clearly see here by the graphical
comparison that the applicants who has their graduate de
degrees
grees have way more approved loans
then the non graduates.
Here we are again using the Seaborn library’s function “countplot()” which is ploting a
graphical representation of our values in our dataset.this time we are comparing the values of
“married” and “loan_status” column from our dataset.here the blue bars represents the
40
rejected loans and orange bars represents the approved loans.So as we can clearly see here by
the graphical comparison that
hat the applicants who are married have way more approved loans
then the non married applicants.
Figure 5.2.1
5.2.1: replace function implementation
Here by using the same “replace()” function as before we are replacing all of the text values
in our dataset into
nto integer values so that we can use the data which is available in the data set
as much as possible for the train and test process.
Here we are using the “drop()” fuction again to drop two coulmns from our data set from
which
hich one of them is we do not need which is the “Loan_ID” column and the other one which
we need to store separately in the variable named “y”.so now we have our default dataset
without the “Loan_ID” column in the “X” variable and we have the “Loan_status column” in
the “Y” variable therefore now we can successfully implement the machine learning model
(SVM) to compare between the two datasets in the variables “X” and ”Y” and create a data
boundary and compare the values form each side of the input variable
variabless and then performing
that model into a test data set we can see the accuracy score of how well our model can
predict the loan status of the applicants successfully.
Figure 5.2.3:
5.2.3 train_test_split module implimentation
Here we are splitting the data int
intoo a train model and a test model(described under point
5.2.3train_test_split module)
41
Figure 5.2.4:
5.2.4 SVM implementation
Here we are implementing the support vector machine model(SVM) and then storing the
accuracy scores of the training data set and printing
printing it then storing the testing data set then
lastly printing the prediction accuracy of our model, in our case which is 0.83 means above
80 percent. and here we can also see that the prediction score of the training data set and the
testing dataset is very
y close so we can also say this positively that there are no over fitting
problems in our model.
42
CONCLUSION
In this project of ours the analytical process started from data cleaning and processing,
missing value, exploratory analysis and finally model building and evaluation. The best
accuracy on public test set is higher accuracy score will be find out. This application can help
to find the Risky Bank Loan Customer Identification.
In this project we try to describe and implement the process of data analysis and AI prediction
as well as we can, here we went through every step from library files importing ,data set
importing, data integration, data analysis to all the way to machine learning model selection,
algorithm implementation, and prediction we described all the steps as we understand them.
This model can also be used to predict many different aspects like it could also be used by
customers to detect the suitable bank of their choice by analysing and comparing the data of
different banks in case the data is available. Like that this model can be used in various
different scenarios depending on the needs of the user.
43
REFARANCES
[1] Amruta S. Aphale , Dr. Sandeep R. Shinde, 2020, Predict Loan Approval in Banking System Machine
Learning Approach for Cooperative Banks Loan Approval, International Journal Of Engineering Research &
Technology (IJERT) Volume 09, Issue 08 (August 2020).
[2] Ashwini S. Kadam, Shraddha R. Nikam, Ankita A. Aher, Gayatri V. Shelke, Amar S.Chandgude,
“Prediction for Loan Approval using Machine Learning Algorithm” (IRJET) Volume: 08 Issue: 04 | Apr 2021.
[3] M. A. Sheikh, A. K. Goel and T. Kumar, "An Approach for Prediction of Loan Approval using Machine
Learning Algorithm," 2020 International Conference on Electronics and Sustainable Communication Systems
(ICESC), 2020, pp. 490494, doi: 10.1109/ICESC48915.2020.9155614.
[4] Rath, Golak& Das, Debasish& Acharya, Biswaranjan. (2021). Modern Approach for Loan Sanctioning in
Banks Using Machine Learning. Pages={179-188} 10.1007/978-981-15-5243-4_15.
[5] Vincenzo Moscato, Antonio Picariello, Giancarlo Sperlí, A benchmark of machine learning approaches for
credit score prediction, Expert Systems with Applications, Volume 165, 2021, 113986, ISSN 0957-4174.
[6] YashDivate, PrashantRana, Pratik Chavan, “Loan Approval Prediction Using Machine Learning”
International Research Journal of Engineering and Technology (IRJET) Volume: 08 Issue: 05 | May 2021.
44